Community

AES Convention Papers Forum

Native High-Resolution versus Red Book Standard Audio: A Perceptual Discrimination Survey

(Subscribe to this discussion)

Document Thumbnail

The perceptibility of high-resolution versus CD standard audio has been the subject of research and debate since the introduction of hi-res audio distribution formats twenty years ago. The author conducted a large survey to determine whether experienced listeners could differentiate between a diverse set of twenty native high-resolution PCM stereo recordings and down conversions of the same masters at 44.1 kHz/16-bits fidelity. Participants were encouraged to audition the files using their own systems, which ranged from modest, headphone-based personal setups to audiophile quality rooms costing in excess of $50,000 to professional studio environments. They were not allowed to use analytical tools or other non-listening means to assist in their observations. Over 400 responses were received from professional audio engineers, experienced audiophiles, casual music enthusiasts, and novices aged eleven to eighty-one years. The online survey submissions show that high-resolution audio was undetectable by a substantial majority of the respondents regardless of experience level, equipment cost, or process with almost 25% choosing “No Choice.”. However, some evidence exists that specific genres and recordings produced moderately higher positives.

Author:
Affiliations:
AES Convention: Paper Number:
Publication Date:
Subject:

Click to purchase paper as a non-member or you can login as an AES member to see more options.

(Comment on this paper)

Comments on this paper

John Stuart
John Stuart


Comment posted October 29, 2020 @ 15:48:43 UTC (Comment permalink)

Up to now, the design of listening tests to discriminate audio resolution has tended to increase rigour by limiting the number of variables – in particular regarding suitability of playback equipment, sound level, listening environment and a clear decision task. So, it was brave to imagine that the test described in this paper would yield results. 

It might have been helpful to include an element of training, which was strongly recommended in the seminal meta-analysis on this topic by Reiss [1].

Given that each listener had a different playback scenario and that no record was kept of the number of repeats, there are uncertainties around the pooling of data. 

However, there seems to be a fundamental flaw of analysis in the paper which should first be addressed.

In section 5, where pooling of data is accepted, the calculation is incorrect. Using the data in the paper, we calculate an extremely low value of p <0.01, meaning that the conclusion should be reversed to: “There is less than 1% chance that the listeners could not discriminate between the variants.” Because the number of trials is high, it can also be examined using a normal distribution, where the result is more than 6 standard-deviations from the mean – again, highly significant.

It is difficult to read the graphs in Figs 10 & 11, but there are hints of a trend that favours experienced listeners over novices and better equipment over basic, that may have been obscured by incorrect calculation method.

In view of the scale of the study, it would be helpful if the author could share the raw data with AES so that more trends could be studied, including between listeners and between songs.

J Robert Stuart, Michael Capp.

[1] JO. D..  Reiss, "A Meta-Analysis of High Resolution Audio Perceptual Evaluation," J. Audio Eng. Soc., vol. 64, no. 6, pp. 364-379, (2016 June.).
doi: https://doi.org/10.17743/jaes.2016.0015                    


Vicki R. Melchior
TC-HRA Chair
Vicki R. Melchior


Comment posted November 4, 2020 @ 16:37:25 UTC (Comment permalink)

The author describes a large-scale listening test, termed a survey rather than a rigorous test, and done with a goal of researching whether high resolution audio is distinguishable from CD over a wide range of listeners using the audio gear they normally listen to.Level-matched A/B file pairs, one at 96 kHz/24b and one downsampled to CD rate and upsampled again, were made available as downloads, with no instructions to the listeners beyond cautioning against cheating and asking them to identify which file is the original 96 kHz file.

A test like this would be of interest since the question of high res audibility to a general audience under ordinary listening conditions hasn"t been studied. However, a ‘survey" is not a meaningful idea because it is still a test, just not a well-done one.A listening test without any structure, training, or controls is problematic at many levels.Rigorous testing is done because many well-known errors, biases, and methodological flaws can arise, and collectively they are likely to bias the outcome toward a null if uncontrolled. When untrained listeners are told to "do whatever you prefer" in listening, it doesn"t mean that they explore different listening options and select the best; more likely they settle into ‘something" or vary the test inconsistently.Structuring a good test is the responsibility of the researcher and not of the listeners.

What is Really Shown in the Paper

 

As previous commenters noted, the author"s analysis in section 5.1 ("The Null Hypothesis") is wrong and should be corrected or disregarded.Having pooled all of the data, he states that the null hypothesis is supported, i.e. there is no discrimination between high res and CD.But accurate analysis of the percent correct data (3364 correct out of 6181 trials) shows a discrimination effect with extremely high significance (p = .000001).Listeners could clearly discriminate, although the effect measured under the circumstances and with all data pooled, was modest (54.4%).

Listed below are some of the major issues in the paper relative to what is normally done in testing, and a consideration of why they can lead to errors or biases, that in turn promote a null result.The primary conclusion is that, had this test been better structured and controlled, it almost certainly would have shown better discrimination than the modest effect measured by the author.

As a matter of data presentation and argument, the author also divides data according to listener experience, equipment quality, listening environment, and age (sections 5.4-5.7), and presents each as a histogram, with a conclusion that "this does not seem to improve the ability to discern HD over CD". But there is no analytic support given for these conclusions, they are simply statements!

The author needs to revise the conclusions in the paper since they are either incorrect (discriminability) or indeterminate (histogram outcome) and lead to unwarranted decisions about high res discrimination.

Some Requirements of Listening Tests

a)The ability to quickly compare short phrases (5 ~ 30 sec) to avoid limits of short term memory

b)Limiting duration of listening times to forestall perceptual fatigue

c)Maintenance of listener attention and effort level

d)Control of volume (not too low or too high) and instructions to keep A&B same volume

e)Listener pre-test training for unskilled listeners on the meaning of "what is high resolution"

f)For rigorous tests, the ability to verify listeners" skill levels, test procedures, gear

If quick comparison of short phrases (5-30 sec) is not observed or possible, then discrimination is very likely to drop due to failures of short term memory.This is especially true for high res, which is an audible refinement but not typically a "day and night difference" (1).Listeners weren"t instructed on listening times, switching, or repeats, and additionally, a great deal of consumer gear makes it difficult to switch quickly and start play at specific times.(Workstation software or prepared control scripts are ideal.)

Extended listening and too many successive repeatsof A/B are known to cause loss of acuity due to perceptual fatigue, memory limitations and adaptation. These aren"t obvious to anyone unfamiliar with listening tests so require instruction by the researcher.

In general, listening tests are hard and intensive.Focused attention and motivation are required of the listener over periods of time, which is why formal testing includes methods to monitor and encourage participants.

In posing the question, "which is high res", this test relies on an individual"s internal concept of what high resolution means.This may be reasonable for pros and experienced listeners, but with the inexperienced listeners sought by the author, no such internal reference exists and pre-training is essential in order to understand what "high res" means.Previous publications have shown that pre-training very substantially improves the outcome of tests of high resolution vs. CD (1).

Methodological and Analytic Problems with the Test

 

a)Improper response options for 2AFC ("no choice") which yields biased answers

b)Lack of anchors in the test (they are especially needed here to "test the test")

c)ABX or same/different protocol is preferable to 2AFC (A/B) to avoid misidentifications

d) Pooling of different categories requires analysis for suitability and multiple comparisons

Inclusion of a "no choice" answer violates the assumptions of analysis based on the binomial distribution.Although the best choice after-the-fact is to throw out these answers, the presence of the response during the test already biases the result.

Anchors are hidden references included in a listening test that help to "test the test".In an online test like this one with many sources of error, they would help to bound the limits to which fine discrimination can be expected at all in the circumstances.Here, anchors might have been A/B pairs that are more easily discriminated than the CD/96 kHz pair; for example, 96 kHz versus very low bit rate MP3.

A/B (2AFC) is not the best choice of protocol.2AFC (which is what A/B is without the "no choice" option) is formally a "directed test", meaning that the researcher instructs listeners what variable to listen for in comparing A to B.2AFC is right for files where a single well-understood variable is under test, e.g. which is loudest, which has the most treble. High resolution on the other hand influences at least a half dozen parameters including transient resolution, detail definition, spatial clarity, immediacy or naturalness of voice and instruments, and lowered noise.Because it is multi-faceted and perceptual in nature, two listeners may put different weight on these parameters and arrive at different conclusions as to whether A or B is the high res one. (And hardware differences may affect such parameters too.) Detecting a difference in A and B but labelling them the wrong way is known as "Simpson"s Paradox", and is a well-known problem in statistics.Note that this is not the same as not hearing a difference or guessing; it is the result of accurate discrimination but mislabeling based on judgement.When Simpson"s Paradox occurs, correct and "incorrect" responses can add to produce a null when pooled.Thus a better test from this standpoint would have been either ABX or same/different.These tests pose a different question, "is A the same as B" or "is A or B the same as the reference X".In both cases they ask whether a difference is heard, but avoid asking for an identification based on the listener"s internal reference.

 

 

(1).J.D. Reiss, "A Meta-Analysis of High Resolution Audio Perceptual Evaluation," J.Audio Eng.Soc. 64, pp. 364-379, June 2016.


Default Avatar
Robert Orban


Comment posted April 16, 2021 @ 15:34:52 UTC (Comment permalink)

Level matching by using peak levels may have introduced audible errors unrelated to "high resolution" because removal of weak, ultrasonic elements in the material via downward sample rate conversion affects peak level to a signficantly greater extent than it does loudness. I would be interested if the author could compare the BS.1770-4 integrated loudness of the sample pairs. If they differ, then this could also influence the results of the tests.

 
I would also be interested in seeing the frequency response of the filter used in the downsampler, with the goal of doscovering whether it could have introduced audible aliasing into the downsampled material.

Subscribe to this discussion

RSS Feed To be notified of new comments on this paper you can subscribe to this RSS feed. Forum users should login to see additional options.

Join this discussion!

If you would like to contribute to the discussion about this paper and are an AES member then you can login here:
Username:
Password:

If you are not yet an AES member and have something important to say about this paper then we urge you to join the AES today and make your voice heard. You can join online today by clicking here.

AES - Audio Engineering Society