Perceived Audio Quality of Sounds Degraded by Non-linear Distortions and Single-Ended Assessment Using HASQI

For field recordings and user generated content recorded on phones, tablets, and other mobile 
devices nonlinear distortions caused by clipping and limiting at pre-amplification stages, and 
dynamic range control (DRC) are common causes of poor audio quality. A single-ended 
method to detect these distortions and predict perceived degradation in speech, music, and 
soundscapes has been developed. This was done by training an ensemble of decision trees. 
During training, both clean and distorted audio was available and so the perceived quality 
could be gauged using HASQI (Hearing Aid Sound Quality Index). The new single-ended 
method can correctly predict HASQI from distorted samples to an accuracy of ±0.19 (95% 
confidence interval) using a quality range between 0.0 and 1.0. The method also has potential 
for estimating HASQI when other types of degradations are present. Subsequent perceptual 
tests validated the method for music and soundscapes. For the average mean opinion score 
for perceived audio quality on a scale from 0 to 1, the single ended method could estimate it 
within ±0.33.


INTRODUCTION
Modern technologies have enabled handy recording devices, large data storage, and diverse outlets of User Generated Content (UGC). Three hundred hours of video are uploaded to YouTube every single minute, and along with other online databases such as freesound.org and soundcloud.com, much user generated audio is widely available. UGC is now used extensively in news broadcasting: on average, a news agency adopts 11 pieces of UGC daily [1]. This necessitates a rapid assessment method to determine if the UGC is broadcast-worthy and so media asset management systems would benefit from automatically generated audio quality metadata. Furthermore, if audio problems can be detected while recording, feedback can be given to the operator of the device and many disappointing end results can be avoided. A survey of both amateur and expert recordists [2] found that the four most commonly reported errors were: background noise (59%), wind noise (46%), handling noise (31%), and other distortions (19%). Wind noise problems in recordings have been addressed recently by the authors [3]. Motivated by the need to tackle other recording errors, this paper develops a method that can predict the perceived quality of audio contaminated by distortion. Distortion problems also arise with other audio systems such as hearing aids, sound reinforcement, and public address sys-tems, and consequently the method developed has a wider applicability than just UGC.
Three of the most common objective measures to quantify non-linear distortions are Total Harmonic Distortion (THD) [4], Inter-Modulation Distortion (IMD) [5], and Total Difference-Frequency Distortion TDFD [6] [7]. Lee and Geddes [8] [9] showed that there is a poor correlation between the perceived amount of distortion and the THD and IMD for a piece of music. They proposed an alternative measure with improved correlation based on integrating the 2 nd differential of the non-linear amplitude transfer function. A number of perceptual measures have been developed to better model the perceived quality after degradation. These include double-ended methods for speech [10]- [13] that have been standardized such as Perceptual Evaluation of Speech Quality (PESQ) [14] and the updated version POLQA [15]. Perceptual Evaluation of Audio Quality (PEAQ) [16] has also been developed to assess audio quality. PEAQ and PESQ are primarily used for assessing quality degradations caused by digital coding, complex audio processing, or transmission chains [17]. The Distortion Score (DS) [18], R nonlin [17], and the Hearing Aid Sound Quality Index (HASQI) [19] are double-ended methods able to predict the degradation in quality caused by overloading of transducers and preamplifiers. Recent studies have shown that HAQSI generalizes well for normal hearing listeners [20] achieving good accuracy when predicting mean opinion scores. For music HASQI was found to be able to predict the perceived degradation in audio quality due to clipping effectively [21]. HASQI can therefore be used to assess distortion on transmission channels but only if both the original and degraded signals are available.
There are many occasions where the undistorted sound is unknown. UGC is a good example where a single-ended method is needed working just from the corrupted audio. An example of a single-ended method is ITU Recommendation P.563 [22] but this is restricted to narrow band speech. Maré [23] presented a method to detect clipping in audio signals using a supervised artificial neural network. The test set was not sufficiently distinct from the training set, however, raising doubts about the capability of the method to generalize to unknown sources.
The new method presented below exploits a different machine learning regime to map features extracted from the corrupted audio to predict human perceived quality monitored using HASQI. A broader database of samples is used, demonstrating the need for more features to achieve generalization.

METHOD
A machine learning regime is used to take features extracted from the distorted audio and predict human perceived quality. Fig. 1 gives an overview of the proposed method. Speech, music, and soundscape samples were artificially distorted in a controlled manner using a diverse range of non-linear processes. The distortion of each sample was quantified using HASQI to form a teacher value for the machine learning algorithm that is used during supervised training. Before passing the audio to the machine learning algorithm it is necessary to reduce the amount of data, and this is done by extracting key features.

Database Formation
The machine learning scheme will learn to map from audio features to HASQI using a large database of training examples. The inclusion of a sufficient number of cases in the dataset is vital. The cases need to represent the wide range of likely audio samples in terms of what might be recorded and also the distortion likely to be encountered.

Audio Database
Speech, music, and soundscape samples were used to represent all the most likely sources of recorded audio. An audio database was collected from a large collection of CDs, including speech, music of various genres, and soundscapes counting a range of geophonic, biophonic, and anthrophonic sound sources. The database contains 404 music files with an average length of 2 minutes 45 seconds, 182 speech files with an average length of 4 minutes 48 seconds, and 469 soundscape clips with an average duration of 1 minute 48 seconds. At least one 10-second excerpt was randomly taken from each of these files, resulting in 1500 10-second excerpts for each of speech, music, and soundscape, with about 500 of each type.

Distorting Samples
To create distortion algorithms to degrade the samples, it was necessary to better understand common recording problems and technologies. In microphones and preamplifiers, overloading can occur when the signals go beyond a device's dynamic range. This causes the peaks in a waveform to be clipped generating harmonics of the original signal. In addition, when the analogue signal exceeds the dynamic range of an AD converter, aliased distortions may also be introduced.
Many devices incorporate Dynamic Range Control (DRC) to protect against overloading. The DRC reduces the amplification gain when the peak or root mean square (rms) of the signal is likely to overload the circuit. Instead of reducing the gain instantaneously, the DRC often incorporates an integration period, characterized by an attack and release time, and the gain reduction is usually characterized by a compression ratio. Dynamic range control systems can inadvertently degrade perceived quality, and careful choice of parameters is important [24]: (i) Audible distortion may occur if the release time is too short and the amplitude gain is modulated too quickly. (ii) Dropouts are likely to happen if the release time is too long because the suppressed gain does not recover quick enough to handle subsequent weak signals. This produces a "pumping" effect that is obvious to the listener. (iii) When the attack time is too short, the transients are suppressed excessively resulting in a lack of punch and clarity. The effectiveness of the compression can also be compromised. In addition, the DRC system is a dynamic compressor and so it may also introduce other artifacts or nonlinear distortions and degrade the signal to noise ratio [25]. Kendrick Table 1 describes the ranges of the three key parameters found in the devices that had DRC. DRC may not completely eliminate overloading, in which case when the signal level is high the compression ratio would be inadequate. Therefore, to detect nonlinear distortions in audio all three scenarios must be carefully considered in constructing the database of examplesoverloading at the preamplifier; distortions due to the DRC system, and overloading during analogue to digital conversion.
Distortion was emulated using the method developed by De Man and Reiss [27] in which the following amplitude transfer function was used to generate non-linear distortions of different types, where x B = x + B; xis the instantaneous value of the input signal (ranging between -1 and 1); T is the threshold (value between 0 and 1); K is the knee parameter (K = 1 for a hard knee, K > 1 for a soft knee) where a Hermite spline is used to connect the linear part (that ends where |x| = T / √ K ) and the non-linear part; and B is a bias parameter that adds a small DC offset to the signal. Components in the signal from 22050 to ∞ Hz, can be aliased. To simulate distortion without significant aliasing, the signal was up-sampled four times to 176.4 kHz prior to applying the amplitude transfer function and then down-sampled to 44.1 kHz afterwards. The oversampling rate was chosen by computing the signal power above 22050 Hz in the oversampled signal for typical sources and parameters. As the oversampling rate is increased the signal power above 22050 Hz in the digital domain converges towards the power in the analogue signal above 22050 Hz. This convergence indicates that above a certain oversampling level aliasing becomes insignificant; an oversampling rate of 4 was found to be sufficient.
The Dynamic Range Control was emulated using the method by Giannoulis et al. [28]. Peak level detection was chosen for its prevalence in DRC systems. Giannoulis et al. modeled four peak detection methods in DRC systems including branching, smoothed-branching, decoupled, and smoothed-decoupled. Decoupling is where the peak level is measured using a separate circuit that ensures that the peak level measure is not dependent on the attack time. This is simulated by, where α A = e −1/(τ a Fs) and α R = e −1/(τ r Fs) ; τ a is the attack time; τ r the release time; peak L [n] is the peak level at sample n; x L [n] is the absolute value of sample n; and Fs is sampling frequency. In this method the attack envelope is imposed on the release envelope, and therefore a branching simulation is also developed that ensures the attack and release envelopes are also decoupled. If the signal does not completely decay away after the compressor is released, the release envelope will decay at the prescribed rate and will meet a background plateau more quickly than expected.
To ensure that the release time is always the same, the releaseenvelope can be smoothed so that it decays gently to the background level rather than silencing abruptly.
Smoothing can be applied to both methods; for the branching method the peak detection becomes, (4) and the decoupled peak detection, These four methods introduce varying levels of harmonic distortion [24].
A Monte Carlo simulation was carried out with each of the 10-second audio samples being distorted or compressed in six ways as shown in Table 2. As this is a system that learns from data, care was taken to ensure that the distribution of samples was well balanced in terms of the types of non-linear processing that may be encountered. For the clipping distortion, the parameters used for the simulation are described in Table 3 and for the DRC the parameters in  Table 4. These parameters are randomly generated but with rules applied to the generating functions to ensure balanced distribution of examples. The reasons for each choice are explained in more detail in Appendices 1 and 2.

Teacher Values
Supervised machine learning needs large quantities of labeled data for training. The massive number of samples due to the combination of distortion types, distortion levels, and huge number of original sources make labeling them by subjective testing impossible. Taking advantage of having both the original and distorted audio during the training phase, a double-ended method could be used to estimate HASQI [19] as the teacher values. The original and distorted audio samples were truncated using rectangular windows of one second. Fifty-percent overlap was used. Each window was normalized to the rms value of that window before estimating HASQI.
HASQI is a continuous value from 0 to 1 but is based on subjective tests that returned a five level quality score from Bad to Excellent as suggested by ITU-R BS.1284-1 [29]. As a supervised classifier was adopted to perform the prediction, HASQI is first quantized back to the five classes shown in Table 5. The class determined by HASQI over one second using the double-ended method will be referred to as Class D, and the single-ended estimate of that class is referred to as Class S. The reason for the nonuniform scale divisions is due to the definition of the ends of the HASQI scale, where Bad = 0 and Excellent = 1, spacing the other descriptors equally over the scale and then quantizing causes, Good, Fair, and Poor classes to have a width of 0.25, while Excellent and Bad have a smaller width of 0.125.

Machine Learning Algorithms
Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Hidden Markov Models (HMMs), and Gaussian Mixture Models (GMMs) are well-known machine learning algorithms in audio classification and pattern recognition. Decision trees have recently gained much attention in related applications and the authors have applied them to wind noise assessment [3]. Consequently, the random decision forest [30], also known as a random forest, was adopted. The Matlab class "TreeBagger" is used to train the random forest [31].
Machine learning is often tested using k-fold cross validation to test how well the trained system deals with cases that were not present in the training and is used in the study. In addition, perceptual experiments were carried out to more rigorously validate the method (see Sec. 3).

Audio Features
Features were extracted from the distorted audio to be used as the input to the random decision forest. Features were extracted within frames of 1024 samples (23 ms) and 50% overlap was used. Clipping and DRC are known to cause sample values to be redistributed. This can be captured by the probability mass function (PMF), which is the discrete form of the probability density function. Fig. 2 shows four example PMFs for the same one second of audio, one with no clipping and the others with clipping applied. Hard clipping (K = 1), causes an increase in the probability a sample will occur around a relative sample value of ±1. Amplitude transfer functions with a soft knee also show a peak at ±1 but with a smoother transition and a lower peak value. A bias causes translation of the PMF in the direction Table 3. Clipping parameters for Monte Carlo simulation Parameter Parameter generating functions x is a random variable with a uniform probability density function between 0 and 1   To compute the PMF, each audio frame was normalized to the maximum absolute sample value, the histogram was then computed using 255 equally spaced sample levels from -1 to 1. The normalization in each window ensured that the PMF was represented with an optimal resolution for that window.
Maré [23] showed how the PMF could be used to identify distortions. To achieve generalization to audio not seen in training, we found that more features are necessary to represent a wide range of signal properties including timbre, spectral features. These were calculated using the MIR toolbox [32] and are listed in Table 6. The mean for each feature was then computed over 1 second.

Feature Selection and Training
To identify which features should be presented to the random decision forest, a sequential forward feature selection was carried out using 2-fold cross validation. Random decision forests allow some integration of automatic feature selection within the learning process. This is particularly useful when handling empirical data with no explicit model or clue for heuristic feature selection.
The random decision forest is an ensemble learning method that uses bagging, whereby a number of classification decision trees are each trained on a bootstrap sampled (with replacement) subset of the data, and at each node a randomized subset of features are selected and used for classification. Brieman [30] suggested that an optimal size of the feature subset would be √ m (rounded to the nearest integer), where m is the total number of features.
Using √ m features for each split, greedy forward feature selection [33] (FFS) was carried out using a wrapper method, which means that the output error from the trained classifier is used to gauge the quality of the algorithm. Twofold cross validation was carried out for every feature set, each time ensuring that the same source of audio did not appear in both training and validation tests.
The performance was quantified using the Matthews Correlation Coefficient (MCC), which takes a value between 0 and 1, where 1 represents optimal performance. The MCC is calculated from the confusion matrix [34]. The FFS was initialized by training a predictor using each feature separately. The best performing feature was the one that produced the highest MCC averaged over all folds. Having determined the first feature to be used, the second, third, fourth, etc., were then determined. The training was undertaken with every possible additional feature added to the first feature with the best individual performance. If the added feature increased the MCC, then the feature was retained. This procedure was repeated until all the features under investigation were exhausted or there was no further improvement in performance. If a feature contained multiple values, such as the 255 values in the PMF, these were treated as a single feature, i.e., all 255 values were included or removed in one block. The random forest is a stochastic method and will yield different results every training phase due to both the bootstrap sampling and the random selection of features at each node. By increasing the size of the forest the variance between the outputs from the trees is decreased, therefore there is a trade-off between variance and speed of processing. A rule of thumb, the number of trees in the forest needs to be sufficient so that the ranking of the features no longer changes as the number of trees is increased [35]. To determine the optimal forest size, a significance test of the performance improvement was carried out between two forest sizes after feature selection. The feature selection procedure was repeated for a number of forest sizes, increasing the number of trees by a factor of 2 starting at 12 (multiples of 12 was a convenient choice because the parallel code was running on a 12 core machine).
McNemar's hypothesis test was used to determine the significance [36]. A hypothesis test is defined where the null hypothesis is rejected (that there is no difference between predictors), if χ 2 > χ 2 1,0.05 = 3.851 (significance level p < 0.05) and if the MCC of the larger forest is greater than the smaller one where, where M ab is the number of misclassifications made by the smaller forest, which were correctly classified by the larger forest, and M ba is the number of misclassifications made by a larger forest, which were correctly classified by the smaller forest, ∼χ 1 2 expresses that the function has a chi-square distribution with 1 degree of freedom. Table 7 presents the results from the forest size investigation showing no significant improvement in performance above a forest size of 96.
The feature selection algorithm produces a different permutation of features every time. Therefore to select the best set of features, the FFS was run repeatedly and the features most frequently selected were used. The FFS was repeated until the rank order of the top N features in the rank order stabilized (no change after two FFS repeats). On an average, 7 features were selected and stability occurred after 12 runs. The rank order and the frequency a feature was selected is shown in Table 6. PMF being joint top supports the work done by Maré [23]. Alongside this was spectral flux, which is the mean Euclidian distance of the spectra between successive frames. Other important features were Spectral Kurtosis, Spectral Entropy, Spectral Roughness (average of all the dissonance between all possible pairs of peaks [37]), Spectral Skewness, and the Zero crossing rate.
Much of the information contained in the spectral and timbral features is already available from the PMF. This indicates that in a lower computational power environment (e.g., a smart phone) where a compact algorithm may be required, the PMF might be sufficient. Table 8 shows a confusion matrix from a system averaged over 2-folds using the 7 chosen features and 96 trees. The MCC was 0.616. Fig. 3 illustrates the performance for different signal and distortion types. Aliasing had little effect on performance of the algorithm, therefore non-aliasing and aliasing cases were pooled for each distortion type. Fig. 3 shows that the performance is generally similar for both soft and hard clipping, but there are small differences between source types, with the estimation being best for music and worst for speech. The relatively poor performance occurs when the degradation to quality is due to DRC alone. The confusion matrix for DRC-only cases in Table 9 shows 96% were rated good or excellent-DRC is not degrading the audio as badly as the other types of distortion. While there appears to be confusion between the two highest quality classes, very rarely will a sample be mislabeled more than two classes above or below its true class.

Aggregation Over Longer Samples
Human judgments of audio quality are usually made over periods longer than one second, therefore a method to aggregated the results over a longer time period is needed. A similar judgment of temporally varying phenomena has been studied in soundscapes research and VoIP speech quality. Dittrich and Oberfeld [38] showed primacy (first sound heard) and recency (last sound heard) effects for annoyance from broadband noises. Västfjäll showed that listeners consistently preferred in-flight soundscapes with a better ending [39]. The peak-end rule hypotheses states that the most recent and the most extreme affective event are most salient for retrospective judgments. While in some studies this was found to explain the variance of the judgments [40], other researchers disagree [41]. It is suggested by Ariely and Carmon [42] that this was due to the recent exposure to affective peaks moderating the judgments. Recent work by Steffens and Guastavino on soundscape pleasantness [41] suggested that the best predictors might be a combination of the average instantaneous rating and the trend over the same judgments (modeled by a linear regression). The rationale is that the linear regression models the expectation of how the soundscapes will evolve in the future.
In summary, there is no agreement about exactly how best to model how humans aggregate sensory judgments over longer periods of time, and consequently this study simply averages the results from each one-second window over the whole sample.
Comparing a HASQI value formed from the whole 10 second sample, with the average of the one-second HASQI values reveals a 95% confidence limit of ±0. 16. By weighting the one second HASQI values according to the rms over the one second window reduces the error to ±0.13. Consequently, the weighting by frame rms is adopted to give bHASQI A , the aggregated single-ended HASQI estimate. The formulation is: where M is the total number of windows, Class S i is the single-ended estimate of the HASQI class over window i and rms i is the root mean square value over windowi. Fig. 4 compares bHASQI A with HASQI integrated over the whole 10-second clip. This dataset was computed using 10-fold cross validation and each of the 10 folds of the cross-validation are overlaid in Fig. 4 (all types of audio and distortion). The Pearson correlation coefficient is 0.97  and 95% of the estimates are within ±0.19 of HASQI, with previous results indicating much of this error is due to the aggregation. If bHASQI A is quantized into five classes, using the specifications in Table 5, the MCC is 0.7; Table  10 displays the averaged confusion matrix for this result. Seventy-nine percent of HASQI classes are correctly identified by the single-ended method, and for those incorrectly identified 95% of those are wrong by a single class. The Pearson correlation coefficient is likely inflated due to the presence of clusters of data near the origin and the top right corner of Fig. 4. The MCC, however, is a balanced measure of classifier performance and is immune to this inflation. Fig. 4 exhibits some quantization of the bHASQI A results around 0 0.25, 0.5 and 0.75 and 1, this is due to all windows in a sample having the same estimated Class S.

SUBJECTIVE VALIDATION
For the single-ended method, HASQI was an intermediate tool to generate a large number of training and testing samples. How does this relate to perceived quality? Since HASQI has been extensively validated on speech, the focus of the subjective validations in this project has been mu-sic and soundscapes. Excerpts of music and soundscapes were distorted by varying amounts of hard clipping and then presented to subjects for subjective quality ratings. The perceptual results were compared with correct HASQI value and the single-ended estimate, bHASQI A .

Music
A small number of music samples, which somehow had to represent the diversity of all music, were needed. As the primary effect of distortion is to change the timbre, it was decided to select the test samples based on music with contrasting timbre. First a large number of music samples were gathered. Three-hundred-fifty-one music extracts were taken from an exemplar set of music samples suggested by Rentfrow and Gosling [43]. For each of the 117 pieces for which high quality recordings could be obtained, three 7-second excerpts representing key sections such as an intro, verse, and chorus were extracted. Additionally, each of the three music samples used by Arehart et al.
[44] to develop HASQI were also included in the test set.
Then a method was devised to extract contrasting timbre examples from the hundreds of excerpts. The samples were distorted by hard clipping, using a threshold set to give a HASQI value of 0.5 for each sample. Each stereo example was sampled at 44.1 kHz (all HASQI values averaged over both channels). All samples, clean and distorted, were clustered according to their timbre using the method by Autocoutrier and Pachet [45]. Two samples were drawn from each of the six clusters. They were drawn by selecting the two with the shortest Euclidian distance to the cluster centres. Additionally, each of the three music samples used by Arehart et al. [44] were also included, regardless of which cluster they had grouped within. The 14 pieces from which the test stimuli were taken are listed in Table 11.

Perceptual Test Design
A total of 30 participants (mean age: 23.7 years; SD: 4.7 years) completed the experiment. None reported any known hearing impairments. Each participant was presented with 140 7-second clips that consisted of 9 different thresholds of hard clipping distortion and 1 clean for each of the 14 music pieces. All samples were presented in stereo at the same Aweighted sound pressure level, integrated over 7 seconds and both channels, over Sennheiser 650 HD headphones, via a Focusrite Scarlett 2i4 audio interface (this having To ensure that the distortion applied to each music sample covered a wide range of quality degradations, nine thresholds for each clip were computed by setting target HASQI values between 0.1 and 1. A participant training session was held before the actual testing with three pairs of samples not included in the test. Participants were reminded that they were judging overall quality not any musical preference. Ratings were entered via a mouse using a continuous slider labelled "Bad" and "Excellent" at each endpoint with no other markers based on the ITU-R BS.1284-1 [29] recommendations adopted in the development of HASQI [29]. Participants were asked to make absolute quality judgments on individual samples with no reference. The use of relative judgments of quality using a reference sample was not adopted for the following of reasons; 1) HASQI was also developed using absolute category ratings and a direct comparison was important. 2) One of the research questions in [21] from which some of this data is based was: is there any link between the underlying quality of a sample and the degradation due to amplitude clipping? 3) A high priority was placed on maximizing the number of music pieces and soundscapes to increase the validity of the resulting algorithm performance analysis. The large number of samples made the use of an impairment scale time prohibitive.
The slider's initial position was at the "Bad" end of the scale on each trial. Progression from one trial to the next was conditional on listening to the sample in full and providing a rating. There were no limits on the number of times each sample could be repeated. There was no time limit for completion of the test and participants were prompted to take a short break at the half-way stage if required. Presentation order of the samples was fully randomized. The test session typically lasted around 40 minutes and participants were financially reimbursed for their time.

Validations with Soundscapes
Twelve sound samples (field recorded soundscapes) were selected from the freefield1010 database [46], which was a selection of ten-second audio clips uploaded to the freesound.com database and tagged as "field-recording." First, the 20 most popular tags were identified and all files with those tags were used. Then, the crest factors were computed. The crest factor is the ratio of the peak to the rms level. A signal with a low crest factor will exhibit fairly constant levels of clipping while a signal with a high crest factor might have some highly distorted regions while other regions may remain relatively clean. Four examples closest to the 10 th , 50 th , and 90 th percentiles of the crest factor distribution were selected and are listed in Table 12. The perceptual test procedure was the same as that used for the music clips-18 subjects participated in the test.

Results
For the music clips, Cox et al. [21] found that the MOS (Mean Opinion Score) of even the clean samples varied considerably because of different styles of audio production for the originals. As the interest is in distortions that degrade the quality, the MOS scores were normalized to the averaged MOS calculated from all subjects for the clean undistorted signals of a particular audio file. The standard deviation of the opinion scores for each clip and distortion condition provides a gauge of the intersubject variability of opinion; the average standard deviation for all conditions was 0.17. Fig. 5 shows relationship between double-ended HASQI (x-axis) and the normalized MOS (y-axis); the Pearson's correlation coefficient is 0.916. The results seem to be more promising than Arehart et al. [44] report. Their correlation between HASQI and the MOS for three pieces of music was 0.838. The better correlation found in our experiments might be attributed to the fact that only clipping and DRC were considered. Ninety-five percent of the HASQI estimates are within ±0.24 of the normalized MOS.
A few samples showed relatively large prediction errors. For example, "Packin' Truck" has HASQI overestimating the MOS by up to 40%. This track was recorded in 1935 and the recording quality is poor with noise and distortion already present. There appears to be some leniency in quality ratings of degraded audio when the expected technical quality of the original audio is already low.
For the soundscape samples there was an increase in the variability of the opinion scores compared with music, the standard deviation of the opinion scores was 0.29; this can be seen in Fig. 6. This increase in variability may be due to the smaller number of listeners (18 rather than 30). Despite this increase in the variability of opinion, the correlation of HASQI and the normalized MOS yields a correlation coefficient of 0.85 with 95% of HASQI estimates within ±0.29 of the normalized MOS.
For soundscapes, HASQI over-estimated the level of degradation for two clips in particular. These two clips contained mainly high frequency bird and insect sounds. There were also cases where HASQI under-estimated the degradation, such as thunder, rain, and machinery sounds. These clips differentiate themselves from the others as they do not contain harmonic sounds. It is likely that the reason for the lower performance with soundscapes is that HASQI was primarily aimed at speech quality during development and naturally performs better on such cases.
Next, the proposed single-ended algorithm was trained using every sample from the audio library described in Sec. 1.1.1 excluding those used in the perceptual studies. Figs. 7 and 8 show the relationship between the normalized MOS and the single-ended estimates, bHASQI A , for music and soundscapes. For music the correlation coefficient between bHASQI A and the normalized MOS is 0.861 and 95% of the single-ended estimates of bHASQI A are within ±0.3 of the MOS. For the soundscapes, similar results are found, with the correlation coefficient between HASQI A and the normalized MOS being 0.802 and 95% of the estimates are within ±0.33 of the MOS.
As previously mentioned, the average standard deviation of the opinion scores for each clip gives an estimation of the intersubject variability. This was 0.17 for music and 0.29 for soundscapes. The intersubject variability and the error in the single-ended estimation of quality can be compared using the standard deviation of the error in the MOS estimation using bHASQI A . This was 0.17 for both music and soundscapes. This shows that on average the error in the single-ended estimate of quality for a single clip is of the same order, or lower than, the intersubject variability of opinion.

CONCLUSION
A single-ended method to quantify perceived audio quality in the presence of non-linear distortions has been developed and presented in this paper. This single-ended method estimates HASQI (Hearing Aid Sound Quality Index). The model uses machine learning to learn from examples and generalize. Validations on a set of music and soundscapes not seen during training, yield single-ended estimates within ±0.19 of HASQI, using a quality range  HASQI has also been shown to predict quality degradations for processes other than non-linear distortions including additive noise, linear filtering, and spectral changes. By including these other causes of quality degradations, the current model for non-linear distortion assessment might be expanded, although additional features and validation would be required.
A series of perceptual measurements on music and soundscapes were undertaken. The subjective testing provided more data that shows that HASQI can be used to quantify perceived non-linear distortion for normal hearing listeners. The new single-ended method was used to estimate quality and compared to the Mean Opinion Scores (MOS) from the subjective tests. The standard deviation of the error in the single-ended MOS estimations was 0.17. This is of a similar order to the standard deviation of human subjects: the standard deviation of the MOS from the perceptual tests was for music was 0.17 and 0.29 for music and soundscapes respectively.
The code to estimate bHASQI is freely available for download at [47] for non-commercial purposes under an Attribution-NonCommercial 4.0 International (CC BY-NC) license. The databases used to develop the algorithm are not available due to copyright issues with the audio samples.

ACKNOWLEDGMENTS
This project is funded by Engineering and Physical Science Research Council, UK (EPSRC EP/J013013/1) and carried out in collaboration with the BBC R&D and the British Library Sound Archive. The perceptual tests were carried out by Stephen David Groves-Kirkby. This work is published under a CC-BY license (http://creativecommons.org/licenses/by/3.0/).