Localization Experiments with Reporting by Head Orientation : Statistical Framework and Case Study ∗

This paper is concerned with sound localization experiments in which subjects report the position of an active sound source by turning toward it. A statistical framework for the analysis of the data from this type of experiment is presented together with a case study from a largescale listening experiment. The statistical framework is based on a model that is robust to the presence of front/back confusions and random errors. Closed-form natural estimators are derived, and one-sample and two-sample statistical tests are presented. The framework is used to analyze the data of an auralized experiment undertaken by nearly nine hundred subjects. Results show that responses had a rightward bias and that speech was harder to localize than percussion sounds, which are results consistent with the literature. Results also show that it was harder to localize sound in a simulated room with high ceiling, despite having a higher direct-to-reverberant ratio than other simulated rooms.


INTRODUCTION
The phenomena governing human sound localization have been the subject of intense study since the turn of the twentieth century [1].A large variety of characteristics have been studied, ranging from the just-noticeable-differences in localization accuracy, adaptation, and learning effects, to the influence of the source's spectral content and room reflections [1][2][3].Recent experiments also studied the contribution of high frequency content in the presence of a noise masker [4], the degradation of localization accuracy with outer ears occlusions [5] and bilateral hearing aids [6], and the localization of multiple coherent sound sources [7].
Subjects are typically asked to indicate the direction of the perceived sound source by (a) reporting the closest loud-speaker, fixed acoustic pointer or label [7,4,2]; (b) steering a movable pointer [8]; (c) reporting the direction on a graphical user interface (GUI) or on paper [6]; or (d) turning their face toward the perceived sound source after the stimulus has been presented [9,5].This paper is concerned with experiments where subjects report the position of the perceived sound source by turning toward it while the stimulus is being presented.This methodology makes it possible to study the dynamics of how subjects rotate themselves to find a sound source, to study the mechanisms that enable them to resolve front/back confusions, and to study the reported direction of the perceived sound source.This paper focuses on the latter of the three.
Metrics of interest for the the perceived sound source include the mean direction and concentration of responses and how many subjects experience a front/back confusion or make a random error.Since the subjects turn toward an active sound source and give their answer once they believe the sound source is in front of them, the methodology considered in this paper is limited to the study of localization in frontal directions.This restriction allows subjects to fine tune their initial decisions and is particularly useful in cases where the stimuli are hard to localize (e.g., in echolocation tasks or when the auditory system is interfered with) and in experiments involving untrained subjects.The task of turning toward a sound source is, in fact, easy to understand and is a natural and intuitive reaction to sound.
The main contribution of this paper is a statistical framework designed to analyze the data obtained with this experimental methodology.The proposed statistical framework is robust to the presence of front/back confusions and random errors.The framework is then used to analyze the data of a large-scale auralized experiment.The objective of this experiment was to study localization performance in the horizontal plane in an informal setting and with little training, which are conditions of interest because they are similar to those typically encountered in consumer applications of binaural audio.An earlier version of the experiment description with partial results was presented at the 60th AES International Conference [10].
This paper is organized as follows.Sec. 1 outlines the experimental context considered here.Sec. 2 reviews concepts of circular statistics that form the basis of the proposed statistical framework, which is presented in Sec. 3. Sec. 4 describes in detail the design of the large-scale auralized experiment and presents an analysis of the data based on the proposed statistical framework.Sec. 5 concludes the paper.

EXPERIMENTAL CONTEXT
The experimental context considered in this paper has the following characteristics.The subject is presented with a sound stimulus and is asked to indicate the direction of the perceived sound source by turning themselves toward it.The sound stimulus stays active throughout the test, including while the subject is turning to identify the source.The sound stimulus may consist of a single sound source in free field or more complex acoustical situations, e.g., a sound source in a reverberant room or multiple coherent sound sources.
The task of the subject is to rotate their head or body until the sound source is perceived to be in front of them.Once confident about the direction of the perceived sound source, the subject confirms the choice.The perceived sound source stays in a fixed position in space.
The experiment could be carried out in an actual physical setting, e.g., with a loudspeaker in a reverberant room.Alternatively, the desired physical setting can be simulated and the resulting binaural stimulus played back through headphones.In this case, the binaural stimulus has to be smoothly updated in real-time as the subject turns, so as to mimic the change that the subject would experience in an actual physical setting with an external stationary sound source.
In order to isolate sound perception as the only factor influencing the decision, no visual cue about the position of the sound source is available.Furthermore, the initial look direction of the subject with respect to the sound source is random and uniformly distributed.
Fig. 1 shows the apparatus used in the large-scale experiment described in detail later in Sec. 4. In this experiment subjects wore headphones and stood on a rotating platform.They could freely turn themselves by applying force on a stationary wheel in the center of the platform.A gyroscope fixed to the platform measures the platform rotation, and this information is used to update the binaural stimulus in real time.Here, the subject is trying to localize a stationary sound source in the stationary virtual room by rotating themselves on the platform.
Another example of the methodology described in this section is the echolocation experiment of the type considered by Pelegrin-Garcia et al. in [11] and subsequent works by the same authors.In this class of experiments, subjects wear head-tracked headphones and a lavalier microphone.Self-generated oral sounds are picked up by the microphone and are processed by a real-time audio processor that simulates the presence of a stationary virtual wall somewhere around the subject.Subjects are asked to turn toward the virtual wall.Here, the perceived sound source sought by the subjects is the acoustic echo of their own voice.
User responses can be divided into three classes.The first class consists of responses in which the subject correctly identified the sound source within a certain angular tolerance.The second class consists of responses where the subject experienced a front/back confusion.In this case the responses are concentrated around the opposite direction.This is due to the fact that when the subject turns toward the perceived sound source, the cone of confusion [1] collapses onto the median sagittal plane.The third class consists of erroneous responses; these include cases where, for instance, the subject could not identify the sound source, did not understand the task, or ended the task early.

ELEMENTS OF CIRCULAR STATISTICS
The data analysis of localization experiments typically involves aperiodic statistical moments, e.g., mean, variance and mean squared errors, and statistical tests that assume normally distributed data, e.g., t-test and ANOVA [3].While the normal distribution is an acceptable approximation in some cases, angular data is periodic in nature, thus circular statistical moments and circular distributions should be used instead.This section briefly reviews elements of circular statistics.Thorough treatments of this topic can be found in Mardia and Jupp [12] and in Fisher [13].
Let f (ϑ) be the probability density function (PDF) of the continous circular random variable , with f (ϑ) ≥ 0, f (ϑ + 2π) = f (ϑ) and 2π 0 f (ϑ)dϑ = 1.The l-th trigonometric moment of is defined as which can be written in polar coordinates as γ l = ρ l e iμ l , with i = √ −1.The parameter ρ 1 is denoted as mean resultant length, and μ 1 as the mean direction.Due to the importance of these two statistics, ρ 1 and μ 1 are usually written simply as ρ and μ, respectively.In the context of this paper μ indicates the direction of the perceived sound source.The cosine and sine moments are defined as the real and imaginary parts of γ l : The l-th central trigonometric moment of is defined as the l-th trigonometric moment of the random variable − μ and are denoted here by γ l : ( The corresponding central cosine and sine moments are respectively. The central trigonometric moment can be expressed as a function of the (non-central) trigonometric moment as ( Therefore γ l = ρ l e iμ l with μ l = μ l − lμ and ρ l = ρ l .Consider now N sample observations of , denoted in the following as θ = [θ 1 , . .., θ N ] T .In the context of this paper the sample observations θ are the angles reported by the subjects, and N is the number of experiments for a certain condition.The sample equivalents of α l and β l are given by From the sample moments a l and b l , one can derive the sample equivalents of ρ and μ as The von Mises (vM) distribution is among the most extensively studied circular distributions.The PDF of the vM distribution is given by where I 0 (κ) is the modified Bessel function of the first kind of order zero.The parameter κ is the concentration parameter.For κ = 0, the vM distribution degenerates to a uniform distribution.On the other hand, for large κ the vM distribution tends to a normal distribution with variance 1/κ.
Closed-form maximum likelihood (ML) estimators of the parameters of the vM distribution are available in the literature, together with one-sample and two-sample statistical tests.
The vM distribution is well suited to model the angular dispersion around the perceived angle in cases where the subject correctly identified the sound source.However, as will be shown later in this paper, in the presence of front/back confusions and random errors the vM distribution and the associated statistical tests fail.In order to model front/back confusions, a suitable distribution is the so-called 3-parameter von Mises mixture (vMM3), which is a mixture of two von Mises distributions having the same concentration parameter κ but mean directions that are π apart.This distribution has a PDF given by where p ∈ [0, 1] is the convex combination parameter.The shape of the vM and vMM3 distributions can be seen, for example, in Fig. 9. Closed-form natural estimators (i.e., method of moments-based) exist for the vMM3 distribution [12].One-sample tests using numerical ML optimization were studied by Grimshaw et al. [14].

VON MISES AND UNIFORM MIXTURE (vMUM) MODEL
As will be shown later in this paper, the vMM3 model and the associated one-sample and two-sample statistical tests perform poorly in the presence of uniformly-distributed random errors.This motivates the von Mises and uniform mixture (vMUM) statistical model, which is presented in this section.

Model Definition
Since the initial look direction of the subject is drawn from a uniform distribution, it is reasonable to model the erroneous decisions as uniformly distributed.Consider then the following statistical model: with p 1 , p 2 , p 3 ∈ [0, 1] and p 1 + p 2 + p 3 = 1.This model will be referred to as vMUM in the following.Here, the values p 1 , p 2 , p 3 can be seen as simple parameters of the model.A different interpretation of these values is to consider them as the probability mass function (PMF) of an unobserved latent variable describing whether the subject experienced a frontal image, a front/back confusion or made a random error.With this interpretation, the terms e κ cos(ϑ−μ) 2πI 0 (κ) , e −κ cos(ϑ−μ) 2πI 0 (κ) and 1 2π take the meaning of the PDFs of the incomplete data while f (ϑ; μ, κ, p 1 , p 2 , p 3 ) takes the meaning of PDF of the complete data.
The central moments of can be written as where δ l is the Kronecker delta function.Appendix A.1 provides a proof of this result.

Method of Moments Estimator (vMUM-MME)
Similarly to the derivation of the method of moments estimator (MME) for the vMM3 distribution [12], consider the random variable associated with the double-wrapped angle, i.e., = 2 .The PDF of a double-wrapped variable can be written as [12] , and thus, with simple trigonometric and algebraic manipulations: where the dependency on the parameters is omitted for clarity.The advantage of considering the random variable instead of the original random variable is that the parameters p 1 and p 2 do not appear separately but only as p 1 + p 2 .This enables all the parameters to be estimated one at a time, as explained in the following.
The central moments of can be calculated as β w l = 0, where p w = p 1 + p 2 .Appendix A.2 provides a proof of this result.Since β w l = 0, then γ w l = α w l + iβ w l = α w l .Using Eq. ( 3), the l-th trigonometric moment can therefore be written as Since p w I 2l (κ) Applying the method of moments to the phase of the first trigonometric moment, γ w 1 , gives φ = 2 μ, where φ is the mean sample direction of , and thus The first and second moments of are given by respectively.Assuming that I 2 (κ) I 0 (κ) = 0, or, in other words, that κ = 0, one can isolate p w from α w 1 and replace it in the expression of α w 2 , which gives By replacing the moments with their sampled equivalents, an estimate of the concentration parameter κ can be taken as the solution of The value of κ is found using non-linear optimization and is called MME of κ.Notice that a similar step is necessary to obtain the parameter κ of the vMM3 model.The convex parameter p w can now be estimated using the expression of the first central moment: Notice that isolating pw from the above equation may produce a solution outside the closed interval [0, 1].This is the same problem encountered in the estimation of the vMM3 distribution parameters [12], and, more in general, in method of moments estimates.When this happens, one approach is to find the value of p w ∈ [0, 1] that best satisfies Eq. ( 17), e.g., in the least square sense.Another, simpler approach is to associate the value pw = 0 to all negative estimates and the value pw = 1 to all estimates larger than one.This approach is the one used in the simulations presented in this paper for both the vMM3 and the vMUM estimates.
It only remains to estimate one of the two parameters p 1 and p 2 , with the second being determined via the expression p w = p 1 + p 2 .Using Eq. ( 10), the first central moment of the unwrapped random variable can be written as By applying again the method of moments, the parameter p 1 can be estimated as the solution of In summary, an estimate of the model parameters can be obtained as follows: which will be referred to as the vMUM method of moments estimator (vMUM-MME) below.
In practice, this procedure yields large values of κ when the ratio of the sample moments a w 2 /a w 1 is close to unity.In order to alleviate this issue, after obtaining a first estimate of κ and p w , one can refine the estimate of κ by solving Eq. ( 17) for κ (instead of for pw as done in the procedure described above).It was found empirically that small further improvements can be obtained by iteratively solving Eq. ( 17) for κ and then for pw .In the results presented in this paper, two such additional iterations are carried out.

vMUM Maximum Likelihood Estimator (vMUM-MLE)
The maximum likelihood estimator (MLE) of the vMUM model parameters is obtained as subject to p 1 , p 2 , p 3 ∈ [0, 1] and p 1 + p 2 + p 3 = 1.Here, the PDF f (θ; μ, κ, p 1 , p 2 , p 3 ) is seen as a function of the parameters for a fixed set of sample observations, θ, and represents the likelihood function.Since a closed-form solution of this problem is not known, it is necessary to resort to nonlinear optimization.As noted in Sec.3.3, it is important to initialize the algorithm carefully so as to avoid convergence to a local maximum.Unless stated otherwise, the initialization used in the simulations of this paper will be the vMUM-MME estimate.

Performance Analysis of Parameter Estimation
The performance of the proposed estimator is assessed via Monte Carlo simulations.A total of 10,000 sets of N samples are generated using the vMUM statistical model.The random samples are generated using the algorithm proposed by Best and Fisher [15].For each set, the parameters are drawn from uniform distributions with μ ∼ U(0, 2π), κ ∼ U(0, 100), p 2 ∼ U(0, 0.3) and p 3 ∼ U(0, 0.3), where the symbol ∼ stands for "distributed as."The parameter p 1 is then obtained as The performance is compared to the standard non-Bayesian MME of the vMM3 parameters [12], which is termed here vMM3 method of moments estimator (vMM3-MME).
The simulations are run using Matlab R2015b with the default random seeding.The trust-region-dogleg algorithm (Matlab command fsolve) is used to find κ in Eq. ( 16) and in a similar step required by the vMM3-MME [12].The sequential quadratic programming (SQP) algorithm (Matlab command fmincon) is used to solve the constrained optimization problem involved in the ML estimations.
The mean squared error (MSE) is used here as metric to assess the estimators performance.MSE values above the 95-percentile are considered as outliers and are removed from the data.Unless stated otherwise, the sample size is N = 20.

Performance of vMUM-MLE for Different Starting Points
In order to assess the sensitivity of the vMUM-MLE estimate to its initialization, Monte Carlo simulations with different starting points were run.The first case is the vMUM-MME estimate.The second case is a random starting point.Here, μ and κ are drawn from uniform distributions with   μ ∼ U(0, 2π) and κ ∼ U(0, 100).Notice that this choice of κ gives the random estimate a slight advantage because the range corresponds to the actual range used to generate the data.The parameters p 1 , p 2 , and p 3 are all drawn from a uniform distribution between 0 and 1 and subsequently normalized such that p 1 + p 2 + p 3 = 1.The third case is obtained from a grid-search using 10 points across each of the five dimensions in the parameter space (it is assumed that κ ∈ [0, 100], again giving it a slight advantage), which together amount to a total of 10 4 points.
Fig. 2 shows that the random starting point results in a very poor performance, indicating that the likelihood function has multiple local maxima.The full-search starting point and the vMUM-MME starting point perform equally well.Given that the latter also requires significantly fewer computations, the vMUM-MME estimate is shown to be the most suitable starting point for the vMUM-MLE optimization problem and is used throughout the rest of this paper.

Performance as a Function of p 3
Fig. 3 shows the mean squared error of the different estimators as a function of p 3 .It can be observed that all the estimators perform equally well for values of p 3 close to zero.However, as p 3 increases, the vMUM-based estimators significantly outperform the vMM3-based one.Hence, the vMUM-based estimators can be seen to be more robust to the presence of uniformly-distributed errors.       of the concentration parameter κ, where the vMUM-MME outperforms vMUM-MLE for sample sizes smaller than around N = 25.This may seem surprising due to the fact that the MLE has a higher likelihood function than the MME.This however does not imply a lower MSE, and, in fact, some ML estimators are known to have poorer MSE than method of moments (MM) estimators in cases with small sample sizes.

Single-Sample Test of the Mean Direction
Consider the case where one wishes to test whether the data is drawn from a distribution with a given mean angle μ 0 .For instance, a hypothesis tested later in the paper is whether the data has a zero directional mean, i.e., μ 0 = 0. Assume that the a priori probability on whether this is true or not is unknown, as is the case in typical listening experiments.This amounts to the following non-Bayesian hypothesis test: with null hypothesis H 0 and alternative hypothesis H 1 .Assume also that the model parameters are unknown.Hence, this is a non-Bayesian test of composite hypotheses [16].
Notice that whenever front/back confusions are present, the statistical model is bi-modal and π-symmetric, and therefore testing for μ = μ 0 is the same as testing for μ = μ 0 + π.
In this paper P d = Pr( Ĥ1 ; H 1 ) denotes the probability of rejecting the null hypothesis given that the alternative hypothesis is true, and P f a = Pr( Ĥ1 ; H 0 ) denotes the probability of rejecting the null hypothesis given that the null hypothesis is true.
A typical approach in this context is to seek the best P d for a given P fa .Toward this end, a commonly used test is the generalized likelihood ratio test (GLRT), which can be shown to have certain optimality properties [16]: where μ, κ, p1 , p2 , p3 are the ML estimates under the hypothesis H 1 and are termed unrestricted MLE, κ, p1 , p2 , p3 are the ML estimates under the hypothesis H 0 and are termed restricted MLE, and λ is a desired threshold.The restricted MLE is calculated as in Sec.3.2.2 but with the constraint μ = μ 0 .The unrestricted MLE is also calculated as in Sec.3.2.2, using the restricted MLE as starting point, which guarantees that R(θ) is larger than one.Using the restricted MLE as starting point, however, sometimes causes the optimization algorithm to become stuck in a local maximum near μ = μ 0 .To overcome this issue, the unrestricted MLE is calculated again using the vMUM-MLE as starting point.Between the two so-obtained unrestricted MLEs, the one with higher likelihood value is chosen.
Regarding the choice of the threshold ξ, consider the transformation L(θ) = 2ln R(θ) of the likelihood ratio: where ξ = 2ln λ.The above ratio is typically termed loglikelihood ratio (LLR).Consider the PDFs of the random variable L( ) for observations obtained under the two different hypotheses, f L (l; H 1 ) and f L (l; H 0 ).The probabilities P d and P fa can be written as P d = ∞ ξ f L (l; H 1 )dl and P f a = ∞ ξ f L (l; H 0 )dl, respectively.The threshold ξ is chosen so as to obtain a P fa that is equal to a desired value, typically P fa = 0.05.This requires knowledge of f L (l; H 0 ).A powerful result in detection theory [16] states that for N → ∞ the log-likelihood ratio under H 0 is chi-squared distributed, i.e., f L (l; H 0 ) ∼ χ 2 r , with r the number of degrees of freedom.For the test Eq.( 21), r = 1.
In order to assess whether χ 2 1 is a reasonable approximation for finite N, Fig. 5 shows the PDFs of f L (l; H 0 ) generated using Monte Carlo simulations (10000 tests) with the same setup of Sec.3.3.It may be observed that the χ 2 1 asymptotic approximation is already reasonably accurate at N = 20.Thus, if a P fa = 0.05 is sought, the threshold can be simply chosen as ξ = f χ 2 1 (1 − 0.05) = 3.8415.With smaller sample sets, however, it is advisable to choose the threshold in some other way, e.g., using Monte Carlo simulations directly.For instance, if one uses the threshold ξ = 3.8415 for N = 5, then the actual P fa is not P fa = 0.05 but rather P fa = 0.13.If one wishes to achieve P fa = 0.05 with N = 5, then the threshold ξ ≈ 5.5 should be used instead.

Performance of Single-Sample Test of the Mean Direction
This section analyzes the performance of various singlesample tests using Monte Carlo simulations.The setup of the simulations is the same used in Sec.3.3 and the number of samples is N = 20.
The proposed vMUM test is compared to a vMM3 test, a vM test, and the standard t-test.The vMM3 test is that described by Grimshaw et al. [14].Here, the algorithm's starting point (which was not specified in [14]), is taken as the vMM3-MME.The vM test is the standard likelihood ratio test with unknown concentration parameter (see [12], page 122).Notice that all these tests assume the χ 2 1 asymptotic approximation.The threshold ξ is chosen such that P fa = 0.05 in all cases.
The comparison also includes the standard t-test [16].The t-test assumes that the data is normally distributed and, more importantly, aperiodic.The interval chosen to represent the angles is thus crucial.Indeed, if one uses the angles representation in the [0, 2π) interval, then the sample mean, N n=1 θ n , will not be zero even when μ = 0.In this section the interval is taken symmetrically around μ 0 .Notice, however, that this device does not solve the issue of treating a periodic random variable as aperiodic.A typical countermeasure used in the literature is to select errors close to the hypothesized direction, which are referred to as genuine errors [9].In this section genuine errors are taken as those within ±π/2 of the hypothesized angle μ 0 , and the associated test is referred to as selected t-test.Without loss of generality, all simulations use μ 0 = 0.

Receiver Operating Characteristics (ROC)
Fig. 6 shows the ROC curves for different tests.Here, the vMUM test has a higher P d than all other tests for all P fa .In particular, for P fa = 0.05, the test power is P d = 0.93 for vMUM and P d = 0.89 for vMM3 when averaging across all the Monte Carlo simulations.In some cases the difference in performance is much larger, especially when the mean angle μ is close to μ 0 .For instance, in simulations where |μ − μ 0 | = π/50 (i.e., 3.6 • ), one has P d = 0.40 for vMUM and P d = 0.17 for vMM3, while for |μ − μ 0 | = π/15 (i.e., 12 • ), one has P d = 0.90 for vMUM and P d = 0.64 for vMM3.The standard vM test has a relatively poor performance (P d = 0.82 for P fa = 0.05).For small |μ − μ 0 | the performance is particularly poor, with P d = 0.11 for |μ − μ 0 | = π/15 (i.e., 12 • ).The t-test has a poor performance.The selected t-test is, on average, even worse.However, if one only considers cases where |μ − μ 0 | < π/4, then the test performs on a par with the vM test.In other words, if it can be reasonably assumed that the mean direction is close to the hypothesized direction, then the selected t-test performs on a par with the vM test.

Performance as a Function of p 2 and p 3
Fig. 7 shows the performance of the vM, vMM3, and vMUM tests as a function of p 2 and p 3 .All tests have essentially the same power when p 2 = 0 and p 3 = 0.As p 2 increases, the power of the vM test drops considerably while both the vMM3 and vMUM tests perform well.As p 3 increases, the power of the vMM3 test starts dropping while the power of the vMUM test remains about the same.In conclusion, the vMUM test performs on a par or better than the other tests and should be preferred whenever front/back confusions and/or random errors may be present in the data.

Two-Sample Tests
Given two data samples, θ X of length N X and θ Y of length N Y , consider the following tests: Here, the a priori probabilities of H 0 and H 1 are unknown.
The distribution parameters are also unknown and are not necessarily the same for the two distributions.
The GLRT associated to these tests is where ν r is the parameter being tested (either μ or κ), with r ∈ {X, Y}, while a r contains the remaining parameters (e.g., when testing the mean direction, ν r = μ r and a r = [κ r , p 1r , p 2r , p 3r ]).
The restricted MLE at the denominator is found numerically.As a starting point, the value of ν is taken as the vMUM-MLE estimate of the combined data set θ = θ T X , θ T Y T , while the vectors containing the remaining parameters, ãr , are taken as the restricted vMUM-MLE estimates of the two data samples separately.The maximization of the unrestricted MLE at the numerator is separable.Thus the optimal parameters (ν r , âr ) are the unrestricted vMUM-MLE estimates of each distribution, which can be calculated using the method in Sec.3.2.2.Similarly to the single-sample test, two starting points are used-the parameters of the restricted MLE and the vMUM-MLE estimates.
The one with the higher likelihood value is then chosen.

Performance of Two-Sample Tests
For space reasons, only the performance of the twosample test of mean directions is analyzed here.The proposed vMUM test is compared to the two-sample vM test with the same concentration parameter1 , which is, to the best of the authors' knowledge, the only comparable test available in the literature.For this reason, the data was drawn from two distributions with the same concentration parameter, κ ∼ U(0, 100), and probabilities p 2 ∼ U(0, 0.3), p 3 ∼ U(0, 0.3), and p 1 = 1 − p 2 − p 3 .In order to run a fair comparison, the vMUM test was amended so as to take into account that a X = a Y , which is a straightforward modification.The mean directions were set as μ X = 0 and μ Y ∼ U(0, π), without loss of generality.The sample size was set to N = 20.The results of this comparison show that for P fa = 0.05, the power of the vM test is P d = 0.76, while the power of the vMUM test is P d = 0.90.

CASE STUDY
This section describes a large-scale auralized experiment that was held in London during the 2015 Summer Science Exhibition of the Royal Society.The objective of the experiment was to study the localization performance of a large number of untrained subjects in an informal setting and with little training.Furthermore, as explained in detail in the following, the study aimed at testing several hypothesis: (a) whether two different head related transfer function (HRTF) datasets led to a significant difference in localization performance, (b) whether two different sound samples led to a significant difference in localization performance, and (c) whether the localization performance was different for a free-field simulation and reverberant rooms with different ceiling heights.The latter objective was inspired by a study by Hartmann [2] where it was shown that a room with high ceiling resulted in poorer localization, despite having a higher direct-to-reverberant ratio (DRR) than rooms with lower ceiling.Hartmann used large rooms with long reverberation times.The objective here was to independently confirm this hypothesis in an auralized experiment and with typically-sized rooms.

Participants and Data Monitoring
A total of 893 subjects participated in the experiment.No personal information of individual subjects was collected.Their age varied greatly and no gender bias was observed.
The data associated with 40 of the subjects was removed because they declared that (a) they were deaf in one ear, or (b) they did not understand the task, or (c) they made a mistake in using the interface, or (d) because the subject was only playing with the apparatus rather than executing the experiment.An additional 27 tests were excluded because the equipment was incorrectly initialized.Another 22 tests were removed because the response was given in less than one second, which indicated that the subject touched the interface without engaging in the test or by mistake.In case a subject performed the task under identical conditions more than once, the additional data points were also excluded.The above data selection reduced the number of subjects from 893 to 844, and the total number of tests reduced from 1979 to 1655.

Apparatus and Procedure
The apparatus consisted of a circular rotating platform with a diameter of about 70 cm and a height of 20 cm.The subjects stood on the rotating platform and they could freely turn themselves by applying force on a stationary wheel.The wheel was positioned in the center of the platform at a height of about 94 cm.Subjects wore a pair of Bang & Olufsen BeoPlay H6 headphones connected to an iPad Air that was mounted on a pole in front of them at eye level.
The iPad's motion sensor was used to measure the rotation of the platform.In order to assess the accuracy of the motion sensor, the iPad was turned 10 times around itself and back, which gave an average deviation of about 2 • from the expected 180 • .Leaving the iPad stationary on a stable horizontal surface gave an average maximum drift of 0.67 • over 10 repetitions.Please notice that the iPad's motion sensor was used as the reference for both the audio rendering and for recording the subjects' angular response, and therefore a drift of the motion sensor will have the same effect on both.
The subjects controlled the experiment using a simple custom-made GUI displayed on the iPad Air (see [10] for screenshots of the GUI).They could choose between two conditions-an anechoic condition and a reverberant condition the details of which are described in the next section 2 .They could run the conditions in any desired order and any number of times.
The subjects were asked to remain still and to keep looking at the iPad which displayed a real-time animation of the rotation of the platform to hold their attention.Once the audio started, their task was to rotate the platform until the sound source appeared to be in front of them.Subjects were told that they could take as long as needed to make a decision.The audio sample was looped until they recorded their decision.Subjects recorded their decision by tapping on a button stating "Touch here when the sound source appears to be in front of you" on the GUI.The GUI would then show their performance in terms of angular error.
It is worth observing here that the apparatus and measurement system is of such a specific nature that one should be careful in making conclusions in absolute terms.The data analysis will thus focus on the relative differences between conditions, for which the impact of the apparatus and measurement system can be factored out.

Sound Stimuli
Two anechoic sound samples from the "Music for Archimedes" CD were used.One was track 4, a female speech sample, and the other was track 26, a sample of an African percussion instrument.The two samples were cut at 28 seconds and 25 seconds, respectively, in order to reduce memory spooling.The levels were manually adjusted until the two sound samples had the same perceived loudness, which resulted in the speech sample being reduced by 3 dB with respect to the original level.
Two HRTF datasets were used-the MIT KEMAR mannequin [17] and subject number 58 of the CIPIC database [18], both having a 5 • horizontal resolution.For every subject one of the two HRTF datasets and one of the two anechoic samples was chosen at random.
The anechoic signal was convolved with HRTFs using time-domain filters running in real-time on the iPad.When the platform (and thus the iPad) rotated, the coefficients of the filters were updated accordingly in real-time.In other words, the iPad was used as a head tracker in this experiment.As mentioned earlier, subjects were asked to remain as still as possible and to keep looking at the iPad, but their head and body were not physically restrained.In updating the HRTF filters, no interpolation was used.No audible artifacts were reported, owing to the slow rotation speed allowed by the rotating platform.
The subjects could choose between two conditions-an anechoic condition, where the sound source was placed in free field at the same height as the listener and at a distance of 1.4 m, and a reverberant condition, where the room acoustic response was simulated in real-time using a scattering delay network (SDN) [19].SDN was chosen to generate the reverberation because of its ability to reproduce faithfully the important physical (e.g., early reflections, reverberation time) and perceptual features (e.g., normalized echo density) while running in real-time.
Table 1 lists the three room setups that were used and in each case shows the value of the DRR and the reverberation time (RT60).The dimensions of the "typical room" are ITU-R-compliant (BS.1116-1) and are identical to those of the "high reverberation" case.The "high reverberation" and "high ceiling" cases have the same T 60 .The listener and sound sources were placed in the room as depicted in Fig. 8.The setup was chosen so as to be simple to describe, while at the same time avoiding the occurrence of sweeping echoes, which have been shown to occur in rectangular rooms with regular setups [20].For each test, one of the three room setups and one of the two sound source positions was selected at random.
The frequency response of the Bang & Olufsen BeoPlay H6 headphones was equalized via monophonic minimumphase inverse filters provided by the manufacturer.

Model Comparison
Fig. 9 shows the empirical PDF of the localization errors under the anechoic condition (N = 751) and the result of fitting various statistical models to the data.Fig. 9a includes the Gaussian model and the vM model, both with ML estimate of the parameters, and the vMM3 model with   MM estimate of the parameters.It is clear that the Gaussian model fits the data poorly.The Pearson correlation between the empirical PDF and the Gaussian PDF, used here as a measure of goodness-of-fit, is 0.42.The vM model has an improved fit to the data but does not account for the front/back confusions, and the model's concentration is insufficient for frontal sound sources (Pearson correlation 0.55).The vMM3 model achieves a better fit in respect of the front/back confusions (Pearson correlation 0.85).However, also in this case, the model fails to represent the higher concentration of the data, which is due to the fact that it is trying to fit the uniformly distributed errors by means of the front and back vM marginal distributions.By including the error marginal distribution explicitly, the vMUM model is able to provide a better fit to the data than the other models investigated, as shown in Fig. 9b.The Pearson correlation for both the vMUM-MLE and vMUM-MME models is 0.97.
In the vMUM model, the concentration parameters for the frontal and front/back confusions vM marginal distributions are identical.However, it could be hypothesized that front/back confusions are associated with a higher uncertainty and thus a lower concentration.For this reason, an additional model allowing for different concentration parameters was also investigated.This model was termed vMUMk and is shown in Fig. 9.As may be observed, the vMUMk and vMUM models fit the data equally well, with vMUMk also having a Pearson correlation of 0.97.Therefore, owing to its simplicity and the availability of closed-form MM estimators, the vMUM model remains the preferred model here.

Angular Error and Concentration Parameter
Table 2 presents a summary of the statistics for the angular error, i.e., the difference between the response angle and true angle.The probability of identifying the frontal source is p 1 = 0.76, while the probability of experiencing a front/back confusion is p 2 = 0.09.The probability of making a uniformly distributed decision, or, in other words, a mistake, is p 3 = 0.15.From a frequentist point of view, 15% of subjects made a mistake.Notice that this includes cases where the subject made a mistake that was close to the correct direction (or to the opposite direction) by chance.These probabilities are similar across individual conditions.
In the anechoic case, the mean directional error is +1.4 • .Using the single-sample test proposed in Sec.3.4, the hypothesis H 0 : μ = 0 can be rejected at the 0.05 significance level.The p-value, i.e. p = +∞ L(θ) f L (l; H 0 )dl, is p < 0.001, which indicates a strongly significant result.
This rightward bias has been observed before in the literature [21] and implies some kind of physiological asymmetry in the auditory system itself.While the present experiment supports the findings in the literature, it cannot be ruled out that the bias observed here is due to systematic experimental errors such as asymmetries in the headphones or in the HRTF datasets.
Considering all the conditions together, the KEMAR and CIPIC datasets have a mean directional error of +2.3 • and +1.1 • , respectively.The two-sample test proposed in Sec.3.6 reveals that the difference is statistically significant (p = 0.02).Furthermore, there is a borderline significant trend in comparing the concentration parameters (p = 0.10), with the KEMAR having larger concentration than CIPIC.This difference is strongly statistically significant if one considers the room simulation with the typical room setup alone (p<0.001).
The data indicates that speech yields a stronger rightward bias than the percussion instrument sound.Across all conditions there is only a borderline significant trend (p = 0.11).The difference is not significant in the anechoic case (p = 0.99) but is significant in the reverberant conditions (p = 0.04).To the best of the authors' knowledge, this bias has not been observed before.
In the high reverberation condition, there is a statistically significant difference between speech and percussion sounds (p = 0.02), with percussion sounds having a larger concentration than speech.This suggests that, in the presence of reverberation, percussion sounds are easier to localize in comparison to the speech signal, which is consistent with results in the literature [1].
Finally, results of this experiment not included in Table 2 show that a larger number of subjects experience a front/back confusion in cases where the sound source starts behind the subject.Indeed, for tests where the initial look direction is less than 90 • away from the source, the vMUM-MLE parameters are p 1 = 0.79, p 2 = 0.06, p 3 = 0.15, while for tests where the initial look direction is more than 90 • away from the source, the vMUM-MLE parameters are p 1 = 0.74, p 2 = 0.11, p 3 = 0.15, which shows that the percentage of front/back confusions has nearly doubled.The percentage of front/back confusions increases even more as the initial look direction approaches 180 • , with p 2 = 0.25 for initial look directions larger than 160 • from the source.
The results presented have not been corrected for multiple comparisons.

Time to Complete Experiment
Subjects could choose whether to run the anechoic or the reverberant conditions first.Unsurprisingly, there was a clear learning effect, with the first condition taking much longer than the second.Those who started with the anechoic condition took on average 30.4s for the anechoic condition and 23.4 s for the reverberant condition.Conversely, those who started with the reverberant condition took on average 22.3 s for the anechoic condition and 31.7 s for the reverberant condition.In order to compare the two conditions fairly, the dataset was pruned until it contained an equal number of subjects who started with each one 3 . 3All subjects who started with the reverberant condition were kept (because there were fewer of them).Then, of the subjects starting with the anechoic condition, only those who carried out the experiment at a similar time to the ones starting with the The results are reported in Table 3 and show that subjects take longer in the high ceiling condition than in all the other conditions.Two-sample t-tests with right tail reveal that the result is statistically significant in all cases.
This result may seem surprising.Indeed, the high ceiling condition has the highest DRR of all three (see Table 1) and has the same reverberation time as the high reverberation condition.In contrast, the high reverberation condition took about the same time as the typical room condition (25.9 s and 25.5 s, respectively).As mentioned earlier this phenomenon has been observed before in the work of Hartmann [2].Hartmann hypothesized that it is easier for the auditory system to localize the sound source if the ceiling reflection arrives before the onset of the precedence effect.The rationale is that there is an additional reflection that is temporally fused with the line of sight component and is azimuthally co-directional with the sound source.While Hartmann supported his conclusions based on experiments in a large room with long reverberation times the results presented here suggest that the same phenomenon also arises in small, ITU-R-compliant rooms.
Finally, results of this experiment not included in Table 3 reveal that the speech sample took 30.5 s on average, which is significantly longer than the percussion sample, 22.9 s (p<0.001).This difference could be due to subjects actively paying attention to what was being said in the speech sample, or to the fact that percussive signals with sharp onsets are easier to localize [1].

CONCLUSIONS
This paper has presented a circular statistical model for sound localization experiments that is robust to the presence of front/back confusions and random errors.Closed-form MM and ML estimators were presented, with the MM estimators outperforming the ML estimators for small sample sizes.Single-sample and two-sample tests were proposed and were shown to have a performance on a par with or reverberant condition were kept.This particular choice was made because it allows to factor out possible time dependances of the experiment (e.g., at peak times a queue would sometimes form that could put subjects under pressure to complete the test faster).The statistical framework was used to analyze the data of a large-scale auralized experiment.Using the proposed statistical framework, it was shown that (a) a rightward bias is present in the subjects' localization responses, (b) the speech sample is found to have a larger rightward bias than the percussion sample under reverberant conditions, (c) the KEMAR HRTF dataset results in a larger bias than the CIPIC HRTF dataset, (d) under high reverberation, responses given with the percussions sample had a higher concentration than with the speech sample.Furthermore, an analysis of the time it took to complete the experiment showed that (a) the speech sample took significantly longer to localize than the percussions sample, and (b) it took significantly longer to localize sound in a simulated room with high ceiling.
Matlab routines implementing the proposed statistical estimators and statistical tests, together with the data of the listening experiment, are available online 4 .
MSE of the parameter p1 .
MSE of the parameter p3 .

Fig. 4
shows the mean squared error of the different estimators as a function of the sample size N.It can be observed that the vMUM-based estimators outperform the vMM3-based one.The vMUM-MLE and vMUM-MME estimators perform equally well, except for the estimation MSE of the parameter p3 .
MSE of the mean angle μ.
MSE of the parameter κ .
MSE of the parameter p1 .
MSE of the parameter p3 .

Fig. 5 .
Fig. 5. Cumulative density function (CDF) of the log-likelihood ratio of the single-sample statistical test for different values of N in comparison to the χ 2 1 asymptotic approximation.Results obtained from Monte Carlo simulations.

Fig. 6 .Fig. 7 .
Fig. 6.Comparison of the ROC curves for the single-sample statistical test.These curves represent the available trade-offs between P d and P fa obtained by varying the value of the threshold ξ.ROC curves with points close to (P d = 1, P fa = 0) indicate a test with high performance.Results obtained from Monte Carlo simulations with sample size N = 20.

Fig. 8 .
Fig. 8. Setup of the reverberant simulation.The black circles denote the position of the sound sources (only one source playing at any one time).
Data fitting with Gaussian, vM and vMM3 models.Data fitting with vMUM model.

Fig. 9 .
Fig. 9. Empirical PDF of the anechoic condition and fitting of the statistical models discussed in this paper.The sound source is positioned at 0 • .

Table 1 .
Characteristics of the room simulation in the reverberant condition.

Table 2 .
Statistics of the angular error.Positive values of μ indicate a rightward bias.The first column denotes the number of samples, N. The following 5 columns denote the vMUM-MLE estimates.The following 4 groups of 3 columns analyze differences in mean and concentration parameter between the KEMAR and CIPIC HRTF datsets, and mean and concentration parameter between speech and percussions audio samples.The p-values are calculated using the proposed two-sample vMUM tests.Values in boldface indicate statistical significance at the 0.05 significance level.p 2 p 3 μ KEM.μ CIP.p κ KEM.κ CIP.p μ spe.μ per.p κ spe.κ per.p

Table 3 .
Statistics of the time to complete the test.The dataset was pruned to factor out learning effects (see text).The first two columns denote the number of samples, N and the mean of the time to complete the test, respectively.The last four columns present the p-value of the interactions across conditions using a two-sample t-test with right tail.Values in boldface indicate statistical significance at the 0.05 significance level.than prior art tests.The performance of the proposed data modelling was substantially better in the presence of front/back confusions and/or random errors.These errors are common if the experiment is auralized (especially when using non-individualized HRTFs), if the subjects are untrained, if the auditory system is intentionally interfered with, or if the sound stimuli are particularly hard to localize, e.g., in echolocation tasks. better