Microphone array geometries for horizontal spatial audio object capture with beamforming

Microphone array beamforming can be used to enhance and separate sound sources, with applications in the capture of object-based audio. Many beamforming methods have been proposed and assessed against each other. However, the effects of compact microphone array design on beamforming performance have not been studied for this kind of application. This study investigates how to maximize the quality of audio objects extracted from a horizontal sound ﬁeld by ﬁlter-and-sum beamforming, through appropriate choice of microphone array design. Eight uniform geometries with practical constraints of a limited number of micro- phones and maximum array size are evaluated over a range of physical metrics. Results show that bafﬂed circular arrays outperform the other geometries in terms of perceptually relevant frequency range, spatial resolution, directivity and robustness. Moreover, a subjective evaluation of microphone arrays and beamformers is conducted with regards to the quality of the target sound, interference suppression and overall quality of simulated music performance recordings. Bafﬂed circular arrays achieve higher target quality and interference suppression than alternative geometries with wideband signals. Furthermore, subjective scores of beamformers regarding target quality and interference suppression agree well with beamformer on- axis and off-axis responses; with wideband signals the superdirective beamformer achieves the highest overall quality.


INTRODUCTION
Object-based audio is a spatial audio representation where the sound field is comprised of individual objects [1]. The advantage of this paradigm over channel-based and scene-based approaches is that objects can be controlled individually before being rendered, allowing for compatibility with arbitrary reproduction systems and user personalization [1,2]. This resuls in improved listening experience, e.g. by controlling the dialogue-to-backgroundsound level [3], automatic optimal rendering exploiting the semantic information from the object metadata [4], and customization for hearing impared people [5].
Sound sources composing audio objects can be captured individually with minimum spill by separate multi-tracked or close microphone recordings [6]. However, there are situations where close-miked recordings may not be feasible due to production constraints: insufficient resources (microphones, preamplifiers, digital converters, etc.); restricted set-up time; impractical to wear clip microphone and transmitter; and/or moving sources that cannot be followed dynamically with a microphone. In all these situations, spatial filtering (or beamforming) with a single, com-pact microphone array to isolate [7] or enhance [1] certain audio objects in the sound scene may be desirable.
Many of the findings from the beamforming literature apply to object capture with microphone arrays. The array output depends on the beamforming method and physical array design. Beamforming methods can be classified as [8] filter-and-sum beamformers (FSBs), differential microphones (DMs) and modal beamformers (MBs). Within these approaches, numerous contributions in filter design optimization based on different criteria have been proposed and reviewed [9,10,11,12,13]. However, it is not obvious how to design the microphone array to maximize the beamforming performance with respect to various metrics. The choice of microphone arrays in the literature can be to simplify the formulation, e.g. linear arrays with FSBs [9] and DMs [14,15,12] or circular and spherical arrays with MBs [16,17]; or to show an improved performance regarding a single physical metric of interest (e.g. resolution [18,19] or sidelobe [20]), thus only partially rating their performance.
Most of the contributions relate to physical performance measures. While there exist some perceptual studies in beamforming, they either relied on objective models trained on perceptual features [7,21,22,23] such as PEASS [24], or performed a listening test only using speech and without stating an attribute to be rated [25]. 1) We perform a thorough comparative evaluation of the physical beamforming performance of compact array designs. Uniform arrays are for the first time compared based on the two most practical design constraints: a given number of microphones, which impacts on the cost and processing power of the system; and a maximum array size, determining its compactness and portability. To achieve such a consistent and systematic comparison their performance with widely used space-domain beamformers over a range of metrics is assessed through simulations, since off-the-shelf arrays do not have the same number of microphones or comparable dimensions [26]. As a result, we show which array is the optimal uniform array geometry in terms of perceptually relevant frequency range, resolution, directivity and robustness for horizontal sound fields.
2) We conduct the first formal comparative listening evaluations of microphone array beamforming for audio applications. Two main experiments are undertaken assessing different arrays and beamformers in terms of quality of the target sound, interference suppression and overall quality, for simulated music performance recordings. Results show how critical effects on target quality and interference suppression are on the overall quality and rankings. A further listening test is employed to discriminate amongst the best performing arrays, obtaining statistical significance of the favored array.
The paper is structured as follows: Sec. 1 reviews the signal model, microphone array designs and beamforming methods to assess the arrays; Sec. 2 presents the evaluation metrics, setup and results from the physical analysis; Sec. 3 presents the methodology and results from the perceptual evaluation of beamformers and arrays; Sec. 4 discusses the physical and perceptual results and their implications for object capture. Finally, the main conclusions are highlighted in Sec. 5.

BACKGROUND
This section introduces the signal model, array manifold transfer functions for open, cylindrical and spherical baffles, and beamformers used for the array evaluation.

Signal model
Consider a collection of S sound source signals expressed in the frequency domain as s(ω) = [s 1 (ω), s 2 (ω), . . . , s S (ω)] T in the far field from an M-element microphone array at S different directions. The signal captured by the array x(ω) = [x 1 (ω), x 2 (ω), . . . , x M (ω)] T can be expressed as where A(ω) = [a 1 (ω), a 2 (ω), . . . , a S (ω)] is the array manifold steering matrix representing the transfer function between each sound source s s (ω) and each of the microphones; a s (ω) = [a 1s (ω), a 2s (ω), . . . , a Ms (ω)] T is the equivalent vector between sound source s s (ω) and all microphones; and v(ω) = [v 1 (ω), v 2 (ω), . . . , v M (ω)] T is a noise signal with arbitrary spatial characteristics [27,28]. The output signal of a FSB y(ω) is obtained by filtering and summing the array input x(ω) with the beamformer weights w(ω) = [w 1 (ω), w 2 (ω), . . . , w M (ω)] T : The directional response d(ω) can be regarded as the transfer function between a source signal at any point over the sound field considered and the array output [29]: where is the response at each angle Ω s over S steering directions, with Ω ≡ (θ , φ ) comprising the inclination and azimuth angles, respectively. The steering matrix A(ω) depends among other things on the microphone positions and potential acoustic wave phenomena (e.g. diffraction and scattering), thus leading to different analytical expressions for the uniform array designs included in this study.

Microphone arrays
There exist many possible designs for compact microphone arrays. Uniform linear arrays are commonly used due to their ability to simplify the formulation of a proposed beamformer or feature to be shown. However, they are unable to resolve the direction of arrival (DoA) in three dimensions (due to unavoidable front-back and elevation ambiguities). Horizontal planar arrays also feature updown confusion. However, they have been used for noise control applications to reduce the sidelobes [20].
On the other hand, circular and spherical arrays have been widely used for 2D and 3D sound field capture in the circular/spherical harmonic domain, i.e. higher-order Ambisonic (HOA) and MB. While all circular/spherical array designs are sensitive to noise at low frequencies, their open counterparts are ill-conditioned at frequencies where Bessel function singularities occur [17]. The latter can be remedied with dual-and multiple-radius spheres/circles [30,31,32] or a combination of pressure and velocity microphones [33], at the cost of at least twice as many microphones; or using cardioid microphones, although their directivity is frequency dependent in practice [31,34]. Alternatively, mounting the array on a cylindrical or spherical baffle also overcomes the robustness issue [17].
This study performs a systematic evaluation of the performance of eight of these array geometries (see Fig. 1): linear (L), rectangular (R), circular (C), dual-circular (DC), spherical (S), circular on rigid cylinder (C-RC), circular on rigid sphere (C-RS) and spherical on rigid sphere (S-RS). These were designed to provide an unbiased comparison by setting the two most practical design factors: the number of (omnidirectional) microphones M = 32, impacting on the cost and processing power of the array; and their aperture limit (maximum distance between two microphones), determining its compactness and portability, by setting a maximum radius of r = 0.1 m. This results in different spacing ∆d (minimum distance between two microphones). The inner radius of DC is 0.08 m. Unbaffled arrays (L, R, C, DC, S) can be modeled as: where a where J n is the Bessel function of order n. While open array manifolds can be expressed either in complex exponential (4) or harmonic decomposition (5) forms, the sound pressure on baffled arrays can only be represented via inverse cylindrical or spherical harmonic transforms.
The transfer function of a microphone array on an infinitely long rigid cylinder in a horizontal plane results in accurate approximation to its finite-length counterpart provided its length is at least 2.8 times the radius [37]. Using this assumption, the array manifold of C-RC is [18,38]: where H (2) n is the derivative of the Hankel function of the second kind and θ ∈ {0, π}.
The plane-wave transfer function for a microphone array mounted on a rigid sphere (C-RS and S-RS) is: where h (2) n is the derivative of the spherical Hankel function of the second kind, Θ = Ω m − Ω, P n is the Legendre Polynomial of order n comprising the sum over the spherical harmonics of all degrees |p| ≤ n. For S and S-RS, sensors are nearly uniformly distributed [17], placed in the center of the faces of a truncated icosahedron [39].
Eqs. (4-7) assume a plane wave incidence which is valid for sources at a distance R ≥ 8r 2 f /c [40], i.e. 2.3 m for r = 0.1 m and frequencies up to 10 kHz. This is satisfied in a practical performance capture where the sound sources will be spaced apart from each other while being evenly distant from the array as that presented in Sec. 3. Note (5), (6) and (7) are approximations of the equivalent infinite series which result in accurate representation up to a maximum frequency f max , provided N a = 1.1k max r m [41], where f max = 20 kHz in this study.

Beamforming
Four beamformers are used to evaluate the performance of the arrays: delay-and-sum (DSB), superdirective (SDB), minimum variance distortionless response (MVDRB) and least-squares (LSB), where B refers to beamformers. They are optimal in some way as reviewed below.

Delay and sum
The simplest FSB is DSB whose weights are the array manifold vector at the look direction a l (k, r) ≡ a(k, Ω l , r), to steer the array in that direction. DSB is very robust against deviations in microphone characteristics [28,42].

Superdirective
SDB, also known as supergain beamformer [43] or superdirective array [27,29], maximizes the directivity factor [29,42] (see Sec. 2.1) by minimizing the array output power at all directions subject to a distortionless constraint in the target direction. The robust weights are [10]: where I is the M × M identity matrix, β (ω) is the regularization parameter controlling the array's sensitivity to sen-sor self-noise and gain, phase and positioning errors and Γ Γ Γ diff (k, r) is the diffuse field coherence matrix: (10)

MVDR
Unlike DSB and SDB, MVDRB [11] is a data-dependent beamformer that minimizes the array output based on the array input covariance R xx (k, r) = E x(k, r)x H (k, r) , subject to the distortionless constraint. The weights [10] resemble those for SDB. In fact (11) simplifies to (8) in a purely diffuse field, i.e. R xx (k, r) = Γ Γ Γ diff (k, r).

Least-squares
Since all array manifolds are frequency dependent as shown in (4-7), so is the directional response of the above beamformers. Conversely, a particular frequency-independent desired directivity response can be approximated using the LSB [23,44,45] by minimizing the error with respect to the synthesized response: resulting in the following closed-form solution: The target patterns are high-order hypercardioid, which maximize the directivity index for a given order N [14]: where φ l and φ s are the azimuths at the look and sth steering directions, respectively and b = [b 0 , b 1 , . . . , b N ] are the real coefficients for natural (n ≥ 0) cylindrical harmonics [13], with b 0 = 1 and b n = 2 ∀ n = 0 [37]. The chosen target directivity patterns are similar to those designed by DMs [14,15] and similar approaches [13]. Unlike those, LSB is regularized, stabilizing the steering matrix inversion in (13), thus limiting the array's mismatches in microphone characteristics and self-noise.

PHYSICAL EVALUATION
This section evaluates the objective performance of the array geometries in Fig. 1 with physical metrics by means of simulations. These are introduced below.

Evaluation metrics
The beampattern |d(ω)| is the magnitude of the directional response (3). It fully quantifies the array processing transfer function over steering angle and frequency. Additional metrics that summarize aspects of the beampattern are also considered: Beam width (BW) is a measure of spatial resolution. It is defined as the angular distance between the two nulls in the beampattern delimiting the mainlobe. The sidelobe suppression level (SSL) is a measure of the minimum acoustic rejection with respect to any single direction outside of the mainlobe. It is defined as the ratio in dB of the directional response at the look direction to that given by the highest sidelobe. Similarly, the acoustic contrast (AC) is a measure of the acoustic rejection at a predefined direction (e.g. interferer direction) with respect to the look direction.
The directivity index (DI) measures the directionality of the array-beamformer as the ratio in dB of the response at the look direction to the average diffuse power [42]: .
The white noise gain (WNG) is a measure of robustness of the beamforming weights against microphone selfnoise, and phase, gain and positioning deviations from nominal values. It represents the gain in signal-to-noise ratio (SNR) at the beamformer output compared to a single sensor, in presence of spatially uncorrelated noise [42]: Finally, the frequency range of the array is bounded by the minimum frequency f min whose BW is smaller than 2π and the spatial aliasing frequency f a , defined here as the frequency at which grating lobes due to aliasing exceed the amplitude of the sidelobes. The frequency-invariant range is set by the onset frequency f o and aliasing frequency and calculated as the range within which the directional response normalized squared error NSE ≤ −20 dB, ensuring a minimum target response accuracy, where NSE(ω) = 10 log 10

Setup
The performance of the eight array geometries (L, R, C, DC, S, C-RC, C-RS, S-RS) shown in Fig. 1 is evaluated with the beamformers introduced in Sec. 1.3 and the metrics from Sec. 2.1 over a horizontal sound field.
All beamformer weights were calculated for a look direction ϕ l = 0 • (where ϕ = 90 − φ ), subject to a WNG constraint (WNG min ) of −10 dB unless otherwise stated, to limit the sensitivity to mismatches between nominal and actual array manifold responses encountered in practice. Thus, β (ω) is derived to meet WNG min . L was pointed endfire to ϕ l = 0 • . MVDRB was computed as a dataindependent beamformer, assuming a diffuse field with an interferer at ϕ i = 60 • .

Results
This section presents the results of the performance of the arrays under study evaluated in terms of the beampat-  tern, frequency range, beamwidth, directivity, robustness and sidelobe suppression.

Beampattern
The beampattern characterizes the effect of beamformer and array design choices for an arbitrary sound field. Fig. 2(a) shows the beampattern for DSB, SDB, MV-DRB and 4th-order hypercardioid LSB (shortened as LSB henceforth) with the C-RC. The shape of beampattern changes significantly for these beamformers: DSB is the most frequency dependent beamformer with omnidirectional response below 300 Hz, narrowing rapidly with frequency; SDB is the most directive with gradual beam narrowing and larger attenuated region as frequency increases; MVDRB's response approaches that of SDB with greater attenuation at the interferer; LSB provides a fixed beampattern within the array design's operating bandwidth while at low frequencies it becomes broader and attenuated due to the regularization to meet WNG min .
On the other hand, the overall shape of the beampattern is more similar across different arrays with the same beamformer. An example is shown in Fig. 2(b) for LSB with different array geometries. However, the array design has significant effects in terms of frequency range, resolution, directivity, robustness and sidelobe suppression. These are analyzed in more detailed below.

Frequency range
The main effect of the array geometry is the operating frequency range. This can be seen in Fig. 2(b) where the onset and aliasing frequencies differ significantly across arrays. The operating frequency ranges of all arrays with DSB, SDB and LSB are shown in Fig. 3. For a fixed number of sensors, the more dimensions the array spans, the smaller the operating bandwidth. In this case, with M = 32 and fixed maximum aperture of 0.2 m (r = 0.1 m), different spacing leads to different f a , ranging with DSB from 3 kHz for S to over 27 kHz for L with circular arrangements achieving the second highest value of 8.9 kHz.
On the other hand, the minimum frequency is rarely reported. Despite physically constraining the maximum aperture to a fixed size for all arrays, f min varies due to their sensor phase differences. The concept of effective or virtual modal aperture was previously used to describe the effect of a baffle on circular arrays' modal response [37]. Here, the effective aperture is referred as the equivalent wave traveling distance from the microphone phase responses, which is used to show the effect of array geometry on the acoustic response (i.e. before applying the beamformer) and to explain the values of f min which are the result of the acoustical and signal processing stages (see [46]). Results show R has the highest f min and smallest effective aperture, due to its sensors' proximity to the origin. L follows, whose f min is 25% higher than that of the highly-separated circular arrangement, C. With diffraction around a baffle, larger phase differences arise with the same array aperture, hence baffled arrays have larger effective apertures result- ing in lower f min . For C-RC and C-RS, f min reduced with respect to C by factors of 2.0 and 1.5 respectively, in line with those from the effective apertures between the closest and farthest microphones from a plane wave incidence for kr < 1 [46]. The rigid-sphere factor of 1.5 is also derived in [40,47]. Note that the ranking of these arrays in terms of f min and f a is consistent for the three beamformers, showing that these physical characteristics of the arrays impact on their operating bandwidth for multiple beamformers.
The beamformer, on the other hand, can further extend the arrays' operating range. SDB and LSB lower significantly f min compared to DSB for all arrays. This shows that the improved low frequency performance shown in Fig. 2(a) has the equivalent effect of extending the minimum frequency of directionality. Some configurations also extend f a beyond the theoretical values (c/(2∆d)) and those obtained numerically by DSB: DC, S and S-RS with SDB, and all arrays except L and R for LSB (Fig. 3). In addition, for circular arrangements with LSB f a extends even beyond that of SDB, e.g. baffled circular arrays extend their upper limit from 8.5 kHz to 12 kHz. This is also seen in Fig. 2(a) with the red dashed lines indicating the theoretical f a . Unlike DSB and SDB, LSB's first aliased lobes occur nearly at the same frequency at all angles, thus extending its upper limit when synthesizing low order patterns. This is in line with the findings in [23] for various hypercardioid orders.
Finally, the frequency-invariant range for the LSB is also shown in Fig. 3 with vertical lines. Its onset frequency f o is higher than f min for all arrays, yet with nearly identical ranking of arrays, with C-RC and L achieving the lowest and highest f o , respectively. Observe that in Fig. 3, WNG min = 0 dB so the differences in f min among arrays become apparent in the frequency range of interest. Note for a different constraint on r, the directional response will scale inversely proportionally with frequency due to the dependence of (4-7) with kr. Thus, the frequency ranges shown in Fig. 3 would be fixed in terms of kr while shifting with respect to the frequency axis. However for r 0.1 m, the arrays will no longer meet the compactness requirement for a practical portable recording device. On the other hand, increasing M maintains the same ranking and extends the theoretical aliasing frequencies shown in Fig. 3  [46], where f a = c(M − 1)/(4r) for L, f a ≈ cM/(4πr) for circular arrays and f a ≈ c √ M/(8r) for spherical arrays. Summarizing, the array geometry has a huge impact on the frequency range of the array-beamformer response which is very important in object capture. Baffled circular arrays (C-RC and C-RS) achieve the widest perceptually relevant bandwidth for all beamformers under study, with R and S having the narrowest ranges.

Resolution and directivity
Beam resolution and directivity are important to improve the isolation from adjacent sources and, in addition to the beamforming method, depend on the array design. Spatial resolution is inversely proportional to frequency, and array size [48,20]. Given a maximum aperture limit, we show how the effective aperture of the array also determines the resolution and directivity. Due to the inverse relationship of resolution and frequency, at low frequencies BW follows the same ranking as f min (Fig. 3), in turn determining DI. The latter is shown in Fig. 4  Hence, while L can theoretically achieve the highest directivity (DI max = 10 log 10 (2M − 1)) [14,27], this is only for unconstrained SDB or DMs, which are extremely sensitive to deviations from ideal microphone characteristics [43]. For robust beamforming required for practical recordings, baffled circular arrays achieve the highest directivity (and resolution) due to its increased effective aperture, being up to 3 dB higher than that for L with SDB. Note that increasing M will increase the maximum directivity which for SDB is DI max = 10 log 10 (2N max + 1), where N max = M − 1 for L and N max = M/2 for circular arrangements [37,49]. However, this may only be achieved at high frequencies (or not at all) given the robustness constraint, so the ranking of array performance remains unaltered for other practical choices of M.
Finally, the same ranking and DI are seen with LSB at low frequencies in Fig. 5 (top). While SDB can be regarded as an N max th-order hypercardioid, given WNG min , both regularized beamformers synthesize the same directivity below f o , thus exhibiting the same array differences.

Robustness
Practical recordings with microphone arrays require the actual array response to be robust to typical deviations in microphone positioning, gain and phase and to sensor noise. While a minimum robustness constraint on the weights limits the sensitivity to these deviations at low frequencies, the array geometry impacts on the absolute robustness at mid-to-high frequencies as shown in Fig. 5 (bottom) with LSB. Baffled circular arrays feature the highest WNG whereas L achieves the lowest. WNG for C and DC shows a significant number of dips at particular frequencies. These correspond to Bessel function singular frequencies (5), becoming ill-conditioned when inverted. While this has been widely reported in MB/HOA [17,32,34], here it is shown that it also applies to FSBs relying on the array manifold inversion, including LSB, SDB and MVDRB, thus being inherent to the open circular arrangement. Due to the robustness constraint, the WNG dips are limited to −10 dB. This constraint causes the directional response of these arrays to differ from the ideal response at those frequencies (even in ideal conditions). These manifest as dips in the response, e.g. DI for LSB in Fig. 5 (top). Unlike C, DC overcomes the singularities below 5 kHz, since it samples the sound field at different radial positions, thus avoiding the singularities to occur at the same frequencies. At high frequencies the number of modes is so large that the singularities overlap for different radii. Hence, careful choice of array radii is crucial as shown in [32].
WNG min can be modified depending on the expected deviations from nominal microphone characteristics. A very low WNG min will lead to significant performance degradation due to minor deviations in microphone characteristics whereas a very high WNG min would result in a response close to that of DSB [23], thus exhibiting similar relative differences across arrays to those shown here in terms of frequency range, BW, DI and WNG.

Sidelobe suppression
The SSL varies significantly with array geometry for DSB. A constant SSL of 13 dB is achieved by R, S and L, being only 7 dB for C. Baffled arrays have a SSL with larger attenuation in the lower range, with S-RS having the highest SSL yet over a narrow range. Conversely, SSL for SDB, MVDRB and LSB is insensitive to the choice of array, being around 14 dB for all arrays with SDB. Thus, the effect of array geometry on SSL is not significant for beamformers with amplitude weights. Fig. 6. Simulated music performance recording for the MUSHRA listening test. All arrays within central circle.

Summary
The array geometry has a significant impact on frequency range, resolution, directivity and robustness, with baffled circular arrays performing best in all these attributes, which are important in object capture. R and S result in the narrowest bandwidths. L achieves the highest f a , yet with the highest f o , and performs the worst in resolution and directivity. Finally, open circular arrangements are less robust than their baffled counterparts.

PERCEPTUAL EVALUATION
This section perceptually evaluates the performance of different array designs and beamformers in terms of sound quality and interference suppression of the isolated audio object from a scene recording.

Procedure
Two listening tests comparisons were conducted: arrays and beamformers. In the array comparison, a 4th-order hypercardioid LSB-N4 was synthesized with L, R, C-RC and S-RS, since their frequency ranges vary significantly both in terms of f o and f a as shown in Sec. 2.3.2. The beamformer comparison used C-RC, since this was shown to perform best overall in Sec. 2.3, and included DSB, SDB and LSB for orders 1, 4 and 8, providing different levels of on-axis and off-axis responses.
For each comparison, three different attributes were evaluated: 1) target quality refers to the quality of the target sound with respect to the reference; 2) interference suppression refers to any and all effects of interfering sources in each stimulus compared to the reference; 3) overall quality refers to the combined score considering the target quality 1) and interference suppression 2).
Each comparative test was undertaken with two target sounds (vocals and drums), which were repeated to check intra-participant agreement, resulting in 4 trials per test. In each trial participants were asked to rate the stimuli (beamformed signals and hidden reference and anchors) with respect to the reference according to each of the tasks above, using a MUSHRA-style interface [50]. To familiarize with the stimuli and the interface, subjects undertook a training phase prior to the formal evaluation [50], where they could adjust the volume of the headphones.

Stimuli
Stimuli were obtained from the Mixing Secret Dataset [51] which includes stems from professionally produced music recordings 1 . Vocals, drums, bass and guitar tracks were collected from the song "A reason to leave". Tensecond clips from these tracks were downmixed to mono and loudness normalized [52], to provide a fair comparison for different instruments. The reference signal was either the vocals or drums track for all trials. The interference task included one hidden anchor corresponding to the loudness normalized mono mixture (Mix) of the four stems. The target quality and overall quality tasks included two hidden anchors. The low and mid quality anchors for the target quality task (LA and MA) were the low-pass filtered versions of the reference signal with a cut-off frequency of 3.5 kHz and 7 kHz, respectively [50]. In the overall task equivalent low and mid quality anchors from the mixture were used (LAMix and MAMix).

Setup
The remaining stimuli were created from simulated microphone array beamformed signals. A sound scene comprising musical instruments was simulated (Fig. 6) by positioning them on the horizontal plane at angles that resemble a practical setup from a music performance or band practice: vocals at 0 • , bass at -60 • , drums at 45 • and guitar at 100 • . The microphone arrays were assumed to be in the center of the scene (with L pointing at 0 • ) and were steered towards the vocals or drums. Array transfer functions were modeled with a 1024-point FIR filter per sensor with a sampling frequency of 44.1 kHz. Microphone array and beamformed signals were calculated by filtering the stimuli as per (1) and (2), respectively.

Pre-analysis
24 participants from the University of Surrey conducted the experiment, 11 of whom had formal critical listening training. Among all of them, 19 were considered in the analysis: 4 failed to rate the reference above 90 for over 85% of the items [50]; and 1 rated the interference task in terms of quality, which was confirmed by a post-test questionnaire and by the mixture (anchor) ratings above 70 for the beamformer test. Each participant's scores were normalized in each trial [50].
A repeated measures analysis of variance (RMANOVA) was performed for each attribute (target quality, interference suppression and overall quality) and comparison (beamformers and arrays) to obtain a statistical analysis of the results [50]. The multivariate normality of the residuals (differences between systems) was tested using the Henze-Zirkler's method, which failed to reject the null hypothesis (normal) for all tests, except for the overall quality task with vocals. The within factors of the two-way RMANOVAs were system (i.e. array or beamformer excluding reference and anchors) and instrument. The results from repeated tests were averaged before the analysis, as repeat was not a significant factor when included.
The results of the RMANOVAs for all tests are shown in Table 1 in terms of the F-statistic with significant factors in bold (p < 0.05). All tests showed significant deviation from sphericity using Mauchly's test and the Huynh-Feldt correction was applied [50]. RMANOVAs show significant differences within beamformers and arrays for all attributes. Moreover, the three levels are significant at least for one attribute in each comparison. Thus, post-hoc comparisons are performed to investigate the differences between the scores of SDB and the other beamformers and between the scores of C-RC and the other arrays as both performed best overall, and since comparisons of all conditions are discouraged [50]. Hochberg's sequentially acceptive step-up Bonferroni procedure was applied to control Type I error [50]. These t-test comparisons are described in the following and tabulated in [46].

Results
The listening test results in terms of means and 95% confidence intervals (CIs) for the two comparisons and three tasks are shown in Fig. 7, and analyzed below.

Beamformer comparison
The scores of the different beamformers for the target quality task are shown in Fig. 7(a). For drums, SDB achieves the highest target score of 88, being significantly higher than those for all other methods. This is because SDB achieves a flat response compared to DSB's high frequency boost from the baffle scattering and LSB's inherent high pass filter from regularization, as shown in Fig. 8 (left). In fact, LSB's mean scores drop from 70 to 42 when increasing the order from 1 to 8, as a result of the higher low frequency roll-off at the look direction. On the other hand, for vocals similar target quality scores are seen with DSB, SDB, LSB-N1 and LSB-N4 with the latter having the highest yet not significant mean of 85. This more similar subjective performance is probably due to the reduced frequency range of the vocals (shaded area in Fig. 8 (left)), where the response difference among these beamformers is deemphasized.
The results in terms of the interference rejection are shown in Fig. 7(b). LSB-N8 and SDB perform best with statistically higher scores than for all other methods. This is because they achieve the highest attenuation at the interfering instruments as shown in Fig. 8 (right). Despite SDB's more frequency-dependent response, both beamformers achieve similar interference suppression scores. On the other hand, LSB-N1 achieves the lowest scores due to its flat 1.9 dB attenuation, followed by DSB, given its omnidirectional and comb filtering responses at low and high frequencies, respectively. Fig. 7(c) shows the overall quality scores. For drums SDB achieves significantly higher scores than all other methods, confirming its higher combined performance from each of the previous tasks. LSB-N8 is significantly worse than LSB-N4, indicating that the target quality    degradation seen in Fig. 7(a) becomes important in the overall score too. For vocals, SDB, LSB-N4 and LSB-N8 obtain very similar values with means 66-67, suggesting that the reduced vocal range flattens the differences across beamformers, as seen for the target quality.

Array comparison
The array comparison is shown in Fig. 7(d-f). C-RC achieves the highest scores for all attributes and instruments, yet not necessarily significant in all cases. For the target quality (Fig. 7(d)), C-RC is significantly higher than all other arrays for drums. For vocals C-RC is only significantly higher than R, since the differences in array responses reduce within the narrower vocal range.
In terms of the interference rejection ( Fig. 7(e)), C-RC is significantly better than the other three arrays for drums. The scores for the linear array are exceptionally low with a mean of 26 as a result of its reduced performance when steered to a direction other than endfire, resulting in a mirrored mainlobe with respect to the endfire direction (i.e. -45 • in this case). This results in very poor attenuation of the bass guitar located at -60 • . For vocals the four arrays perform similarly, including L since the vocals are located at the endfire direction.
The overall score ( Fig. 7(f)) for C-RC is significantly higher than those for L and R but not S-RS with drums, and only significantly higher than that for L for vocals.

3AFC test
The MUSHRA test revealed higher mean scores by C-RC for all tests. However, some of these could not be shown to be statistically significant with the vocals excerpt for both target quality and interference. In order to show whether C-RC consistently achieves higher scores than R and S-RS, a 3 alternative forced choice (3AFC) test was designed. L was discarded due to its notable performance drop when steered at off-axis directions, which is essential in a multi-source array beamforming capture.
The 3AFC test consisted of a clean reference and three stimuli corresponding to the beamformed signals from R, C-RC and S-RS, with LSB-N4. The two tasks were to select a single stimulus that resulted in 1) highest quality and 2) least interference with respect to the reference. Since the performance of these different arrays with frequency-invariant LSB beampatterns is mainly related to the onset and aliasing frequencies, wideband signals are required. Thus, the quality of the target sound was evaluated for drums. For the interference task the drums acted as one of the interfering instruments, with the target instrument being bass or guitar. To generalize the results to multiple setups, different combinations of the angles in Fig. 6 were considered for all instruments: 5 for the quality task and 3 for each instrument in the interference task. To account for intra-participant agreement, each trial was repeated 3 times, resulting in 15 and 18 trials for the quality and interference tasks, respectively.
14 participants with formal critical listening conducted the experiment. All of them were selected for the analysis since their mean normalized mode frequency was above 2/3 (1/3 implies random scoring and 1 corresponds to fully correlated scores). Fig. 9 shows the percentage of votes for each array for the quality and interference tasks. C-RC clearly outperforms the other two arrays in both tasks with 64% and 69% of votes. To determine whether this result is statistically significant, binomial distributions of the probability of selecting any array by chance (p 0 = 1/3) were implemented with t = 15 × 14 = 210 and t = 18 × 14 = 252 trials for both tasks. The critical value c of this binomial chance probability is calculated from the cumulative dis- , where α = 0.05 is the significance level. Since the percentage of votes from C-RC exceeds these critical values for both tasks as shown in Fig. 9, C-RC's higher quality and interference rejection is statistically significant. Moreover, since the 95% CI of the votes from C-RC does not overlap with the chance critical region, these results can be said to extrapolate to a larger population. Hence, the 3AFC test shows that C-RC achieves statistically significantly higher quality and interference rejection than R and S-RS.

DISCUSSION
The results from the physical and perceptual evaluations have shown evidence of the higher performance of the baffled circular arrays over the alternative array geometries considered. These are discussed in the context of desired properties of captured objects, while also extrapolate to other beamforming applications.
One of the most important requirements for multi-source 2D capture is to synthesize a beampattern that is independent of the steering azimuth. This is achieved by all arrays considered here except for L, whose mirrored response when steered off the endfire direction showed a significant drop in perceptual interference attenuation, making it inadequate for this application.
Another very important aspect in object capture is frequency range, since audio objects may include wideband signals such as music. Baffled circular arrays achieve the widest perceptually relevant bandwidth with the lowest onset frequency and the second highest aliasing frequency. This explains C-RC's statistically highest quality scores with drums when synthesizing a 4th-order hypercardioid pattern in both MUSHRA and 3AFC tests. On the other hand, R achieves the narrowest bandwidth, L has the highest onset frequency and S-RS has a similar onset frequency than C-RC yet with a lower aliasing frequency. The lowest quality scores achieved by R, followed by L, suggest that they are penalized by their bass drop, whereas S-RS performs better than R and L but worse than C-RC, probably due to its lower upper limit.
On the other hand, for vocals, which are not as wideband, the results for target quality and interference across arrays become much more similar. However, the 3AFC test shows the significantly higher interference suppression of C-RC evaluated with bass and guitar as target instruments, and over different relative instrument positions. This indicates that even though the differences in target quality and interference across arrays are not fully exploited with bandlimited target signals, the extended frequency range of the baffled circular array may become important to attenuate low frequencies and/or aliasing effects that may be audible from the other arrays in presence of interfering wideband signals like drums.
Since the performance of the captured object also depends on the beamformer, a perceptual evaluation of different beamformers was conducted. The quality of the target sound is one of the most important aspects of object capture, with SDB achieving excellent quality, due to its distortionless constraint, compared to DSB's good quality as a result of its high frequency boost from C-RC's baffle scattering. LSB's quality degrades as the order increases due to the higher low frequency roll-off from its regularized response. However, this could be compensated through equalization at the look direction.
The ability to suppress other sources is important for object capture, where SDB and LSB-N8 perform best with good attenuation. This indicates that the overall level difference is mainly considered, compared to LSB-N4's lower yet more frequency consistent AC. However, the equivalent interference scores from SDB (which can be regarded as N = 16) and LSB-N8 suggests that increasing the order beyond N = 8 with a robustness constraint may not lead to greater perceptual attenuation.
The overall quality is highest for SDB with drums, followed by LSB-N4 and LSB-N8, showing that the low frequency roll-off from high-ordered LSB becomes detrimental in the overall quality too. On the other hand, the same overall performance is seen for these three beamformers with vocals, suggesting that LSB's high pass filter is not as important for such band-limited signals.
For future work, target patterns other than hypercardioid, and other beamformers, may be explored to maximize the signal-to-interferer ratio for isolating the target object. The perceptual properties of these arrays may also be investigated for capturing performances in reverberant conditions.

CONCLUSION
This study evaluated the performance of uniform microphone array designs with the same number of microphones and maximum array size for object capture with beamforming in 2D. Simulation results show that baffled circular arrays performed best in terms of physical measures, including resolution, directivity, robustness and perceptually relevant frequency range, compared to alternative geometries. Listening tests were conducted to perceptually evaluate the performance of arrays and beamformers on simulated music performance recordings. The cylindrical array showed higher overall quality than linear, rectangular and baffled spherical arrays for a 4th-order hypercardioid LSB, yet not always significantly, especially for vocals. However, the cylindrical array showed statistically significantly higher quality of target sound and interference suppression than all other arrays in the presence of wideband signals, being confirmed by the 3AFC test. Hence, these conclusions quantitatively motivate the use of baffled circular arrays for practical horizontal source separation capture. In terms of beamformers, perceptual scores for target quality and interference suppression agreed well with beamformer on-axis and off-axis responses, respectively, with SDB achieving higher overall quality than LSB for wideband signals, potentially due to LSB's regularized high-pass response.

ACKNOWLEDGMENT
This work was supported by the EPSRC Programme Grant S3A: Future Spatial Audio for an Immersive Listener Experience at Home (EP/L000539/1) and the BBC as part of the BBC Audio Research Partnership.