AES Warsaw 2015
Paper Session P11

P11 - (Lecture) Sound Localization and Separation

Saturday, May 9, 09:00 — 12:00 (Room: Belweder)

Chair:
Christof Faller, Illusonic GmbH - Zurich, Switzerland; EPFL - Lausanne, Switzerland

P11-1 Classification of Spatial Audio Location and Content Using Convolutional Neural Networks—Toni Hirvonen, Dolby Laboratories - Stockholm, Sweden
This paper investigates the use of Convolutional Neural Networks for spatial audio classification. In contrast to traditional methods that use hand-engineered features and algorithms, we show that a Convolutional Network in combination with generic preprocessing can give good results and allows for specialization to challenging conditions. The method can adapt to e.g. different source distances and microphone arrays, as well as estimate both spatial location and audio content type jointly. For example, with typical single-source material in a simulated reverberant room, we can achieve cross-validation accuracy of 94.3% for 40-ms frames across 16 classes (eight spatial directions, content type speech vs. music).
Convention Paper 9294 (Purchase now)

P11-2 A Theoretical Analysis of Sound Localization, with Application to Amplitude Panning—Dylan Menzies, University of Southampton - Southampton, UK; Filippo Maria Fazi, University of Southampton - Southampton, Hampshire, UK
Below 700 Hz sound fields can be approximated well over a region of space that encloses the human head, using the acoustic pressure and gradient. With this representation convenient expressions are found for the resulting Interaural Time Difference (ITD) and Interaural Level Difference (ILD). This formulation facilitates the investigation of various head-related phenomena of natural and synthesized fields. As an example, perceived image direction is related to head direction and the sound field description. This result is then applied to a general amplitude panning system and can be used to create images that are stable with respect to head direction.
Convention Paper 9295 (Purchase now)

P11-3 Audio Object Separation Using Microphone Array Beamforming—Philip Coleman, University of Surrey - Guildford, Surrey, UK; Philip Jackson, University of Surrey - Guildford, Surrey, UK; Jon Francombe, University of Surrey - Guildford, Surrey, UK
Audio production is moving toward an object-based approach, where content is represented as audio together with metadata that describe the sound scene. From current object definitions, it would usually be expected that the audio portion of the object is free from interfering sources. This poses a potential problem for object-based capture, if microphones cannot be placed close to a source. This paper investigates the application of microphone array beamforming to separate a mixture into distinct audio objects. Real mixtures recorded by a 48-channel microphone array in reflective rooms were separated, and the results were evaluated using perceptual models in addition to physical measures based on the beam pattern. The effect of interfering objects was reduced by applying the beamforming techniques.
Convention Paper 9296 (Purchase now)

P11-4 Limits of Speech Source Localization in Acoustic Wireless Sensor Networks—David Ayllón, University of Alcalá - Alcalá de Henares, Spain; Roberto Gil-Pita, University of Alcalá - Alcalá de Henares, Madrid, Spain; Manuel Rosa-Zurera, University of Alcalá - Alcalá de Henares, Madrid, Spain; Guillermo Ramos-Auñón, University of Alcalá - Alcalá de Henares, Madrid, Spain
Acoustic Wireless Sensor Networks (AWSN) have become very popular in the last years due to the drastic increment in the number of wireless nodes with microphones and computational capability. In such networks accurate knowledge of sensor node locations is often not available, but this information is crucial to process the collected data by means of array processing techniques. In this paper we consider the error in the estimation of the position of the nodes as a traditional microphone mismatch with large values, and we perform a detailed study of the effect that a large microphone mismatch has on the accuracy of TDOA- based source localization techniques.
Convention Paper 9297 (Purchase now)

P11-5 Improving Speech Mixture Synchronization in Blind Source Separation Problems—Cosme Llerena-Aguilar, Sr., University of Alcalá - Alcala de Henares (Madrid), Spain; Guillermo Ramos-Auñón, University of Alcalá - Alcalá de Henares, Madrid, Spain; Francisco J. Llerena-Aguilar, University of Alcalá - Alcalá de Henares, Madrid, Spain; Héctor A. Sánchez-Hevia, University of Alcala - Alcalá de Henares, Madrid, Spain; Manuel Rosa-Zurera, University of Alcalá - Alcalá de Henares, Madrid, Spain
The use of wireless acoustic sensor networks carry many advantages in the speech separation framework. Since nodes are separated by greater distances than a few centimeters, they can cover rooms completely, although these new distances involve certain problems to be solved. For instance, important time differences of arrival between the speech mixtures captured at the different microphones can appear, affecting the performance of classical sound separation algorithms. One solution consists in synchronizing the speech mixtures captured at the microphones. Following with this idea, we put forward in this paper a new time delay estimation method that outperforms classical methods in order to synchronize speech mixtures. The results obtained show the feasibility of using our proposal aiming at synchronizing speech mixtures.
Convention Paper 9298 (Purchase now)

P11-6 Direction of Arrival Estimation of Multiple Sound Sources Based on Frequency-Domain Minimum Variance Distortionless Response Beamforming—Seung Woo Yu, Gwangju Institute of Science and Technology - Gwangju, Korea; Kwang Myung Jeon, Gwangju Institute of Science and Technology (GIST) - Gwangju, Korea; Dong Yun Lee, Gwangju Institute of Science and Technology (GIST) - Gwangju, Korea; Hong Kook Kim, Gwangju Institute of Science and Tech (GIST) - Gwangju, Korea; City University of New York - New York, NY, USA
In this paper a method for estimating the direction-of-arrivals (DOAs) of multiple non-stationary sound sources is proposed on the basis of a frequency-domain minimum variance distortionless response (FD-MVDR) beamformer. First, an FD-MVDR beamformer is applied to multiple sound sources, where the beamformer weights are updated according to the surrounding environments for the reduction of the sidelobe effect of the beamformer. Then, multistage DOA estimation is performed to reduce computational complexity regarding the beam search. Finally, a median filter is applied to improve the DOA estimation accuracy. It is demonstrated that the average DOA estimation error of the proposed method is smaller than those of the methods based on conventional GCC-PHAT, MVDR-PHAT, and FD-MVDR, with lower computational complexity than that of the conventional FD-MVDR-based DOA estimation method.
Convention Paper 9299 (Purchase now)

Return to Paper Sessions

EXHIBITION HOURS May 7th 10:00 – 18:00 May 8th 09:00 – 18:00 May 9th 09:00 – 18:00

REGISTRATION DESK May 6th 15:00 – 18:00 May 7th 09:30 – 18:30 May 8th 08:30 – 18:30 May 9th 08:30 – 18:30 May 10th 08:30 – 16:30

TECHNICAL PROGRAM May 7th 10:00 – 18:00 May 8th 09:00 – 18:00 May 9th 09:00 – 18:00 May 10th 09:00 – 17:00

Audio Engineering Society

AES Warsaw 2015Paper Session P11

P11 - (Lecture) Sound Localization and Separation

AES Warsaw 2015
Paper Session P11