AES London 2010
Semantic Audio Event Details
Saturday, May 22, 14:00 — 15:30
(Room C6)
T1 - Do-it-Yourself Semantic Audio
Presenter:
Jörn Loviscach, Fachhochschule Bielefeld (University of Applied Sciences) - Bielefeld, Germany
Abstract:
Content-based music information retrieval (MIR) and similar applications require advanced algorithms that often overburden non-expert developers. However, many building blocks are available—mostly for free—to significantly ease software development, for instance of similarity search methods, or to serve as components for ad-hoc solutions, for instance in forensics or linguistics. This tutorial looks into software libraries/frameworks (e.g., MARSYAS
and CLAM), toolboxes (e.g., MIRtoolbox), Web-based services (e.g., Echo
Nest), and stand-alone software (e.g., The Sonic Visualizer) that help
with the extraction of audio features and/or execute basic machine
learning algorithms. Focusing on solutions that require little to no programming in the classical sense, the tutorial's major part consists in live demos of hand-picked routes to roll one's own semantic audio application.
Saturday, May 22, 14:00 — 18:00 (Room C3)
P4 - Spatial Signal Processing
Chair: Francis Rumsey
P4-1 Classification of Time-Frequency Regions in Stereo Audio—Aki Härmä, Philips Research Europe - Eindhoven, The Netherlands
The paper is about classification of time-frequency (TF) regions in stereo audio data by the type of mixture the region represents. The detection of the type of mixing is necessary, for example, in source separation, upmixing, and audio manipulation applications. We propose a generic signal model and a method to classify the TF regions into six classes that are different combinations of central, panned, and uncorrelated sources. We give an overview of traditional techniques for comparing frequency-domain data and propose a new approach for classification that is based on measures specially trained for the six classes. The performance of the new measures is studied and demonstrated using synthetic and real audio data.
Convention Paper 7980 (Purchase now)
P4-2 A Comparison of Computational Precedence Models for Source Separation in Reverberant Environments—Christopher Hummersone, Russell Mason, Tim Brookes, University of Surrey - Guildford, UK
Reverberation continues to be problematic in many areas of audio and speech processing, including source separation. The precedence effect is an important psychoacoustic tool utilized by humans to assist in localization by suppressing reflections arising from room boundaries. Numerous computational precedence models have been developed over the years and all suggest quite different strategies for handling reverberation. However, relatively little work has been done on incorporating precedence into source separation. This paper details a study comparing several computational precedence models and their impact on the performance of a baseline separation algorithm. The models are tested in a range of reverberant rooms and with a range of other mixture parameters. Large differences in the performance of the models are observed. The results show that a model based on interaural coherence produces the greatest performance gain over the baseline algorithm.
Convention Paper 7981 (Purchase now)
P4-3 Converting Stereo Microphone Signals Directly to MPEG-Surround—Christophe Tournery, Christof Faller, Illusonic LLC - Lausanne, Switzerland; Fabian Kuech, Jürgen Herre, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany
We have previously proposed a way to use stereo microphones with spatial audio coding to record and code surround sound. In this paper we are describing further considerations and improvements needed to convert stereo microphone signals directly to MPEG Surround, i.e., a downmix signal plus a bit stream. It is described in detail how to obtain from the microphone channels the information needed for computing MPEG Surround spatial parameters and how to process the microphone signals to transform them to an MPEG Surround compatible downmix.
Convention Paper 7982 (Purchase now)
P4-4 Modification of Spatial Information in Coincident-Pair Recordings—Jeremy Wells, University of York - York, UK
A novel method is presented for modifying the spatial information contained in the output from a stereo coincident pair of microphones. The purpose of this method is to provide additional decorrelation of the audio at the left and right replay channels for sound arriving at the sides of a coincident pair but to retain the imaging accuracy for sounds arriving to the front or rear or where the entire sound field is highly correlated. Details of how this is achieved are given and results for different types of sound field are presented.
Convention Paper 7983 (Purchase now)
P4-5 Unitary Matrix Design for Diffuse Jot Reverberators—Fritz Menzer, Christof Faller, Ecole Polytechnique Federale de Lausanne - Lausanne, Switzerland
This paper presents different methods for designing unitary mixing matrices for Jot reverberators with a particular emphasis on cases where no early reflections are to be modeled. Possible applications include diffuse sound reverberators and decorrelators. The trade-off between effective mixing between channels and the number of multiply operations per channel and output sample is investigated as well as the relationship between the sparseness of powers of the mixing matrix and the sparseness of the impulse response.
Convention Paper 7984 (Purchase now)
P4-6 Sound Field Indicators for Hearing Activity and Reverberation Time Estimation in Hearing Instruments—Andreas P. Streich, ETH Zurich - Zurich, Switzerland; Manuela Feilner, Alfred Stirnemann, Phonak AG - Stäfa, Switzerland; Joachim M. Buhmann, ETH Zurich - Zurich, Switzerland
Sound field indicators (SFI) are proposed as a new feature set to estimate the hearing activity and reverberation time in hearing instruments. SFIs are based on physical measurements of the sound field. A variant thereof, called SFI short-time statistics SFIst2, is obtained by computing mean and standard deviations of SFIs on 10 subframes. To show the utility of these feature sets for the mentioned prediction tasks, experiments are carried out on artificially reverberated recordings of a large variety of sounds encountered in daily life. In a classification scenario where the hearing activity is to be predicted, both SFI and SFIst2 yield clearly superior accuracy even compared to hand-tailored features used in state-of-the-art hearing instruments. For regression on the reverberation time, the SFI-based features yield a lower residual error than standard feature sets and reach the performance of specially designed features. The hearing activity classification is mainly based on the average of the SFIs, while the standard deviation over sub-window is used heavily to predict the reverberation time.
Convention Paper 7985 (Purchase now)
P4-7 Stereo-to-Binaural Conversion Using Interaural Coherence Matching—Fritz Menzer, Christof Faller, Ecole Polytechnique Fédérale de Lausanne - Lausanne, Switzerland
In this paper a method of converting stereo recordings to simulated binaural recordings is presented. The stereo signal is separated into coherent and diffuse sound based on the assumption that the signal comes from a coincident symmetric microphone setup. The coherent part is reproduced using HRTFs and the diffuse part is reproduced using filters adapting the interaural coherence to the interaural coherence a binaural recording of diffuse sound would have.
Convention Paper 7986 (Purchase now)
P4-8 Linear Simulation of Spaced Microphone Arrays Using B-Format Recordings—Andreas Walther, Christof Faller, Ecole Polytechnique Federal de Lausanne - Lausanne, Switzerland
A novel approach for linear post-processing of B-Format recordings is presented. The goal is to simulate spaced microphone arrays by approximating and virtually recording the sound field at the position of each single microphone. The delays occurring in non-coincident recordings are simulated by translating an approximative plane wave representation of the sound field to the positions of the microphones. The directional responses of the spaced microphones are approximated by linear combination of the corresponding translated B-format channels.
Convention Paper 7987 (Purchase now)
Saturday, May 22, 14:00 — 15:30 (Room C4-Foyer)
P6 - Audio Equipment and Emerging Technologies
P6-1 Study and Evaluation of MOSFET Rds(ON) Impedance Efficiency Losses in High Power Multilevel DCI-NPC Amplifiers—Vicent Sala, G. Ruiz, Luis Romeral, UPC-Universitat Politecnca de Catalunya - Terrassa, Spain
This paper justifies the usefulness of multilevel power amplifiers with DCI-NPC (Diode Clamped Inverter – Neutral Point Converter) topology in applications where size and weight needs were optimized. These amplifiers can work at high frequencies thereby reducing the size and weight of the filter elements. However, it is necessary to study, analyze, and evaluate the efficiency losses because this amplifier has double the number of switching elements. This paper models the behavior of the MOSFET Rds(ON) in a DCI-NPC topology for different conditions.
Convention Paper 7996 (Purchase now)
P6-2 Modeling Distortion Effects in Class-D Amplifier Filter Inductors—Arnold Knott, Tore Stegenborg-Andersen, Ole C. Thomsen, Technical University of Denmark - Lyngby, Denmark; Dominik Bortis, Johann W. Kolar, Swiss Federal Institute of Technology in Zurich - Zurich, Switzerland; Gerhard Pfaffinger, Harman/Becker Automotive Systems GmbH - Straubing, Germany; Michael A. E. Andersen, Technical University of Denmark - Lyngby, Denmark
Distortion is generally accepted as a quantifier to judge the quality of audio power amplifiers. In switch-mode power amplifiers various mechanisms influence this performance measure. After giving an overview of those, this paper focuses on the particular effect of the nonlinearity of the output filter components on the audio performance. While the physical reasons for both, the capacitor and the inductor induced distortion are given, the practical in-depth demonstration is done for the inductor only. This includes measuring the inductors performance, modeling through fitting and resulting into simulation models. The fitted models achieve distortion values between 0.03 % and 0.20 % as a basis to enable the design of a 200 W amplifier.
Convention Paper 7997 (Purchase now)
P6-3 Multilevel DCI-NPC Power Amplifier High-Frequency Distortion Analysis through Parasitic Inductance Dynamic Model—Vicent Sala, G. Ruiz, E. López, Luis Romeral, UPC-Universitat Politecnica de Catalunya - Terrassa, Spain
The high frequency distortion sources in DCI-NPC (Diode Clamped Inverter- Neutral Point Converter) amplifiers topology are studied and analyzed. It has justified the need for designing a model that contains the different parasitic inductive circuits that presents dynamically this kind of amplifier, as a function of the combination of its active transistors. By means of a proposed pattern layout we present a dynamic model of the parasitic inductances of the amplifier Full-Bridge DCI-NPC, and this is used to propose some simple rules for the optimal designing of layouts for these types of amplifiers. Simulation and experimental results are presented to justify the proposed model, and the affirmations and recommendations are given in this paper.
Convention Paper 7998 (Purchase now)
P6-4 How Much Gain Should a Professional Microphone Preamplifier Have?—Douglas McKinnie, Middle Tennessee State University - Murfreesboro, TN, USA
Many tradeoffs are required in the design of microphone preamplifier circuits. Characteristics such as noise figure, stability, bandwidth, and complexity may be dependent upon the gain of the design. Three factors determine the gain required from a microphone preamp: sound-pressure level of the sound source, distance of the microphone from that sound source (within the critical distance), and sensitivity of the microphone. This paper is an effort to find a probability distribution of the gain settings used with professional microphones. This is done by finding the distribution of max SPL in real use and by finding the sensitivity of the most commonly used current and classic microphones.
Convention Paper 7999 (Purchase now)
P6-5 Equalizing Force Contributions in Transducers with Partitioned Electrode—Libor Husník, Czech Technical University in Prague - Prague, Czech Republic
A partitioned electrode in an electrostatic transducer can present among others a possibility for making the transducer with the direct D/A conversion. Nevertheless, partitioned electrodes, the sizes of which are proportional to powers of 2 or terms of other convenient series, do not have the corresponding force action on the membrane. The reason is the membrane does not vibrate in a piston-like mode and electrode parts close to the membrane periphery do not excite membrane vibrations in the same way as the elements near the center. The aim of this paper is to suggest equalization of force contributions from different partitioned electrodes by varying their sizes. Principles presented here can also be used for other membrane-electrode arrangements.
Convention Paper 8000 (Purchase now)
P6-6 Low-End Device to Convert EEG Waves to MIDI—Adrian Attard Trevisan, St. Martins Institute of Information Technology - Hamrun, Malta; Lewis Jones, London Metropolitan University - London, UK
This research provides a simple and portable system that is able to generate MIDI output based on the inputted data collected through an EEG collecting device. The context is beneficial in many ways, where the therapeutic effects of listening to the music created by the brain waves documents many cases of treating health problems. The approach is influenced by the interface described in the article “Brain-Computer Music Interface for Composition and Performance” by Eduardo Reck Miranda, where different frequency bands trigger corresponding piano notes through, and the complexity of, the signal represents the tempo of the sound. The correspondence of the sound and the notes have been established through experimental work, where data of participants of a test group were gathered and analyzed, putting intervals for brain frequencies for different notes. The study is an active contribution to the field of the neurofeedback, by providing criteria tools for assessment.
Convention Paper 8001 (Purchase now)
P6-7 Implementation and Development of Interfaces for Music Performance through Analysis of Improvised Dance Movements—Richard Hoadley, Anglia Ruskin University - Cambridge, UK
Electronic music, even when designed to be interactive, can lack performance interest and is frequently musically unsophisticated. This is unfortunate because there are many aspects of electronic music that can be interesting, elegant, demonstrative, and musically informative. The use of dancers to interact with prototypical interfaces comprising clusters of sensors generating music algorithmically provides a method of investigating human actions in this environment. This is achieved through collaborative work involving software and hardware designers, composers, sculptors, and choreographers who examine aesthetically and practically the interstices of these disciplines. This paper investigates these interstices.
Convention Paper 8002 (Purchase now)
P6-8 Violence Prediction through Emotional Speech—José Higueras-Soler, Roberto Gil-Pita, Enrique Alexandre, Manuel Rosa-Zurera, Universidad de Alcalá - Acalá d Henares, Madrid, Spain
Preventing violence takes an absolute necessity in our society. Whether in homes with a particular risk of domestic violence, as in prisons or schools, there is a need for systems capable of detecting risk situations, for preventive purposes. One of the most important factors that precede a violent situation is an emotional state of anger. In this paper we discuss the features that are required to provide decision makers dedicated to the detection of emotional states of anger from speech signals. For this purpose, we present a set of experiments and results with the aim of studying the combination of features extracted from the literature and their effects over the detection performance (relationship between probability of detection of anger and probability of false alarm) of a neural network and a least-square linear detector.
Convention Paper 8003 (Purchase now)
P6-9 FoleySonic: Placing Sounds on a Timeline through Gestures—David Black, Kristian Gohlke, University of Applied Sciences, Bremen - Bremen, Germany; Jörn Loviscach, University of Applied Sciences, Bielefeld - Bielefeld, Germany
The task of sound placement on video timelines is usually a time-consuming process that requires the sound designer or foley artist to carefully calibrate the position and length of each sound sample. For novice and home video producers, friendlier and more entertaining input methods are needed. We demonstrate a novel approach that harnesses the motion-sensing capabilities of readily available input devices, such as the Nintendo Wii Remote or modern smart phones, to provide intuitive and fluid arrangement of samples on a timeline. Users can watch a video while simultaneously adding sound effects, providing a near real-time workflow. The system leverages the user’s motor skills for enhanced expressiveness and provides a satisfying experience while accelerating the process.
Convention Paper 8004 (Purchase now)
P6-10 A Computer-Aided Audio Effect Setup Procedure for Untrained Users—Sebastian Heise, Michael Hlatky, Hochschle Bremen (University of Applied Sciences) - Bremen, Germany; Jörn Loviscach, Fachhochschule Bielefeld (University of Applied Sciences) - Bielefeld, Germany
The number of parameters of modern audio effects easily ranges in the dozens. Expert knowledge is required to understand which parameter change results in a desired effect. Yet, such sound processors are also making their way into consumer products, where they overburden most users. Hence, we propose a procedure to achieve a desired effect without technical expertise based on a black-box genetic optimization strategy: Users are only confronted with a series of comparisons of two processed examples. Learning from the users’ choices, our software optimizes the parameter settings. We conducted a study on hearing-impaired persons without expert knowledge, who used the system to adjust a third-octave equalizer and a multiband compressor to improve the intelligibility of a TV set.
Convention Paper 8005 (Purchase now)
Sunday, May 23, 09:00 — 13:00 (Room C3)
P8 - Music Analysis and Processing
Chair: David Malham, University of York - York, UK
P8-1 Automatic Detection of Audio Effects in Guitar and Bass Recordings—Michael Stein, Jakob Abeßer, Christian Dittmar, Fraunhofer Institue for Digital Media Technology IDMT - Ilmenau, Germany; Gerald Schuller, Ilmenau University of Technology - Ilmenau, Germany
This paper presents a novel method to detect and distinguish 10 frequently used audio effects in recordings of electric guitar and bass. It is based on spectral analysis of audio segments located in the sustain part of previously detected guitar tones. Overall, 541 spectral, cepstral and harmonic features are extracted from short time spectra of the audio segments. Support Vector Machines are used in combination with feature selection and transform techniques for automatic classification based on the extracted feature vectors. With correct classification rates up to 100% for the detection of single effects and 98% for the simultaneous distinction of 10 different effects, the method has successfully proven its capability—performing on isolated sounds as well as on multitimbral, stereophonic musical recordings.
Convention Paper 8013 (Purchase now)
P8-2 Time Domain Emulation of the Clavinet—Stefan Bilbao, University of Edinburgh - Edingburgh, UK; Matthias Rath, Technische Universität Berlin - Berlin, Germany
The simulation of classic electromechanical musical instruments and audio effects has seen a great deal of activity in recent years, due in part to great recent increases in computing power. It is now possible to perform full emulations of relatively complex musical instruments in real time, or near real time. In this paper time domain finite difference schemes are applied to the emulation of the Hohner Clavinet, an electromechanical stringed instrument exhibiting special features such as sustained hammer/string contact, pinning of the string to a metal stop, and a distributed damping mechanism. Various issues, including numerical stability, implementation details, and computational cost will be discussed. Simulation results and sound examples will be presented.
Convention Paper 8014 (Purchase now)
P8-3 Polyphony Number Estimator for Piano Recordings Using Different Spectral Patterns—Ana M. Barbancho, Isabel Barbancho, Javier Fernandez, Lorenzo J. Tardón, Universidad de Málaga - Málaga, Spain
One of the main tasks of a polyphonic transcription system is the estimation of the number of voices, i.e., the polyphony number. The correct estimation of this parameter is very important for polyphonic transcription systems, this task has not been discussed in depth in the known transcription systems. The aim of this paper is to propose a novel estimation method of the polyphony number for piano recordings. This new method is based on the use of two different types of spectral patterns: single-note patterns and composed-note patterns. The usage of composed-note patterns in the estimation of the polyphony number and in the polyphonic detection process has not been previously reported in the literature.
Convention Paper 8015 (Purchase now)
P8-4 String Ensemble Vibrato: A Spectroscopic Study—Stijn Mattheij, AVANS University - Breda, The Netherlands
A systematic observation of the presence of ensemble vibrato on early twentieth century recordings of orchestral works has been carried out by studying spectral line shapes of individual musical notes. Broadening of line shapes was detected in recordings of Beethoven’s Fifth Symphony and Brahms’s Hungarian Dance no. 5; this effect was attributed to ensemble vibrato. From these observations it may be concluded that string ensemble vibrato was common practice in orchestras from the continent throughout the twentieth century. British orchestras do not use much vibrato before 1940.
Convention Paper 8016 (Purchase now)
P8-5 Influence of Psychoacoustic Roughness on Musical Intonation Preference—Julián Villegas, Michael Cohen, Ian Wilson, University of Aizu - Aizu, Japan; William Martens, University of Sydney - Sydney, NSW, Australia
An experiment to compare the acceptability of three different music fragments rendered with three different intonations is presented. These preference results were contrasted with those of isolated chords also rendered with the same three intonations. The least rough renditions were found to be those using Twelve-Tone Equal-Temperament (12-tet). Just Intonation (ji) renditions were the roughest. A negative correlation between preference and psychoacoustic roughness was also found.
Convention Paper 8017 (Purchase now)
P8-6 Music Emotion and Genre Recognition Toward New Affective Music Taxonomy—Jonghwa Kim, Lars Larsen, University Augsburg - Augsburg, Germany
Exponentially increasing electronic music distribution creates a natural pressure for fine-grained musical metadata. On the basis of the fact that a primary motive for listening to music is its emotional effect, diversion, and the memories it awakens, we propose a novel affective music taxonomy that combines the global music genre taxonomy, e.g., classical, jazz, rock/pop, and rap, with emotion categories such as joy, sadness, anger, and pleasure, in a complementary way. In this paper we deal with all essential stages of automatic genre/emotion recognition system, i.e., from reasonable music data collection up to performance evaluation of various machine learning algorithms. Particularly, a novel classification scheme, called consecutive dichotomous decomposition tree (CDDT) is presented, which is specifically parameterized for multi-class classification problems with extremely high number of class, e.g., sixteen music categories in our case. The average recognition accuracy of 75% for the 16 music categories shows a realistic possibility of the affective music taxonomy we proposed.
Convention Paper 8018 (Purchase now)
P8-7 Perceptually-Motivated Audio Morphing: Warmth—Duncan Williams, Tim Brookes, University of Surrey - Guildford, UK
A system for morphing the warmth of a sound independently from its other timbral attributes was coded, building on previous work morphing brightness only, and morphing brightness and softness. The new warmth-softness-brightness morpher was perceptually validated using a series of listening tests. A multidimensional scaling analysis of listener responses to paired-comparisons showed perceptually orthogonal movement in two dimensions within a warmth-morphed and everything-else-morphed stimulus set. A verbal elicitation experiment showed that listeners’ descriptive labeling of these dimensions was as intended. A further “quality control” experiment provided evidence that no “hidden” timbral attributes were altered in parallel with the intended ones. A complete timbre morpher can now be considered for further work and evaluated using the tri-stage procedure documented here.
Convention Paper 8019 (Purchase now)
P8-8 A Novel Envelope-Based Generic Dynamic Range Compression Model—Adam Weisser, Oticon A/S - Smørum, Denmark
A mathematical model is presented, which reproduces typical dynamic range compression, when given the nominal input envelope of the signal and the compression constants. The model is derived geometrically in a qualitative approach and the governing differential equation for an arbitrary input and an arbitrary compressor is found. Step responses compare well to commercial compressors tested. The compression effect on speech using the general equation in its discrete version is also demonstrated. This model applicability is especially appealing to hearing aids, where the input-output curve and time constants of the nonlinear instrument are frequently consulted and the qualitative theoretical effect of compression may be crucial for speech perception.
Convention Paper 8020 (Purchase now)
Sunday, May 23, 09:00 — 10:30 (Room C4-Foyer)
P10 - Audio Processing—Analysis and Synthesis of Sound
P10-1 Cellular Automata Sound Synthesis with an Extended Version of the Multitype Voter Model—Jaime Serquera, Eduardo R. Miranda, University of Plymouth - Plymouth, UK
In this paper we report on the synthesis of sounds with cellular automata (CA), specifically with an extended version of the multitype voter model (MVM). Our mapping process is based on DSP analysis of automata evolutions and consists in mapping histograms onto sound spectrograms. This mapping allows a flexible sound design process, but due to the non-deterministic nature of the MVM such process acquires its maximum potential after the CA run is finished. Our extended version model presents a high degree of predictability and controllability making the system suitable for an in-advance sound design process with all the advantages that this entails, such as real-time possibilities and performance applications. This research focuses on the synthesis of damped sounds.
Convention Paper 8029 (Purchase now)
P10-2 Stereophonic Rendering of Source Distance Using DWM-FDN Artificial Reverberators—Saul Maté-Cid, Hüseyin Hacihabiboglu, Zoran Cvetkovic, King's College London - London, UK
Artificial reverberators are used in audio recording and production to enhance the perception of spaciousness. It is well known that reverberation is a key factor in the perception of the distance of a sound source. The ratio of direct and reverberant energies is one of the most important distance cues. A stereophonic artificial reverberator is proposed that allows panning the perceived distance of a sound source. The proposed reverberator is based on feedback delay network (FDN) reverberators and uses a perceptual model of direct-to-reverberant (D/R) energy ratio to pan the source distance. The equivalence of FDNs and digital waveguide mesh (DWM) scattering matrices is exploited in order to devise a reverberator relevant in the room acoustics context.
Convention Paper 8030 (Purchase now)
P10-3 Separation of Music+Effects Sound Track from Several International Versions of the Same Movie—Antoine Liutkus, Télécom ParisTech - Paris, France; Pierre Leveau, Audionamix - Paris, France
This paper concerns the separation of the music+effects (ME) track from a movie soundtrack, given the observation of several international versions of the same movie. The approach chosen is strongly inspired from existing stereo audio source separation and especially from spatial filtering algorithms such as DUET that can extract a constant panned source from a mixture very efficiently. The problem is indeed similar for we aim here at separating the ME track, which is the common background of all international versions of the movie soundtrack. The algorithm has been adapted to a number of channels greater than 2. Preprocessing techniques have also been proposed to adapt the algorithm to realistic cases. The performances of the algorithm have been evaluated on realistic and synthetic cases.
Convention Paper 8031 (Purchase now)
P10-4 A Differential Approach for the Implementation of Superdirective Loudspeaker Array—Jung-Woo Choi, Youngtae Kim, Sangchul Ko, Jungho Kim, SAIT, Samsung Electronics Co. Ltd. - Gyeonggi-do, Korea
A loudspeaker arrangement and corresponding analysis method to obtain a robust superdirective beam are proposed. The superdirectivity technique requires precise matching of the sound sources modeled to calculate excitation patterns and those used for the loudspeaker array. To resolve the robustness issue arising from the modeling mismatch error, we show that the overall sensitivity to the model-mismatch error can be reduced by rearranging loudspeaker positions. Specifically, a beam pattern obtained by a conventional optimization technique is represented as a product of robust delay-and-sum patterns and error-sensitive differential patterns. The excitation pattern driving the loudspeaker array is then reformulated such that the error-sensitive pattern is only applied to the outermost loudspeaker elements, and the array design that fits to the new excitation pattern is discussed.
Convention Paper 8032 (Purchase now)
P10-5 Improving the Performance of Pitch Estimators—Stephen J. Welburn, Mark D. Plumbley, Queen Mary University of London - London, UK
We are looking to use pitch estimators to provide an accurate high-resolution pitch track for resynthesis of musical audio. We found that current evaluation measures such as gross error rate (GER) are not suitable for algorithm selection. In this paper we examine the issues relating to evaluating pitch estimators and use these insights to improve performance of existing algorithms such as the well-known YIN pitch estimation algorithm.
Convention Paper 8033 (Purchase now)
P10-6 Reverberation Analysis via Response and Signal Statistics—Eleftheria Georganti, Thomas Zarouchas, John Mourjopoulos, University of Patras - Patras, Greece
This paper examines statistical quantities (i.e., kurtosis, skewness) of room transfer functions and audio signals (anechoic, reverberant, speech, music). Measurements are taken under various reverberation conditions in different real enclosures ranging from small office to a large auditorium and for varying source–receiver positions. Here, the statistical properties of the room responses and signals are examined in the frequency domain. From these properties, the relationship between the spectral statistics of the room transfer function and the corresponding reverberant signal are derived.
Convention Paper 8034 (Purchase now)
P10-7 An Investigation of Low-Level Signal Descriptors Characterizing the Noise-Like Nature of an Audio Signal—Christian Uhle, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany
This paper presents an overview and an evaluation of low-level features characterizing the noise-like or tone-like nature of an audio signal. Such features are widely used for content classification, segmentation, identification, coding of audio signals, blind source separation, speech enhancement, and voice activity detection. Besides the very prominent Spectral Flatness Measure various alternative descriptors exist. These features are reviewed and the requirements for these features are discussed. The features in scope are evaluated using synthetic signals and exemplarily real-world application related to audio content classification, namely voiced-unvoiced discrimination for speech signals and speech detection.
Convention Paper 8035 (Purchase now)
P10-8 Algorithms for Digital Subharmonic Distortion—Zlatko Baracskai, Ryan Stables, Birmingham City University - Birmingham, UK
This paper presents a comparison between existing digital subharmonic generators and a new algorithm developed with the intention of having a more pronounced subharmonic frequency and reduced harmonic, intermodulation and aliasing distortions. The paper demonstrates that by introducing inversions of a waveform at the minima and maxima instead of the zero crossings, the discontinuities are mitigated and various types of distortion are significantly attenuated.
Convention Paper 8036 (Purchase now)
Monday, May 24, 09:00 — 11:00 (Room C2)
W8 - Interacting with Semantic Audio—Bridging the Gap between Humans and Algorithms
Chair:
Michael Hlatky, University of Applied Sciences, Bremen - Bremen, Germany
Panelists:
Masataka Goto, Media Interaction Group, National Institute of Advanced Industrial Science and Technology - Tsukuba, Japan
Anssi Klapuri, Queen Mary University of London - London, UK
Jörn Loviscach, Fachhochschule Bielefeld, University of Applied Sciences - Bielefeld, Germany
Yves Raimond, BBC Audio & Music Interactive - London, UK
Abstract:
Technologies under the heading Semantic Audio have undergone a fascinating development in the past few years. Hundreds of algorithms have been developed; first applications have made their way from research into possible mainstream application. However, the current level of awareness among prospective users and the amount of actual practical use do not seem to live up to the potential of semantic audio technologies. We argue that this is more an issue concerning interface and interaction than a problem concerning the robustness of the applied algorithms or a lack of need in audio production. The panelists of this workshop offer ways to improve the usability of semantic audio techniques. They look into current applications in off-the-shelf products, discuss the use in a variety of specialized applications such as custom-tailored archival solutions, demonstrate and showcase their own developments in interfaces for semantic audio, and propose future directions in interface and interaction development for semantic audio technologies ranging from audio file retrieval to intelligent audio effects.
The second half of this workshop includes hands-on interactive experiences provided by the panel.
Monday, May 24, 14:00 — 17:30 (Room C5)
P18 - Audio Coding and Compression
Chair: Jamie A. S. Angus, University of Salford - Salford, Greater Manchester, UK
P18-1 High-Level Sound Coding with Parametric Blocks—Daniel Möhlmann, Otthein Herzog, Universität Bremen - Bremen, Germany
This paper proposes a new parametric encoding model for sound blocks that is specifically designed for manipulation, block-based comparison, and morphing operations. Unlike other spectral models, only the temporal evolution of the dominant tone and its time-varying spectral envelope are encoded, thus greatly reducing perceptual redundancy. All sounds are synthesized from the same set of model parameters, regardless of their length. Therefore, new instances can be created with greater variability than through simple interpolation. A method for creating the parametric blocks from an audio stream through partitioning is also presented. An example of sound morphing is shown and applications of the model are discussed.
Convention Paper 8096 (Purchase now)
P18-2 Exploiting High-Level Music Structure for Lossless Audio Compression—Florin Ghido, Tampere University of Technology - Tampere, Finland
We present a novel concept of "noncontiguous" audio segmentation by exploiting the high-level music structure. The existing lossless audio compressors working in asymmetrical mode divide the audio into quasi-stationary segments of variable length by recursive splitting (MPEG-4 ALS) or by dynamic programming (asymmetrical OptimFROG) before computing a set of linear prediction coefficients for each segment. Instead, we combine several variable length segments into a group and use a single set of linear prediction coefficients for each group. The optimal algorithm for combining has exponential complexity and we propose a quadratic time approximation algorithm. Integrated into asymmetrical OptimFROG, the proposed algorithm obtains up to 1.20% (on average 0.23%) compression improvements with no increase in decoder complexity.
Convention Paper 8097 (Purchase now)
P18-3 Interactive Teleconferencing Combining Spatial Audio Object Coding and DirAC Technology—Jürgen Herre, Cornelia Falch, Dirk Mahne, Giovanni del Galdo, Markus Kallinger, Oliver Thiergart, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany
The importance of telecommunication continues to grow in our everyday lives. An ambitious goal for developers is to provide the most natural way of audio communication by giving users the impression of being located next to each other. MPEG Spatial Audio Object Coding (SAOC) is a technology for coding, transmitting, and interactively reproducing spatial sound scenes on any conventional multi-loudspeaker setup (e.g., ITU 5.1). This paper describes how Directional Audio Coding (DirAC) can be used as recording front-end for SAOC-based teleconference systems to capture acoustic scenes and to extract the individual objects (talkers). By introducing a novel DirAC to SAOC parameter transcoder, a highly efficient way of combining both technologies is presented that enables interactive, object-based spatial teleconferencing.
Convention Paper 8098 (Purchase now)
P18-4 A New Parametric Stereo- and Multichannel Extension for MPEG-4 Enhanced Low Delay AAC (AAC-ELD)—María Luis Valero, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Andreas Hölzer, DSP Solutions GmbH & Co. - Regensburg, Germany; Markus Schnell, Johannes Hilpert, Manfred Lutzky, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Jonas Engdegård, Heiko Purnhagen, Per Ekstrand, Kristofer Kjörling, Dolby Sweden AB - Stockholm, Sweden
ISO/MPEG standardizes two communication codecs with low delay: AAC-LD is a well established low delay codec for high quality communication applications such as video conferencing, tele-presence, and Voice over IP. Its successor AAC-ELD offers enhanced bit rate efficiency being an ideal solution for broadcast audio gateway codecs. Many existing and upcoming communication applications benefit from the transmission of stereo or multichannel signals at low bitrates. With low delay MPEG Surround, ISO has recently standardized a low delay parametric extension for AAC-LD and AAC-ELD. It is based on MPEG Surround technology with specific adaption for low delay operation. This extension comes along with a significant improved coding efficiency for transmission of stereo and multichannel signals.
Convention Paper 8099 (Purchase now)
P18-5 Efficient Combination of Acoustic Echo Control and Parametric Spatial Audio Coding—Fabian Kuech, Markus Schmidt, Meray Zourub, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany
High-quality teleconferencing systems utilize surround sound to provide natural communication experience. Directional Audio Coding (DirAC) is an efficient parametric approach to capture and reproduce spatial sound. It uses a monophonic audio signal together with parametric spatial cue information. For reproduction, multiple loudspeaker signals are determined based on the DirAC stream. To allow for hands-free operation, multichannel acoustic echo control (AEC) has to be employed. Standard approaches apply multichannel adaptive filtering to address this problem. However, computational complexity constraints and convergence issues inhibit practical applications. This paper proposes an efficient combination of AEC and DirAC by explicitly exploiting its parametric sound field representation. The approach suppresses the echo components in the microphone signals solely based on the single channel audio signal used for the DirAC synthesis of the loudspeaker signals.
Convention Paper 8100 (Purchase now)
P18-6 Sampling Rate Discrimination: 44.1 kHz vs. 88.2 kHz—Amandine Pras, Catherine Guastavino, McGill University - Montreal, Quebec, Canada
It is currently common practice for sound engineers to record digital music using high-resolution formats, and then down sample the files to 44.1 kHz for commercial release. This study aims at investigating whether listeners can perceive differences between musical files recorded at 44.1 kHz and 88.2 kHz with the same analog chain and type of AD-converter. Sixteen expert listeners were asked to compare 3 versions (44.1 kHz, 88.2 kHz, and the 88.2 kHz version down-sampled to 44.1 kHz) of 5 musical excerpts in a blind ABX task. Overall, participants were able to discriminate between files recorded at 88.2 kHz and their 44.1 kHz down-sampled version. Furthermore, for the orchestral excerpt, they were able to discriminate between files recorded at 88.2 kHz and files recorded at 44.1 kHz.
Convention Paper 8101 (Purchase now)
P18-7 Comparison of Multichannel Audio Decoders for Use in Mobile and Handheld Devices—Manish Nema, Ashish Malot, Nokia India Pvt. Ltd. - Bangalore, Karnataka, India
Multichannel audio provides immersive experience to listeners. Consumer demand coupled with technological improvements will drive consumption of high-definition content in mobile and handheld devices. There are several multichannel audio coding algorithms, both, proprietary ones like Dolby Digital, Dolby Digital Plus, Windows Media Audio Professional (WMA Pro), Digital Theater Surround High Definition (DTS-HD), and standard ones like Advanced Audio Coding (AAC), MPEG Surround, available in the market. This paper presents salient features/coding techniques of important multichannel audio decoders and a comparison of these decoders on key parameters like processor complexity, memory requirements, complexity/features for stereo playback, and quality/coding efficiency. The paper also presents a ranking of these multichannel audio decoders on the key parameters in a single table for easy comparison.
Convention Paper 8102 (Purchase now)
Monday, May 24, 16:30 — 18:00 (Room C4-Foyer)
P20 - Audio Content Management—Audio Information Retrieval
P20-1 Complexity Scalable Perceptual Tempo Estimation From HE-AAC Encoded Music—Danilo Hollosi, Ilmenau University of Technology - Ilmenau, Germany; Arijit Biswas, Dolby Germany GmbH - Nürberg, Germany
A modulation frequency-based method for perceptual tempo estimation from HE-AAC encoded music is proposed. The method is designed to work on fully-decoded PCM-domain; the intermediate HE-AAC transform-domain after partial decoding; and directly on HE-AAC compressed-domain using Spectral Band Replication (SBR) payload. This offers complexity scalable solutions. We demonstrate that SBR payload is an ideal proxy for tempo estimation directly from HE-AAC bit-streams without even decoding them. A perceptual tempo correction stage is proposed based on rhythmic features to correct for octave errors in every domain. Experimental results show that the proposed method significantly outperforms two commercially available systems, both in terms of accuracy and computational speed.
Convention Paper 8109 (Purchase now)
P20-2 On the Effect of Reverberation on Musical Instrument Automatic Recognition—Mathieu Barthet, Mark Sandler, Queen Mary University of London - London, UK
This paper investigates the effect of reverberation on the accuracy of a musical instrument recognition model based on Line Spectral Frequencies and K-means clustering. One-hundred-eighty experiments were conducted by varying the type of music databases (isolated notes, solo performances), the stage in which the reverberation is added (learning, and/or testing), and the type of reverberation (3 different reverberation times, 10 different dry-wet levels). The performances of the model systematically decreased when reverberation was added at the testing stage (by up to 40%). Conversely, when reverberation was added at the training stage, a 3% increase of performance was observed for the solo performances database. The results suggest that pre-processing the signals with a dereverberation algorithm before classification may be a means to improve musical instrument recognition systems.
Convention Paper 8110 (Purchase now)
P20-3 Harmonic Components Extraction in Recorded Piano Tones—Carmine Emanuele Cella, Università di Bologna - Bologna, Italy
It is sometimes desirable, in the purpose of analyzing recorded piano tones, to remove from the original signal the noisy components generated by the hammer strike and by other elements involved in the piano action. In this paper we propose an efficient method to achieve such result, based on adaptive filtering and automatic estimation of fundamental frequency and inharmonicity; the final method, applied on a recorded piano tone, produces two separate signals containing, respectively, the hammer knock and the harmonic components. Some sound examples to listen for evaluation are available on the web as specified in the paper.
Convention Paper 8111 (Purchase now)
P20-4 Browsing Sound and Music Libraries by Similarity—Stéphane Dupont, Université de Mons - Mons, Belgium; Christian Frisson, Université Catholique de Louvain - Louvain-la-Neuve, Belgium; Xavier Siebert, Damien Tardieu, Université de Mons - Mons, Belgium
This paper presents a prototype tool for browsing through multimedia libraries using content-based multimedia information retrieval techniques. It is composed of several groups of components for multimedia analysis, data mining, interactive visualization, as well as connection with external hardware controllers. The musical application of this tool uses descriptors of timbre, harmony, as well as rhythm and two different approaches for exploring/browsing content. First, a dynamic data mining allows the user to group sounds into clusters according to those different criteria, whose importance can be weighted interactively. In a second mode, sounds that are similar to a query are returned to the user, and can be used to further proceed with the search. This approach also borrows from multi-criteria optimization concept to return a relevant list of similar sounds.
Convention Paper 8112 (Purchase now)
P20-5 On the Development and Use of Sound Maps for Environmental Monitoring—Maria Rangoussi, Stelios M. Potirakis, Ioannis Paraskevas, Technological Education Institute of Piraeus - Aigaleo-Athens, Greece; Nicolas–Alexander Tatlas, University of Patras - Patras, Greece
The development, update, and use of sound maps for the monitoring of environmental interest areas is addressed in this paper. Sound maps constitute a valuable tool for environmental monitoring. They rely on networks of microphones distributed over the area of interest to record and process signals, extract and characterize sound events and finally form the map; time constraints are imposed by the need for timely information representation. A stepwise methodology is proposed and a series of practical considerations are discussed to the end of obtaining a multi-layer sound map that is periodically updated and visualizes the sound content of a “scene.” Alternative time-frequency-based features are investigated as to their efficiency within the framework of a hierarchical classification structure.
Convention Paper 8113 (Purchase now)
P20-6 The Effects of Reverberation on Onset Detection Tasks—Thomas Wilmering, György Fazekas, Mark Sandler, Queen Mary University of London - London, UK
The task of onset detection is relevant in various contexts such as music information retrieval and music production, while reverberation has always been an important part of the production process. The effect may be the product of the recording space or it may be artificially added, and, in our context, destructive. In this paper we evaluate the effect of reverberation on onset detection tasks. We compare state-of-the art techniques and show that the algorithms have varying degrees of robustness in the presence of reverberation depending on the content of the analyzed audio material.
Convention Paper 8114 (Purchase now)
P20-7 Segmentation and Discovery of Podcast Content—Steven Hargreaves, Chris Landone, Mark Sandler, Panos Kudumakis, Queen Mary University of London - London, UK
With ever increasing amounts of radio broadcast material being made available as podcasts, sophisticated methods of enabling the listener to quickly locate material matching their own personal tastes become essential. Given the ability to segment a podcast that may be in the order of one or two hours duration into individual song previews, the time the listener spends searching for material of interest is minimized. This paper investigates the effectiveness of applying multiple feature extraction techniques to podcast segmentation and describes how such techniques could be exploited by a vast number of digital media delivery platforms in a commercial cloud-based radio recommendation and summarization service.
Convention Paper 8115 (Purchase now)
Tuesday, May 25, 10:30 — 12:00 (Room C4-Foyer)
P23 - Audio Processing—Music and Speech Signal Processing
P23-1 Beta Divergence for Clustering in Monaural Blind Source Separation—Martin Spiertz, Volker Gnann, RWTH Aachen University - Aachen, Germany
General purpose audio blind source separation algorithms have to deal with a large dynamic range for the different sources to be separated. In the used algorithm the mixture is separated into single notes. These notes are clustered to construct the melodies played by the active sources. The non-negative matrix factorization (NMF) leads to good results in clustering the notes according to spectral features. The cost function for the NMF is controlled by the parameter beta. Beta should be adjusted properly depending on the dynamic difference of the sources. The novelty of this paper is to propose a simple unsupervised decision scheme that estimates the optimal parameter beta for increasing the separation quality over a large range of dynamic differences.
Convention Paper 8130 (Purchase now)
P23-2 On the Effects of Room Reverberation in 3-D DOA Estimation Using a Tetrahedral Microphone Array—Maximo Cobos, Jose J. Lopez, Amparo Marti, Universidad Politécnica de Valencia - Valencia, Spain
This paper studies the accuracy in the estimation of the Direction-Of-Arrival (DOA) of multiple sound sources using a small microphone array. As other sparsity-based algorithms, the proposed method is able to work in undetermined scenarios, where the number of sound sources exceeds the number of microphones. Moreover, the tetrahedral shape of the array allows estimation of DOAs in the three-dimensional space easily, which is an advantage over other existing approaches. However, since the proposed processing is based on an anechoic signal model, the estimated DOA vectors are severely affected by room reflections. Experiments to analyze the resultant DOA distribution under different room conditions and source arrangements are discussed using both simulations and real recordings.
Convention Paper 8131 (Purchase now)
P23-3 Long Term Cepstral Coefficients for Violin Identification—Ewa Lukasik, Poznan University of Technology - Poznan, Poland
Cepstral coefficients in mel scale proved to be efficient features for speaker and musical instrument recognition. In this paper Long Term Cepstral Coefficients—LTCCs—of solo musical phrases are used as features for identification of individual violins. LTCC represents the envelope of LTAS—Long Term Average Spectrum—in linear scale useful to characterize the subtleties’ of violin sound in frequency domain. Results of the classification of 60 instruments are presented and discussed. It was shown, that if the experts’ knowledge is applied to analyze violin sound, the results may be promising.
Convention Paper 8132 (Purchase now)
P23-4 Adaptive Source Separation Based on Reliability of Spatial Feature Using Multichannel Acoustic Observations—Mitsunori Mizumachi, Kyushu Institute of Technology - Kitakyushu, Fukuoka, Japan
Separation of sound source can be achieved by spatial filtering with multichannel acoustic observations. However, the right algorithm should be prepared in each condition of acoustic scene. It is difficult to provide the suitable algorithm under real acoustic environments. In this paper an adaptive source separation scheme is proposed based on the reliability of a spatial feature, which gives an estimate of direction of arrival (DOA). As confidence measures for DOA estimates, the third and fourth moments for spatial features are employed to measure how sharp the main-lobes of spatial features are. This paper proposes to selectively use either spatial filters or frequency-selective filters without spatial filtering depending on the reliability of each DOA estimate.
Convention Paper 8133 (Purchase now)
P23-5 A Heuristic Text-Driven Approach for Applied Phoneme Alignment—Konstantinos Avdelidis, Charalampos Dimoulas, George Kalliris, George Papanikolaou, Aristotle University of Thessaloniki - Thessaloniki, Greece
The paper introduces a phoneme matching algorithm considering a novel concept of functional strategy. In contrast to the classic methodologies that are focusing on the convergence to a fixed expected phonemic sequence (EPS), the presented method follows a more realistic approach. Based on text input, a soft EPS is populated taking into consideration the structural and linguistic deviations that may appear in a naturally spoken sequence. The results of the matching process is evaluated using fuzzy inference and is consisted of both the phoneme transition positions as well as the actual utterance phonemic content. An overview of convergence quality performance through a series of runs for the Greek language is presented.
Convention Paper 8134 (Purchase now)
P23-6 Speech Enhancement with Hybrid Gain Functions—Xuejing Sun, Kuan-Chieh Yen, Cambridge Silicon Radio - Auburn Hills, MI, USA
This paper describes a hybrid gain function for single-channel acoustic noise suppression systems. The proposed gain function consists of Wiener filter and Minimum Mean Square Error – Log Spectral Amplitude estimator (MMSE-LSA) gain functions and selects respected gain values accordingly. Objective evaluation using a composite measure shows the hybrid gain function yields better results over using either of the two functions alone.
Convention Paper 8135 (Purchase now)
P23-7 Human Voice Modification Using Instantaneous Complex Frequency—Magdalena Kaniewska, Gdansk University of Technology - Gdansk, Poland
The paper presents the possibilities of changing human voice by modifying instantaneous complex frequency (ICF) of the speech signal. The proposed method provides a flexible way of altering voice without the necessity of finding fundamental frequency and formants’ positions or detecting voiced and unvoiced fragments of speech. The algorithm is simple and fast. Apart from ICF it uses signal factorization into two factors: one fully characterized by its envelope and the other with positive instantaneous frequency. ICFs of the factors are modified individually for different sound effects.
Convention Paper 8136 (Purchase now)
P23-8 Designing Optimal Phoneme-Wise Fuzzy Cluster Analysis—Konstantinos Avdelidis, Charalampos Dimoulas, George Kalliris, George Papanikolaou, Aristotle University of Thessaloniki - Thessaloniki, Greece
A large number of pattern classification algorithms and methodologies have been proposed for the phoneme recognition task during the last decades. The current paper presents a prototype distance-based fuzzy classifier, optimized for the needs of phoneme recognition. This is accomplished by the specially designed objective function and a respective training strategy. Particularly, each phonemic class is represented by a number of arbitrary-shaped clusters that adaptively match the corresponding features space distribution. The formulation of the approach is capable of delivering a variety of related conclusions based on fuzzy logic arithmetic. An overview of the inference capability is presented in combination with performance results for the Greek language.
Convention Paper 8137 (Purchase now)
Tuesday, May 25, 14:00 — 15:45 (Room C2)
AES/APRS—Life in the Old Dogs Yet—Part Three: After the Ball—Protecting the Crown Jewels
Moderator:
John Spencer, BMS CHACE
Panelists:
Chris Clark, British Library Sound Archive
Tommy D, Producer
Tony Dunne, A&R Coordinator, DECCA Records and UMTV/UMR - UK
Simon Hutchinson, PPL
Paul Jessop, Consulant IFI/RIAA
George Massenburg, P&E Wing, NARAS
Abstract:
A fascinating peek into the unspoken worlds of archiving and asset protection. It examines the issues surrounding retrievable formats that promise to future-proof recorded assets and the increasing importance of accurate recordings information (metadata). A unique group of experts from archiving and royalty distribution communities will hear a presentation from John Spencer, from BMS CHACE in Nashville, explaining his work with NARAS and the U.S. Library of Congress to establish an information schema for sound recording and Film and TV audio and then engage in a group discussion. The discussion then moves onto probably the most important topic to impact on the future of the sound and music economies—how to keep what we’ve got and reward those who made it.
Sir George Martin CBE was also awarded an AES Honorary Membership just before this session started. The award was introduced by AES Past President Jim Anderson and presented to Sir George by AES President Diemer de Vries. Click here to watch a video of the presentation.