42nd AES Conference: Abstracts

Friday, July 22
Paper Session 1 - Music Information Retrieval [Part 1]
1-1 [ Invited talk ] [ Submission ID: 18 ] New Developments in Music Information Retrieval - Meinard Müller, Saarland University and Max Planck Institute Informatik, Saarbrücken, Germany

The digital revolution has brought about a massive increase in the availability and distribution of music- related documents of various modalities comprising textual, audio, as well as visual material. Therefore, the development of techniques and tools for organizing, structuring, retrieving, navigating, and presenting music-related data has become a major strand of research the field is often referred to as music information retrieval (MIR). Major challenges arise because of the richness and diversity of music in form and content leading to novel and exciting research problems. In this article, we give an overview of new developments in the MIR field with a focus on content-based music analysis tasks including audio retrieval, music synchronization, structure analysis, and performance analysis.
1-2 [ Invited talk ] [ Submission ID: 68 ] Music Listening in the Future: Augmented Music-Understanding Interfaces and Crowd Music Listening - Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan

In the future, music listening can be more active, more immersive, richer, and deeper by using automatic music-understanding technologies (semantic audio analysis). In the first half of this invited talk, four Augmented Music-Understanding Interfaces that facilitate deeper understanding of music are introduced. In our interfaces, visualization of music content and music touch-up (customization) play important roles in augmenting people's understanding of music because understanding is deepened through seeing and editing. In the second half, a new style of music listening called Crowd Music Listening is discussed. By posting, sharing, and watching time-synchronous comments (semantic information), listeners can enjoy music together with the crowd. Such Internet-based music listening with shared semantic information also helps music understanding because understanding is deepened through communication. Two systems that deal with new trends in music listening --- time-synchronous comments and mashup music videos --- are finally introduced.
Poster Session 1
P1-1 [ Submission ID: 10 ] Improving a Multiple Pitch Estimation Method With AR Models - Tiago Fernandes Tavares, Jayme Garcia Arnal Barbedo, Amauri Lopes, School of Electrical and Computer Engineering, University of Campinas, Campinas, Brazil

Multiple pitch estimation (MPE) methods aim to detect the pitches of the sounds that are part of a certain mixture. A possible approach to such problem is applying a FIR filter bank in the frequency domain and choosing the filter that presents more energy. This process is equivalent to performing a linear combination of frequency domain representations of a signal, hence it is a linear classification tool. When spectral lobes corresponding to existing partials merge, such process may fail. In this paper, AR models were used to provide an spectral representation where lobes tend to merge less. The proper choice of model significantly improved the MPE method.
P1-2 [ Submission ID: 26 ] Polyphonic Music Transcription using Weighted CQT and Non-Negative Matrix Factorization - Sang Ha Park, Seokjin Lee, Koeng-Mo Sung, INMC, Seoul National University, Seoul, Republic of Korea

Non-negative Matrix Factorization (NMF) is a useful method in music transcription. It achieves high speed computing and high performance. However, low frequency components are not fully detected and frequency confusion occurs occasionally in the conventional NMF based transcription algorithm. We propose a music transcription method using NMF with Weighted Constant-Q Transform (WCQT) to solve this problem. The filter bank of CQT is the same as that of the Western music scale interval, so the frequency components are well analyzed. And the weights on the CQT compensate the relatively small energy in low frequencies. We successfully transcribed polyphonic piano music with the proposed transcription algorithm, and the performance was better than that of the conventional method.
P1-3 [ Submission ID: 19 ] Towards Context-Sensitive Music Recommendations Using Multifaceted User Profiles - Rafael Schirru1,2, Stephan Baumann1, Christian Freye3, Andreas Dengel1,2, 1German Research Center for Artificial Intelligence, Kaiserslautern, Germany, 2University of Kaiserslautern, Kaiserslautern, Germany, 3Brandenburg University of Applied Sciences, Brandenburg, Germany

In this paper we present an approach extracting multifaceted user profiles that enable recommendations according to a user's different preferred music styles. We describe the artists a user has listened to by making use of metadata obtained from Semantic Web data sources. After preprocessing the data we cluster a user's preferred artists and extract for each cluster a descriptive label. These labels are then aggregated to form multifaceted user profiles representing a user's full range of preferred music styles. Our evaluation experiments show that the extracted labels are specific to the artists in the clusters and can thus be used to recommend, e.g., internet radio stations and allow for an integration into existing recommendation strategies.
P1-4 [ Submission ID: 39 ] Automatic Classification of Musical Pieces Into Global Cultural Areas - Anna Kruspe, Hanna Lukashevich, Jakob Abeßer, Holger Großmann, Christian Dittmar, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany

Music Information Retrieval (MIR) has a large variety of applications. One aspect that has not gathered a lot of attention yet is the application to non-western music ("world music"). In a task comparable to genre classification, this work's goal is the classification of musical pieces into their corresponding cultural regions of origin. As a basis for such a classification, a three-tier taxonomy based on musical and geographic properties is created. A database consisting of approximately 4400 musical pieces representing the taxonomical classes is assembled and annotated. Based on rhythmic, tonal, and timbre-related audio features, different classification experiments are performed. We achieved an accuracy of approx. 70% for the classification of musical pieces into nine large world regions. Twelve new features that are especially suited for non-western music are implemented. They improve the classification result slightly. For the purpose of comparison, we carried out a listening test with musical laymen with an average accuracy of 52%.
P1-5 [ Submission ID: 49 ] Regression-Based Tempo Recognition from Chroma and Energy Accents for Slow Audio Recordings - Thorsten Deinert, Igor Vatolkin, Günter Rudolph, TU Dortmund, Dortmund, Germany

Although the performance of automatic tempo estimation methods has been improved during the recent research activities, some objectives to solve are still remaining. One of them is the analysis of slow music or songs without a strong drum pulse which corresponds to the correct tempo. One of the most frequent errors is the prediction of the doubled tempo, however further error sources exist. In our work we reimplemented, extended and optimized the original tempo recognition method from Eronen and Klapuri with the concrete goal to achieve reliable classification accuracy especially for slow songs. The results from the experiment study confirm the increased quality of the adapted algorithm chain. Several possible error sources are discussed in detail and further ideas beside the scope of this work are proposed for future research.
P1-6 [ Submission ID: 38 ] Blind Estimation of Reverberation Time from Monophonic Instrument Recordings Based on Non-Negative Matrix Factorization - Maximo Cobos1, Pedro Vera-Candeas2, Julio Jose Carabias-Orti2, Nicolas Ruiz-Reyes2, Jose J. Lopez1, 1Institute for Telecommunications and Multimedia Applications (iTEAM), Universitat Politècnica de València, Valencia, Spain, 2Escuela Politécnica Superior de Linares, Universidad de Jaen, Linares, Spain

Reverberation time is the most important acoustic parameter that describes the acoustic behavior of a room. To measure this parameter, test signals such as pink or impulse noises are usually employed. However, despite its importance, there are few methods aimed at estimating reverberation time from natural signals such as music and speech. In this paper, we propose a method for estimating reverberation time from monophonic instrument recordings by detecting decay parts in the performance using a Non-Negative Matrix Factorization (NMF) algorithm with a Basic Harmonic Constrained (BHC) model. Preliminary results using different databases and instrument models are given for mid-long reverberation times.
Paper Session 2 - Speech Processing and Analysis
2-1 [ Invited Talk ] [ Submission ID: 26 ] Semantic Speech Tagging: Towards Combined Analysis of Speaker Traits - Bjoern W. Schuller, Institute for Human-Machine Communication, Technische Universität München, München, Germany

A number of paralinguistic problems are often dealt with in isolation, such as emotion, health state or personality. However, there are also good examples of mutual benefit, mostly incorporating speaker gender knowledge. In this paper we deal with the question how further paralinguistic information, such as speaker age, height, or race can provide beneficial information when their ground truth knowledge is provided within single-task speaker classification. Tests with open SMILE's 1.5 k Paralinguistic Challenge Feature set on the TIMIT corpus of 630 speakers reveal significant boost in accuracy or cross-correlation depending on the representation form of the problem at hand.
2-2 [ Submission ID: 45 ] SyncTS: Automatic synchronization of speech and text documents - David Damm1, Harald Grohganz1, Frank Kurth2 Sebastian Ewert1, Michael Clausen1, 1University of Bonn, Department of Computer Science III, Bonn. Germany, 2Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany

In this paper, we present an automatic approach for aligning speech signals to corresponding text documents. For this sake, we propose to first use text-to-speech synthesis (TTS) to obtain a speech signal from the textual representation. Subsequently, both speech signals are transformed to sequences of audio features which are then time-aligned using a variant of greedy dynamic time-warping (DTW). The proposed approach is both efficient (with linear running time), computationally simple, and does not rely on a prior training phase as it is necessary when using HMM-based approaches. It benefits from the combination of a) a novel type of speech feature, being correlated to the phonetic progression of speech, b) a greedy left-to-right variant of DTW, and c) the TTS-based approach for creating a feature representation from the input text documents. The feasibility of the proposed method is demonstrated in several experiments.
2-3 [ Submission ID: 58 ] Extraction of spectro-temporal speech cues for robust automatic speech recognition - Bernd T. Meyer, International Computer Science Institute, Berkeley, CA, USA

This work analyzes the use of spectro-temporal signal characteristics with the aim of improving the robustness of automatic speech recognition (ASR) systems. Experiments that aim at the robustness against extrinsic sources of variability (such as additive noise) as well as intrinsic variation of speech (changes in speaking rate, style, and effort) are presented. Results are compared to scores for the most common features in ASR (mel-frequency cepstral coefficients and perceptual linear prediction features), which account for the spectral properties of short-time segments of speech, but mostly neglect temporal or spectro-temporal cues. Intrinsic variations were found to severely degrade the overall ASR performance. The performance of the two most common feature types was degraded in much the same way, whereas the proposed spectro-temporal features exhibit a different sensitivity against intrinsic variations, which suggests that classic and spectro-temporal feature types carry complementary information. Furthermore, spectro-temporal features were shown to be more robust than the baseline system in the presence of additive noise.
Paper Session 3 - Automatic Music Transcription
3-1 [ Submission ID: 44 ] Automatic Recognition and Parametrization of Frequency Modulation Techniques in Bass-Guitar Recordings - Jakob Abeßer1, Christian Dittmar1, Gerald Schuller2, 1Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany, 2Ilmenau University of Technology, Ilmenau, Germany

In this paper, we propose a novel method to parametrize and classify different frequency modulation techniques in bass guitar recordings. A parametric spectral estimation technique is applied to refine the fundamental frequency estimates derived from an existing bass transcription algorithm. We apply a two-stage taxonomy of bass playing styles with special focus on the frequency modulation techniques slide, bending, and vibrato. An existing database of isolated note recordings is extended by approx. 900 samples to evaluate the presented algorithm. We achieve comparable classification accuracy values of 85.1% and 81.5% for classification on class-level and subclass-level. Furthermore, two potential application scenarios are outlined.
3-2 [ Submission ID: 11 ] Note Clustering based on 2D Source-Filter Modeling for Underdetermined Blind Source Separation - Martin Spiertz, Volker Gnann, Institut für Nachrichtentechnik, RWTH Aachen University, Aachen, Germany

For blind source separation, the non-negative matrix factorization extracts single notes out of a mixture. These notes can be clustered to form the melodies played by a single instrument. A current approach for clustering utilizes a source filter model to describe the envelope over the first dimension of the spectrogram: the frequency-axis. The novelty of this paper is to extend this approach by a second source-filter model, characterizing the second dimension of a spectrogram: the time-axis. The latter one models the temporal evolution of the energy of one note: an instrument specific envelope is convolved with an activation vector, corresponding to tempo, rhythm, and amplitudes of single note instances. We introduce an unsupervised clustering framework for both models and a simple, yet effective combination strategy. Finally, we show the advantages of our separation algorithm compared with to two other state-of-the-art separation frameworks: the separation quality is comparable, but our algorithm needs much less computational load, is independent from other BSS-algorithm as initialization, and works with a unique set of parameters for a wide range of audio data.
3-3 [ Submission ID: 43 ] Pitch Estimation by the Pair-Wise Evaluation of Spectral Peaks - Karin Dressler, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany

In this paper, a new approach for pitch estimation in polyphonic musical audio is presented. The algorithm is based on the pair-wise analysis of spectral peaks. The idea of the technique lies in the identification of partials with successive (odd) harmonic numbers. Since successive partials of a harmonic sound have well defined frequency ratios, a possible fundamental can be derived from the instantaneous frequencies of the two spectral peaks. Consecutively, the identified harmonic pairs are rated according to harmonicity, timbral smoothness, the appearance of intermediate spectral peaks, and harmonic number. Finally, the resulting pitch strengths are added to a pitch spectrogram. The pitch estimation was developed for the identification of the predominant voice (e.g. melody) in polyphonic music recordings. It was evaluated as part of a melody extraction algorithm during the Music Information Retrieval Evaluation eXchange (MIREX 2006 and 2009), where the algorithm reached the best overall accuracy as well as very good performance measures.
Saturday, July 23
Paper Session 4 - Music Information Retrieval [Part 2]
4-1 [ Invited talk] [ Submission ID: 33 ] Adaptive Distance Measures for Exploration and Structuring of Music Collections - Sebastian Stober, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany

Music similarity plays an important role in many Music Information Retrieval applications. However, it has many facets and its perception is highly subjective - very much depending on a person's background or task. This paper presents a generalized approach to modeling and learning individual distance measures for comparing music pieces based on multiple facets that can be weighted. The learning process is described as an optimization problem guided by generic distance constraints. Three application scenarios with different objectives exemplify how the proposed method can be employed in various contexts by deriving distance constraints either from domain-specific expert information or user actions in an interactive setting.
4-2 [ Submission ID: 32 ] Expressivity in musical timing in relation to musical structure and interpretation: a cross-performance, audio-based approach - Cynthia C.S. Liem1, Alan Hanjalic1, Craig Stuart Sapp2, 1Delft University of Technology, The Netherlands, 2CCRMA/CCARH, Stanford University, USA

Classical music performances are personal, expressive renditions, representing a performing musician's artistic view on a written music score. Typically, many interpretations are available for the same music piece. We believe that the variation in expressive renditions across performances can be exploited to gain insight into the musical content and provide supporting information for existing Music Information Retrieval tasks. In this paper, we focus on timing as one aspect of an individual performer’s expressivity and propose a light-weight, unsupervised and audio-based method to study timing deviations among different performances. The results of our qualitative study obtained for five Chopin mazurkas show that timing individualism as inferred by our method can be related to the structure of a music piece, and even highlight interpretational aspects of a piece that are not necessarily visible from the musical score.
4-3 [ Submission ID: 34 ] Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation - Sebastien Gulluni1,2, Slim Essid2, Olivier Buisson1, Gaël Richard2, 1Institut National de l’Audiovisuel, Bry-sur-marne Cedex, France, 2Institut Telecom, Telecom ParisTech, Paris, France

In this paper, we present an interactive approach for the classification of sound objects in electro-acoustic music. For this purpose, we use relevance feedback combined with active-learning segment selection in an interactive loop. Validation and correction information given by the user is injected in the learning process at each iteration to achieve more accurate classification. Three active learning criteria are compared in the evaluation of a system classifying polyphonic pieces (with a varying degree of polyphony). The results show that the interactive approach achieves satisfying performance in a reasonable number of iterations.
Paper Session 5 - Audio Source Separation [Part 2]
5-1 [ Submission ID: 51 ] Singing Voice Separation from Stereo Recordings using Spatial Clues and Robust F0 Estimation - Pablo Cabañas-Molero1, Damián Martínez-Muñoz1, Maximo Cobos2, José J. López2, 1University of Jaén, Polytechnic School, Linares, Jaén, Spain, 2Institute for Telecommunications and Multimedia Applications (iTEAM), Technical University of Valencia, Valencia, Spain

Separation of singing voice from music accompaniment is a topic of great utility in many application of Music Information Retrieval. In the context of stereophonic music mixtures, many algorithms face this problem making use of the spatial diversity of the sound sources to localize and isolate the singing voice. Although these spatial approaches can obtain acceptable results, the separated signal usually is affected by a high level of distortions and artifacts. In this paper, we propose a method for improving the isolation of the singing voice in stereo recordings based on incorporating the fundamental frequency (F0) information to the separation process. First, the singing voice is pre-separated from the input mixture using a state-of-the-art stereo source separation method, the MuLeTs algorithm. Then, the F0 of this pre-separated signal is obtained using a robust pitch estimator based on the computation of the difference function and Hidden Markov Models, obtaining a smooth pitch contour with voiced/unvoiced decisions. A binary mask is finally constructed from F0 to isolate the singing voice from the original mix. The method has been tested on studio music recordings, obtaining good separation results.
5-2 [ Submission ID: 42 ] Interaction of phase, magnitude and location of harmonic components in the perceived quality of extracted solo signals - Estefanía Cano1, Christian Dittmar1, Gerald Schuller2, 1Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany, 2Ilmenau University of Technology, Ilmenau, Germany

During the last year, many research efforts have been directed to the refinement of sound source separation algorithms. However, little or no effort has been made to assess the impact and interaction of different spectral parameters as phase, magnitude and location of harmonic components in the resulting quality of the extracted signals. Recent developments in objective measures for sound quality that also fit subjective ratings have made this possible. This paper presents a study where spectral phase, magnitude and location of harmonic components are systematically changed to assess the impact of such variations in the perceived quality of extracted solo signals. To properly evaluate results, multi-track recordings that allow comparison with original tracks were used.
Poster Session 2
P2-1 [ Submission ID: 16 ] A Psychoacoustic Approach to Wave Field Synthesis - Tim Ziemer, Institute of Musicology, University of Hamburg, Hamburg, Germany

Conventional audio systems use psychoacoustic knowledge to create a sound which is perceived equivalent to natural auditory events. Wave field synthesis (WFS) has overcome several disadvantages of conventional stereophonic audio systems by physically synthesizing natural wave fields. A practical implementation of a wave field synthesis system leads to errors which the literature proposes to compensate by physical means (e.g. compensation of shadow waves via compensation sources, modeling reflections from the third dimension via 2½D-operator) or by a combination of WFS with conventional stereophonic sound (e.g. compensating aliasing errors by optimized phantom source imaging (OPSI)). This paper introduces a psychoacoustic approach to compensate synthesis errors to ensure a proper localization, sound coloration and spaciousness. Incitements for further psychoacoustic WFS research topics are given.
P2-2 [ Submission ID: 50 ] Comparative evaluation and combination of audio tempo estimation approaches - Jose R. Zapata, Emilia Gómez, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain

The automatic analysis of musical tempo from audio is still an open research task in the Music Information Retrieval (MIR) community. The goal of this paper is to provide an updated comparative evaluation of different methods for audio tempo estimation. We overview, following the same block diagram, 23 documented methods. We then analyze their accuracy, error distribution and statistical differences, and we discuss which strategies can provide better performance for different input material. We then take advantage of their complementarity to improve the results by combining different methods, and we finally analyze the limitations of current approaches and give some ideas for future work on the task
P2-3 [ Submission ID: 70 ] Observing uncertainty in music tagging by automatic gaze tracking - Bozana Kostek, Multimedia Systems Department of Gdansk University of Technology, Gdansk, Poland

In this paper, a new approach to observe music file tagging process by employing a gaze tracking system is proposed. The study was conducted with the participation of twenty subjects having different musical experience. For the purpose of the experiments a website survey based on a musical database was prepared. It allowed to gather information about music experience of subjects along with music characteristics such as genre, tempo, dynamics, etc. The results obtained from the preliminary tests show that it is also possible to use a gaze tracking system to automatically tag music characteristics, however this process should be optimized. Conclusions are derived with respect to the outcomes of the experiments. Future directions aimed at optimization the experimental set-up are also discussed.
P2-4 [ Submission ID: 56 ] Tempo Estimation from Urban Music using Non-Negative Matrix Factorization - Daniel Gärtner, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany

Automatic tempo estimation is a useful tool for DJs. Several algorithms have been introduced during the last years. In this paper, a system for tempo induction of urban music is presented. While other algorithms are designed to work on all kinds of music, this one utilizes characteristics that are typical for urban music, e.g., hiphop, like constant tempo and a time signature of 4/4. Activation functions are obtained from non-negative matrix factorization of the spectrogram, and then used as periodicity detection functions. An initial pool of dominant periodicities is collected, using a comb grid approach on a combined autocorrelation representation of the periodicity detection functions. From this pool, the most promising periodicity and it’s power-of-two-multiples are determined and rated. Based on this rating, the periodicity candidates are sorted and eventually transformed into bpm values. The system’s performance is close to perfect (99.85% accuracy), if octave errors are accepted as correct estimates. Two reference systems are outperformed by the proposed approach.
P2-5 [ Submission ID: 28 ] A Musical Source Separation System using a Source-Filter Model and Beta-Divergence Non-Negative Matrix Factorization - Seokjin Lee, Sang Ha Park, Koeng-Mo Sung, INMC, Seoul National University, Seoul, Republic of Korea

A musical source separation algorithm for mono channel signals is presented in this paper. The algorithm is based on a non-negative matrix factorization (NMF) method which factorizes the magnitude spectrum of the input signal into a sum of components, each of which has a fixed magnitude spectra and a time-varying gain. In order to factorize the input spectrum, the input signal is modeled using a source-filter model. The parameters of the source-filter model are estimated by minimizing the beta-divergence from the input spectrum to the reconstructed model. This source-filter model takes advantage of the reliability of the estimated parameter. Simulation experiments were carried out using mixed signals composed of piano and cello. The performance of the proposed algorithm was compared to the basic NMF algorithm using a linear signal model and to a source-filter model NMF algorithm using Kullback-Leibler divergence instead of beta-divergence. According to the results of these simulations, the proposed algorithm has a better separation quality than that found in the previous algorithms.
P2-6 [ Submission ID: 29 ] Design of a Karaoke System for Commercial Stereophonic Audio Tracks aiming a Musical Learning Aid for Amateur Singers - Karthik. R1, Jeyasingh Pathrose1, M. Madheswaran2, 1Jasmin Infotech Pvt Ltd, Chennai, India, 2Muthayammal Engineering College, Rasipuram, India

The process of singing voice extraction from polyphonic audio using panning information and an enhancement applied to refine the extraction process along with their performance measures are analyzed in this paper. Overall, a karaoke system design which works for commercial stereophonic audio tracks, is proposed which can act as a musical learning aid for amateurs.
P2-7 [ Submission ID: 35 ] Geometric Source Separation Method of Audio Signals based on Beamforming and NMF - Seokjin Lee, Sang Ha Park, Koeng-Mo Sung, INMC, Seoul National University, Seoul, Republic of Korea

In this paper, two geometric source separation methods are proposed. The first one is composed of a NMF-based separation module and clustering module with spatial information. The purpose of the algorithm is twofold: first is to enhance the interference injection performance relative to conventional beamformer algorithms, and second is to provide a useful clustering algorithm for decomposed bases after using the NMF method. The second proposed algorithm is composed of a beamformer module, a NMF-based separation and clustering module, and a selection module. The proposed system is compiled to assure robustness with difference of the incidence angles between the desired signal and the interference signal. The evaluation was performed with recorded speech signals and the results are illustrated.
Paper Session 6 - Informed Source Separation
6-1 [ Invited talk ] [ Submission ID: 61 ] Parametric Coding of Audio Objects: Technology, Performance and Opportunities - Juergen Herre, Leon Terentiv, Fraunhofer Institute for Integrated Circuits, Erlangen, Germany

While efficient low-bitrate coding of multi-channel audio (surround sound) has been an active topic of research for about one decade, the quest for representing audio signals in a semantically more relevant way has recently led to the development of highly efficient schemes for representing multiple audio objects, thus enabling the exciting prospect of user-based semantic interactivity for a broad palette of applications. This paper discusses the evolution and current state of the art in parametric audio object coding, its current performance, and possible applications with special focus on the recent "Spatial Audio Object Coding" standard produced by the MPEG standardization group.
6-2 [ Submission ID: 30 ] Informed Audio Source Separation from Compressed Linear Stereo Mixtures - Laurent Girin, Jonathan Pinel, Grenoble Laboratoratory of Images, Speech, Signal and Automation, Grenoble, France

In this paper, new developments concerning a system for informed source separation (ISS) of music signals are presented. Such system enables to separate I>2 musical instruments and singing voices from linear instantaneous stationary stereo (2-channel) mixtures, based on audio signal natural sparsity, pre-mix source signal analysis, and side-information embedding (within the mix signal). The foundations of the system have been presented in previous papers, within the framework of uncompressed (16-bit PCM) mix signals. In the present paper, we extend the study to compressed mix signals. For instance, we use a MPEG-AAC codec and we show that the ISS process is quite robust to compression, opening the way for real-world karaoke/soloing/remixing applications for downloadable music.
6-3 [ Submission ID: 21 ] Compressive Sensing for Music Signals: Comparison of transforms with coherent dictionaries - Ch. Srikanth Raj, T. V. Sreenivas, Department of Electrical Communication Engineering, Indian Institute Of Science, Bangalore, India

Compressive Sensing (CS) is a new sensing paradigm which permits sampling of a signal at its intrinsic information rate which could be much lower than Nyquist rate, while guaranteeing good quality reconstruction for signals sparse in a linear transform domain. We explore the application of CS formulation to music signals. Since music signals comprise of both tonal and transient nature, we examine several transforms such as DCT, DWT, Fourier basis and also non-orthogonal warped transforms to explore the effectiveness of CS theory and the reconstruction algorithms. We show that for a given sparsity level, DCT, overcomplete and warped Fourier dictionaries result in better reconstruction, and warped Fourier dictionary giving a perceptually better reconstruction. 'MUSHRA' test results show that a moderate quality reconstruction is possible with about half the Nyquist sampling.
6-4 [ Submission ID: 37 ] 'Sparsification' of Audio Signals using the MDCT/IntMDCT and a Psychoacoustic Model - Application to Informed Audio Source Separation - Jonathan Pinel, Laurent Girin, Grenoble Institute of Technology, Grenoble, France

Sparse representations have proved a very useful tool in a variety of domain, e.g. speech/music source separation. As strictly sparse representations (in the sense of l0) are often impossible to achieve, other ways of studying signals sparsity have been proposed. In this paper, we revisit the irrelevance filtering analysis-synthesis approach proposed in (Balazs et al., IEEE Trans. ASLP, 18(1), 2010), where the TF coefficients that are below some masking threshold are set to zero. Instead of using the Gabor transform and a specific psychoacoustic model, we use tools directly inspired from perceptual audio coding, for instance MPEG-AAC. We show that significantly better “sparsification performances” are obtained on music signals, at lower computational cost. We then apply the sparsification process to the informed source separation (ISS) problem and show that it enables to significantly decrease the computational cost at the ISS decoder.
Paper Session 7 - Music Information Retrieval [Part 3]
7-2 [ Submission ID: 17 ] Analyzing Chroma Feature Types for Automated Chord Recognition - Nanzhu Jiang, Peter Grosche, Verena Konz, Meinard Müller, Saarland University and MPI Informatik, Saarbrücken, Germany

The computer-based harmonic analysis of music recordings with the goal to automatically extract chord labels directly from the given audio data constitutes a major task in music information retrieval. In most automated chord recognition procedures, the given music recording is first converted into a sequence of chroma-based audio features and then pattern matching techniques are applied to map the chroma features to chord labels. In this paper, we analyze the role of the feature extraction step within the recognition pipeline of various chord recognition procedures based on template matching strategies and hidden Markov models. In particular, we report on numerous experiments which show how the various procedures depend on the type of the underlying chroma feature as well as on parameters that control temporal and spectral aspects.
Paper Session 8 - Intelligent Audi Effects
8-1 Intelligent Multi Channel Audio Effects for Live Sound - Josh Reiss, Centre for Digital Music (C4DM), Queen Mary University of London, London, United Kingdom

Mixing consoles and digital audio workstations are capable of saving static mix scenes but lack the ability to intelligently take decisions that would assist in the production process. We describe recent research towards the creation of intelligent software tools for live sound mixing. Multichannel signal processing is usually concerned with extracting information about sources from several received signals. Instead, our tools are based on multichannel audio effects where the inter-channel relationships are exploited in order to manipulate and combine the multichannel content. Applications to real-time, automatic audio production are described and the necessary technologies and the architecture of such systems are presented. The current state of the art is reviewed, and directions of future research are also discussed.
8-2 [ Invited talk ] [ Submission ID: 69 ] Surround Recording Based on a Coincident Pair of Microphones - Christof Faller, ILLUSONIC LLC, St-Sulpice, Switzerland

Recently, we have proposed an adaptive beamformer with the target to simulate a highly directive microphone, as opposed to the conventional signal-to-noise ratio target. It is shown that by applying this beamformer in different ways to a coincident cardioid or tailed cardioid signal pair, audio channels suitable for surround front left, front right, surround left, and surround right can be generated. It is also described how to generate a center channel from the same microphone signals. While the proposed surround sound generation is not as flexible as linear B-Format decoding in terms of determining the direction of the response for each channel, the directional responses of the audio channels overlap less resulting in higher channel separation.
8-3 [ Submission ID: 57 ] A Knowledge Representation Framework for Context-Dependent Audio Processing - György Fazekas, Thomas Wilmering, Mark B. Sandler, Centre for Digital Music (C4DM), Queen Mary University of London, London, United Kingdom

This paper presents a general framework for using appropriately structured information about audio recordings in music processing, and shows how this framework can be utilised in multitrack music production tools. The information, often referred to as metadata, is commonly represented in a highly domain and application specific format. This prevents interoperability and its ubiquitous use across applications. In this paper, we address this issue. The basis for the formalism we use is provided by Semantic Web ontologies rooted in formal logic. A set of ontologies are used to describe structured representation of information such as tempo, the name of instruments or onset times extracted from audio. This information is linked to audio tracks in music production environments as well as processing blocks such as audio effects. We also present specific case studies, for example, the use of audio effects capable of processing and predicting metadata associated with the processed signals. We show how this increases the accuracy of description, and reduces the computational cost, by omitting repeated application of feature extraction algorithms.
AES - Audio Engineering Society