42nd AES Conference: Abstracts
|Friday, July 22|
|Paper Session 1 - Music Information Retrieval [Part 1]|
|1-1||[ Invited talk ] [ Submission ID: 18 ] New Developments in Music Information Retrieval - Meinard Müller, Saarland University and Max Planck Institute Informatik, Saarbrücken, Germany
The digital revolution has brought about a massive increase in the availability and distribution of music- related documents of various modalities comprising textual, audio, as well as visual material. Therefore, the development of techniques and tools for organizing, structuring, retrieving, navigating, and presenting music-related data has become a major strand of research the field is often referred to as music information retrieval (MIR). Major challenges arise because of the richness and diversity of music in form and content leading to novel and exciting research problems. In this article, we give an overview of new developments in the MIR field with a focus on content-based music analysis tasks including audio retrieval, music synchronization, structure analysis, and performance analysis.
|1-2||[ Invited talk ] [ Submission ID: 68 ] Music Listening in the Future: Augmented Music-Understanding Interfaces and Crowd Music Listening - Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan
In the future, music listening can be more active, more immersive, richer, and deeper by using automatic music-understanding technologies (semantic audio analysis). In the first half of this invited talk, four Augmented Music-Understanding Interfaces that facilitate deeper understanding of music are introduced. In our interfaces, visualization of music content and music touch-up (customization) play important roles in augmenting people's understanding of music because understanding is deepened through seeing and editing. In the second half, a new style of music listening called Crowd Music Listening is discussed. By posting, sharing, and watching time-synchronous comments (semantic information), listeners can enjoy music together with the crowd. Such Internet-based music listening with shared semantic information also helps music understanding because understanding is deepened through communication. Two systems that deal with new trends in music listening --- time-synchronous comments and mashup music videos --- are finally introduced.
|Poster Session 1|
|P1-1||[ Submission ID: 10 ] Improving a Multiple Pitch Estimation Method With AR Models - Tiago Fernandes Tavares, Jayme Garcia Arnal Barbedo, Amauri Lopes, School of Electrical and Computer Engineering, University of Campinas, Campinas, Brazil
Multiple pitch estimation (MPE) methods aim to detect the pitches of the sounds that are part of a certain mixture. A possible approach to such problem is applying a FIR filter bank in the frequency domain and choosing the filter that presents more energy. This process is equivalent to performing a linear combination of frequency domain representations of a signal, hence it is a linear classification tool. When spectral lobes corresponding to existing partials merge, such process may fail. In this paper, AR models were used to provide an spectral representation where lobes tend to merge less. The proper choice of model significantly improved the MPE method.
|P1-2||[ Submission ID: 26 ] Polyphonic Music Transcription using Weighted CQT and Non-Negative Matrix Factorization - Sang Ha Park, Seokjin Lee, Koeng-Mo Sung, INMC, Seoul National University, Seoul, Republic of Korea
Non-negative Matrix Factorization (NMF) is a useful method in music transcription. It achieves high speed computing and high performance. However, low frequency components are not fully detected and frequency confusion occurs occasionally in the conventional NMF based transcription algorithm. We propose a music transcription method using NMF with Weighted Constant-Q Transform (WCQT) to solve this problem. The filter bank of CQT is the same as that of the Western music scale interval, so the frequency components are well analyzed. And the weights on the CQT compensate the relatively small energy in low frequencies. We successfully transcribed polyphonic piano music with the proposed transcription algorithm, and the performance was better than that of the conventional method.
|P1-3||[ Submission ID: 19 ] Towards Context-Sensitive Music Recommendations Using Multifaceted User Profiles - Rafael Schirru1,2, Stephan Baumann1, Christian Freye3, Andreas Dengel1,2, 1German Research Center for Artificial Intelligence, Kaiserslautern, Germany, 2University of Kaiserslautern, Kaiserslautern, Germany, 3Brandenburg University of Applied Sciences, Brandenburg, Germany
In this paper we present an approach extracting multifaceted user profiles that enable recommendations according to a user's different preferred music styles. We describe the artists a user has listened to by making use of metadata obtained from Semantic Web data sources. After preprocessing the data we cluster a user's preferred artists and extract for each cluster a descriptive label. These labels are then aggregated to form multifaceted user profiles representing a user's full range of preferred music styles. Our evaluation experiments show that the extracted labels are specific to the artists in the clusters and can thus be used to recommend, e.g., internet radio stations and allow for an integration into existing recommendation strategies.
|P1-4||[ Submission ID: 39 ] Automatic Classification of Musical Pieces Into Global Cultural Areas - Anna Kruspe, Hanna Lukashevich, Jakob Abeßer, Holger Großmann, Christian Dittmar, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany
Music Information Retrieval (MIR) has a large variety of applications. One aspect that has not gathered a lot of attention yet is the application to non-western music ("world music"). In a task comparable to genre classification, this work's goal is the classification of musical pieces into their corresponding cultural regions of origin. As a basis for such a classification, a three-tier taxonomy based on musical and geographic properties is created. A database consisting of approximately 4400 musical pieces representing the taxonomical classes is assembled and annotated. Based on rhythmic, tonal, and timbre-related audio features, different classification experiments are performed. We achieved an accuracy of approx. 70% for the classification of musical pieces into nine large world regions. Twelve new features that are especially suited for non-western music are implemented. They improve the classification result slightly. For the purpose of comparison, we carried out a listening test with musical laymen with an average accuracy of 52%.
|P1-5||[ Submission ID: 49 ] Regression-Based Tempo Recognition from Chroma and Energy Accents for Slow Audio Recordings - Thorsten Deinert, Igor Vatolkin, Günter Rudolph, TU Dortmund, Dortmund, Germany
Although the performance of automatic tempo estimation methods has been improved during the recent research activities, some objectives to solve are still remaining. One of them is the analysis of slow music or songs without a strong drum pulse which corresponds to the correct tempo. One of the most frequent errors is the prediction of the doubled tempo, however further error sources exist. In our work we reimplemented, extended and optimized the original tempo recognition method from Eronen and Klapuri with the concrete goal to achieve reliable classification accuracy especially for slow songs. The results from the experiment study confirm the increased quality of the adapted algorithm chain. Several possible error sources are discussed in detail and further ideas beside the scope of this work are proposed for future research.
|P1-6||[ Submission ID: 38 ] Blind Estimation of Reverberation Time from Monophonic Instrument Recordings Based on Non-Negative Matrix Factorization - Maximo Cobos1, Pedro Vera-Candeas2, Julio Jose Carabias-Orti2, Nicolas Ruiz-Reyes2, Jose J. Lopez1, 1Institute for Telecommunications and Multimedia Applications (iTEAM), Universitat Politècnica de València, Valencia, Spain, 2Escuela Politécnica Superior de Linares, Universidad de Jaen, Linares, Spain
Reverberation time is the most important acoustic parameter that describes the acoustic behavior of a room. To measure this parameter, test signals such as pink or impulse noises are usually employed. However, despite its importance, there are few methods aimed at estimating reverberation time from natural signals such as music and speech. In this paper, we propose a method for estimating reverberation time from monophonic instrument recordings by detecting decay parts in the performance using a Non-Negative Matrix Factorization (NMF) algorithm with a Basic Harmonic Constrained (BHC) model. Preliminary results using different databases and instrument models are given for mid-long reverberation times.
|Paper Session 2 - Speech Processing and Analysis|
|2-1||[ Invited Talk ] [ Submission ID: 26 ] Semantic Speech Tagging: Towards Combined Analysis of Speaker Traits - Bjoern W. Schuller, Institute for Human-Machine Communication, Technische Universität München, München, Germany
A number of paralinguistic problems are often dealt with in isolation, such as emotion, health state or personality. However, there are also good examples of mutual benefit, mostly incorporating speaker gender knowledge. In this paper we deal with the question how further paralinguistic information, such as speaker age, height, or race can provide beneficial information when their ground truth knowledge is provided within single-task speaker classification. Tests with open SMILE's 1.5 k Paralinguistic Challenge Feature set on the TIMIT corpus of 630 speakers reveal significant boost in accuracy or cross-correlation depending on the representation form of the problem at hand.
|2-2||[ Submission ID: 45 ] SyncTS: Automatic synchronization of speech and text documents - David Damm1, Harald Grohganz1, Frank Kurth2 Sebastian Ewert1, Michael Clausen1, 1University of Bonn, Department of Computer Science III, Bonn. Germany, 2Fraunhofer Institute for Communication, Information Processing and Ergonomics, Wachtberg, Germany
In this paper, we present an automatic approach for aligning speech signals to corresponding text documents. For this sake, we propose to first use text-to-speech synthesis (TTS) to obtain a speech signal from the textual representation. Subsequently, both speech signals are transformed to sequences of audio features which are then time-aligned using a variant of greedy dynamic time-warping (DTW). The proposed approach is both efficient (with linear running time), computationally simple, and does not rely on a prior training phase as it is necessary when using HMM-based approaches. It benefits from the combination of a) a novel type of speech feature, being correlated to the phonetic progression of speech, b) a greedy left-to-right variant of DTW, and c) the TTS-based approach for creating a feature representation from the input text documents. The feasibility of the proposed method is demonstrated in several experiments.
|2-3||[ Submission ID: 58 ] Extraction of spectro-temporal speech cues for robust automatic speech recognition - Bernd T. Meyer, International Computer Science Institute, Berkeley, CA, USA
This work analyzes the use of spectro-temporal signal characteristics with the aim of improving the robustness of automatic speech recognition (ASR) systems. Experiments that aim at the robustness against extrinsic sources of variability (such as additive noise) as well as intrinsic variation of speech (changes in speaking rate, style, and effort) are presented. Results are compared to scores for the most common features in ASR (mel-frequency cepstral coefficients and perceptual linear prediction features), which account for the spectral properties of short-time segments of speech, but mostly neglect temporal or spectro-temporal cues. Intrinsic variations were found to severely degrade the overall ASR performance. The performance of the two most common feature types was degraded in much the same way, whereas the proposed spectro-temporal features exhibit a different sensitivity against intrinsic variations, which suggests that classic and spectro-temporal feature types carry complementary information. Furthermore, spectro-temporal features were shown to be more robust than the baseline system in the presence of additive noise.
|Paper Session 3 - Automatic Music Transcription|
|3-1||[ Submission ID: 44 ] Automatic Recognition and Parametrization of Frequency Modulation Techniques in Bass-Guitar Recordings - Jakob Abeßer1, Christian Dittmar1, Gerald Schuller2, 1Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany, 2Ilmenau University of Technology, Ilmenau, Germany
In this paper, we propose a novel method to parametrize and classify different frequency modulation techniques in bass guitar recordings. A parametric spectral estimation technique is applied to refine the fundamental frequency estimates derived from an existing bass transcription algorithm. We apply a two-stage taxonomy of bass playing styles with special focus on the frequency modulation techniques slide, bending, and vibrato. An existing database of isolated note recordings is extended by approx. 900 samples to evaluate the presented algorithm. We achieve comparable classification accuracy values of 85.1% and 81.5% for classification on class-level and subclass-level. Furthermore, two potential application scenarios are outlined.
|3-2||[ Submission ID: 11 ] Note Clustering based on 2D Source-Filter Modeling for Underdetermined Blind Source Separation - Martin Spiertz, Volker Gnann, Institut für Nachrichtentechnik, RWTH Aachen University, Aachen, Germany
For blind source separation, the non-negative matrix factorization extracts single notes out of a mixture. These notes can be clustered to form the melodies played by a single instrument. A current approach for clustering utilizes a source filter model to describe the envelope over the first dimension of the spectrogram: the frequency-axis. The novelty of this paper is to extend this approach by a second source-filter model, characterizing the second dimension of a spectrogram: the time-axis. The latter one models the temporal evolution of the energy of one note: an instrument specific envelope is convolved with an activation vector, corresponding to tempo, rhythm, and amplitudes of single note instances. We introduce an unsupervised clustering framework for both models and a simple, yet effective combination strategy. Finally, we show the advantages of our separation algorithm compared with to two other state-of-the-art separation frameworks: the separation quality is comparable, but our algorithm needs much less computational load, is independent from other BSS-algorithm as initialization, and works with a unique set of parameters for a wide range of audio data.
|3-3||[ Submission ID: 43 ] Pitch Estimation by the Pair-Wise Evaluation of Spectral Peaks - Karin Dressler, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany
In this paper, a new approach for pitch estimation in polyphonic musical audio is presented. The algorithm is based on the pair-wise analysis of spectral peaks. The idea of the technique lies in the identification of partials with successive (odd) harmonic numbers. Since successive partials of a harmonic sound have well defined frequency ratios, a possible fundamental can be derived from the instantaneous frequencies of the two spectral peaks. Consecutively, the identified harmonic pairs are rated according to harmonicity, timbral smoothness, the appearance of intermediate spectral peaks, and harmonic number. Finally, the resulting pitch strengths are added to a pitch spectrogram. The pitch estimation was developed for the identification of the predominant voice (e.g. melody) in polyphonic music recordings. It was evaluated as part of a melody extraction algorithm during the Music Information Retrieval Evaluation eXchange (MIREX 2006 and 2009), where the algorithm reached the best overall accuracy as well as very good performance measures.
|Saturday, July 23|
|Paper Session 4 - Music Information Retrieval [Part 2]|
|4-1||[ Invited talk] [ Submission ID: 33 ] Adaptive Distance Measures for Exploration and Structuring of Music Collections - Sebastian Stober, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
Music similarity plays an important role in many Music Information Retrieval applications. However, it has many facets and its perception is highly subjective - very much depending on a person's background or task. This paper presents a generalized approach to modeling and learning individual distance measures for comparing music pieces based on multiple facets that can be weighted. The learning process is described as an optimization problem guided by generic distance constraints. Three application scenarios with different objectives exemplify how the proposed method can be employed in various contexts by deriving distance constraints either from domain-specific expert information or user actions in an interactive setting.
|4-2||[ Submission ID: 32 ] Expressivity in musical timing in relation to musical structure and interpretation: a cross-performance, audio-based approach - Cynthia C.S. Liem1, Alan Hanjalic1, Craig Stuart Sapp2, 1Delft University of Technology, The Netherlands, 2CCRMA/CCARH, Stanford University, USA
Classical music performances are personal, expressive renditions, representing a performing musician's artistic view on a written music score. Typically, many interpretations are available for the same music piece. We believe that the variation in expressive renditions across performances can be exploited to gain insight into the musical content and provide supporting information for existing Music Information Retrieval tasks. In this paper, we focus on timing as one aspect of an individual performer’s expressivity and propose a light-weight, unsupervised and audio-based method to study timing deviations among different performances. The results of our qualitative study obtained for five Chopin mazurkas show that timing individualism as inferred by our method can be related to the structure of a music piece, and even highlight interpretational aspects of a piece that are not necessarily visible from the musical score.
|4-3||[ Submission ID: 34 ] Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation - Sebastien Gulluni1,2, Slim Essid2, Olivier Buisson1, Gaël Richard2, 1Institut National de l’Audiovisuel, Bry-sur-marne Cedex, France, 2Institut Telecom, Telecom ParisTech, Paris, France
In this paper, we present an interactive approach for the classification of sound objects in electro-acoustic music. For this purpose, we use relevance feedback combined with active-learning segment selection in an interactive loop. Validation and correction information given by the user is injected in the learning process at each iteration to achieve more accurate classification. Three active learning criteria are compared in the evaluation of a system classifying polyphonic pieces (with a varying degree of polyphony). The results show that the interactive approach achieves satisfying performance in a reasonable number of iterations.
|Paper Session 5 - Audio Source Separation [Part 2]|
|5-1||[ Submission ID: 51 ] Singing Voice Separation from Stereo Recordings using Spatial Clues and Robust F0 Estimation - Pablo Cabañas-Molero1, Damián Martínez-Muñoz1, Maximo Cobos2, José J. López2, 1University of Jaén, Polytechnic School, Linares, Jaén, Spain, 2Institute for Telecommunications and Multimedia Applications (iTEAM), Technical University of Valencia, Valencia, Spain
Separation of singing voice from music accompaniment is a topic of great utility in many application of Music Information Retrieval. In the context of stereophonic music mixtures, many algorithms face this problem making use of the spatial diversity of the sound sources to localize and isolate the singing voice. Although these spatial approaches can obtain acceptable results, the separated signal usually is affected by a high level of distortions and artifacts. In this paper, we propose a method for improving the isolation of the singing voice in stereo recordings based on incorporating the fundamental frequency (F0) information to the separation process. First, the singing voice is pre-separated from the input mixture using a state-of-the-art stereo source separation method, the MuLeTs algorithm. Then, the F0 of this pre-separated signal is obtained using a robust pitch estimator based on the computation of the difference function and Hidden Markov Models, obtaining a smooth pitch contour with voiced/unvoiced decisions. A binary mask is finally constructed from F0 to isolate the singing voice from the original mix. The method has been tested on studio music recordings, obtaining good separation results.
|5-2||[ Submission ID: 42 ] Interaction of phase, magnitude and location of harmonic components in the perceived quality of extracted solo signals - Estefanía Cano1, Christian Dittmar1, Gerald Schuller2, 1Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany, 2Ilmenau University of Technology, Ilmenau, Germany
During the last year, many research efforts have been directed to the refinement of sound source separation algorithms. However, little or no effort has been made to assess the impact and interaction of different spectral parameters as phase, magnitude and location of harmonic components in the resulting quality of the extracted signals. Recent developments in objective measures for sound quality that also fit subjective ratings have made this possible. This paper presents a study where spectral phase, magnitude and location of harmonic components are systematically changed to assess the impact of such variations in the perceived quality of extracted solo signals. To properly evaluate results, multi-track recordings that allow comparison with original tracks were used.
|Poster Session 2|
|P2-1||[ Submission ID: 16 ] A Psychoacoustic Approach to Wave Field Synthesis - Tim Ziemer, Institute of Musicology, University of Hamburg, Hamburg, Germany
Conventional audio systems use psychoacoustic knowledge to create a sound which is perceived equivalent to natural auditory events. Wave field synthesis (WFS) has overcome several disadvantages of conventional stereophonic audio systems by physically synthesizing natural wave fields. A practical implementation of a wave field synthesis system leads to errors which the literature proposes to compensate by physical means (e.g. compensation of shadow waves via compensation sources, modeling reflections from the third dimension via 2½D-operator) or by a combination of WFS with conventional stereophonic sound (e.g. compensating aliasing errors by optimized phantom source imaging (OPSI)). This paper introduces a psychoacoustic approach to compensate synthesis errors to ensure a proper localization, sound coloration and spaciousness. Incitements for further psychoacoustic WFS research topics are given.
|P2-2||[ Submission ID: 50 ] Comparative evaluation and combination of audio tempo estimation approaches - Jose R. Zapata, Emilia Gómez, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
The automatic analysis of musical tempo from audio is still an open research task in the Music Information Retrieval (MIR) community. The goal of this paper is to provide an updated comparative evaluation of different methods for audio tempo estimation. We overview, following the same block diagram, 23 documented methods. We then analyze their accuracy, error distribution and statistical differences, and we discuss which strategies can provide better performance for different input material. We then take advantage of their complementarity to improve the results by combining different methods, and we finally analyze the limitations of current approaches and give some ideas for future work on the task
|P2-3||[ Submission ID: 70 ] Observing uncertainty in music tagging by automatic gaze tracking - Bozana Kostek, Multimedia Systems Department of Gdansk University of Technology, Gdansk, Poland |
In this paper, a new approach to observe music file tagging process by employing a gaze tracking system is proposed. The study was conducted with the participation of twenty subjects having different musical experience. For the purpose of the experiments a website survey based on a musical database was prepared. It allowed to gather information abo