AES 45th conference, Helsinki, Finland, March 1-4, 2012.

NOTE! The first day (Thursday, March 1) is at the Aalto University, Electrical Engineering building. The street address is Otakaari 5A and the place can be most easily reached with buses 102 and 102T.

The program of the conference is available HERE.

HERE you can find links to optional cultural events on Sunday afternoon, if you are not traveling home immediately after the conference.

The invited speaker and their topics are as follow. 

Bernd Edler 
International Audio Laboratories, Erlangen (AudioLabs)
Fraunhofer-Gesellschaft and Friedrich-Alexander University Erlangen-Nürnberg, Germany

Mathematic Background for Time-Frequency Processing of Audio
Systems which map audio signals from the time domain to a spectral representation are nowadays used in almost all audio codecs and in many other processing applications. Usually a method for the conversion from the spectral representation back to the time domain is needed as well. This tutorial gives an introduction to this class of systems. Different approaches like filter banks, block transforms, wavelet transforms, and linear transforms in multi-dimensional vector spaces are compared and relations of their equivalence are shown. Knowing about these relations allows the selection of the most appropriate approach for analysis or design of any of these systems. For example, regarding a system as a filter bank allows to determine frequency responses while other properties could be examined with its interpretation as a transform. Important system properties like temporal and spectral resolution, critical sampling, perfect reconstruction, (over-)completeness, orthogonality, and system delay are explained. The concepts of aliasing cancellation in the frequency domain and in the time domain are introduced. In this context, the fundamentals of sampling, sampling rate changes, duality of time and frequency domains, multi-rate, and polyphase systems will be reviewed. Methods for the design of systems yielding perfect reconstruction and approaches to efficient implementations are presented. Furthermore some approaches to achieve time varying temporal and spectral resolutions are explained, which allow the adaptation to variations of characteristics as they frequently occur in audio signals. Complex valued spectral representations are compared to real valued representations. Approaches for achieving non-uniform time and frequency resolutions are shown and advantages as well as disadvantages over uniform resolutions are discussed. The gains which can be achieved in coding applications by redundancy and irrelevance reduction are analyzed. Some optimization methods are indicated, which take into account special needs in coding and processing of audio signals. Some systems used in standard audio coding schemes like the cascaded structure of mp3, the MDCT used in MPEG Advanced Audio Coding, and the filter banks used in parametric extensions are examined in detail.

Anssi Klapuri
Queen Mary, University of London, Department of Electronic Engineering (Center for Digital Music)

Time-frequency processing of music signals
This talk discusses the time-frequency domain processing of music signals. Both the tonal and the percussive elements of music signals are addressed. The first part of the talk describes a mapping of audio signals from the time-frequency to time-pitch domain. Human auditory system has a strong tendency to associate complex sounds with a single frequency value that we call pitch. A time-varying pitch trajectory summarizes certain aspects of sound events and is useful in organizing complex auditory scenes where several sound sources are active at the same time. A time-pitch domain representation of audio signals provides an informative visualization of music signals and is useful for acoustic signal analysis tasks. Furthermore, the time-differential of such a representation can be used to model certain aspects of the human auditory organization. Applications to main melody transciption and vocals separation from complex music are discussed. In the second part of the talk, the percussive element of music signals is addressed. In particular, the signal model of non-negative matrix factorization (NMF) is described and its use and modeling power for music signals is discussed. NMF is an unsupervised learning method that can be used to estimate the spectral shape and time-varying gain of component sounds in a polyphonic audio signal. Typically, the input consists of the magnitude or power spectrogram of an audio signal. Applications of NMF to drum transcription and audio effects are demonstrated, and extensions of NMF utilizing prior information, weighted NMF, or non-negative matrix deconvolution are briefly introduced. In certain audio analysis tasks, the constant-Q transform (CQT) is sometimes more suitable than the Fourier transform. CQT refers to a Wavelet transform with relatively high frequency resolution (12-120 bins per octave). The recent introduction of toolboxes for inverse-CQT transform and CQT-domain processing enable some flexibility in transform-domain audio signal processing, including NMF-based methods.

Torsten Dau
Department of Electrical Engineering, DTU, Copenhagen, Denmark

Human auditory signal processing in complex acoustic environments
In everyday life, the speech we listen to is often mixed with many other sound sources as well as reverberation. In such situations, people with normal hearing are able to almost effortlessly segregate a single voice out of the background – a skill commonly known as the 'cocktail party effect'. In contrast, hearing-impaired people have great difficulty understanding speech when more than one person is talking, even when reduced audibility has been fully compensated for by a hearing aid. As with the hearing impaired, the performance of automatic speech recognition systems deteriorates dramatically with additional sound sources. The reasons for these difficulties are not well understood. This presentation highlights recent concepts of the monaural and binaural signal processing strategies employed by the normal as well as impaired auditory system. The aim is to develop a computational auditory signal-processing model, capable of describing the transformation from the acoustical input signal into its “internal” (neural) representations. Several stages of processing, including cochlear, midbrain and central stages, are considered to be important for a robust signal representation, and a deficiency in any of these processing stages is likely to result in a deterioration of the entire system’s performance. A state-of-the-art model of auditory signal processing would be of major practical significance for technical applications, in digital hearing aids, cochlear implants, speech and audio coding, and automatic speech recognition.

Karlheinz Brandenburg
Fraunhofer IDMT, Ilmenau, Germany

35 years of time-frequency domain based audio coding

Nearly all current low bit-rate audio coding systems are based on filterbanks. From early ideas using analog filterbanks (!) through systems using QMF filterbanks or DFT, quite some work has been put into optimizing filterbank structures for audio coding. The independent development of cosine-modulated filterbanks at several places in the late 80s (e.g. MDCT, directly using the papers on TDAC by Princen/Bradley) improved the basic parameters by a large degree. Since then, a number of other modifications have been applied with varying degrees of success. The talk will give examples of techniques and codecs through these 35 years.


 Key dates / Call for demo / RegistrationProgram / Accepted papers /

Venue / Travel details / Accommodation / Call for papers / Conference committee








AES - Audio Engineering Society