AES Journal

Journal of the AES

2017 May - Volume 65 Number 5


Modeling Perceptual Characteristics of Loudspeaker Reproduction in a Stereo Setup

Open Access



Loudspeaker specifications traditionally describe their physical characteristics rather than the perceptual properties of the sound reproduction. This study explores three metrics for predicting the perceived characteristics of loudspeakers’ sound in a stereo setup evaluated in a standardized listening room. Perceptual evaluations of eleven loudspeakers were conducted on the basis of six selected sensory descriptors chosen by trained listeners during consensus meetings. Four of these descriptors were found suitable for modeling metrics that predicted Bass depth, Punch, Brilliance, and Dark-Bright respectively; Bass depth and Punch were however combined because of a high correlation between them. The input for the metrics was recordings made using a head-and-torso simulator and processed using a loudness model. The prediction models were trained on a subset of seven sets of loudspeakers and validated on four others. The range of correlation coefficients between perceptual evaluations and outputs of the metrics were r = 0.85-0.96.

  Download: PDF (HIGH Res) (718KB)

  Download: PDF (LOW Res) (338KB)

  Be the first to discuss this paper

Real-Time Emulation of the Acoustic Violin using Convolution and Arbitrary Equalization


This report describes a real-time audio effects processor that modifies the signal generated by an electric violin to produce a sound that closely resembles the tonal qualities of an equivalent acoustic instrument. The processor convolves the incoming signal with an impulse response measured from an acoustic violin in the far field. This approach is justified if the violin body behaves as a linear system, which is typically a good approximation. Because the processor system can store sixteen such responses, it is ideally suited for listening studies in which the timbres of different emulations are being assessed or compared. The device further incorporates a uniquely adjustable arbitrary equalizer and a blender after the convolution stage, which then allows the performer to modify the instrument voice to match personnel preferences or room acoustics. The system has been evaluated in a blind listening study, in which participants were asked to rank emulations based on a range of violins of varying quality, including Old Italian models. Statistical analysis suggests that high-quality instruments were favored over the raw electric sound and cheap student model. The study confirmed that the improvements in tonal quality were convincing and realistic, thereby conveying many of the tonal nuances of the emulated wooden instrument.

  Download: PDF (HIGH Res) (4.6MB)

  Download: PDF (LOW Res) (286KB)

  Be the first to discuss this paper

Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries


Although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. The challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. Unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. This approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. In contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. Universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. Evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.

  Download: PDF (HIGH Res) (4.7MB)

  Download: PDF (LOW Res) (478KB)

  Be the first to discuss this paper

Engineering Reports

The Subwoofer Room Impulse Response (SUBRIR) Database


The Subwoofer Room Impulse Response (SUBRIR) database introduced in this report is a collection of RIRs measured in an empty rectangular domestic listening room using a subwoofer as the sound source. Two subwoofers with different characteristics and two types of omnidirectional microphones were used to measure the RIR at different locations, for a total of 96 measurements. Performing acoustic measurements at very LFs presents some difficulties, mainly related to LF ambient noise and unavoidable nonlinear distortions of the subwoofer. It is shown that these issues can be addressed and partially solved by means of the exponential sine-sweep technique and a careful calibration of the measurement equipment. The purpose of this database is to provide acoustic measurements within the frequency region of modal resonances of such small spaces. A procedure for estimating the reverberation time at very low frequencies is proposed, which uses a cosine-modulated filterbank and an approximation of the RIRs using parametric models in order to reduce problems related to low signal-to-noise ratio and to the length of typical band-pass filter responses. The Exponential Sine-Sweep (ESS) technique has been chosen to estimate the RIRs because of its robustness to nonlinear distortions and its capability of providing a higher SNR at LFs. However, not all distortions can be isolated using the ESS technique, with impulsive distortions and odd-order harmonic distortions partially overlapping with the causal RIR.

  Download: PDF (HIGH Res) (2.5MB)

  Download: PDF (LOW Res) (743KB)

  Be the first to discuss this report

Low-Distortion, Low-Noise Composite Operational Amplifier


In state-of-the-art implementations of instrument devices, a single amplifier stage may be required to provide a THD figure of better than -140 dB for a 5V RMS, 20 kHz signal in order to support a total instrument dynamic range of 120 dB in an 80 kHz measurement bandwidth. Currently it is not possible to achieve this performance level using available commercial monolithic operational amplifiers in a standard configuration. A proposed design approach achieves this goal. A unity gain stable composite operational amplifier is presented that consists of a cascade of two operational amplifiers, an intermediate compensation network and a frequency-selective feedback network for the second amplifier. This configuration achieves very high open-loop gain (100 dB at 100 kHz) and thus shows exceptionally good distortion characteristics. Furthermore, the noise characteristics of the first operational amplifier are preserved. The open-loop response is designed for conditional stability, such that a very-large-gain-bandwidth product at signal frequencies can be achieved. A numerical optimization procedure is then introduced to derive the frequency compensation, based on specific stability criteria. Measurement results confirm the predicted high-gain-bandwidth product (10 GHz at 100 kHz) and excellent distortion performance (-180 dB). Applications for the new composite operational amplifier include audio frequency distortion measurement equipment.

  Download: PDF (HIGH Res) (767KB)

  Download: PDF (LOW Res) (241KB)

  Be the first to discuss this report

A Matlab-Based Signal Processing Toolbox for the Characterization and Analysis of Musical Vibrato


To assess and manipulate the vibrato in musical sounds, audio engineers either informally listen to the audio or visually inspect waveform envelopes or spectrographic representations. Unfortunately, detailed descriptions of the amplitude and frequency trajectories of harmonic partials are difficult to infer from audio spectrograms, which means quantitative information is limited. This paper describes a collection of signal processing methods and a toolbox for extracting and analyzing vibrato-related parameters from solo audio recordings. The Vibrato Analysis Toolbox (VAT) uses a method based on the Hilbert transform to extract the amplitude and frequency variations as feature tracks. A parameterization algorithm then extracts various descriptive parameters including vibrato depth, frequency, spectral centroid, relative amplitude-frequency modulation phase and time delay, and other relationships based on the vibrato tracks. Together, these parameters provide a quantitative characterization of vibrato. The VAT also provides visualization and resynthesis functions that enable users to interactively explore many musical features. Algorithms are written in the Matlab programming language for easy adaptation, enabling further development by researchers and developers. Applications include music performance pedagogy, musicological studies, music production, and voice analysis.

  Download: PDF (HIGH Res) (9.0MB)

  Download: PDF (LOW Res) (632KB)

  Be the first to discuss this report


Real-World Perception: Life Beyond the Listening Room


In perceptual experiments there’s an increasing emphasis on listening situations that represent real-world contexts such as mobile and in-car listening, where background noise can be high. The effect of background noise on listeners’ mix preferences is not easy to predict, but the advent of object-based systems may enable one to adjust balance at the replay end of the chain. Distortion in car audio systems may only become a big problem for sound quality when the system is pushed to its limits, otherwise other factors are probably more important. In-ear headphones may be evaluated using simulation methods without undue disadvantages, provided that leakage is controlled and accounted for. Also time may be saved in certain types of listening tests, as long as care is taken with stimuli, and listeners are reasonably diligent. Finally, the attribute of “punch” may be predicted reasonably successfully using a new model.

  Download: PDF (262KB)

  Be the first to discuss this feature

2017 Audio Forensics Conference Preliminary Program, Arlington

Page: 432

Download: PDF (106KB)

2017 Semantic Audio Conference Preliminary Program, Erlangen

Page: 436

Download: PDF (131KB)


Section News

Page: 428

Download: PDF (223KB)

Book Reviews

Page: 441

Download: PDF (180KB)


Page: 442

Download: PDF (175KB)

AES Conventions and Conferences

Page: 444

Download: PDF (146KB)


Table of Contents

Download: PDF (38KB)

Cover & Sustaining Members List

Download: PDF (77KB)

AES Officers, Committees, Offices & Journal Staff

Download: PDF (79KB)

AES - Audio Engineering Society