The Journal of the Audio Engineering Society — the official publication of the AES — is the only peer-reviewed journal devoted exclusively to audio technology. Published 10 times each year, it is available to all AES members and subscribers.
The Journal contains state-of-the-art technical papers and engineering reports; feature articles covering timely topics; pre and post reports of AES conventions and other society activities; news from AES sections around the world; Standards and Education Committee work; membership news, patents, new products, and newsworthy developments in the field of audio.
Authors:Ben Ali, Faten; Djaziri-Larbi, Sonia; Girin, Laurent
Affiliation:University of Tunis El Manar, National Engineering School of Tunis, Signal and Systems Lab, Tunis, Tunisia; GIPSA Lab, University Grenoble Alpes, France, and INRIA Grenoble Rhone-Alpes, France
In speech/music coders and analysis/synthesis systems, spectral modeling is generally performed on a short-term (ST) frame-by-frame basis, which is justified by the fact that the signal is only locally (quasi-) stationary. The vocal tract configuration moves slowly and smoothly thereby resulting in a high correlation between the spectral parameters of successive frames: this correlation property is exploited in long-term modeling of the ST parameters, which however results in longer modeling/coding delays. The short delay constraint can be relaxed in many applications, such as text-to-speech modification/synthesis, telephony surveillance data, digital answering machines, electronic voicemail, digital voice logging, electronic toys, and video games. The long-term harmonic plus noise model (LT-HNM) for speech shows additional data compression possibilities since it exploits the smooth evolution of the time trajectories of the short-term harmonic plus noise model parameters by applying a discrete cosine model (DCM). In this paper, the authors extend the LT-HNM to a complete low bit-rate speech coder that is based on a long-term approach ca. 200ms. The proposed LT-HNM coder reaches a bit-rate of 2.7kbps for wideband speech.
Download: PDF (HIGH Res) (1.3MB)
Download: PDF (LOW Res) (682KB)
Authors:Mo, Ronald; Choi, Ga Lam; Lee, Chung; Horner, Andrew
Affiliation:Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong; United Overseas Bank, Singapore
Musical instrument sounds have distinct timbral and emotional characteristics that can change when audio processing is applied. This paper investigates the effects of MP3 compression on the emotional characteristics of eight sustained instrument sounds using listening tests. The experimental paradigm involved a pairwise comparison of compressed and uncompressed samples at several bit rates over ten emotional categories. The results showed that MP3 compression strengthened neutral and negative emotional characteristics such as Mysterious, Shy, Scary, and Sad, and weakened positive emotional characteristics such as Happy, Heroic, Romantic, Comic, and Calm. Angry was relatively unaffected by MP3 compression, probably because the background “growl” artifacts added by MP3 compression decreased positive emotional characteristics and increased negative characteristics such as Mysterious and Scary. Compression effected some instruments more and others less; trumpet was the most effected and the horn the least.
Download: PDF (HIGH Res) (2.6MB)
Download: PDF (LOW Res) (292KB)
Authors:Lecomte, Pierre; Gauthier, Philippe-Aubert; Langrenne, Christophe; Berry, Alain; Garcia, Alexandre
Affiliation:Conservatoire National des Arts et Métiers, Paris, France; Groupe d’Acoustique de l’Université de Sherbrooke, University de Sherbrooke, Québec, Canada; Centre for Interdisciplinary Research in Music, Media and Technology, McGill University, Montréal, Québec, Canada
Physical reconstruction or synthesis of three-dimensional sound fields can be implemented with Near Field Compensated Higher Order Ambisonics. This paper investigates the use of a fifty-node Lebedev grid, which is derived from rotationally-invariant quadrature rules. Special attention is paid to spatial aliasing artifacts at the capture and reproduction steps. While comparing a fifty-node Lebedev grid with a Fliege and a t-design grid that both use almost the same number of nodes, it is shown that the Lebedev grid provides the best performance in terms of sound field capture and reproduction. Finally, a multiband multiorder decoder is presented. These decoders take advantage of the inherent nested subgrids when following the rotationally-invariant quadrature approach. The importance of orthonormality of the spherical harmonics was highlighted in a context of physical encoding or reconstruction of a sound field with the Ambisonics approach. Simulation results are provided for the case of a three-band decoder using the three grids contained in the Lebedev grid. It was found that a multifrequency sound field can be reproduced accurately in the sweet-spot by using a combination of low-order decoder for low frequency and higher-order decoder for higher frequency.
Download: PDF (HIGH Res) (5.3MB)
Download: PDF (LOW Res) (416KB)
Authors:Bomhardt, Ramona; Lins, Marcia; Fels, Janina
Affiliation:RWTH Aachen University, Institute of Technical Acoustics, Medical Acoustics Group, Aachen, Germany
To enhance the localization performance of listeners using head-related impulse responses HRIR datasets from dummy heads, individualization was added. An ellipsoidal model is used to adapt the Interaural Time Difference (ITD) of the dataset to individual subjects by using their anthropometric data. Head measurements from 23 subjects were used to validate the model. The ITD model is based on an ellipsoid shape and the analytical solution of the sound transmission around a sphere. A comparison of the measured and adapted ITDs shows that the average absolute error was 25 +/- 9 µs, which is a value below the just-noticeable difference. However, the ellipsoidal model underestimates the ITD. In contrast to similar approaches, this model calculates both azimuth- and elevation-dependent ITDs. Since the ITD of a given HRTF dataset is individualized, the shoulder reflections and the ear offset are maintained.
Download: PDF (HIGH Res) (4.9MB)
Download: PDF (LOW Res) (600KB)
Authors:Kanai, Sekitoshi; Sugaya, Maho; Adachi, Shuichi; Matsui, Kentaro
Affiliation:Keio University, Yokohama, Japan; NHK Science & Technology Research Laboratories, Tokyo, Japan
This research explores a fast measurement method for computing a head-related transfer function (HRTF). The method uses a multidirectional intermediate directional transfer function (IDTF) with a multiple-input single-output structure. An ef?cient procedure is then used to calculate the model parameters. Experiments showed that a simultaneous estimation method made it possible to estimate IDTFs on the horizontal plane as accurately as those measured one by one in the frequency range from 375 to 19,875 Hz. Even though the IDTFs in the directions contralateral to each ear are dif?cult to estimate because of low signal-to-noise ratio, the estimated IDTFs preserved the spectral cues. The effectiveness of the proposed method was veri?ed through a simultaneous estimation experiment on a set of IDTFs of 24 directions measured using a dummy head. In this experiment, the intermediate directional impulse responses were approximated by the 128-order FIR models. Through the experiments, it was con?rmed that the average spectral distortion between the simultaneously estimated IDTFs and IDTFs measured one direction at a time was less than 1 dB in the frequency range from 375 to 19,875 Hz. The method can also be applied to room impulse responses.
Download: PDF (HIGH Res) (1.5MB)
Download: PDF (LOW Res) (1.1MB)
Authors:Olivieri, Ferdinando; Fazi, Filippo Maria; Nelson, Philip A.; Fontana, Simone
Affiliation:Institute of Sound and Vibration Research, University of Southampton, Southampton, Hampshire, United Kingdom; Huawei European Research Center, Munich, Germany
Compact loudspeaker arrays can be driven so that the radiation patterns produce zones of private sound and zones of silence. This research compares the performance of two strategies, both based on the Pressure Matching Method (PMM), for accurate reproduction of a target signal. The first strategy is the Weighted PMM (WPMM) with low values for the weight of the reproduction error in the zone where accurate reproduction is not targeted. The second strategy is the Linearly-Constrained PMM (LCPMM), wherein a performance constraint on the accuracy of the target signal in a given zone is added to the cost function for the calculation of the input signals. Performance of the two methods was evaluated using numerical simulations of monopoles in free field and a linear array prototype with measured transfer functions in an anechoic environment. The two strategies were evaluated using a target signal with large amplitude variations between the so-called acoustically bright and dark zones. Results show that input signals designed with the WPMM provide better trade-offs between accuracy of the target field reproduction in the bright zone and directivity performance compared to that of the LCPMM.
Download: PDF (HIGH Res) (7.8MB)
Download: PDF (LOW Res) (903KB)
Authors:Chau, Chuck-jee; Mo, Ronald; Horner, Andrew
Affiliation:The Hong Kong University of Science and Technology, Hong Kong
Previous research has shown that both sustained and nonsustained musical instrument sounds have strong emotional characteristics. This report explores how the effects of pitch and dynamics influence the emotional characteristics of isolated one-second piano sounds. Listeners compared the sounds pairwise over ten emotion categories. The results showed that all ten emotional categories were significantly affected by pitch and nine of them by dynamics. In particular, the emotional characteristics Happy, Romantic, Comic, Calm, Mysterious, and Shy generally increased with pitch, but sometimes decreased at the highest pitches. The characteristics Heroic, Angry, and Sad generally decreased with pitch. Scary was strong in the extreme low and high registers. With regard to dynamics, the results showed that the characteristics Heroic, Comic, Angry, and Scary were stronger for loud notes, while Romantic, Calm, Mysterious, Shy, and Sad were stronger for soft notes. Surprisingly, Happy was not affected by dynamics. These results help quantify the emotional characteristics of piano sounds.
Download: PDF (HIGH Res) (2.1MB)
Download: PDF (LOW Res) (283KB)
Headphones are almost certainly now the dominant means by which many people listen to reproduced sound. Still, the quality of the devices used is often remarkably low, and there is a very wide range of frequency responses represented. The need for a new “preferred” target curve has been suggested. Loudspeaker reproduction can now be simulated on headphones in such a way that timbral coloration is minimized. Virtual room simulation, head tracking, and personalization, could make this even more successful. There is continued debate about whether consumer preference trumps objective accuracy.
Download: PDF (302KB)