The Journal of the Audio Engineering Society — the official publication of the AES — is the only peer-reviewed journal devoted exclusively to audio technology. Published 10 times each year, it is available to all AES members and subscribers.
The Journal contains state-of-the-art review papers, technical papers, and engineering reports; standards committee work, convention and conference announcements, membership news, and book reviews.
Authors: Cecchi, Stefania; Evangelista, Gianpaolo; Germain, François G.; Kondo, Katsunobu
Authors:Cecchi, Stefania; Bruschi, Valeria; Nobili, Stefano; Terenzi, Alessandro; Välimäki, Vesa
Affiliation:Universitá Politecnica delle Marche, Ancona, Italy; Universitá Politecnica delle Marche, Ancona, Italy; Universitá Politecnica delle Marche, Ancona, Italy; Universitá Politecnica delle Marche, Ancona, Italy; Acoustics Lab, Department of Information and Communications Engineering, Aalto University, Espoo, Finland
Crossover networks for multi-way loudspeaker systems and audio processing are reviewed, including both analog and digital designs.Ahigh-quality crossover network must maintain a flat overall magnitude response, within small tolerances, and a sufficiently linear phase response. Simultaneously, the crossover filters for each band must provide a steep transition to properly separate the bands, also accounting for the frequency ranges of the drivers. Furthermore, crossover filters affect the polar response of the loudspeaker, which should vary smoothly and symmetrically in the listening window. The crossover filters should additionally be economical to implement and not cause much latency. Perceptual aspects and the inclusion of equalization in the crossover network are discussed. Various applications of crossover filters in audio engineering are explained, such as in multiband compressors and in effects processing. Several methods are compared in terms of the basic requirements and computational cost. The results lead to the recommendation of an all-pass-filter--based Linkwitz-Riley crossover network, when a computationally efficient minimum-phase solution is desired. When a linear-phase crossover network is selected, the throughput delay becomes larger than with minimum-phase filters. Digital linear-phase crossover filters having a finite impulse response may be designed by optimization and implemented efficiently using a complementary structure.
Download: PDF (HIGH Res) (4.5MB)
Download: PDF (LOW Res) (2.4MB)
Authors:Renault, Lenny; Mignot, Rémi; Roebel, Axel
Affiliation:STMS - UMR9912, IRCAM, Sorbonne Université, CNRS, Ministére de la Culture, Paris, France; STMS - UMR9912, IRCAM, Sorbonne Université, CNRS, Ministére de la Culture, Paris, France; STMS - UMR9912, IRCAM, Sorbonne Université, CNRS, Ministére de la Culture, Paris, France
Instrument sound synthesis using deep neural networks has received numerous improvements over the last couple of years. Among them, the Differentiable Digital Signal Processing (DDSP) framework has modernized the spectral modeling paradigm by including signal-based synthesizers and effects into fully differentiable architectures. The present work extends the applications of DDSP to the task of polyphonic sound synthesis, with the proposal of a differentiable piano synthesizer conditioned on MIDI inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument, which, along with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable, and realistic-sounding piano model. A subjective listening test has revealed that the proposed approach achieves better sound quality than a state-of-the-art neural-based piano synthesizer, but physical-modeling-based models still hold the best quality. Leveraging its interpretability and modularity, a qualitative analysis of the model behavior was also conducted: it highlights where additional modeling knowledge and optimization procedures could be inserted in order to improve the synthesis quality and the manipulation of sound properties. Eventually, the proposed differentiable synthesizer can be further used with other deep learning models for alternative musical tasks handling polyphonic audio and symbolic data.
Download: PDF (HIGH Res) (8.1MB)
Download: PDF (LOW Res) (774KB)
Authors:Mannall, Joshua; Savioja, Lauri; Calamia, Paul; Mason, Russell; De Sena, Enzo
Affiliation:Department of Music and Media, University of Surrey, Guildford, UK; Aalto University, Department of Computer Science, Espoo, Finland; Reality Labs Research at Meta, Redmond, WA, USA; Department of Music and Media, University of Surrey, Guildford, UK; Department of Music and Media, University of Surrey, Guildford, UK
Creating plausible geometric acoustic simulations in complex scenes requires the inclusion of diffraction modeling. Current real-time diffraction implementations use the Uniform Theory of Diffraction, which assumes all edges are infinitely long. The authors utilize recent advances in machine learning to create an efficient infinite impulse response model trained on data generated using the physically accurate Biot-Tolstoy-Medwin model. The authors propose an approach to data generation that allows their model to be applied to higher-order diffraction. They show that their model is able to approximate the Biot-Tolstoy-Medwin model with a mean absolute level difference of 1.0 dB for first-order diffraction while maintaining a higher computational efficiency than the current state of the art using the Uniform Theory of Diffraction.
Download: PDF (HIGH Res) (1.6MB)
Download: PDF (LOW Res) (1.3MB)
Authors:Vahidi, Cyrus; Han, Han; Wang, Changhong; Lagrange, Mathieu; Fazekas, György; Lostanlen, Vincent
Affiliation:"Centre for Digital Music, Queen Mary University of London, London, UK; Nantes Université, École Centrale Nantes, Centre National de la Recherche Scientifique (CNRS), Laboratoire desSciences du Numérique de Nantes (LS2N), UMR 6004, F-44000 Nantes, France; Nantes Université, École Centrale Nantes, Centre National de la Recherche Scientifique (CNRS), Laboratoire desSciences du Numérique de Nantes (LS2N), UMR 6004, F-44000 Nantes, France; Nantes Université, École Centrale Nantes, Centre National de la Recherche Scientifique (CNRS), Laboratoire desSciences du Numérique de Nantes (LS2N), UMR 6004, F-44000 Nantes, France; Centre for Digital Music, Queen Mary University of London, London, UK; Nantes Université, École Centrale Nantes, Centre National de la Recherche Scientifique (CNRS), Laboratoire des Sciences du Numérique de Nantes (LS2N), UMR 6004, F-44000 Nantes, France"
Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in recent applications of deep learning to the analysis and synthesis of musical audio. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure, i.e., local amplitude variations up to 100 ms or so. In this paper, the authors formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. The authors empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, the authors motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.
Download: PDF (HIGH Res) (6.1MB)
Download: PDF (LOW Res) (767KB)
Authors:Colone, Joseph T; Reiss, Joshua
Affiliation:Centre for Digital Music, Queen Mary University of London, London, UK; Centre for Digital Music, Queen Mary University of London, London, UK
In the field of intelligent audio production, neural networks have been trained to automatically mix a multitrack to a stereo mixdown. Although these algorithms contain latent models of mix engineering, there is still a lack of approaches that explicitly model the decisions a mix engineer makes while mixing. In this work, a method to retrieve the parameters used to create a multitrack mix using only raw tracks and the stereo mixdown is presented. This method is able to model a multitrack mix using gain, panning, equalization, dynamic range compression, distortion, delay, and reverb with the aid of greybox differentiable digital signal processing modules. This method allows for a fully interpretable representation of the mixing signal chain by explicitly modeling the audio effects one may expect in a typical engineer's mixing chain. The modeling capacities of several different mixing chains are measured using both objective and subjective measures on a dataset of student mixes. Results show that the full signal chain performs best on objective measures and that there is no statistically significant difference between the participants' perception of the full mixing chain and reference mixes.
Download: PDF (HIGH Res) (3.6MB)
Download: PDF (LOW Res) (540KB)
Authors:Marchand, Sylvain; Meaux, Eric
Affiliation:L3i, University of La Rochelle, La Rochelle, France; L3i, University of La Rochelle, La Rochelle, France
The Synthetic Transaural Audio Rendering (STAR) method was recently published in the Journal of the Audio Engineering Society. That article proposed a method for sound spatialization in a perceptual way, by the reproduction of acoustic cues based on some models, as well as tests for its validation. In that article, the authors focused on azimuth and gave only hints for extensions to distance and elevation. Since then, the implementation and testing of these extensions have been carried out, and this article aims at completing the STAR method. Although for the distance, the authors rather simulate physical phenomena, but for the elevation, they propose to reproduce monaural cues by shaping the Head-Related Transfer Functions with peaks and notches controlled by some models, in order to give the listener the sensation of elevation. The extensions to distance and elevation have been validated by subjective listening tests. The independence of these two parameters is also demonstrated. For the azimuth, there is a robust localization method giving objective results consistent with human hearing. Thanks to this method, the independence of azimuth and distance or elevation is also demonstrated. Finally, there is now a full 3D system for sound spatialization, managing each parameter of each sound source position (azimuth, elevation, and distance) independently.
Download: PDF (HIGH Res) (2.8MB)
Download: PDF (LOW Res) (723KB)