AES New York 2019
Poster Session P6
P6 - Posters: Audio Signal Processing
Wednesday, October 16, 3:00 pm — 4:30 pm
P6-1 Modal Representations for Audio Deep Learning—Travis Skare, Stanford University - Stanford, CA, USA; Jonathan S. Abel, Stanford University - Stanford, CA, USA; Julius O. Smith, III, Stanford University - Stanford, CA, USA
Deep learning models for both discriminative and generative tasks have a choice of domain representation. For audio, candidates are often raw waveform data, spectral data, transformed spectral data, or perceptual features. For deep learning tasks related to modal synthesizers or processors, we propose new, modal representations for data. We experiment with representations such as an N-hot binary vector of frequencies, or learning a set of modal filterbank coefficients directly. We use these representations discriminatively–classifying cymbal model based on samples–as well as generatively. An intentionally naive application of a basic modal representation to a CVAE designed for MNIST digit images quickly yielded results, which we found surprising given less prior success when using traditional representations like a spectrogram image. We discuss applications for Generative Adversarial Networks, towards creating a modal reverberator generator.
Convention Paper 10248
P6-2 Distortion Modeling of Nonlinear Systems Using Ramped-Sines and Lookup Table—Paul Mayo, University of Maryland - College Park, MD, USA; Wesley Bulla, Belmont University - Nashville, TN, USA
Nonlinear systems identification is used to synthesize black-box models of nonlinear audio effects and as such is a widespread topic of interest within the audio industry. As a variety of implementation algorithms provide a myriad of approaches, questions arise whether there are major functional differences between methods and implementations. This paper presents a novel method for the black-box measurement of distortion characteristic curves and an analysis of the popular “lookup table” implementation of nonlinear effects. Pros and cons of the techniques are examined from a signal processing perspective and the basic limitations and efficiencies of the approaches are discussed.
Convention Paper 10249
P6-3 An Open Audio Processing Platform Using SoC FPGAs and Model-Based Development—Trevor Vannoy, Montana State University - Bozeman, MT, USA; Flat Earth Inc. - Bozeman, MT, USA; Tyler Davis, Flat Earth Inc. - Bozeman, MT, USA; Connor Dack, Flat Earth Inc. - Bozeman, MT, USA; Dustin Sobrero, Flat Earth Inc. - Bozeman, MT, USA; Ross Snider, Montana State University - Bozeman, MT, USA; Flat Earth Inc. - Bozeman, MT, USA
The development cycle for high performance audio applications using System-on-Chip (SoC) Field Programmable Gate Arrays (FPGAs) is long and complex. To address these challenges, an open source audio processing platform based on SoC FPGAs is presented. Due to their inherently parallel nature, SoC FPGAs are ideal for low latency, high performance signal processing. However, these devices require a complex development process. To reduce this difficulty, we deploy a model-based hardware/software co-design methodology that increases productivity and accessibility for non-experts. A modular multi-effects processor was developed and demonstrated on our hardware platform. This demonstration shows how a design can be constructed and provides a framework for developing more complex audio designs that can be used on our platform.
Convention Paper 10250
P6-4 Objective Measurement of Stereophonic Audio Quality in the Directional Loudness Domain—Pablo Delgado, International Audio Laboratories Erlangen - Erlangen, Germany; Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Jürgen Herre, International Audio Laboratories Erlangen - Erlangen, Germany; Fraunhofer IIS - Erlangen, Germany
Automated audio quality prediction is still considered a challenge for stereo or multichannel signals carrying spatial information. A system that accurately and reliably predicts quality scores obtained by time-consuming listening tests can be of great advantage in saving resources, for instance, in the evaluation of parametric spatial audio codecs. Most of the solutions so far work with individual comparisons of distortions of interchannel cues across time and frequency, known to correlate to distortions in the evoked spatial image of the subject listener. We propose a scene analysis method that considers signal loudness distributed across estimations of perceived source directions on the horizontal plane. The calculation of distortion features in the directional loudness domain (as opposed to the time-frequency domain) seems to provide equal or better correlation with subjectively perceived quality degradation than previous methods, as con?rmed by experiments with an extensive database of parametric audio codec listening tests. We investigate the effect of a number of design alternatives (based on psychoacoustic principles) on the overall prediction performance of the associated quality measurement system.
Convention Paper 10251
P6-5 Detection of the Effect of Window Duration in an audio Source Separation Paradigm—Ryan Miller, Belmont University - Nashville, TN, USA; Wesley Bulla, Belmont University - Nashville, TN, USA; Eric Tarr, Belmont University - Nashville, TN, USA
Non-negative matrix factorization (NMF) is a commonly used method for audio source separation in applications such as polyphonic music separation and noise removal. Previous research evaluated the use of additional algorithmic components and systems in efforts to improve the effectiveness of NMF. This study examined how the short-time Fourier transform (STFT) window duration used in the algorithm might affect detectable differences in separation performance. An ABX listening test compared speech extracted from two types of noise-contaminated mixtures at different window durations to determine if listeners could discriminate between them. It was found that the window duration had a significant impact on subject performance in both white- and conversation-noise cases with lower scores for the latter condition.
Convention Paper 10252
P6-6 Use of DNN-Based Beamforming Applied to Different Microphone Array Configurations—Tae Woo Kim, Gwangju Institute of Science and Technology (GIST) - Gwangju, South Korea; Nam Kyun Kim, Gwangju Institute of Science and Technology (GIST) - Gwangju, South Korea; Geon Woo Lee, Gwangju Institute of Science and Technology (GIST) - Gwangju. Korea; Inyoung Park, Gwangju Institute of Science and Technology (GIST) - Gwangju, South Korea; Hong Kook Kim, Gwangju Institute of Science and Tech (GIST) - Gwangju, Korea
Minimum variance distortionless response (MVDR) beamforming is one of the most popular multichannel signal processing techniques for dereverberation and/or noise reduction. However, the MVDR beamformer has the limitation that it must be designed to be dependent on the receiver array geometry. This paper demonstrates an experimental setup and results by designing a deep learning-based MVDR beamformer and applying it to different microphone array configurations. Consequently, it is shown that the deep learning-based MVDR beamformer provides more robust performance under mismatched microphone array configurations than the conventional statistical MVDR one.
Convention Paper 10253
P6-7 Deep Neural Network Based Guided Speech Bandwidth Extension—Konstantin Schmidt, Friedrich-Alexander-University (FAU) - Erlangen, Germany; International Audio Laboratories Erlangen - Erlangen; Bernd Edler, Friedrich Alexander University - Erlangen-Nürnberg, Germany; Fraunhofer IIS - Erlangen, Germany
Up to today telephone speech is still limited to the range of 200 to 3400 Hz since the predominant codecs in public switched telephone networks are AMR-NB, G.711, and G.722 [1, 2, 3]. Blind bandwidth extension (blind BWE, BBWE) can improve the perceived quality as well as the intelligibility of coded speech without changing the transmission network or the speech codec. The BBWE used in this work is based on deep neural networks (DNNs) and has already shown good performance . Although this BBWE enhances the speech without producing too many artifacts it sometimes fails to enhance prominent fricatives that can result in muffled speech. In order to better synthesize prominent fricatives the BBWE is extended by sending a single bit of side information—here referred to as guided BWE. This bit may be transmitted, e.g., by watermarking so that no changes to the transmission network or the speech codec have to be done. Different DNN con?gurations (including convolutional (Conv.) layers as well as long short-term memory layers (LSTM)) making use of this bit have been evaluated. The BBWE has a low computational complexity and an algorithmic delay of 12 ms only and can be applied in state-of-the-art speech and audio codecs.
Convention Paper 10254
P6-8 Analysis of the Sound Emitted by Honey Bees in a Beehive—Stefania Cecchi, Universitá Politecnica della Marche - Ancona, Italy; Alessandro Terenzi, Universita Politecnica delle Marche - Ancona, Italy; Simone Orcioni, Universita Politecnica delle Marche - Ancona, Italy; Francesco Piazza, Universitá Politecnica della Marche - Ancona (AN), Italy
The increasing in honey bee mortality of the last years has brought great attention on the possibility of intensive bee hive monitoring in order to better understand the problems that are seriously affecting the honey bee health. It is well known that sound emitted inside a beehive is one of the key parameters for a non-invasive monitoring capable of determining some aspects of their condition. The proposed work aims at analyzing the bees’ sound introducing features extraction useful for sound classification techniques and to determine dangerous situations. Taking into consideration a real scenario, several experiments have been performed focusing on particular events, such as swarming, to highlight the potentiality of the proposed approach.
Convention Paper 10255
P6-9 Improvement of DNN-Based Speech Enhancement with Non-Normalized Features by Using an Automatic Gain Control—Linjuan Cheng, Institute of Acoustics, Chinese Academy of Sciences - Beijing, China; Chengshi Zheng, Institute of Acoustics, Chinese Academy of Sciences - Beijing, China; Renhua Peng, Chinese Academy of Sciences - Beijing, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
Speech enhancement performance may degrade when the peak level of the noisy speech is significantly different from the training datasets in Deep Neural Networks (DNN)-based speech enhancement algorithms, especially when the non-normalized features are used in practical applications, such as log-power spectra. To overcome this shortcoming, we introduce an automatic gain control (AGC) method as a preprocessing technique. By doing so, we can train the model with the same peak level of all the speech utterances. To further improve the proposed DNN-based algorithm, the feature compensation method is combined with the AGC method. Experimental results indicate that the proposed algorithm can maintain consistent performance when the peak of the noisy speech changes in a large range.
Convention Paper 10256