AES New York 2018
Paper Session P02
P02 - Signal Processing—Part 1
Wednesday, October 17, 9:00 am — 11:00 am
Emre Çakir, Tampere University of Technology - Tampere, Finland
P02-1 Musical Instrument Synthesis and Morphing in Multidimensional Latent Space Using Variational, Convolutional Recurrent Autoencoders—Emre Çakir, Tampere University of Technology - Tampere, Finland; Tuomas Virtanen, Tampere University of Technology - Tampere, Finland
In this work we propose a deep learning based method—namely, variational, convolutional recurrent autoencoders (VCRAE)—for musical instrument synthesis. This method utilizes the higher level time-frequency representations extracted by the convolutional and recurrent layers to learn a Gaussian distribution in the training stage, which will be later used to infer unique samples through interpolation of multiple instruments in the usage stage. The reconstruction performance of VCRAE is evaluated by proxy through an instrument classifier and provides significantly better accuracy than two other baseline autoencoder methods. The synthesized samples for the combinations of 15 different instruments are available on the companion website.
Convention Paper 10035
P02-2 Music Enhancement by a Novel CNN Architecture—Anton Porov, PDMI RAS - St. Petersburg, Russia; Eunmi Oh, Samsung Electronics Co., Ltd. - Seoul, Korea; Kihyun Choo, Samsung Electronics Co., Ltd. - Suwon, Korea; Hosang Sung, Samsung Electronics - Korea; Jonghoon Jeong, Samsung Electronics Co. Ltd. - Seoul, Korea; Konstantin Osipov, PDMI RAS - Russia; Holly Francois, Samsung Electronics R&D Institute UK - Staines-Upon Thames, Surrey, UK
This paper is concerned with music enhancement by removal of coding artifacts and recovery of acoustic characteristics that preserve the sound quality of the original music content. In order to achieve this, we propose a novel convolution neural network (CNN) architecture called FTD (Frequency-Time Dependent) CNN, which utilizes correlation and context information across spectral and temporal dependency for music signals. Experimental results show that both subjective and objective sound quality metrics are significantly improved. This unique way of applying a CNN to exploit global dependency across frequency bins may effectively restore information that is corrupted by coding artifacts in compressed music content.
Convention Paper 10036
P02-3 The New Dynamics Processing Effect in Android Open Source Project—Ricardo Garcia, Google - Mountain View, CA, USA
The Android “P” Audio Framework’s new Dynamics Processing Effect (DPE) in Android Open Source Project (AOSP), provides developers with controls to fine-tune the audio experience using several stages of equalization, multi-band compressors, and linked limiters. The API allows developers to configure the DPE’s multichannel architecture to exercise real-time control over thousands of audio parameters. This talk additionally discusses the design and use of DPE in the recently announced Sound Amplifier accessibility service for Android and outlines other uses for acoustic compensation and hearing applications.
Convention Paper 10037
P02-4 On the Physiological Validity of the Group Delay Response of All-Pole Vocal Tract Modeling—Aníbal Ferreira, University of Porto - Porto, Portugal
Magnitude-oriented approaches dominate the voice analysis front-ends of most current technologies addressing, e.g., speaker identification, speech coding/compression, and voice reconstruction and re-synthesis. A popular technique is all-pole vocal tract modeling. The phase response of all-pole models is known to be non-linear and highly dependent on the magnitude frequency response. In this paper we use a shift-invariant phase-related feature that is estimated from signal harmonics in order to study the impact of all-pole models on the phase structure of voiced sounds. We relate that impact to the phase structure that is found in natural voiced sounds to conclude on the physiological validity of the group delay of all-pole vocal tract modeling. Our findings emphasize that harmonic phase models are idiosyncratic, and this is important in speaker identification and in fostering the quality and naturalness of synthetic and reconstructed speech.
Convention Paper 10038