AES New York 2019
Poster Session P9
P9 - Posters: Applications in Audio
Thursday, October 17, 9:00 am — 10:30 am
P9-1 Analyzing Loudness Aspects of 4.2 Million Musical Albums in Search of an Optimal Loudness Target for Music Streaming—Eelco Grimm, HKU University of the Arts - Utrecht, Netherlands; Grimm Audio - Eindhoven, The Netherlands
In cooperation with music streaming service Tidal, 4.2 million albums were analyzed for loudness aspects such as loudest and softest track loudness. Evidence of development of the loudness war was found and a suggestion for music streaming services to use album normalization at –14 LUFS for mobile platforms and –18 LUFS or lower for stationary platforms was derived from the data set and a limited subject study. Tidal has implemented the recommendation and reports positive results.
Convention Paper 10268
P9-2 Audio Data Augmentation for Road Objects Classification—Ohad Barak, Mentor Graphics - Mountain View, CA, USA; Nizar Sallem, Mentor Graphics - Mountain View, CA, USA
Following the resurgence of machine learning within the context of autonomous driving, the need for acquiring and labeling data expanded by folds. Despite the large amount of available visual data (images, point clouds, . . . ), researchers apply augmentation techniques to extend the training dataset, which improves the classification accuracy. When trying to exploit audio data for autonomous driving, two challenges immediately surfaced: first, the lack of available data and second, the absence of augmentation techniques. In this paper we introduce a series of augmentation techniques suitable for audio data. We apply several procedures, inspired by data augmentation for image classification, that transform and distort the original data to produce similar effects on sound. We show the increase in overall accuracy of our neural network for sound classification by comparing it to the non-augmented version.
Convention Paper 10269
P9-3 Is Binaural Spatialization the Future of Hip-Hop?—Kierian Turner, University of Lethbridge - Lethbridge, AB, Canada; Amandine Pras, Digital Audio Arts - University of Lethbridge - Lethbridge, Alberta, Canada; School for Advanced Studies in the Social Sciences - Paris, France
Modern hip-hop is typically associated with samples and MIDI and not so much with creative source spatialization since the energy-driving elements are usually located in the center of a stereo image. To evaluate the impact of certain element placements behind, above, or underneath the listener on the listening experience, we experimented beyond standard mixing practices by spatializing beats and vocals of two hip-hop tracks in different ways. Then, 16 hip-hop musicians, producers, and enthusiasts, and three audio engineers compared a stereo and a binaural version of these two tracks in a perceptual experiment. Results showed that hip-hop listeners expect a few elements, including the vocals, to be mixed conventionally in order to create a cohesive mix and to minimize distractions.
Convention Paper 10270
P9-4 Alignment and Timeline Construction for Incomplete Analogue Audience Recordings of Historical Live Music Concerts—Thomas Wilmering, Queen Mary University of London - London, UK; Centre for Digital Music (C4DM); Florian Thalmann, Queen Mary University of London - London, UK; Mark Sandler, Queen Mary University of London - London, UK
Analogue recordings pose specific problems during automatic alignment, such as distortion due to physical degradation, or differences in tape speed during recording, copying, and digitization. Oftentimes, recordings are incomplete, exhibiting gaps with different lengths. In this paper we propose a method to align multiple digitized analogue recordings of same concerts of varying quality and song segmentations. The process includes the automatic construction of a reference concert timeline. We evaluate alignment methods on a synthetic dataset and apply our algorithm to real-world data.
Convention Paper 10271
P9-5 Noise Robustness Automatic Speech Recognition with Convolutional Neural Network and Time Delay Neural Network—Jie Wang, Guangzhou University - Guangzhou, China; Dunze Wang, Guangzhou University - Guangzhou, China; Yunda Chen, Guangzhou University - Guangzhou, China; Xun Lu, Power Grid Planning Center, Guandgong Power Grid Company - Guangdong, China; Chengshi Zheng, Institute of Acoustics, Chinese Academy of Sciences - Beijing, China
To improve the performance of automatic speech recognition in noisy environments, the convolutional neural network (CNN) combined with time-delay neural network (TDNN) is introduced, which is referred as CNN-TDNN. The CNN-TDNN model is further optimized by factoring the parameter matrix in the time-delay neural network hidden layers and adding a time-restricted self-attention layer after the CNN-TDNN hidden layers. Experimental results show that the optimized CNN-TDNN model has better performance than DNN, CNN, TDNN, and CNN-TDNN. The average recognition word error rate (WER) can be reduced by 11.76% when comparing with the baselines.
Convention Paper 10272