AES Dublin 2019
Poster Session P17
P17 - Poster Session 3
Friday, March 22, 15:00 — 17:00 (The Liffey B)
P17-1 Audio Event Identification in Sports Media Content: The Case of Basketball—Panagiotis-Marios Filippidis, Aristotle University of Thessaloniki - Thessaloniki, Greece; Nikolaos Vryzas, Aristotle University of Thessaloniki - Thessaloniki, Greece; Rigas Kotsakis, Aristotle University of Thessaloniki - Thessaloniki, Greece; Iordanis Thoidis, Aristotle University of Thessaloniki - Thessaloniki, Greece; Charalampos A. Dimoulas, Aristotle University of Thessaloniki - Thessaloniki, Greece; Charalampos Bratsas, Aristotle University of Thessaloniki - Thessaloniki, Greece
This paper presents an audio event recognition methodology in the case of basketball content. The proposed method leverages low-level features of the audio component of basketball videos to identify basic events of the game. Through the process of detecting and defining audio event classes, a sound event taxonomy of the sport is formed. The tasks of detecting acoustic events related to basketball games, namely referee whistles and court air horns, are investigated. For the purpose of audio event detection, a feature vector is extracted and evaluated for the training of one-class classifiers. The detected events are used to segment basketball games, while the results are combined with Speech-To-Text and text mining in order to pinpoint keywords in every segment.
Convention Paper 10190 (Purchase now)
P17-2 Objective and Subjective Comparison of Several Machine Learning Techniques Applied for the Real-Time Emulation of the Guitar Amplifier Nonlinear Behavior—Thomas Schmitz, University of Liege - Liege, Belgium; Jean-Jacques Embrechts, University of Liege - Liege, Belgium
Recent progress made in the nonlinear system identification field have improved the ability to emulate nonlinear audio systems such as the tube guitar amplifiers. In particular, machine learning techniques have enabled an accurate emulation of such devices. The next challenge lies in the ability to reduce the computation time of these models. The first purpose of this paper is to compare different neural-network architectures in terms of accuracy and computation time. The second purpose is to select the fastest model keeping the same perceived accuracy using a subjective evaluation of the model with a listening-test.
Convention Paper 10191 (Purchase now)
P17-3 A Generalized Subspace Approach for Multichannel Speech Enhancement Using Machine Learning-Based Speech Presence Probability Estimation—Yuxuan Ke, University of Chinese Academy of Sciences - Beijing, China; Yi Hu, University of Wisconsin - Milwaukee - Milwaukee, WI, USA; Jian Li, University of Chinese Academy of Sciences - Beijing, China; Institute of Acoustics, Chinese Academy of Sciences - Beijing, China; Chengshi Zheng, Institute of Acoustics, Chinese Academy of Sciences - Beijing, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
A generalized subspace-based multichannel speech enhancement in frequency domain is proposed by estimating multichannel speech presence probability using machine learning methods. An efficient and low-latency neural networks (NN) model is introduced to discriminatively learn a gain mask for separating the speech and the noise components in noisy scenarios. Besides, a generalized subspace-based approach in frequency domain is proposed, where the speech power spectral density (PSD) matrix and the noise PSD matrix are estimated by short-term and long-term averaging periods, respectively. Experimental results show that the proposed method outperforms the existing NN-based beamforming methods in terms of the perceptual evaluation of speech quality score and the segmental signal-to-noise ratio improvement.
Convention Paper 10192 (Purchase now)
P17-4 Detecting Road Surface Wetness Using Microphones and Convolutional Neural Networks—Giovanni Pepe, Universitá Politecnica delle Marche - Ancona, Italy; ASK Industries S.p.A. - Montecavolo di Quattro Castella (RE), Italy; Leonardo Gabrielli, Universitá Politecnica delle Marche - Ancona, Italy; Livio Ambrosini, Universita Politecnica delle Marche - Ancona, Italy; ASK Industries S.p.A. - Montecavolo di Quattro Castella (RE), Italy; Stefano Squartini, Università Politecnica delle Marche - Ancona, Italy; Luca Cattani, Ask Industries S.p.A. - Montecavolo di Quattrocastella (RE), Italy
The automatic detection of road conditions in next-generation vehicles is an important task that is getting increasing interest from the research community. Its main applications concern driver safety, autonomous vehicles, and in-car audio equalization. These applications rely on sensors that must be deployed following a trade-off between installation and maintenance costs and effectiveness. In this paper we tackle road surface wetness classification using microphones and comparing convolutional neural networks (CNN) with bi-directional long-short term memory networks (BLSTM) following previous motivating works. We introduce a new dataset to assess the role of different tire types and discuss the deployment of the microphones. We find a solution that is immune to water and sufficiently robust to in-cabin interference and tire type changes. Classification results with the recorded dataset reach a 95% F-score and a 97% F-score using the CNN and BLSTM methods, respectively.
Convention Paper 10193 (Purchase now)
P17-5 jReporter: A Smart Voice-Recording Mobile Application—Lazaros Vrysis, Aristotle University of Thessaloniki - Thessaloniki, Greece; Nikolaos Vryzas, Aristotle University of Thessaloniki - Thessaloniki, Greece; Efstathios Sidiropoulos, Aristotle University of Thessaloniki - Thessaloniki, Greece; Evangelia Avraam, Aristotle University of Thessaloniki - Thessaloniki, Greece; Charalampos A. Dimoulas, Aristotle University of Thessaloniki - Thessaloniki, Greece
The evaluation of sound level measuring mobile applications shows that the development of a sophisticated audio analysis framework for voice-recording purposes may be useful for journalists. In many audio recording scenarios the repetition of the procedure is not an option, and under unwanted conditions the quality of the capturing is possibly degraded. Many problems are fixed during post-production but others may make the source material useless. This work introduces a framework for monitoring voice-recording sessions, capable of detecting common mistakes and providing the user with feedback to avoid unwanted conditions, ensuring the improvement of the recording quality. The framework specifies techniques for measuring sound level, estimating reverberation time, and performing audio semantic analysis by employing audio processing and feature-based classification.
Convention Paper 10194 (Purchase now)
P17-6 Two-Channel Sine Sweep Stimuli: A Case Study Evaluating 2-n Channel Upmixers—Laurence Hobden, Meridian Audio Ltd. - Huntingdon, Cambridgeshire, UK; Christopher Gribben, Meridian Audio Ltd. - Huntingdon, Cambridgeshire, UK
This paper presents new two-channel test stimuli for the evaluation of systems where traditional monophonic test signals are not suitable. The test stimuli consist of a series of exponential sine sweep signals with varying inter-channel level difference and inter-channel phase difference. As a case study the test signals have been used to evaluate a selection of 2-n channel upmixers within a consumer audio-visual receiver. Results from using the new stimuli have been shown to provide useful insight for the improvement and development of future upmixers.
Convention Paper 10195 (Purchase now)
P17-7 A Rendering Method for Diffuse Sound—Akio Ando, University of Toyama - Toyama, Japan
This paper proposes a new audio rendering method that tries to preserve the sound inputs to both ears instead of the sound direction. It uses a conversion matrix that converts the original sound signal into the converted sound signal with the different number of channels. The least squares method optimizes the matrix so as to minimize the difference between the input signals to both ears by the original signal and those by the rendered signals. To calculate the error function, the method uses the Head Related Impulse Responses. Two rendering experiments were conducted to evaluate the method. In the first experiment, 22 channel signals of 22.2 multichannel without two LFE channels were rendered into three dimensional 8-channel signals by the conventional directional-based method and the new method. The result showed that the new method could preserve the diffuseness of sound better than the conventional method. In the second experiment, the 22 channel signals were converted into 2-channel signals by the conventional downmix method and the new method. The evaluation result based on the cross correlation coefficient showed that there were not so many differences between the downmix method and the new method. However, the informal listening test showed that the new method might preserve the diffuseness of sound better than the downmix method.
Convention Paper 10196 (Purchase now)