Affiliation:Department of Computer Science, University of Milan, Via Celoria 18, 20133 Milan, Italy
Cross-language speech emotion recognition is receiving increased attention due to its extensive real-world applicability. This work proposes a language-agnostic speech emotion recognition algorithm focusing on Italian and German languages. The mel-scaled and temporal modulation spectral representations are combined and then subsequently modeled by means of Gaussian mixture models. Emotion prediction is carried out via a Kullback Leibler divergence scheme. The proposed methodology is applied to two problem settings: one including positive vs. negative emotion classification and a second one where all Big Six emotional states are considered. A thorough experimental campaign demonstrated the efficacy of such a method, as well as its superiority over other generative modeling schemes and state-of-the-art approaches. The results demonstrate the feasibility of recognizing emotional states in a language-, gender- and speaker-independent setting.
Download: PDF (HIGH Res) (1.5MB)
Download: PDF (LOW Res) (491KB)
Authors:Vryzas, Nikolaos; Vrysis, Lazaros; Matsiola, Maria; Kotsakis, Rigas; Dimoulas, Charalampos; Kalliris, George
Affiliation:Aristotle University of Thessaloniki, Thessaloniki, Greece
Emotional speech is a separate channel of communication that carries the paralinguistic aspects of spoken language. Affective information knowledge can be crucial for contextual speech recognition, which can also provide elements from the personality and psychological state of the speaker enriching the communication. That kind of data may play an important role as semantic analysis features of web content and would also apply in intelligent affective new media and social interaction domains. A model for Speech Emotion Recognition (SER), based on Convolutional Neural Networks (CNN) architecture is proposed and evaluated. Recognition is performed on successive time frames of continuous speech. The dataset used for training and testing the model is the Acted Emotional Speech Dynamic Database (AESDD), a publicly available corpus in the Greek language. Experiments involving the subjective evaluation of the AESDD are presented to serve as a reference for human-level recognition accuracy. The proposed CNN architecture outperforms previous baseline machine learning models (Support Vector Machines) by 8.4% in terms of accuracy and it is also more efficient because it bypasses the stage of handcrafted feature extraction. Data augmentation of the database did not affect classification accuracy in the validation tests but is expected to improve robustness and generalization. Besides performance improvements, the unsupervised feature-extraction stage of the proposed topology also makes it feasible to create real-time systems.
Download: PDF (HIGH Res) (1.6MB)
Download: PDF (LOW Res) (195KB)
Authors:Liew, Kongmeng; Lindborg, PerMagnus
Affiliation:Kyoto University, Kyoto, Japan; Seoul National University, Seoul, South Korea
Sonification can be defined as any technique that translates data into non-speech sound with a systematic, describable, and reproducible method, in order to reveal or facilitate communication, interpretation, or discovery of meaning that is latent in the data. This paper describes an approach for communicating cross-cultural differences in sentiment data through sonification, which is a powerful technique for the translation of patterns into sounds that are understandable, accessible, and musically pleasant. A machine-learning classifier was trained on sentiment information of two samples of Tweets from Singapore and New York with the keyword of "happiness." Positive-valence words that relate to the concept of happiness showed stronger influences on the classifier than negative words. For mapping, Tweet frequency differences of the semantic variable "anticipation" affected tempo, positive-affected pitch, and joy-affected loudness, while "trust" affected rhythmic regularity. The authors evaluated sonification of the original data from the two cities, together with a control condition generated from random mappings in a listening experiment. Results suggest that the original was rated as significantly more pleasant.
Download: PDF (HIGH Res) (380KB)
Download: PDF (LOW Res) (204KB)
Authors:Jeong, Dasaem; Kwon, Taegyun; Nam, Juhan
Affiliation:Graduate School of Culture Technology, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Automatic Music Transcription (AMT) is a process of inferring score notation from audio recordings, which depends on such subtasks as multipitch estimation, onset detection, tempo estimation, etc. The dynamics of music is one of the main elements that explains the characteristics of a performance, but dynamics has not yet been thoroughly investigated in the context of automatic music transcription. This report proposes a system for estimating the intensity of individual notes from piano recordings. The algorithm is based on a score-informed nonnegative matrix factorization (NMF) that takes the spectrogram of an audio recording and a corresponding MIDI score as inputs and factorizes the spectrogram into a set of spectral templates and their activations. The intensity of each note is obtained from the maximum activation of the corresponding pitch template around the onset of the note. The authors improved their system by employing an NMF model that can learn the temporal progress of the timbre of piano notes. While the previous research was evaluated only with perfectly-aligned scores, this paper also presents an evaluation with coarsely-aligned scores. The results shows that this approach is robust in aligning errors within 100 ms.
Download: PDF (HIGH Res) (3.1MB)
Download: PDF (LOW Res) (469KB)
Authors:Coleman, William; Delany, Sarah Jane; Yan, Ming; Cullen, Charlie
Affiliation:Technological University Dublin, Ireland; DTS Inc. now part of Xperi
With the advent of new audio delivery technologies, object-based audio conceives of the audio content as being created at the delivery end of the chain. The concept of object-based audio envisages content delivery not via a fixed mix but as a series of auditory objects that can then be controlled either by consumers or by content creators and providers via the accompanying metadata. The proliferation of a variety of consumption modes (stereo headphones, home cinema systems, "hearables"), media formats (mp3, CD, video and audio streaming) and content types (gaming, music, drama, and current affairs broadcasting) has given rise to a complicated landscape where content must often be adapted for multiple end-use scenarios. Such a separation of audio assets facilitates the concept of Variable Asset Compression, where the most important elements from a perceptual standpoint are prioritized before others. In order to implement such a system however, insight is first required into what objects are most important, and how this importance changes over time. This research investigates the first of these questions, the hierarchical classification of isolated auditory objects using machine learning techniques. The results suggest that audio object hierarchies can be successfully modeled.
Download: PDF (HIGH Res) (630KB)
Download: PDF (LOW Res) (295KB)
Authors:Koszewski, Damian; Kostek, Bozena
Affiliation:Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdansk, Poland
This paper describes a promising method for an automatic music instrument tagging system using neural networks. Developing signal processing methods to extract information automatically has potential utility in many applications, as for example searching for multimedia based on its audio content, making context-aware mobile applications, and pre-processing for an automatic mixing system. However, the last-mentioned application needs a significant amount of research to reliably recognize real musical instruments in recordings. This research focuses on how to obtain data for efficient training, validating, and testing a deep-learning model by using a data augmentation technique. These data are transformed into 2D feature spaces, i.e. mel-scale spectrograms. The neural network used in the experiments consists of a single-block DenseNet architecture and a multihead softmax classifier for efficient learning with the mixup augmentation. For automatic noisy data labeling, the batch-wise loss masking, which is robust with regard to corrupting outliers in data, was applied. The method provides promising recognition scores even with real-world recordings that contain noisy data. 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification Lazaros Vrysis, Nikolaos Tsipas, Iordanis Thoidis, and Charalampos Dimoulas 66 Semantic audio analysis has become a fundamental task in modern audio applications, making the improvement and optimization of classification algorithms a necessity. Standard frame-based audio classification methods have been optimized, and modern approaches introduce engineering methodologies that capture the temporal dependency between successive feature observations, following the process of temporal feature integration. Moreover, the deployment of the convolutional neural networks defined a new era on semantic audio analysis. This paper attempts a thorough comparison between standard feature-based classification strategies, state-of-the-art temporal feature integration tactics, and 1D/2D deep convolutional neural network setups on typical audio classification tasks. Experiments focus on optimizing a lightweight configuration for convolutional network topologies on a Speech/Music/Other classification scheme that can be deployed on various audio information retrieval tasks, such as voice activity detection, speaker diarization, or speech emotion recognition. The main target of this work is the establishment of an optimized protocol for constructing deep convolutional topologies on general audio detection classification schemes, minimizing complexity and computational needs.
Download: PDF (HIGH Res) (2.2MB)
Download: PDF (LOW Res) (269KB)
Authors:Vrysis, Lazaros; Tsipas, Nikolaos; Thoidis, Iordanis; Dimoulas, Charalampos
Affiliation:Aristotle University of Thessaloniki, Thessaloniki, Greece
Semantic audio analysis has become a fundamental task in modern audio applications, making the improvement and optimization of classification algorithms a necessity. Standard frame-based audio classification methods have been optimized, and modern approaches introduce engineering methodologies that capture the temporal dependency between successive feature observations, following the process of temporal feature integration. Moreover, the deployment of the convolutional neural networks defined a new era on semantic audio analysis. This paper attempts a thorough comparison between standard feature-based classification strategies, state-of-the-art temporal feature integration tactics, and 1D/2D deep convolutional neural network setups on typical audio classification tasks. Experiments focus on optimizing a lightweight configuration for convolutional network topologies on a Speech/Music/Other classification scheme that can be deployed on various audio information retrieval tasks, such as voice activity detection, speaker diarization, or speech emotion recognition. The main target of this work is the establishment of an optimized protocol for constructing deep convolutional topologies on general audio detection classification schemes, minimizing complexity and computational needs.
Download: PDF (HIGH Res) (1.4MB)
Download: PDF (LOW Res) (1MB)
Authors:Das, Orchisama; Smith III, Julius O.; Chafe, Chris
Affiliation:Center for Computer Research in Music and Acoustics, Stanford University, Stanford, CA, USA
This paper proposes a real-time, sample-by-sample pitch tracker for monophonic audio signals using the Extended Kalman Filter in the complex domain, called an Extended Complex Kalman Filter (ECKF). It improves upon the algorithm proposed in a previous paper by fixing the issue of slow tracking of rapid note changes. It does so by detecting harmonic change in the signal, and resetting the filter whenever a significant harmonic change is detected. Along with the fundamental frequency, the ECKF also tracks the amplitude envelope and instantaneous phase of the input audio signal. The pitch tracker is ideal for detecting ornaments in solo instrument music such as slides and vibratos. The improved algorithm is tested to track pitch of bowed string (double-bass), plucked string (guitar), and vocal singing samples. Parameter selection for the ECKF pitch tracker requires knowledge of the type of signal whose pitch is to be tracked, which is a potential drawback. It would be interesting to automatically pick the optimum set of parameters given an audio signal by training on instrument specific datasets.
Download: PDF (HIGH Res) (1.6MB)
Download: PDF (LOW Res) (1.4MB)
Authors:Malecki, Pawel; Piotrowska, Magdalena; Sochaczewska, Katarzyna; Piotrowski, Szymon
Affiliation:AGH University of Science and Technology, Faculty of Mechanical Engineering and Robotics, Department of Mechanics and Vibroacoustics, Kraków, Poland; Faculty of ETI, Multimedia Systems Department Gdansk University of Technology, Gdansk, Poland; Psychosound Studio, Kraków, Poland
This paper presents a case study of an original electronic music production in stereo and means for then creating an Ambisonic remix. The main goal of this work was to explore the potential for extending all dimensions into extended space. The stereo and Ambisonic mixes, as well as the stereo and binaural renders, were subjectively evaluated by experts performers. When compared, the stereo and Ambisonic mixes differ not only in terms of space but also with regard to the timbre and dynamics. In general, the listeners preferred the spaciousness and selectivity of the Ambisonic version over the stereo. The obtained results are consistent with the outcomes from other studies that focused on the differences between stereo and multichannel reproduction. The most interesting conclusion can be formulated from a comparison between the stereo and binaural renders of the Ambisonic mix. The results clearly show the preference of spaciousness of the binaural version, and the general preference also indicated binaural as preferred. Due to the popularity of headphone playback, the obtained results show the potential of Ambisonic productions with targeted binaural playback.
Download: PDF (HIGH Res) (3.2MB)
Download: PDF (LOW Res) (380KB)
The Audio Engineering Society will host it's first ever education event focused on the technical teams at houses of worship - the AES Worship Sound Academy - set to take place March 10-11, 2020, at the Johnson Center on the campus of Belmont University in Nashville. Offering presentations covering core audio topics along with practical, actionable advice, the Academy will bring together experts in worship sound and related fields for two packed days of experiential sessions. The Academy will address the special needs of technical worship ministry teams - often staffed by lay volunteers - with a program designed to expand their knowledge, build their confidence, and deliver tips and techniques that they will be eager to apply. Manufacturer and service provider partners will be on hand to lend their expertise and demonstrate products in the Academy's exhibition space. The AES Worship Sound Academy will serve as a prototype for a new generation of AES events, targeted on regional needs and cooperatively developed with local sections. The adoption of sophisticated technical infrastructure by HoWs is evidenced by the demand for audio, video, and lighting equipment in a perennially strong market segment. Simultaneously, technical ministry personnel frequently lack prior audio production experience or audio education. The AES, the AES Nashville Section, and Belmont University are working collaboratively to provide foundational training for HoW tech teams. Whether providing sound reinforcement for a startup congregation, or for the most advanced, purpose-built HoW, the AES Worship Sound Academy provides tech team members an unprecedented chance to listen, learn, and connect with industry professionals and with peers that share their challenges and goals. Specific topics will include mixing FOH and monitors, personal monitoring, miking, wireless systems, streaming, podcasting, recording, acoustics, system optimization, and more. Early Bird online registration for the AES Worship Sound Academy is open through February 17 at a special rate of $150 (includes lunch both days) at www.aesworship.org. The AES Worship Sound Academy main session track will be held in the Belmont Large Theater (the Dolby Atmos environment is used by Belmont for immersive sound mixing training).
Download: PDF (509KB)
Dynamic range compressors represent a good example of regularly used effects that rely on nonlinear processing, and it is common to attempt to emulate classic designs in digital plug-ins. The techniques used to profile or model such devices have included deep learning neural networks, the estimation of polynomial functions, and use of lookup tables. It's possible to construct a digital version of a parametric filter that gives the impression of having the correct response within the audio range even when its center frequency is above the Nyquist frequency. We also discover a novel approach to reducing the room effect of reproduced audio signals that preserves the direct sound and early reflections while reducing the influence of late reverberation.
Download: PDF (375KB)