AES E-Library

AES E-Library Search Results

Search Results (Displaying 1-10 of 11 matches) New Search
Sort by:
                 Records Per Page:

Bulk download: Download Zip archive of all papers from this Journal issue

 

1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification

Document Thumbnail

Semantic audio analysis has become a fundamental task in modern audio applications, making the improvement and optimization of classification algorithms a necessity. Standard frame-based audio classification methods have been optimized, and modern approaches introduce engineering methodologies that capture the temporal dependency between successive feature observations, following the process of temporal feature integration. Moreover, the deployment of the convolutional neural networks defined a new era on semantic audio analysis. This paper attempts a thorough comparison between standard feature-based classification strategies, state-of-the-art temporal feature integration tactics, and 1D/2D deep convolutional neural network setups on typical audio classification tasks. Experiments focus on optimizing a lightweight configuration for convolutional network topologies on a Speech/Music/Other classification scheme that can be deployed on various audio information retrieval tasks, such as voice activity detection, speaker diarization, or speech emotion recognition. The main target of this work is the establishment of an optimized protocol for constructing deep convolutional topologies on general audio detection classification schemes, minimizing complexity and computational needs.

Authors:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 66-77; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


A Machine Learning Approach to Hierarchical Categorization of Auditory Objects

Document Thumbnail

With the advent of new audio delivery technologies, object-based audio conceives of the audio content as being created at the delivery end of the chain. The concept of object-based audio envisages content delivery not via a fixed mix but as a series of auditory objects that can then be controlled either by consumers or by content creators and providers via the accompanying metadata. The proliferation of a variety of consumption modes (stereo headphones, home cinema systems, "hearables"), media formats (mp3, CD, video and audio streaming) and content types (gaming, music, drama, and current affairs broadcasting) has given rise to a complicated landscape where content must often be adapted for multiple end-use scenarios. Such a separation of audio assets facilitates the concept of Variable Asset Compression, where the most important elements from a perceptual standpoint are prioritized before others. In order to implement such a system however, insight is first required into what objects are most important, and how this importance changes over time. This research investigates the first of these questions, the hierarchical classification of isolated auditory objects using machine learning techniques. The results suggest that audio object hierarchies can be successfully modeled.

Authors:
Affiliations:
JAES Volume 68 Issue 1/2 pp. 48-56; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


A Sonification of Cross-Cultural Differences in Happiness-Related Tweets

Document Thumbnail

Sonification can be defined as any technique that translates data into non-speech sound with a systematic, describable, and reproducible method, in order to reveal or facilitate communication, interpretation, or discovery of meaning that is latent in the data. This paper describes an approach for communicating cross-cultural differences in sentiment data through sonification, which is a powerful technique for the translation of patterns into sounds that are understandable, accessible, and musically pleasant. A machine-learning classifier was trained on sentiment information of two samples of Tweets from Singapore and New York with the keyword of "happiness." Positive-valence words that relate to the concept of happiness showed stronger influences on the classifier than negative words. For mapping, Tweet frequency differences of the semantic variable "anticipation" affected tempo, positive-affected pitch, and joy-affected loudness, while "trust" affected rhythmic regularity. The authors evaluated sonification of the original data from the two cities, together with a control condition generated from random mappings in a listening experiment. Results suggest that the original was rated as significantly more pleasant.

Authors:
Affiliations:
JAES Volume 68 Issue 1/2 pp. 25-33; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


Continuous Speech Emotion Recognition with Convolutional Neural Networks

Document Thumbnail

Emotional speech is a separate channel of communication that carries the paralinguistic aspects of spoken language. Affective information knowledge can be crucial for contextual speech recognition, which can also provide elements from the personality and psychological state of the speaker enriching the communication. That kind of data may play an important role as semantic analysis features of web content and would also apply in intelligent affective new media and social interaction domains. A model for Speech Emotion Recognition (SER), based on Convolutional Neural Networks (CNN) architecture is proposed and evaluated. Recognition is performed on successive time frames of continuous speech. The dataset used for training and testing the model is the Acted Emotional Speech Dynamic Database (AESDD), a publicly available corpus in the Greek language. Experiments involving the subjective evaluation of the AESDD are presented to serve as a reference for human-level recognition accuracy. The proposed CNN architecture outperforms previous baseline machine learning models (Support Vector Machines) by 8.4% in terms of accuracy and it is also more efficient because it bypasses the stage of handcrafted feature extraction. Data augmentation of the database did not affect classification accuracy in the validation tests but is expected to improve robustness and generalization. Besides performance improvements, the unsupervised feature-extraction stage of the proposed topology also makes it feasible to create real-time systems.

Authors:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 14-24; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


Electronic Music Production in Ambisonics-Case Study

Document Thumbnail

This paper presents a case study of an original electronic music production in stereo and means for then creating an Ambisonic remix. The main goal of this work was to explore the potential for extending all dimensions into extended space. The stereo and Ambisonic mixes, as well as the stereo and binaural renders, were subjectively evaluated by experts performers. When compared, the stereo and Ambisonic mixes differ not only in terms of space but also with regard to the timbre and dynamics. In general, the listeners preferred the spaciousness and selectivity of the Ambisonic version over the stereo. The obtained results are consistent with the outcomes from other studies that focused on the differences between stereo and multichannel reproduction. The most interesting conclusion can be formulated from a comparison between the stereo and binaural renders of the Ambisonic mix. The results clearly show the preference of spaciousness of the binaural version, and the general preference also indicated binaural as preferred. Due to the popularity of headphone playback, the obtained results show the potential of Ambisonic productions with targeted binaural playback.

Authors:
Affiliations:
JAES Volume 68 Issue 1/2 pp. 87-94; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this report!


Improved Real-Time Monophonic Pitch Tracking with the Extended Complex Kalman Filter

Document Thumbnail

This paper proposes a real-time, sample-by-sample pitch tracker for monophonic audio signals using the Extended Kalman Filter in the complex domain, called an Extended Complex Kalman Filter (ECKF). It improves upon the algorithm proposed in a previous paper by fixing the issue of slow tracking of rapid note changes. It does so by detecting harmonic change in the signal, and resetting the filter whenever a significant harmonic change is detected. Along with the fundamental frequency, the ECKF also tracks the amplitude envelope and instantaneous phase of the input audio signal. The pitch tracker is ideal for detecting ornaments in solo instrument music such as slides and vibratos. The improved algorithm is tested to track pitch of bowed string (double-bass), plucked string (guitar), and vocal singing samples. Parameter selection for the ECKF pitch tracker requires knowledge of the type of signal whose pitch is to be tracked, which is a potential drawback. It would be interesting to automatically pick the optimum set of parameters given an audio signal by training on instrument specific datasets.

Authors:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 78-86; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


Modeling Audio Effects

Document Thumbnail

[Feature] Dynamic range compressors represent a good example of regularly used effects that rely on nonlinear processing, and it is common to attempt to emulate classic designs in digital plug-ins. The techniques used to profile or model such devices have included deep learning neural networks, the estimation of polynomial functions, and use of lookup tables. It's possible to construct a digital version of a parametric filter that gives the impression of having the correct response within the audio range even when its center frequency is above the Nyquist frequency. We also discover a novel approach to reducing the room effect of reproduced audio signals that preserves the direct sound and early reflections while reducing the influence of late reverberation.

Author:
JAES Volume 68 Issue 1/2 pp. 100-104; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this feature!


Musical Instrument Tagging Using Data Augmentation and Effective Noisy Data Processing

Document Thumbnail

This paper describes a promising method for an automatic music instrument tagging system using neural networks. Developing signal processing methods to extract information automatically has potential utility in many applications, as for example searching for multimedia based on its audio content, making context-aware mobile applications, and pre-processing for an automatic mixing system. However, the last-mentioned application needs a significant amount of research to reliably recognize real musical instruments in recordings. This research focuses on how to obtain data for efficient training, validating, and testing a deep-learning model by using a data augmentation technique. These data are transformed into 2D feature spaces, i.e. mel-scale spectrograms. The neural network used in the experiments consists of a single-block DenseNet architecture and a multihead softmax classifier for efficient learning with the mixup augmentation. For automatic noisy data labeling, the batch-wise loss masking, which is robust with regard to corrupting outliers in data, was applied. The method provides promising recognition scores even with real-world recordings that contain noisy data. 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification Lazaros Vrysis, Nikolaos Tsipas, Iordanis Thoidis, and Charalampos Dimoulas 66 Semantic audio analysis has become a fundamental task in modern audio applications, making the improvement and optimization of classification algorithms a necessity. Standard frame-based audio classification methods have been optimized, and modern approaches introduce engineering methodologies that capture the temporal dependency between successive feature observations, following the process of temporal feature integration. Moreover, the deployment of the convolutional neural networks defined a new era on semantic audio analysis. This paper attempts a thorough comparison between standard feature-based classification strategies, state-of-the-art temporal feature integration tactics, and 1D/2D deep convolutional neural network setups on typical audio classification tasks. Experiments focus on optimizing a lightweight configuration for convolutional network topologies on a Speech/Music/Other classification scheme that can be deployed on various audio information retrieval tasks, such as voice activity detection, speaker diarization, or speech emotion recognition. The main target of this work is the establishment of an optimized protocol for constructing deep convolutional topologies on general audio detection classification schemes, minimizing complexity and computational needs.

Authors:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 57-65; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


Note-Intensity Estimation of Piano Recordings Using Coarsely Aligned MIDI Score

Document Thumbnail

Automatic Music Transcription (AMT) is a process of inferring score notation from audio recordings, which depends on such subtasks as multipitch estimation, onset detection, tempo estimation, etc. The dynamics of music is one of the main elements that explains the characteristics of a performance, but dynamics has not yet been thoroughly investigated in the context of automatic music transcription. This report proposes a system for estimating the intensity of individual notes from piano recordings. The algorithm is based on a score-informed nonnegative matrix factorization (NMF) that takes the spectrogram of an audio recording and a corresponding MIDI score as inputs and factorizes the spectrogram into a set of spectral templates and their activations. The intensity of each note is obtained from the maximum activation of the corresponding pitch template around the onset of the note. The authors improved their system by employing an NMF model that can learn the temporal progress of the timbre of piano notes. While the previous research was evaluated only with perfectly-aligned scores, this paper also presents an evaluation with coarsely-aligned scores. The results shows that this approach is robust in aligning errors within 100 ms.

Authors:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 34-47; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


Toward Language-Agnostic Speech Emotion Recognition

Document Thumbnail

Cross-language speech emotion recognition is receiving increased attention due to its extensive real-world applicability. This work proposes a language-agnostic speech emotion recognition algorithm focusing on Italian and German languages. The mel-scaled and temporal modulation spectral representations are combined and then subsequently modeled by means of Gaussian mixture models. Emotion prediction is carried out via a Kullback Leibler divergence scheme. The proposed methodology is applied to two problem settings: one including positive vs. negative emotion classification and a second one where all Big Six emotional states are considered. A thorough experimental campaign demonstrated the efficacy of such a method, as well as its superiority over other generative modeling schemes and state-of-the-art approaches. The results demonstrate the feasibility of recognizing emotional states in a language-, gender- and speaker-independent setting.

Author:
Affiliation:
JAES Volume 68 Issue 1/2 pp. 7-13; January 2020 Permalink
Publication Date:

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Start a discussion about this paper!


                 Search Results (Displaying 1-10 of 11 matches)
AES - Audio Engineering Society