During the recent years, convolutional neural networks have been the standard on audio semantics, surpassing traditional classification approaches which employed hand-crafted feature engineering as front-end and various classifiers as back-end. Early studies were based on prominent 2D convolutional topologies for image recognition, adapting them to audio classification tasks. After the surge of deep learning in the past decade, real end-to-end audio learning, employing algorithms that directly process waveforms are to become the standard. This paper attempts a comparison between deep neural setups on typical audio classification tasks, focusing on optimizing 1D convolutional neural networks that can be deployed on various audio in-formation retrieval tasks, such as general audio detection and classification, environmental sound or speech emotion recognition.
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!
This paper costs $33 for non-members and is free for AES members and E-Library subscribers.