Emotional speech is a separate channel of communication that carries the paralinguistic aspects of spoken language. Affective information knowledge can be crucial for contextual speech recognition, which can also provide elements from the personality and psychological state of the speaker enriching the communication. That kind of data may play an important role as semantic analysis features of web content and would also apply in intelligent affective new media and social interaction domains. A model for Speech Emotion Recognition (SER), based on Convolutional Neural Networks (CNN) architecture is proposed and evaluated. Recognition is performed on successive time frames of continuous speech. The dataset used for training and testing the model is the Acted Emotional Speech Dynamic Database (AESDD), a publicly available corpus in the Greek language. Experiments involving the subjective evaluation of the AESDD are presented to serve as a reference for human-level recognition accuracy. The proposed CNN architecture outperforms previous baseline machine learning models (Support Vector Machines) by 8.4% in terms of accuracy and it is also more efficient because it bypasses the stage of handcrafted feature extraction. Data augmentation of the database did not affect classification accuracy in the validation tests but is expected to improve robustness and generalization. Besides performance improvements, the unsupervised feature-extraction stage of the proposed topology also makes it feasible to create real-time systems.
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!
This paper costs $33 for non-members and is free for AES members and E-Library subscribers.