AES E-Library

AES E-Library

Mapping voice gender and emotion to acoustic properties of natural speech

Document Thumbnail

This study is concerned with listener’s natural ability to identify an anonymous speaker’s gender and emotion from voice alone. We attempt to map psychological characteristics of the speaker, such as gender image and emotion, to acoustical properties. The acoustical parameters of voice samples were pitch (mean, maximum, and minimum), pitch variation over time, jitter, shimmer, and Harmonics-to-Noise Ratio (HNR). Participants listened to 2-second voice clips and were asked to rate each voice’s gender image and emotion using a 7-point scale. Emotional responses were obtained for 7 opposite pairs of affective attributes (Goble and Ni Chasaide, 2003). The pairs of affective attributes were relaxed/stressed, content/angry, friendly/hostile, sad/happy, bored/interested, intimate/formal, and timid/confident. Experimental results show that listeners were able to identify voice gender and assess emotional status from short utterances. Statistical analyses revealed that these acoustic parameters were related to listeners’ perception of a voice’s gender image and its affective attributes. For voice gender perception, there were significant correlations with jitter, shimmer, and HNR parameters in addition to pitch parameters. For perception of affective attributes, acoustic parameters were analyzed with respect to the valence-arousal dimension. Voices perceived as positive tended to have higher variance in pitch and higher maximum pitch than those perceived as negative. Voices perceived as strongly active tended to have higher number of voice breaks, jitter, shimmer, and lower HNR than those perceived as passive. We expect that our experimental results on mapping acoustical parameters with voice gender and emotion perception could be applied to the field of Artificial Intelligence (AI) when assigning specific tone or quality to voice agents. Moreover, such psycho-acoustical mapping can improve the naturalness of synthesized speech, especially neural TTS (Text-To-Speech), because it can assist in selecting the appropriate speech database for voice interaction and for situations where certain voice gender and affective expressions are needed.

Authors:
Affiliation:
AES Convention: Paper Number:
Publication Date:
Subject:
Permalink: https://www.aes.org/e-lib/browse.cfm?elib=21054

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!

This paper costs $33 for non-members and is free for AES members and E-Library subscribers.

Learn more about the AES E-Library

E-Library Location:

Start a discussion about this paper!


AES - Audio Engineering Society