This study examines listeners’ natural ability to identify an anonymous speaker’s emotions from speech samples with broad ranges of emotional intensity. This study aims to compare emotional ratings between posed and spontaneous speech samples and analyzes how basic acoustic parameters are utilized. The spontaneous samples were extracted from the Korean Spontaneous Speech corpus consisting of casual conversations. The posed samples with emotions (happiness, neutrality, anger, sadness) were obtained from the Emotion Classification dataset. Non-native listeners were asked to evaluate seven opposite pairs of affective attributes perceived from the speech samples. Listeners perceived fewer spontaneous samples as having negative valences. The posed samples had higher mean rating scores than those of the spontaneous speeches, only in negative valences. Listeners reacted more sensitively to the posed than spontaneous speeches in negative valence and had difficulty detecting happiness from the posed samples. The spontaneous samples perceived as positive had higher variance in pitch and higher maximum pitch than those perceived as negative. Contrastingly, the posed samples perceived as negative valences were positively correlated with higher values of the pitch parameters. These results can be utilized to assign specific vocal affects to artificial intelligence voice agents or virtual humans, rendering more human-like voices.
Download Now (3.8 MB)