Thursday 12th February
Damian Murphy, Audio Lab, University of York, UK
Paper Session 4: Speech Processing and Analysis
4-1: Vowel-Based Voice Conversion and its Application to Singing-Voice Manipulation
Yuri Yoshida, Ryuichi Nisimura, Toshio Irino, Hideki Kawahara, Wakayama University, Wakayama City, Japan
A novel and light-weight voice conversion method is applied to manipulate a singer's identity and singing style in real time. The proposed method is based on a nonlinear spectral morphing method that uses proximity information for vowel templates of the source and the target singing materials. The proposed method is based on the STRAIGHT speech analysis, modification, and resynthesis system, and it yields highly natural manipulated sounds. To deal with the difficulties in applying our vowel-based voice conversion method to singing voices, singular-value decomposition and robust statistical measures are introduced to handle the huge variability of vowel spectra and fundamental frequencies in singing voices. Distance measures for preparing vowel templates and calculating proximity information are designed based on a psychophysical frequency scale, the equivalent rectangular band, ERB_N rate.
4-2: Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech
Masanori Morise, Kwansei Gakuin University, Nishinomiya Hyogo, Japan; Hideki Kawahara, Wakayama University, Wakayama-City, Japan; Haruhiro Katayose, Kwansei Gakuin University, Nishinomiya Hyogo, Japan
A fast and reliable fundamental frequency (F0) extraction method is proposed for real-time interactive applications using a singing voice. It is based on a period detection of the vocal fold vibration, so it does not require expensive computation such as STFT or autocorrelation. Parallel processing architecture and a new cost function made this simple idea competitive with state-of-the-art F0 estimation methods. A series of tests using publicly accessible F0 reference databases revealed that the proposed method supersedes conventional methods in terms of speed and accuracy. Finally, comparative tests using artificial test signals with fast and deep vibrato were conducted to demonstrate the effectiveness of the proposed method in interactive real-time applications for singing sounds.
4-3: Kaleivoicecope: Voice Transformation from Interactive Installations to Video Games
Oscar Mayor, Jordi Bonada, Jordi Janer, Pompeu Fabra University, Barcelona, Spain
A real-time voice transformation technology and its applications are presented in this paper. The technology allows the transformation of a human voice, such as changing gender from male to female or transforming a teenager to an old woman. More exotic transformations are also possible, for instance robotizing the voice or giving the voice an alien character as it was taken from a science fiction film. The technology has been already used for real-time installations in museums and in postproduction applications. Now, it's being adapted to interactive videogames to transform the voice of the user or any of the game characters.
4-4: Natural Transformation of Type and Nature of the Voice for Extending Vocal Repertoire in High-Fidelity Applications
Snorre Farner, Axel Roebel, Xavier Rodet, Ircam, Paris, France
Natural voice transformation will reduce the need for authentic voices in many situations, ranging from vocal services via education and entertainment to artistic applications. Transformation of one voice to correspond to that of another person has been studied for decades but still suffers from limitations that we propose to overcome by an alternative approach. It consists in modifying pitch, spectral envelope, durations, etc., in a global way. While it sacrifices the possibility to attain a specific target voice, the approach allows the production of new voices of a high degree of naturalness with different sex and age, modified vocal quality (soft, breathy, and whisper), or another speech style (dullness and eagerness). The transformation of sex and age has been evaluated by a listening test.