Affiliation:College of Intelligence and Computing, Tianjin University, Tianjin, P.R. China
Given a face image and a speech audio, talking face generation refers to synthesizing a face video speaking the given speech. It has wide applications in movie dubbing, teleconference, virtual assistant, etc. This paper gives an overview of research progress on talking face generation in recent years. The author first reviews traditional talking face generation methods. Then, deep learning talking face generation methods, including talking face synthesis for a specific identity and talking face synthesis for an arbitrary identity, are summarized. The author then surveys recent detail-aware talking face generation methods, including noise based approaches, eye conversion based approaches, and facial anatomy based approaches. Next, the author surveys the talking head generation methods, such as video/image driven talking head generation, pose information--driven talking head generation, and audio-driven talking head generation. Finally, some future directions for talking face generation are highlighted.
Download: PDF (HIGH Res) (2.2MB)
Download: PDF (LOW Res) (325KB)
Authors:Bergner, Jakob; Schössow, Daphne; Preihs, Stephan; Peissig, Jürgen
Affiliation:Institute of Communications Technology, Leibniz University Hannover, Germany
This work is motivated by the question of whether different loudspeaker-based multichannel playback methods can be robustly characterized by measurable acoustic properties. For that, underlying acoustic dimensions were identified that allow for a discriminative sound field analysis within a music reproduction scenario. The subject of investigation is a set of different musical pieces available in different multichannel playback formats. Re-recordings of the stimuli at a listening position using a spherical microphone array enable a sound field analysis that includes, in total, 237 signal-based indicators in the categories of loudness, quality, spaciousness, and time. The indicators are fed to a factor and time series analysis to identify the most relevant acoustic dimensions that reflect and explain significant parts of the variance within the acoustical data. The results show that of the eight relevant dimensions, the dimensions "High-Frequency Diffusivity," "Elevational Diffusivity," and "Mid-Frequency Diffusivity" are capable of identifying statistically significant differences between the loudspeaker setups. The presented approach leads to plausible results that are in accordance with the expected differences between the loudspeaker configurations used. The findings may be used for a better understanding of the effects of different loudspeaker configurations on human perception and emotional response when listening to music.
Download: PDF (HIGH Res) (3.9MB)
Download: PDF (LOW Res) (698KB)
Authors:Master, Aaron S.; Lu, Lie; Swedlow, Nathan
Affiliation:Dolby Laboratories, Inc., San Francisco, CA
Speech enhancement (SE) systems typically operate on monaural input and are used for applications including voice communications and capture cleanup for user-generated content. Recent advancements and changes in the devices used for these applications are likely to lead to an increase in the amount of two-channel content for the same applications. However, SE systems are typically designed for monaural input; stereo results produced using trivial methods such as channel-independent or mid-side processing may be unsatisfactory, including substantial speech distortions. To address this, the authors propose a system that creates a novel representation of stereo signals called custom mid-side signals (CMSS). CMSS allow benefits of mid-side signals for center-panned speech to be extended to a much larger class of input signals. This, in turn, allows any existing monaural SE system to operate as an efficient stereo system by processing the custom mid signal. This paper describes how the parameters needed for CMSS can be efficiently estimated by a component of the spatio-level--filtering source separation system. Subjective listening using state-of-the-art deep learning--based SE systems on stereo content with various speech mixing styles shows that CMSS processing leads to improved speech quality at approximately half the cost of channel-independent processing.
Download: PDF (HIGH Res) (2.6MB)
Download: PDF (LOW Res) (580KB)
Authors:Salmon, François; Changenet, Frédéric; Colas, Tom; Verron, Charles; Paquier, Mathieu
Affiliation:Noise Makers, 3 Rue Brossay Saint-Marc, 35700 Rennes, France; Radio France, 116 Avenue du Pr´esident Kennedy, 75220 Paris, France; University of Brest, CNRS, Lab-STICC UMR 6285, 6 avenue Victor Le Gorgeu, CS 93837, 29238 Brest Cedex 3, France; Radio France, 116 Avenue du Pr´esident Kennedy, 75220 Paris, France; Noise Makers, 3 Rue Brossay Saint-Marc, 35700 Rennes, France; University of Brest, CNRS, Lab-STICC UMR 6285, 6 avenue Victor Le Gorgeu, CS 93837, 29238 Brest Cedex 3, France
With the growing advent of object-based audio productions, a major challenge for sound recordists is to determine multichannel microphone arrays that are suitable on several sound reproduction systems. Various multichannel 3D microphone arrays have been designed for the production of immersive content and it seems necessary to assess their qualities on several playback systems. This study concerns the subjective evaluation of six multichannel microphone arrays used for the recording of classical music: Decca Tree, ESMA-3D, MMAD, 2L-Cube, and first-order and second-order ambisonic microphone arrays. Subjects evaluated the sound recordings according to four perceptual attributes (precision of localization, envelopment, spectral quality, and preference) as well as on two reproduction systems (a 5.1.4 multichannel loudspeaker setup and a dynamic binaural playback). As observed previously with stereophonic reproduction, results showed that coincident systems can provide a good localization accuracy but can lack in the sensation of envelopment by reverberation. Moreover, they are more likely to be perceived differently under different rendering conditions. The greatest sense of envelopment was produced by ESMA-3D for the two rendering conditions. No particular system was preferred by the subjects for creating a mix with spot microphones.
Download: PDF (HIGH Res) (2.3MB)
Download: PDF (LOW Res) (902KB)
Authors:Riedel, Stefan; Frank, Matthias; Zotter, Franz
Affiliation:Institute of Electronic Music and Acoustics, University of Music and Performing Arts, Graz, Austria
Listener envelopment refers to the sensation of being surrounded by sound, either by multiple direct sound events or by a diffuse reverberant sound field. More recently, a specific attribute for the sensation of being covered by sound from elevated directions has been proposed by Sazdov et al. and was termed listener engulfment. The first experiment presented here investigates how the temporal and directional density of sound events affects listener envelopment. The second experiment studies how elevated loudspeaker layers affect envelopment versus engulfment. A spatial granular synthesis technique is used to precisely control the temporal and directional density of sound events. Experimental results indicate that a directionally uniform distribution of sound events at time intervals Δt < 20 ms is required to elicit a sensation of diffuse envelopment, whereas longer time intervals lead to localized auditory events. It shows that elevated loudspeaker layers do not increase envelopment but contribute specifically to listener engulfment. Low-pass-filtered stimuli enhance envelopment in directionally sparse conditions, but impede control over engulfment due to a reduction of height localization cues. The results can be exploited in the technical design and creative application of spatial sound synthesis and reverberation algorithms.
Download: PDF (HIGH Res) (935KB)
Download: PDF (LOW Res) (792KB)
Authors:Fierro, Leonardo; Välimäki, Vesa
Affiliation:Acoustics Lab, Department of Information and Communication Engineering, Aalto University, Espoo, Finland
The decomposition of sounds into sines, transients, and noise is a long-standing research problem in audio processing. The current solutions for this three-way separation detect either horizontal and vertical structures or anisotropy and orientations in the spectrogram to identify the properties of each spectral bin and classify it as sinusoidal, transient, or noise. This paper proposes an enhanced three-way decomposition method based on fuzzy logic, enabling soft masking while preserving the perfect reconstruction property. The proposed method allows each spectral bin to simultaneously belong to two classes, sine and noise or transient and noise. Results of a subjective listening test against three other techniques are reported, showing that the proposed decomposition yields a better or comparable quality. The main improvement appears in transient separation, which enjoys little or no loss of energy or leakage from the other components and performs well for test signals presenting strong transients. The audio quality of the separation is shown to depend on the complexity of the input signal for all tested methods. The proposed method helps improve the quality of various audio processing applications. A successful implementation over a state-of-the-art time-scale modification method is reported as an example.
Download: PDF (HIGH Res) (7.8MB)
Download: PDF (LOW Res) (1.2MB)
Affiliation:Laboratory of Music Informatics, Department of Computer Science, Universit`a degli Studi di Milano, Milan, Italy
The Fourier Transform (FT) is a widely used analysis tool. However, FT alone is not suited for the analysis of bivariate signals, e.g., stereophonic recordings, because it is not sensitive to the relationship between channels. Different works addressing this problem can be found in the literature; the BivariateMixture Space (BMS) is introduced here as an alternative representation to the existing techniques. BMS is still based on the FT and can be thought of as an extension of it, such that the relationship between two signals is considered as additional information in the frequency domain. Despite being simpler than other techniques aimed at representing bivariate signals, this representation is shown to have some desirable characteristics that are absent in traditional representations, which lead to novel ways to perform linear and non-linear decomposition, feature extraction, and data visualization. As a demonstrative application, an Independent Component Analysis algorithm is derived from the BMS, which shows promising results with respect to existing implementations in terms of performance and robustness.
Download: PDF (HIGH Res) (2.7MB)
Download: PDF (LOW Res) (776KB)
Authors:Young, Kat; Kearney, Gavin
Affiliation:AudioLab, School of Physics, Engineering and Technology, University of York, UK
Traditional head-related transfer function acoustic measurement methods can be timeconsuming, repetitive, and require complex equipment. Although numerical simulation such as the boundary element method offers an alternative approach, creating the required accurate 3D mesh of a subject can also be time-consuming and complex, typically involving scanning the subject and a number of manual post-processing steps. This paper presents an alternative solution specifically for the Knowles Electronics Manikin for Acoustic Research (KEMAR) and reports the results of comparisons between simulated and acoustically measured head-related transfer functions. Such comparisons show good consistency: differences in interaural time difference, spectral magnitude, and interaural spectral difference are close to just-noticeable--difference values and similar to values reported by others. The mesh can therefore be used as a viable representation of direct measurement within virtual acoustic simulations, allowing researchers with unusual requirements to access the benefits of the boundary element method without having to first scan their manikin.
Download: PDF (HIGH Res) (7.5MB)
Download: PDF (LOW Res) (837KB)
Authors:Brinkmann, Fabian; Kreuzer, Wolfgang; Thomsen, Jeffrey; Dombrovskis, Sergejs; Pollack, Katharina; Weinzierl, Stefan; Majdak, Piotr
Affiliation:Audio Communication Group, Technische Universit¨at Berlin, Germany; Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria; Audio Communication Group, Technische Universit¨at Berlin, Germany; China Euro Vehicle Technology AB, Gothenburg, Sweden; Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria; Audio Communication Group, Technische Universit¨at Berlin, Germany; Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria
Mesh2HRTF 1.x is an open-source and fully scriptable end-to-end pipeline for the numerical calculation of head-related transfer functions (HRTFs). The calculations are based on 3D meshes of listener's body parts such as the head, pinna, and torso. The numerical core of Mesh2HRTF is written in C++ and employs the boundary-element method for solving the Helmholtz equation. It is accelerated by a multilevel fast multipole method and can easily be parallelized to further speed up the computations. The recently refactored framework of Mesh2HRTF 1.x contains tools for preparing the meshes as well as specific post-processing and inspection of the calculatedHRTFs. The resultingHRTFs are saved in the spatially oriented format for acoustics being directly applicable in virtual and augmented reality applications and psychoacoustic research. The Mesh2HRTF 1.x code is automatically tested to assure high quality and reliability. A comprehensive online documentation enables easy access for users without in-depth knowledge of acoustic simulations.
Download: PDF (HIGH Res) (6.8MB)
Download: PDF (LOW Res) (999KB)