Authors:Breebaart, Jeroen; Cengarle, Giulio; Lu, Lie; Mateos, Toni; Purnhagen, Heiko; Tsingos, Nicolas
Object-based audio (OBA) program material is challenging to distribute over low bandwidth channels and costly to render for thin clients. This research proposes a dynamic object-grouping solution that can represent a complex object-based scene as an equivalent reduced set of object groups while maintaining perceptually transparent rendering quality. This solution is a type of spatial coding. This paper introduces a real-time greedy simplification technique that addresses limitations of previous approaches by modeling spatial release from masking and distributing input objects into to multiple output groups. The core algorithm is extended to preserve other types of artistic metadata beyond object position. Results of perceptual tests show that this solution can achieve a 10:1 reduction in object count while maintaining high-quality audio playback and rendering flexibility at the endpoint. Spatial coding does not require perceptual coding of the objects’ audio essence but can be further combined with audio coding tools to deliver OBA content at low bit rates. This makes spatial coding a key component of an OBA production and distribution workflow. Object-based content creation, distribution, and rendering workflows require novel methods to process, combine, encode, and simplify complex auditory scenes to allow end-point rendering flexibility, efficiency, and adaptability as well as the means to cater for personalized experiences.
Download: PDF (HIGH Res) (1.0MB)
Download: PDF (LOW Res) (306KB)
Authors:Franck, Andreas; Francombe, Jon; Woodcock, James; Hughes, Richard; Coleman, Philip; Menzies, Dylan; Cox, Trevor J.; Jackson, Philip J.B.; Fazi, Filippo Maria
Affiliation:Institute of Sound and Vibration Research, University of Southampton, Southampton, Hampshire, UK; BBC Research and Development, Dock House, MediaCityUK, Salford, UK; Acoustics Research Centre, University of Salford, Salford, UK; Institute of Sound Recording, University of Surrey, Guildford, Surrey, UK; Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, Surrey, UK
Object-based audio promises format-agnostic reproduction and extensive personalization of spatial audio content. However, in practical listening scenarios, such as in consumer audio, ideal reproduction is typically not possible. To maximize the quality of listening experience, a different approach is required, for example modifications of metadata to adjust for the reproduction layout or personalization choices. This paper proposes a novel system architecture for semantically informed rendering (SIR), that combines object audio rendering with high-level processing of object metadata. In many cases, this processing uses novel, advanced metadata describing the objects to optimally adjust the audio scene to the reproduction system or listener preferences. The proposed system is evaluated with several adaptation strategies, including semantically motivated downmix to layouts with few loudspeakers, manipulation of perceptual attributes, perceptual reverberation compensation, and orchestration of mobile devices for immersive reproduction. These examples demonstrate how SIR can significantly improve the media experience and provide advanced personalization controls, for example by maintaining smooth object trajectories on systems with few loudspeakers, or providing personalized envelopment levels. An example implementation of the proposed system architecture is described and provided as an open, extensible software framework that combines object-based audio rendering and high-level processing of advanced object metadata.
Download: PDF (HIGH Res) (4.7MB)
Download: PDF (LOW Res) (528KB)
Authors:Paulus, Jouni; Torcoli, Matteo; Uhle, Christian; Herre, Jürgen; Disch, Sascha; Fuchs, Harald
Affiliation:Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany; International Audio Laboratories Erlangen, Erlangen, Germany, a joint institution of Universität Erlangen-Nürnberg and Fraunhofer IIS
Low intelligibility of narration or dialogue resulting from high background level is one of the most common complaints in broadcasting. Even when the intelligibility is not compromised, listeners may have personal preferences that differ from the mix being broadcast. Dialogue Enhancement (DE) enables the delivery of optimal dialogue mixing to each listener, be it in terms of intelligibility or for aesthetic preference. This makes DE one of the most promising applications of user interactivity enabled by object-based audio broadcasting, such as MPEG-H. This paper investigates the use of source separation methods to extract dialogue and background from the complex sound mixture for enabling object-based broadcasting when dialogue is not available from the production process, as for example, with legacy content. The presented source separation technology integrates several separation approaches with known limitations into a more powerful overall architecture. In addition, the paper evaluates the subjective benefit of DE using the Adjustment/Satisfaction Test in which the listeners made extensive use of the dialogue level personalization. The fact that the preferred dialogue level had a high variance among the listeners indicates the need for this functionality. Even when an imperfect separation result was used for enabling DE, the possibility for personalizing the dialogue level lead to increased listener satisfaction.
Download: PDF (HIGH Res) (2.2MB)
Download: PDF (LOW Res) (335KB)
Authors:Wilson, Alex; Fazenda, Bruno M.
Affiliation:Acoustics Research Centre, University of Salford, Salford, UK
One of the advantages of object-based audio/broadcast over traditional channel-based delivery is that it allows for the rendering of personalized content when delivered to the listeners. The methods by which personalization are achieved often require an in-depth understanding of the problem domain. This paper describes the design and evaluation of an interactive audio renderer, which is used to optimize an audio mix based on the feedback of the listener. A panel of 14 trained participants was recruited to try the system. When using the proposed system in a simple music mixing task, participants were able to create a range of mixes of audio objects comparable to those made using the conventional fader-based system. This suggests that the system is not an obstacle to the creation of desired content, and does not impose noticeable limits on what content can be created. Evaluation using the System Usability Scale showed a low level of physical and mental burden and so is predicted that the system would be suitable for a variety of applications where physical interaction is to be kept low, such as an interface for users with vision and/or mobility impairments.
Download: PDF (HIGH Res) (1.1MB)
Download: PDF (LOW Res) (466KB)
Authors:Heilemann, Michael C.; Anderson, David A.; Bocko, Mark F.
Affiliation:University of Rochester, Rochester, NY, USA; University of Pittsburgh, Pittsburgh, PA, USA
Devices such as smartphones and televisions are beginning to employ screens as both a video display and a loudspeaker. This multimodal device is well suited for object-based encoding of audio, where audio objects may be rendered at the location corresponding to the visual images. The audio object renderer must be configured to account for variations in panel behavior at different excitation frequencies. This research proposes a multiband crossover network for the audio object renderer that separates the signal for each audio object into low, midrange, and high-frequency bands. Each band is then reproduced on the panel using a different vibration rendering technique. The different rendering techniques are realized by employing a combination of actuator array processing and the natural vibration localization characteristics of point-driven panels. The cutoff frequencies for each band are determined by the physical properties of the panel. Experiments on a prototype panel employing the multiband crossover system demonstrate that the vibration response behaves as predicted in each frequency range. This system provides a platform for rendering spatial audio on devices when listeners are close to the screen, and where there are restrictions related to weight, power consumption, and form-factor.
Download: PDF (HIGH Res) (2.7MB)
Download: PDF (LOW Res) (281KB)
Authors:Kon, Homare; Koike, Hideki
Affiliation:School of Computing, Tokyo Institute of Technology, Tokyo, Japan
In augmented-reality (AR) applications, reproducing acoustic reverberation is essential for the immersive audio experience. The audio components of an AR system should simulate the acoustics of the environment that is experienced by the users. Earlier, in virtual–reality (VR) applications, sound engineers could program all of the reverberation parameters for a particular scene in advance or when the user is at a fixed position. However, adjusting the reverberation parameters using conventional procedures is difficult because the unlimited range of such parameters cannot be programmed for AR applications. Therefore, it is necessary to dynamically estimate the reverberation characteristics based on the environments in which the users move. Considering that skilled acoustic engineers can estimate the reverberation parameters using the images of a room without performing any measurements, we trained convolutional neural networks to estimate the reverberation parameters using two–dimensional images. The proposed method does not require the simulations of sound propagation using 3D reconstruction techniques.
Download: PDF (HIGH Res) (2.2MB)
Download: PDF (LOW Res) (422KB)
Authors:Menzies, Dylan; Fazi, Filippo Maria
Affiliation:Institute of Sound and Vibration Research, University of Southampton, Southampton, UK
Conventional approaches for surround sound panning require loudspeakers to be distributed over the regions where images are required. However in many listening situations it is not practical or desirable to place loudspeakers at some positions, such as behind or above the listener. Compensated Amplitude Panning (CAP) is an object-based reproduction method that adapts dynamically to the listener’s head orientation to provide stable images in any direction in the frequency range up to approximately 1000 Hz. This is achieved by accurately controlling the Interaural Time Difference cue. CAP can also provide images in the near-field range, by controlling the Interaural Level Difference. Using two loudspeakers and with full 6-degrees-of-freedom head tracking, it was previously shown possible to create low band images in any direction, although excessive gain is required for some listener orientations. But with 3 loudspeakers all images directions can be reproduced with moderate gain. Adding more loudspeakers to a stereo configuration does not worsen performance. For comparison, an Ambisonic approach with position tracking and 3 frontal loudspeakers can reproduce horizontal surround images, and 4 loudspeakers can reproduce full 3D.
Download: PDF (HIGH Res) (856KB)
Download: PDF (LOW Res) (298KB)
Authors:Woodcock, James; Davies, William J.; Cox, Trevor J.
Affiliation:Acoustics Research Centre, University of Salford, Salford, UK
Although audio is often reproduced with a visual counterpart, the audio technology for these systems is often researched and evaluated in isolation from the visual component. Previous research indicates that the auditory and visual modalities are not processed separately by the brain. For example, visual stimuli can influence ratings of audio quality and vice versa. This paper presents an experiment to investigate the influence of visual stimuli on a set of attributes relevant to the perception of spatial audio. Eighteen participants took part in a paired comparison listening test where they were asked to judge pairs of stimuli rendered with fourteen-, five-, and two-channel systems using ten perceptual attributes. The stimuli were presented in both audio only and audiovisual conditions. The results show a significant and large effect of the loudspeaker configuration for all the tested attributes other than overall spectral balance and depth of field. The effect of visual stimuli was found to be small and significant for the attributes realism, sense of space, and spatial clarity. These results suggest that evaluations of audiovisual technologies that are aimed to evoke a sense of realism or presence should consider the influence of both the audio and visual modalities.
Download: PDF (HIGH Res) (18.9MB)
Download: PDF (LOW Res) (407KB)
Authors:Silzle, Andreas; Schmidt, Rebekka; Bleisteiner, Werner; Epain, Nicolas; Ragot, Martin
Affiliation:Fraunhofer IIS, Erlangen, Germany; Fraunhofer SCS, Nürnberg, Germany; Bayerischer Rundfunk, München, Germany; b<>com, Cesson-Sévigné, France
Object-based audio (OBA) provides many enhancements and new features; yet, many of these require the user to be active in choosing and selecting the functionalities in visual representations and graphical interfaces. Basic investigations of the user experience of OBA within the EU research project OPRHEUS helped to identify the necessary criteria and dimensions. The user experience in object-based media comprises three dimensions: audio, information, and usability experience. During the project, a radio app for mobile devices was designed, developed, and tested. It includes many of the end-user features available with OBA. A first Quality of Experience (QoE) test to evaluate the radio app was carried out at JOSEPHS, an open innovation lab located in Nuremberg, Germany. The second QoE test took place at b<>com’s user experience lab in Rennes, France. For both investigations, the main objective was to find out how users can access, interact, and appreciate the various new features of OBA. For the first test, two typical user and listening scenarios were simulated: mobile listening and at home. The general acceptance of the new features and functions that come along with OBA is very high. The usability is rated high. Further possibilities for improvements were provided by the test users. The very good perceived sound quality with surround sound over loudspeakers or binaural reproduction over headphones impressed the listeners most. The second test focused mainly on the approach of comparing and evaluating the features from acceptability to acceptance, or from expectations to fulfillment. In the second test, the most appreciated feature was to set fore-to-background balance. This feature was number two in the first test. The importance of speech intelligibility for Radio and TV is a known and well discussed issue. Now, with OBA and the Next Generation Audio (NGA) codec MPEG-H, solutions are at hand to address it.
Download: PDF (HIGH Res) (24.4MB)
Download: PDF (LOW Res) (974KB)
Authors:Ward, Lauren A.; Shirley, Ben G.
Affiliation:Acoustics Research Centre, University of Salford, Manchester, UK
Hearing loss is widespread and significantly impacts an individual’s ability to engage with broadcast media. Access for people with impaired hearing can be improved through new object-based audio personalization methods. Utilizing the literature on hearing loss and intelligibility, this paper develops three dimensions that have the potential to improve intelligibility: spatial separation, speech-to-noise ratio, and redundancy. These can be personalized, individually or concurrently, using object-based audio. A systematic review of all work in object-based audio personalization is then undertaken. These dimensions are utilized to evaluate each project’s approach to personalization, identifying successful approaches, commercial challenges, and the next steps required to ensure continuing improvements to broadcast audio for hard-of-hearing individuals. Although no single solution will address all problems faced by individuals with hearing impairments when accessing broadcast audio, several approaches covered in this review show promise.
Download: PDF (HIGH Res) (321KB)
Download: PDF (LOW Res) (138KB)
Some interesting issues in audio forensics arise from the widespread use of smartphones for recording chunks of audio, and how one can detect edits and copying or moving of the resulting files. The idea that microphones leave unique signatures on the signal is an intriguing one for further investigation, and identification based on frequency-response features seems a promising avenue. In the area of forgery detection there are big challenges for the development of new technology as systems get more and more clever at faking human characteristics or hiding the results of artificial processing. Papers from the 2019 Audio Forensics Conference are summarized.
Download: PDF (496KB)