Within audio-tactile playback systems, the induced vibration is often calibrated subjectively with no objective frame of reference. Using a broadband excitation signal, the sound induced vibration characteristics of the torso were identified, including the magnitude response, amplitude conversion efficiency and subjective perceptual thresholds. The effect of additional factors such as Body Mass Index were considered. The human torso was shown to act as a Helmholtz cavity, while an increase in BMI was shown to reduce the peak vibration amplitude. The body was further shown to behave as a linear transducer of sound into vibration, leading to the production of a novel conversion table. Perceptual tests identified a frequency dependent threshold of 94-107dBZ required to induce a perceivable whole-body vibration.
A VR training application is built in this paper to help improve human sound localization performance on generic head-related transfer function. Subjects go through 4 different phases in the experiment, tutorial, pre-test, training and post-test, in which he or she is instructed to trigger a sound stimuli and report the perceived location by rotating their head to face the direction. The data captured automatically during each trial of the experiment includes the correct and reported position of the stimuli, reaction time and the head rotation at each 50ms. The analysis results show that there is a statistically significant improvement on subjects performance.
The practical benefits of conducting evaluations of acoustical scenes in laboratory settings are evident in the literature. Such approaches, however, may implicate an audio-visual incongruity, as assessors are physically in a laboratory room, whilst auditioning another, e.g., an auditorium.
In this report it is hypothesised that presenting congruent audio-visual stimuli improves the experience of an auralised sound field. Measured sound fields were reproduced over a 3D loudspeaker array. Experts assessors evaluated those in two visual conditions: a congruent room, and a dark environment. The results indicate a tendency towards improved plausibility and decreased task-difficulty for congruent conditions. Visual conditions did not reveal a statistical significance indicating the need of further experiments with a larger sample-size, interface improvements, and realistic graphics.
Spatial audio rendering techniques often make assumptions about head-related transfer functions to simplify the reproduction process, however, state-of-the-art techniques require critical consideration of the transfer function accuracy. This paper uses boundary-element method HRTFs to investigate the perceptual spectral difference (PSD) that arises with distance and angle discrepancies. PSD between HRTFs at various radial distances is calculated, and between HRTFs in head- and ear-centred systems as function of radial distance. The distance at which the average and maximum PSD falls below the threshold of perception is determined. HRTF average variation reaches perceptual limits by 2-3m, however, perceivable PSD values occur up to and beyond 10m, indicating that care must be taken when approximating either the distance or angular location of HRTFs.
Binaural rendering allows us to reproduce auditory scenes through headphones while preserving spatial cues. The best results are achieved if the headphone effect is compensated with an individualized filter, which depends on the headphone transfer function, ear morphology and fitting. However, due to the high complexity of remeasuring a new filter every time the user repositions the headphone, generic compensation may be of interest. In this study, the effects of generic headphone equalization in binaural rendering are evaluated objectively and subjectively, with respect to unequalized and individually-equalized cases. Results show that generic headphone equalization yields perceptual benefits similar to individual equalization for non-individual binaural renderings, and it increases overall quality, reduces coloration, and improves distance perception compared to unequalized renderings.
Traditional sound localization studies are often performed in anechoic chambers and in complete darkness. In our daily life, however, we are exposed to rich auditory scenes with multiple sound sources and complementary visual information. Although it is understood that the presence of maskers hinders auditory spatial awareness, it is not known whether competing sound sources can provide spatial information that helps in localizing a target stimulus. In this study, we explore the effect of presenting controlled auditory scenes with different amounts of visual and spatial cues during a sound localization task. A novel, gamified localization task is also presented. Preliminary results suggest that subjects who are exposed to audio-visual anchors show faster improvements than those who are not.
Verbal descriptors used by participants when describing audio-visual distance match and mismatch conditions in cinematic VR are analyzed to expose underlying similarities and structures. The participants are analyzed from two perspectives: accuracy in auditory distance discrimination and audio expertise. Similarities are found in the verbal descriptors used between accurate groups and inaccurate groups, as opposed to groups split by expertise. We propose the use of descriptors can be explained by internalized certainty. Audio experts and non-experts were equally likely to be accurate or consistent, thus audio expertise is not synonymous with spatial audio expertise, which demands unique consideration.
The ability of human listeners to segregate two sound sources was examined by conducting an experiment when the sources are concurrently presented from different directions in the median plane. A high-pass filtered pink noise was utilised as sound stimuli in a free-field condition and presented as either a pair of incoherent sound sources or a single-source. Subjects were tested with both monaural and binaural hearing, and responded whether they perceived sound from one or two directions. The responses showing "two directions" for pairwise stimuli exceeded 50% above 33.75° separation and reached above 70% at 67.5° in both hearing sessions. The difference in the ability to segregate sources in the median plane with binaural or monaural hearing was not prominently different.
The increasing popularity of Ambisonics as a spatial audio format for streaming services poses new challenges to the existing audio coding techniques. This paper seeks to evaluate timbral distortion and localization accuracy of Ambisonic audio encoded using Opus low-bitrate compression. The study was conducted for first, third and fifth order Ambisonic signals at various bitrates reproduced over a 50-channel spherical loudspeaker configuration. This study has identified how lower bitrates reduce timbral fidelity, though this changes depending on the audio content. Localization accuracy has been found to be relatively robust even at very low bitrates. The results suggest that the user experience of spatial audio streaming services would significantly improve if third order Ambisonics would be implemented over first order.
This study investigated the influence of binaural Ambisonic rendering on the perceived spatial and timbral fidelities of recordings made using main microphone arrays. Eight recordings made for different types of sound sources using various microphone techniques were encoded in Ambisonics with the orders of 1 to 5. In MUSHRA listening tests, 1st to 5th order binaural Ambisonic decoders based on the “magnitude least square (magLS)” method as well as the conventional 1st order “Cube” basic decoder were compared against the direct binauralisation of the original recording. Results generally indicated that perceived spatial and timbral quality degradations for the magLS stimuli were minimal for complex musical ensemble recordings, regardless of the order. However, significant source and microphone technique dependencies were observed.
Multichannel and immersive audio technologies are usually tied to specific loudspeaker layouts. Any loudspeaker misplacement relative to these predefined positions might degrade the overall sound quality. To avoid this situation, some Next Generation Audio codecs offer the possibility to compensate for angular misplacement of loudspeakers. The experiment presented in this paper investigates the perceptual impact of loudspeakers angular misplacement in a 5.0 multichannel setup, with and without the angular compensation implemented in the MPEG-H 3D Audio decoder. Three audio formats are tested: Channel, Scene, and Object-Based contents. The experiment is conducted on two test sites and reveals that the benefit of the angular compensation essentially depends on the sound stimulus. No significant effect of the audio formats tested was found.
Sound propagation is the result of several wave phenomena that need to be modeled in real time to achieve realistic and immersive audio in games and virtual reality (VR). A physical and perceptual comparison is conducted between two different approaches: an image source model (ISM) of a shoebox room, and a ray-tracing simulation using custom geometries. The physical analysis compares the results with those of an industry standard room acoustic commercial package. The perceptual evaluation is implemented in an ecologically valid immersive VR framework. Results suggest that the ISM is subjectively more preferred in small to medium rooms, while ray-tracing is more appropriate for large reverberant spaces. Thus, a combination of both methods could suit a larger variety of spaces.
In the current effort to improve sound for virtual auditory environments, realism and audio quality in head-tracked binaural rendering is again becoming important. While rendering based on static dummy-head measurements achieve high audio quality and externalization, the realism lacks interactivity with changes of the head orientation. Motion-tracked binaural (MTB) has been presented as a head-tracked rendering algorithm for recordings made with circular arrays on rigid spheres. In this contribution, we investigate the algorithm proposed for MTB rendering and adopt it for variable-orientation rendering using binaural room impulse responses (BRIR) measured for multiple, discrete orientations of an artificial-head. The experiment in particular investigates the perceptual implications of the angular resolution of the multi-orientation BRIR sets and the time/frequency-resolution of the algorithm.
The present study aims to investigate the influence of head movement on perceived externalization of a virtual sound source with various lengths of binaural room impulse responses (BRIRs). For this purpose, non-individual BRIRs were measured in a listening room and truncated to different lengths. Such modified BRIRs were convolved with speech and music signals, and the resulting binaural signals were presented over headphones. During each presentation, subjects were either asked to perform head movements or to remain stationary. The experimental results revealed that head movements can substantially improve externalization of virtual sound sources rendered by short BRIRs, especially for frontal sound sources. In contrast, head movements have no substantial influence on externalization for virtual sound sources generated by long BRIRs.
The equalization of room acoustic influences on the sound transmission from a source (loudspeaker) to a receiver (listener/microphone) is a well known problem in audio engineering and academia that led to various equalization methods. However, many highly effective approaches lack of either robustness, practicability or usability. This paper thus presents an algorithm for room response equalization making use of parametric peak filters that are popular in audio engineering for their intuitive handling and performance. Motivated by the urge of controllable acoustic conditions for listening tests within immersive multi-loudspeaker audio reproduction, the presented algorithm was designed with respect to the recommendation ITU-R BS.1116-3. Results show, that spectral coloration can be reduced significantly, reaching conformity with the recommendation of up to 96%.
Amplitude panning, for example Vector Base Amplitude Panning (VBAP), is one of the most widely used methods to reproduce spatial and object-based audio scenes over loudspeakers. It creates, however, a position-dependent perceived source extent, or spread, which is undesirable in many situations. This paper propose an efficient algorithm to generate image sources with exact constant spread for arbitrary loudspeaker layouts. This algorithm is based on a previously proposed approach using convex optimization, but avoids run-time optimization to achieve a complexity very close to VBAP. In this paper, the algorithm is derived for two-dimensional loudspeaker layouts, but it can be adopted to 3D configurations as well.
The Department of Music at the University of York was one of the first, possibly the first, adopters of Ambisonics for use by composers of electronic music. This paper looks at the history of Ambisonics,concentrating on the York perspective.
We present a statistical analysis result of ear-related anthropometric data measured from 162 subjects and its subsets divided by gender and race. To analyze the data efficiently, we have developed a measurement technique that is semi-automatic, and therefore, can scale to larger data sets. The results show that, ear dimensions of Asian subjects’ ears tend to be larger than those of non-Asian subjects. Statistical tests confirmed the significant difference of the ear dimensions between different gender and racial categories. These findings suggest the importance of taking into account the subject’s demographic information such as gender or race, for generalized or individualized HRTF data based on ear shape modeling for immersive audio applications.
The paper describes a method for obtaining spherical sets of head-related transfer functions (HRTFs) based on a small number of measurements in reverberant environments. For spatial upsampling, we apply HRTF interpolation in the spherical harmonics (SH) domain. However, the number of measured directions limits the maximal accessible SH order, resulting in order-limitation errors and a restricted spatial resolution. Thus, we propose a method which reduces these errors by a directional equalization based on a spherical head model prior to the SH transform. To enhance the valid range of a subsequent low-frequency extension towards higher frequencies, we perform the extension on the equalized dataset. Finally, we apply windowing to the impulse responses to eliminate room reflections from the measured HRTF set. The analysis shows that the method for for spatial upsampling influences the resulting HRTF sets more than degradations due to room reflections or due to distortions of the loudspeakers.
Direct to Reverberant Ratio (DRR) is measured for three non-idealised rooms of different sizes, using a variety of methods. Binaural room impulse response are compared to the DRR calculated from an omnidirectional room impulse response. Consistent differences were found in the absolute DRR values calculated from each type of impulse, and also when the source is positioned close to a room boundary. As expected DRR decreased with distance, but certain room features produced some inconsistencies. The binaural DRR is also shown to vary with source angle, particularly for nearfield sources. DRR values are calculated with a variety of Direct Sound integration window sizes. The results suggest that in smaller rooms, a smaller window size produces more consistent changes in DRR.
The acquisition of higher-order ambisonic signals presents a technical challenge which has been met by several authors proposing a number of different array geometries and sensor types. The current paper presents a class of arrays whose performance has not been assessed before; arrays comprising only pressure-sensitive sensors on both sides of an acoustically hard plate. The combined acoustical and signal processing system is analyzed and numerical experiments on an optimized third order array provide a performance comparison with a more conventional hard-shell spherical microphone array. The model is verified by measurements.
Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech signals.
Ambisonics is a spatial audio rendering method appropriate for dynamic binaural synthesis due to its sound field rotation and transformation capabilities. An issue of low-order Ambisonics is that interaural level differences (ILDs), a crucial cue for lateral localisation, are often reproduced lower than they should be, which reduces lateral localisation accuracy. This paper introduces a method for Ambisonic ILD Optimisation (AIO), aiming to bring the ILDs produced by binaural Ambisonic rendering closer to those of head-related impulse responses (HRIRs). AIO is evaluated versus a reference dataset of HRIRs for all locations on the sphere using estimated ILD, perceptual spectral difference and horizontal plane localisation. Results show an overall improvement in all tested metrics.
Ambisonics has been widely adopted for conveying immersive audio experiences via headphone based sound scene reproduction. It is well known, however, that order limitation causes blurred source images, reduced spaciousness and direction dependent timbral artifacts when rendered for binaural playback. Signal-dependent methods aim to remedy these shortcomings based on a set of estimated sound field parameters or by injecting decorrelated signals, however, parameter estimation errors and decorrelator artifacts oftentimes introduce audible artifacts. In this work we propose a signal-dependent method for binaural redering of Ambisonic signals based on a constrained least-squares decoder. The method enables faithfull reproduction of a direct and a diffuse signal per time-frequency tile requiring only a single parameter and without the need of injecting decorrelated signals.
This demo paper aims at introducing a novel VST binaural audio plugin based on the 3D Tune-In (3DTI) Toolkit, a multiplatform open-source C++ library which includes several functionalities for headphone-based sound spatialisation, together with generalised hearing aid and hearing loss simulators. The VST plugin integrates all the binaural spatialisation functionalities of the 3DTI Toolkit for one single audio source. The spatialisation is based on direct convolution with any user-imported Head Related Transfer Function (HRTF) set. Interaural Time Differences (ITDs) are customised in real-time according to the listener’s head circumference. Binaural reverberation is performed using a virtual-loudspeakers Ambisonic approach and convolution with user-imported Binaural Room Impulse Responses (BRIRs). Additional processes for near- and far-field sound sources simulations are also included.
This paper presents a theoretical, analytical and experimental studies for synthesizing virtual auditory space to multiple listeners using binaural synthesis over two-channel loudspeakers. The optimal source distribution (OPSODIS) improves many of the problems arising with the binaural reproduction over loudspeakers. The principle utilizes the idea of a pair of conceptual monopole transducers whose azimuthal location varies continuously as a function of frequency. In this paper, theoretical considerations under free-field conditions and numerical calculations including the effect of head related transfer functions for off-axis listeners are presented. It is revealed that the OPSODIS can also provide the same binaural signals as for the on-axis target listener to the ears of multiple off-axis listeners.
Cross-talk cancellation makes possible the reproduction of binaural audio through loudspeakers. This is typically achieved by employing a digital signal processing network that controls the acoustic pressure at the listener's ears. Although this can be achieved by using only two loudspeakers, there has been a recent tendency of using loudspeaker arrays, which increase the robustness to source errors and reduce room's response influence. This document introduces a numerical study on the trade-off between cross-talk cancellation performance and the number of channels of a loudspeaker array. Special attention is given to the conditioning of the array and to how this is affected by inaccuracies in the driver response for different numbers of loudspeakers: 2, 3, 4, 5, and 7.
The singular value decomposition is used to analyse the generalised modes of the radiation matrix for a symmetric three-channel crosstalk cancellation loudspeaker system assuming plane wave sources in free-field. The addition of a third centre source acts as a physical lossless regularisation that dramatically improves the efficiency of the in-phase mode. In addition, the singular value decomposition is used to analyse the source strength solution for a plane wave virtual source. The validity of the derived source strengths for real systems is shown by comparison of the analytical function to source strengths calculated using KEMAR head-related transfer functions. The analytical formula is shown to be accurate up to approximately 700 Hz in far-field and anechoic conditions.
This paper describes the design and development of rhythm-based music game technology to support upper arm rehabilitation following a stroke. The potential benefit of game technology for rehabilitation is well established. However, there is a significant gap in research incorporating Rhythmic Auditory Cueing (RAC), the practice of synchronising rehabilitation movements with a periodic rhythm. This paper describes the design and development of a prototype game system which incorporates RAC, using music as the rhythmic stimulus. The system can cater for a range of impairments and provide metrics to monitor user performance and progress. The operation of the game system algorithms are discussed in detail, focusing on issues surrounding player interaction, rhythm synchronisation and the performance metrics gathered during game play.
There is increasing interest in the maritime industry in the potential for use of uncrewed vessels to improve efficiency and safety. This requires remote monitoring of vessels for maintenance purposes. This project examined the potential to enhance remote monitoring by replicating the audio and vibration of a real vessel in a VR simulation. Seven experienced marine engineers were asked to assess several scenarios in which simulated faults were presented in different elements of the simulated engine room. Users were able to diagnose simulated mechanical failures with a high degree of accuracy, particularly utilising audio and vibration stimuli, and reported specifically that the immersive audio and vibration improved realism and increased their ability to diagnose system failures from a remote location.
This paper reports the results of a pilot study evaluating how a group of children with autism spectrum disorders respond to SoundFields, an interactive spatial audio game. Based in a 360-degree collaborative virtual environment with 3rd order ambisonic audio delivered over headphones, the system is designed to promote natural communication and cooperation between users with autism through shared interactions and game mechanics centred around spatitalized auditory events. Head rotation and game data were collected to evaluate the participants' experience and behaviour. The results show a positive response to 3D sound stimuli and engagement with in game tasks. Furthermore, observations noted an increase of interaction during game play, demonstrating SoundFields' potential for developing communication and social impairments in children with autism.
One of the tough but rewarding challenges of interactive audio synthesis is the continuous representation of reflecting and occluding objects in the simulated world. For decades it’s been normal for game engines to support two sorts of 3D geometry, for graphics and physics, but neither of those is well-suited for audio. This paper explains how the geometric needs of audio differ from those others, and describes techniques hit games have used to fill the gaps. It presents easily-programmed methods to tailor object-based audio, physics data, pre-rendered 3D audio soundfields and reverb characteristics to account for occlusion and gaps in the reverberant environment, including those caused by movements in the simulated world, or the collapse of nearby objects.
Reactive virtual acoustic environments (VAEs) that respond to any user-generated sound with an appropriate acoustic room response enable immersive audio applications with enhanced sonic interaction between the user and the VAE. This paper presents a reactive VAE that has two clear advantages when compared to other systems introduced so far: it generally works with any type of sound source, and the dynamic directivity of the source is adequately considered in the binaural reproduction. The paper describes the implementation of the reactive VAE and completes the technical evaluation of the overall system focusing on the recently added software components. Regarding use of the system in research, the study briefly discusses challenges of conducting psychoacoustic experiments with such a reactive VAE.
Music production has always been influenced by and evolved alongside the newest technological standards and listener demands. This paper discusses the 3D mix aesthetics of Ambisonics beyond 6th order taking a classical Turkish music production as a musical case. An ensemble recording was made in the recording studio of Istanbul Technical University (ITÜ) MIAM. The channels of that session were mixed on the High Density Loudspeaker Array in the Immersive Audio Lab of University 2, exploring generic ways of spatial music production. The results were rated by means of a survey grading immersive audio parameters.
Perhaps the most pervasive immersive format at present is 360º video, which can be panned whilst being viewed. Typically, such footage is captured with a specialist camera. Approaches and workflow for the creation of 3-D audio for this medium are seldom documented, and methods are often heuristic. This paper offers insight into such approaches, and whilst centered on post-production, also discusses some aspects of audio capture. This is done via a number of case studies that draw from the commercial work of immersive-audio company, 1.618 Digital. Although these case studies are unified by certain common approaches, they also include unusual aspects such as binaural recording of insects, sonic capture of moving vehicles and the use of drones.
Of the many sounds we encounter throughout the day, some stay lodged in our minds more easily than others; these may serve as powerful triggers of our memories. In this paper, we measure the memorability of everyday sounds across 20,000 crowd-sourced aural memory games, and then analyze the relationship between memorability and acoustic/ cognitive salience features; we also assess the relationship between memorability and higher-level gestalt features such as its familiarity, valence, arousal, source type, causal certainty, and verbalizability. We suggest that modeling these cognitive processes opens the door for human-inspired compression of sound environments, automatic curation of large-scale environmental recording datasets, and real-time modification of aural events to alter their likelihood of memorability.
Immersive sandboxes for music creation in Virtual Reality (VR) are becoming widely available. Some sandboxes host Virtual Reality Musical Instruments (VRMIs), but usually only the basic components, such as oscillators, sample-based instruments, or simplistic step-sequencers. In this paper, after describing MuX (a VR sandbox) and its basic components, we present new elements developed for the environment. We focus on the lumped and distributed physically-inspired models for sound synthesis. A simple interface was developed to control the physical models with gestures, expanding the interaction possibilities within the sandbox. A preliminary evaluation shows that, as the number and complexity of the components increase, it becomes important to provide to the users ready-made machines instead of allowing them to build everything from scratch.
Interactive auralization workflows in games and virtual reality today employ manual markup coupled to designer specified acoustic effects that lack spatial detail. Acoustic simulation can model such detail, yet is uncommon because realism often does not perfectly align with aesthetic goals. We show how to integrate realistic acoustic simulation while retaining designer control over aesthetics. Our method eliminates manual zone placement, provides spatially smooth transitions, and automates re-design for scene changes. It proceeds by computing perceptual parameters from simulated impulse responses, then applying transformations based on novel modification controls presented to the user. The result is an end-to-end physics-based auralization system with designer control. We present case studies that show the viability of such an approach.
Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras; then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement, in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within the respective real rooms.
Resonance Audio is an open source project designed for creating and controlling dynamic spatial sound in Virtual & Augmented Reality (VR/AR), gaming or video experiences. It also provides integrations with popular game development platforms and digital audio workstations as a preview plugin. Resonance Audio binaural decoder is used in YouTube VR to provide cinematic spatial audio experiences. This paper describes the core sound spatialization algorithms used in Resonance Audio.
Virtual reality is increasingly used throughout performance research, offering an exciting opportunity to access experiences which people may otherwise be unable to participate in. This project extends an existing VR singing performance tool, placing it in an outdoor context, introducing significant challenges for audiovisual data collection. A choral performance in the Lake District was captured using immersive recording technologies, and turned into a VR experience, allowing members of the public to join the choir through singing. Initial feedback indicates that users responded well to the experience, feeling immersed in the performance and not inhibited by the VR equipment. This work highlights the positive impact of VR as a tool to increase accessibility to remote places and unique experiences.
This study has investigated to what extent and how trained singers adapt to the room acoustic conditions in physical and in virtual acoustic environments. Two musical pieces were recorded for four singers in eight different performance venues by means of a near-field microphone. The first experiment was replicated in the anechoic chamber, using an interactive auralization provided by dynamic binaural synthesis of the same performances venues. The 128 recordings were analyzed in terms of audio features related to tempo, loudness, and timbre. Their interrelation with room acoustical parameters was analysed by linear regression. The results show individual patterns of adaptation although similar interactions between room acoustics and musical performance could be observed in the physical and virtual environment.