Wednesday, October 18, 9:00 am — 11:30 am
P01-1 Generation and Evaluation of Isolated Audio Coding Artifacts—Sascha Dick, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Nadja Schinkel-Bielefeld, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Silvantos GmbH - Erlangen, Germany; Sascha Disch, Fraunhofer IIS, Erlangen - Erlangen, Germany
Many existing perceptual audio codec standards define only the bit stream syntax and associated decoder algorithms, but leave many degrees of freedom to the encoder design. For a systematic optimization of encoder parameters as well as for education and training of experienced test listeners, it is instrumental to provoke and subsequently assess individual coding artifact types in an isolated fashion with controllable strength. The approach presented in this paper consists of a pre-selection of suitable test audio content in combination with forcing a specially modified encoder into non-common operation modes to willingly generate controlled coding artifacts. In conclusion, subjective listening tests were conducted to assess the subjective quality for different parameters and test content.
Convention Paper 9809
P01-2 Enhancement of Voice Intelligibility for Mobile Speech Communication in Noisy Environments—Kihyun Choo, Samsung Electronics Co., Ltd. - Suwon, Korea; Anton Porov, Samsung R&D Institute Russia - Moscow, Russia; Maria Koutsogiannaki, Samsung R&D Institute UK; Holly Francois, Samsung Electronics R&D Institute UK - Staines-Upon Thames, Surrey, UK; Jonghoon Jeong, Samsung Electronics Co. Ltd. - Seoul, Korea; Hosang Sung, Samsung Electronics - Korea; Eunmi Oh, Samsung Electronics Co., Ltd. - Seoul, Korea
One of the biggest challenges still encounter with speech communication via a mobile phone is that it is sometimes very difficult to understand what is said when listening in a noisy place. In this paper a novel approach based on two models is introduced to increase speech intelligibility for a listener surrounded by environmental noise. One is to perceptually optimize the speech when considering simultaneous background noise, the other is to modify the speech towards a more intelligible, naturally elicited speaking style. The two models are combined to provide more understandable speech even in a loud noisy environment environment, even in the case where we are unable to increase the speech volume. The improvements in perceptual quality and intelligibility are shown by Perceptual Objective Listening Quality Assessment and Listening Effort Mean Opinion Score evaluation.
Convention Paper 9810
P01-3 Application of Spectral-Domain Matching and Pseudo Non-Linear Convolution to Down-Sample-Rate-Conversion (DSRC)—Malcolm O. J. Hawksford, University of Essex - Colchester, Essex, UK
A method of down-sample-rate conversion is discussed that exploits processes of spectral-domain matching and pseudo non-linear convolution applied to discrete data frames as an alternative to conventional convolutional filter and sub-sampling techniques. Spectral-domain matching yields a complex sample sequence that can subsequently be converted into a real sequence using the Discrete Hilbert Transform. The method is shown to result in substantially reduced time dispersion compared to the standard convolutional approach and circumvents filter symmetry selection such as linear phase or minimum phase. The formal analytic process is presented and validated through simulation then adapted to digital-audio sample-rate conversion by using a multi-frame overlap and add process. It has been tested in both LPCM-to-LPCM and DSD-to-LPCM applications where the latter can be simplified using a look-up code table.
Convention Paper 9811
P01-4 Detection of Piano Pedaling Techniques on the Sustain Pedal—Beici Liang, Queen Mary University of London - London, UK; György Fazekas, Queen Mary University of London - London, UK; Mark B. Sandler, Queen Mary University of London - London, UK
Automatic detection of piano pedaling techniques is challenging as it is comprised of subtle nuances of piano timbres. In this paper we address this problem on single notes using decision-tree-based support vector machines. Features are extracted from harmonics and residuals based on physical acoustics considerations and signal observations. We consider four distinct pedaling techniques on the sustain pedal (anticipatory full, anticipatory half, legato full, and legato half pedaling) and create a new isolated-note dataset consisting of different pitches and velocities for each pedaling technique plus notes played without pedal. Experiment shows the effectiveness of the designed features and the learned classifiers for discriminating pedaling techniques from the cross-validation trails.
Convention Paper 9812
Wednesday, October 18, 9:00 am — 10:30 am
P02-1 Audio Education: Audio Recording Production Students Report Skills Learned or Focused on in Their Programs—Doug Bielmeier, Purdue School Of Engineering and Technology, IUPUI - Indianapolis, IN, USA
Previous research polled employers, new hires, and educators in the audio industry to identify what skills were most important, what skills new hires had, and what skills educators focused on in Audio Recording Production (ARP) Programs. The Skills Students Learned (SSL) Survey used in this study, polled 40 students from the U.S. and aboard to identify skills learned at ARP programs. Students reported their skill level before and after attending a formal ARP program via an online mixed methods survey instrument. In the quantitative section, students reported an improvement in all skill levels upon completing their ARP training. In the qualitative section, students reported communication skills and in-depth technical skills missing from their programs and personal skill sets. This study recommends infusion of these skills into existing ARP curriculum.
Convention Paper 9814
P02-2 Audio Archive Preservation Challenges and Pedagogical Opportunities: School of Music RePlayed—Samantha Bennett, Australian National University - Canberra, ACT, Australia
This paper considers the various challenges, implications and pedagogical opportunities presented via a small-scale audio archiving project: School of Music RePlayed. Housed in the Australian National University’s School of Music, this historical archive of more than 1200 recital and concert tape recordings features multiple recordings of historical significance, yet presents with a number of issues pertaining to storage and tape deterioration. This paper first considers the challenges presented in the digitization of such an archive before focusing on the pedagogical opportunities afforded by such a unique project. Developed and run in conjunction with the National Film and Sound Archive of Australia, this unique project addresses both technological and pedagogical matters of preservation, heritage and digitization.
Convention Paper 9815
P02-3 The Education of the Next Generation of Pro-Audio Professionals—Curig Huws, University of South Wales - Cardiff, UK
Since the late 1990s and early 2000s, the changing nature of the music industry has led to the demise of recording studios, which have decreased dramatically in number. This decline has led to a corresponding disappearance of the “teaboy” route, the traditional route whereby engineers, producers, and mixers (EPM) learned their craft. In the training vacuum that the demise of recording studios creates, how do EPM professionals now learn the skills and knowledge necessary to succeed in the music industry? Through primary research and indepth interviews with leading EPM professionals and online education providers, this paper assesses the skills needed to become a successful EPM and explores whether the internet can ever replace the traditional teaboy route in educating the next generation of professionals. It concludes that there are currently significant limitations to internet learning of EPM skills, some of which might be overcome by new technological developments such as virtual reality.
Convention Paper 9816
Wednesday, October 18, 10:45 am — 12:15 pm
P03-1 Study on Objective Evaluation Technique for Small Differences of Sound Quality—Yuki Fukuda, Hiroshima City University - Hiroshima-shi, Japan; Kenta Ueyama, Hiroshima City University - Hiroshima-shi, Japan; Shunsuke Ishimitsu, Hiroshima City University - Hiroshima, Japan; Ryoji Higashi, Memory-Tech Corporation - Tokyo, Japan; Seiji Yumoto, Memory-Tech Corporation - Tokyo, Japan; Takashi Numanou, Memory-Tech Corporation - Tokyo, Japan
In recent years, some results on different auditory impressions from differences of materials and media have been discussed. To check the causes of these differences, we analyzed the differences in the sound pressure levels and interaural time difference  between three different Compact Discs by using wavelet analysis. The results of these analyses detected objective differences in sound despite different materials having the same data, and the new Compact Disc called the “Ultimate Hi Quality Compact Disc” made of photopolymer, where a special alloy has been employed as a reflection film, reproduces more of the master sound than the conventional Compact Disc. We show the method for analyzing sound and evaluate these differences and consider their application on various sound quality evaluations.
Convention Paper 9817
P03-2 Alternative Weighting Filters for Multi-Track Program Loudness Measurement—Steven Fenton, University of Huddersfield - Huddersfield, West Yorkshire, UK; Hyunkook Lee, University of Huddersfield - Huddersfield, UK
The ITU-Recommendation BS.1770 is now established throughout most of the broadcast industry. Program loudness measurement is undertaken through the summation of K-weighted energy and this summation typically involves material that is broadband in nature. We undertook listening tests to investigate the performance of the K-weighting filter in relation to perceived loudness of narrower band stimuli, namely octave-band pink noise and individual stems of a multitrack session. We propose two alternative filters based on the discrepancies found and evaluate their performance using different measurement window sizes. The new filters yield better performance accuracy for both pink noise stimuli and certain types of multitrack stem. Finally, we propose an informed set of parameters that may improve loudness prediction in auto mixing systems.
Convention Paper 9818
P03-3 An Audio Loudness Compression and Compensation Method for Miniature Loudspeaker Playback—Ziran Jiang, Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences; University of Chinese Academy of Sciences - Beijing, China; Jinqiu Sang, Institute of Acoustics, Chinese Academy of Science - Beijing, China; Jie Wang, Guangzhou University - Guangzhou, China; Chengshi Zheng, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China; Fangjie Zhang, Institute of Acoustice, Chinese Academy of Science - Beijing, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
Audio playback through miniature loudspeakers is bounded by the loudspeaker’s limited dynamic range. How to compress the audio and simultaneously preserve the original artistic effect is worthy of study. Traditional peak-based and RMS-based dynamic range compression (DRC) methods do not consider the audio loudness characteristic that may influence the perceptual artistic effect. This paper proposes a novel compression and compensation method based on Zwicker’s loudness model and equal-loudness contours. The proposed method aims to provide a high-quality audio playback by mapping the audio’s loudness to a smaller range, while preserving the perceived spectral balance of the original audio. Subjective listening tests are performed to demonstrate the benefits of the proposed method.
Convention Paper 9819
P03-4 Assessing the Authenticity of the KEMAR Mouth Simulator as a Repeatable Speech Source—Thomas McKenzie, University of York - York, UK; Damian Murphy, University of York - York, UK; Gavin Kearney, University of York - York, UK
In audio engineering research, repeatability is paramount. Speech is a great stimulus to use when evaluating audio systems as it is a real world sound highly familiar to the human auditory system. With a view to the comparison of real and virtual sound fields, a repeatable speech source is therefore highly advantageous. This paper presents both an objective and subjective evaluation of the G.R.A.S. Knowles Electronic Manikin for Acoustic Research mouth simulator as a repeatable speech source, assessing its accuracy and perceptual authenticity.
Convention Paper 9820
P03-5 Pilot Experiment on Verbal Attributes Classification of Orchestral Timbres—Ivan Simurra, Sr., University of Sao Paolo - São Paulo, Brazil; Marcelo Queiroz, University of São Paulo - São Paulo, Brazil
This paper presents a listening test of an ongoing research related to timbre perception, using a set of 33 orchestral music excerpts that are subjectively rated using quantitative scales based on 13 pairs of opposing verbal attributes. The aim of the experiment is to identify significant verbal descriptions potentially associated with timbre aspects of musical excerpts that explore technical aspects of contemporary music such as extended techniques and nonstandard music orchestration. Preliminary results suggest that these scales are able to describe timbral qualities in a way that is consistent among different listeners.
Convention Paper 9821
P03-6 Precedence Effect Using Simultaneous High and Low-Passed Stimuli—Austin Arnold, Belmont University - Nashville, TN, USA; Ellicott City, MD; Wesley Bulla, Belmont University - Nashville, TN, USA
This study was an exploration of interaural suppression in the context of two simultaneous auditory precedence scenarios. The experiment investigated the nature of aural precedence by presenting subjects with two sets of stimuli simultaneously. Combinations of lead-lag signals employed a series of low- and high-passed noise bursts presented as either leading on the same side or on opposite sides of the listener. Subjects were asked to localize each noise burst. Findings suggest that when signals originated at opposite loudspeakers, performance for both signals was degraded. However, degradation appeared to be dependent upon the frequency span between the two stimuli. This novel study of the precedence effect more broadly addresses the manner in which the brain resolves bilaterally conflicting information and provides evidence that binaural suppression is not band limited, is possibly object oriented, and may change with the content of the objects of interest.
Convention Paper 9822
Wednesday, October 18, 2:00 pm — 5:00 pm
P04-1 Estimation of Magnitude Response of Reflecting Loudspeaker System in Listening Area Using Near-Box Measurement—Ge Zhu, Nanjing University - Nanjing, China; Ziyun Liu, Nanjing University - Nanjing, China; Yong Shen, Nanjing University - Nanjing, Jiangsu Province, China; Yuchen Shen, Nanjing University - Nanjing, China
This paper presents a simple and robust method to estimate the general magnitude response in reflecting loudspeaker systems. The method utilizes statistical acoustics and is based on near-box impulse response measurement. This measurement holds information across the entire listening area after truncation post-processing. The estimation was investigated in different acoustic environments, which lead to that the more diffusive the room is, the more precise result can be achieved. Measured response can be a reliable reference for correction system in reflecting loudspeaker system.
Convention Paper 9823
P04-2 Optimal Modulator with Loudspeaker Parameter Inclusion—Nicolai Dahl, Technical University of Denmark - Lyngby, Denmark; Niels Elkjær Iversen, Technical University of Denmark - Kogens Lyngby, Denmark; Arnold Knott, Technical University of Denmark - Kgs. Lyngby, Denmark
Today most class-D amplifier designs are able to deliver high efficiency and low distortion. However, the effect of parasitic component and speaker dynamics are not taken into account resulting in a degradation of the performance. This paper proposes a new PWM modulator that is able to capture an arbitrary amount of dynamics through optimization-based design methods. This makes it possible to include the parasitic components in the amplifier and the loudspeaker parameters in the design, thus creating a more linear response.
Convention Paper 9824
P04-3 Fast Loudspeaker Measurement in Non-Anechoic Environments—Christian Bellmann, Klippel GmbH - Dresden, Germany; Wolfgang Klippel, Klippel GmbH - Dresden, Germany
The evaluation of the loudspeaker performance requires a measurement of the sound pressure output in the far field of the source under free field condition. If the available test room does not fulfil this condition, it is common practice to generate a simulated free field response by separating the direct sound from the room reflection based on windowing and holographic processing. This paper presents a new technique that performs a filtering of the measured sound pressure signal with a complex compensation function prior to other time and frequency analysis. The influence of room, nearfield and positioning error is compensated in the measured fundamental and nonlinear distortion characteristics. Different methods are presented for the generation of the compensation function based on a reference response measured under anechoic conditions and a test response measured under in-situ conditions. Benefits and particularities are demonstrated by practical measurements using different kinds of test signals.
Convention Paper 9825
P04-4 Analog Circuit Model for Loudspeakers including Eddy Current Behavior and Suitable for Time Domain Simulation—Stephen C. Thompson, Pennsylvania State University - State College, PA, USA; Daniel M. Warren, GN Advanced Science - Glenview, IL, USA
This paper presents two analog circuit models for the blocked electrical impedance for a moving coil loudspeaker. The first includes an exact model of the effects of eddy currents as derived by Vanderkooy. The model is implemented using a partial fraction expansion that allows an implementation using conventional electrical circuit components. An alternative circuit suggested by Leach uses fewer components and can model not only a purely semi-inductive behavior, but also other frequency variations that are sometimes observed. Because these eddy current models do not use frequency dependent components, they can be used in time domain simulations of loudspeaker behavior that are capable of modeling mechanical and magnetic nonlinearities.
Convention Paper 9826
P04-5 Use of Repetitive Multi-Tone Sequences to Estimate Nonlinear Response of a Loudspeaker to Music—Pascal Brunet, Samsung Research America - Valencia, CA USA; Audio Group - Digital Media Solutions; William Decanio, Samsung Research America - Valencia, CA, USA; Ritesh Banka, Samsung Research America - Valencia, CA USA; Shenli Yuan, Center for Computer Research in Music and Acoustics (CCRMA), Stanford University - Stanford, CA, USA
Aside from frequency response, loudspeaker distortion measurements are perhaps the most commonly used metrics to appraise loudspeaker performance. Unfortunately the stimuli utilized for many types of distortion measurements are not complex waveforms such as music or speech, thus the measured distortion characteristics of the DUT may not typically reflect the performance of the device when reproducing usual program material. To this end, the topic of this paper will be the exploration of a new multi-tone sequence stimulus to measure loudspeaker system distortion. This method gives a reliable estimation of the average nonlinear distortion produced with music on a loudspeaker system and delivers a global objective assessment of the distortion for a DUT in normal use case.
Convention Paper 9827
P04-6 Non-Invasive Audio Performance Measurement on Wireless Speakers—Srinath Arunachalam, Harman International - South Jordan, UT, USA; Douglas J. Button, Harman International - Northridge, CA USA; Jay Kirsch, Harman International - South Jordan, UT, USA
Wireless audio systems are gaining market share due to their portability, flexibility, and simply because users do not want to be entangled in wires. As with any technology, the advantages come with many challenges, one of which is creating a meaningful measurement of performance. In this paper we propose a non-invasive testing methodology for manufacturers to measure audio performance in their wireless speaker products . The method begins with baseline acoustic measurements using electrical (line-in) inputs, which are used as a reference for measurements of other wireless input types such as Bluetooth and Wi-Fi. The results show the degradations due to the wireless transport.
Convention Paper 9828
Wednesday, October 18, 2:00 pm — 5:00 pm
P05-1 Direct and Indirect Listening Test Methods—A Discussion Based on Audio-Visual Spatial Coherence Experiments—Cleopatra Pike, University of St Andrews - Scotland, Fife, UK; Hanne Stenzel, University of Surrey - Guildford, Surrey, UK
This paper reviews the pros and cons of using direct measures (e.g. preference, annoyance) and indirect measures (e.g. “subconscious” EEG measures and reaction times, “RTs”) to determine how viewers perceive audio and audio-visual attributes. The methodologies are discussed in relation to spatial coherence testing (whether audio/visual signals arrive from the same direction). Experimental results in coherence testing are described to illustrate problems with direct measures and improvements seen with RTs. Suggestions are made for the use of indirect measures in testing, including more sophisticated uses of RTs. It is concluded that indirect measures offer novel insights into listener evaluations of audio-visual experiences but are not always suitable
Convention Paper 9829
P05-2 Identification of Perceived Sound Quality Attributes of 360º Audiovisual Recordings in VR Using a Free Verbalization Method—Marta Olko, New York University - New York, NY, USA; Dennis Dembeck, New York University - New York, NY, USA; Yun-Han Wu, New York University - New York, NY, USA; Andrea Genovese, New York University - New York, NY, USA; Agnieszka Roginska, New York University - New York, NY, USA
Recent advances in Virtual Reality (VR) technology have led to fast development of 3D binaural sound rendering methods that work in conjunction with head-tracking technology. As the production of 360° media grows, new subjective experiments that can appropriately evaluate and compare the sound quality of VR production tools are required. In this preliminary study a Free Verbalization Method is employed to uncover auditory features within 360° audio-video experiences when paired with a 3-degrees-of-freedom head-tracking VR device and binaural sound over headphones. Subjects were first asked to identify perceived differences and similarities between different versions of audiovisual stimuli. In a second stage, subjects developed bipolar scales based on their verbal descriptions obtained previously. The verbal constructs created during the experiment, were then combined by the authors and experts into parent attributes by means of semantical analysis, similar to previous research on sound quality attributes. Analysis of the results indicated that there were three main groups of the sound quality attributes: attributes of sound quality describing the general impression of the 360° sound environment, attributes describing sound in relation to the head movement, and attributes describing audio and video congruency. Overall, the consistency of sound between different positions in 360° environment seems to create the new fundamental aspect of sound evaluation for VR and AR multimedia content.
Convention Paper 9830
P05-3 Tonal Component Coding in MPEG-H 3D Audio Standard—Tomasz Zernicki, Zylia sp. z o.o. - Poznan, Poland; Lukasz Januszkiewicz, Zylia Sp. z o.o. - Poznan, Poland; Andrzej Ruminski, Zylia sp. z.o.o. - Poznan, Poland; Marcin Chryszczanowicz, Zylia sp. z.o.o. - Poznan, Poland
This paper describes a Tonal Component Coding (TCC) technique that is an extension tool for the MPEG-H 3D Audio Standard. The method is used in order to enhance the perceptual quality of audio signals with strong and time-varying high frequency (HF) tonal components. At the MPEG-H 3D Audio Core Coder, the TCC tool exploits sinusoidal modeling in order to detect substantial HF tonal content and transform it into the so-called sinusoidal trajectories. A novel parametric coding scheme is applied and the additional data are multiplexed into the bitstream. At the decoder side, the trajectories are reconstructed and merged with the output of the 3D Audio Core Decoder. The TCC was tested as an extension to MPEG-H Audio Reference Quality Encoder in low bitrate (enhanced Spectral Band Replication) and low complexity (Intelligent Gap Filling) operating mode. The subjective listening tests prove the statistical improvement of perceptual quality of signals encoded with proposed technique.
Convention Paper 9831
P05-4 Lead-Signal Localization Accuracy for Inter-Channel Time Difference in Higher and Lower Vertical, Side, and Diagonal Loudspeaker Configurations—Paul Mayo, Belmont University - Nashville, TN, USA; Wesley Bulla, Belmont University - Nashville, TN, USA
The effects of inter-channel time difference (ICTD) on a sound source’s perceived location are well understood for horizontal loudspeaker configurations. This experiment tested the effect of novel loudspeaker configurations on a listener’s ability to localize the leading signal in ICTD scenarios. The experiment was designed to be a comparison to standard horizontal precedence-effect experiments but with non-traditional loudspeaker arrangements. Examples of such arrangements include vertical, elevated, and lowered configurations. Data will be analyzed using sign and ANOVA tests with listeners’ responses being visualized graphically. Outcomes are expected to follow a predicted precedence-based suppression model assuming localization will be concentrated at the leading loudspeaker.
Convention Paper 9832
P05-5 Non-Intrusive Polar Pattern Estimation in Diffuse Noise Conditions for Time Variant Directional Binaural Hearing Aids—Changxue Ma, GN Resound Inc. - Glenview, IL, USA; Andrew B. Dittberner, GN Resound Inc. - Glenview, IL, USA; Rob de Vries, GN Resound Inc. - Eindhoven, The Netherlands
The directivity index is often used to represent the performance of beamforming algorithms on hearing aids. For binaural listening modeling, we also need to measure the directivity patterns. The common method to estimate the directivity patterns is using a single rotating sound source. With this method, it is difficult to obtain a good directivity pattern when the performance of the system dependents on the acoustic environment in an adaptive manner. The directivity pattern can be also confounded by other signal processing components. These processing include direction of arrival (DOA) based nonlinear post filtering. This paper proposes a method to extract the directivity patterns of a beamforming algorithm under diffuse noise conditions with a rotating probe signal. We are able to obtain the directivity pattern from the probe signal by spectral subtraction of diffuse noise sound field in a non-intrusive manner.
Convention Paper 9833
P05-6 Training on the Acoustical Identification of the Listening Position in a Virtual Environment—Florian Klein, Technische Universität Ilmenau - Ilmenau, Germany; Annika Neidhardt, Technische Universität Ilmenau - Ilmenau, Germany; Marius Seipel, TU Ilmenau - Ilmenau, Germany; Thomas Sporer, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany
This paper presents an investigation of the training effect on the perception of position dependent room acoustics. Listeners are trained to distinguish the acoustics at different listening positions and to detect mismatches between the visual and the acoustical representations. In virtual acoustic environments simplified representation of room acoustics are used often. That works well, when fictive or unknown rooms are auralized but may be critical for real rooms. The results show that 10 out of 20 participants could significantly increase their accuracy for choosing the correct combinations after training. The publication investigates the underlying processes of the adaptation effect and the reasons for the individual differences. The relevance of these findings for acoustic virtual/augmented reality applications is discussed.
Convention Paper 9834
Thursday, October 19, 9:00 am — 11:30 am
P06-1 Objective Testing of High-End Audio Systems—Gregor Schmidle, NTi Audio AG - Schaan, Liechtenstein; Gerd Köck, Art Déco Acoustics by Audio Manufaktur Köck - Kornwestheim, Germany; Brian MacMillan, NTi Audio Inc. - Portland, OR, USA
The high-end audio equipment market is filled with extraordinary products. Although the engineering and the materials utilized are often of the finest available, the quality control of such systems is frequently done subjectively rather than objectively. This paper shows some best practice examples of how to deploy effective quality measurement systems through the complete life cycle (R&D, QC installation, and repair) of high-end audio systems.
Convention Paper 9835
P06-2 Theory of Constant Directivity Circular-Arc Line Arrays—Richard Taylor, Thompson Rivers University - Kamloops, BC, Canada; D. B. (Don) Keele, Jr., DBK Associates and Labs - Bloomington, IN, USA
We develop the theory for a broadband constant-beamwidth transducer (CBT) formed by a continuous circular-arc isophase line source. Appropriate amplitude shading of the source distribution leads to a far-field radiation pattern that is constant above a cutoff frequency determined by the prescribed beam width and arc radius. We derive two shading functions, with cosine and Chebyshev polynomial forms, optimized to minimize this cutoff frequency and thereby extend constant-beamwidth behavior over the widest possible band. We illustrate the theory with simulations of magnitude responses, full-sphere radiation patterns and directivity index, for example designs with both wide- and narrow-beam radiation patterns.
Convention Paper 9836
P06-3 Constant Directivity Circular-Arc Arrays of Dipole Elements—Richard Taylor, Thompson Rivers University - Kamloops, BC, Canada; Kurtis Manke, Thompson Rivers University - Kamloops, BC, Canada; D. B. (Don) Keele, Jr., DBK Associates and Labs - Bloomington, IN, USA
We develop the theory for a broadband constant-beamwidth transducer (CBT) formed by a conformal circular-arc line array of dipole elements. Just as for CBT arrays of point sources, with suitable amplitude shading of the source distribution the far-field radiation pattern is constant above a cutoff frequency. This cutoff frequency is determined by the prescribed beam width and arc radius. We illustrate the theory with examples, including numerical simulations of magnitude responses, full-sphere radiation patterns, and directivity index. Unlike a circular-arc array of monopole elements, a dipole CBT maintains directivity control at low frequency. We give an example of one such array that achieves just 1 dB variation in directivity index over all frequencies.
Convention Paper 9837
P06-4 Voice Coil Temperature—Non Linearity Compensations for Ultra Audio Band Impedance Probing—Isao Anazawa, Ny Works - Toronto, ON, Canada
As loudspeaker output power of mobile devices increases for better audio experience, an accurate measurement or estimation of the voice coil temperature becomes necessary in order to protect the loudspeaker from over-heating. A voice coil designed with a short ring, a metal pole piece, or under hung voice coil will most likely exhibit impedance nonlinearity. A voice coil resistance based temperature measurement method that relies on the resistance value may be adversely affected by voice coil impedance nonlinearity when the resistance is measured using high frequency probing. For this reason, the nonlinearity must be known and compensated. This paper analyzes and explains voice coil high frequency impedance characteristics due to Eddy losses and impedance nonlinearity, and develops a method to compensate voice coil impedance nonlinearity to obtain an accurate voice coil temperature measurement.
Convention Paper 9838
P06-5 Variable Fractional Order Analysis of Loudspeaker Transducers: Theory, Simulations, Measurements, and Synthesis—Andri Bezzola, Samsung Research America - Valencia, CA USA; Pascal Brunet, Samsung Research America - Valencia, CA USA; Audio Group - Digital Media Solutions; Shenli Yuan, Center for Computer Research in Music and Acoustics (CCRMA), Stanford University - Stanford, CA, USA
Loudspeaker transducer models with fractional derivatives can accurately approximate the inductive part of the voice coil impedance of a transducer over a wide frequency band, while maintaining the number of fitting parameters to a minimum. Analytical solutions to Maxwell equations in infinite lossy coils can also be interpreted as fractional derivative models. However, they suggest that the fractional order a cannot be a constant, but rather a function of frequency that takes on values between 1/2 and 1. This paper uses Finite Element (FEM) simulations to bridge the gap between the theoretical first-principles approach and lumped parameter models using fractional derivatives. The study explores the dependence of a on frequency for idealized infinite and finite cores as well as in four real loudspeaker transducers. To better match the measured impedances and frequency-dependent a values we propose to represent the voice coil impedance by a cascade of R-L sections.
Convention Paper 9839
Thursday, October 19, 9:00 am — 11:00 am
P07-1 A Statistical Model that Predicts Listeners’ Preference Ratings of In-Ear Headphones: Part 1—Listening Test Results and Acoustic Measurements—Sean Olive, Harman International - Northridge, CA, USA; Todd Welti, Harman International Inc. - Northridge, CA, USA; Omid Khonsaripour, Harman International - Northridge, CA, USA
A series of controlled listening tests were conducted on 30 different models of in-ear (IE) headphones to measure their relative sound quality. A total of 71 listeners both trained and untrained rated the headphones on a 100-point preference scale using a multiple stimulus method with a hidden reference and low anchor. A virtual headphone test method was used wherein each headphone was simulated over a high-quality replicator headphone equalized to match their measured magnitude response. Leakage was monitored and eliminated for each subject. The results revealed both trained and untrained listeners preferred the hidden reference, which was the replicator headphone equalized to our new IE headphone target response curve. The further the other headphones deviated from the target response, the less they were preferred. Part two of this paper develops a statistical model that predicts the headphone preference ratings based on their acoustic measurements.
Convention Paper 9840
P07-2 Perceptual Assessment of Headphone Distortion—Louis Fielder, Dolby - San Francisco, CA, USA
A perceptually-driven distortion metric for headphones is proposed that is based on a critical-band spectral comparison of the distortion and noise to an appropriate masked threshold, when the headphone is excited by a sine wave signal. Additionally, new headphone-based masking curves for 20, 50, 100, 200, 315, 400, and 500 Hz sine waves are derived by subjective tests using bands of narrow-band noise being masked by a sine wave signal. The ratios of measured distortion and noise levels in critical bands over the appropriate masking curve values are compared, with the critical bands starting at the second harmonic. Once this is done the audibility of all these contributions are combined into a single audibility value. Extension to loudspeaker measurements is briefly discussed.
Convention Paper 9841
P07-3 The Adjustment / Satisfaction Test (A/ST) for the Subjective Evaluation of Dialogue Enhancement—Matteo Torcoli, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Jürgen Herre, International Audio Laboratories Erlangen - Erlangen, Germany; Fraunhofer IIS - Erlangen, Germany; Jouni Paulus, Fraunhofer IIS - Erlangen, Germany; International Audio Laboratories Erlangen - Erlangen, Germany; Christian Uhle, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; International Audio Laboratories Erlangen - Erlangen, Germany; Harald Fuchs, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany; Oliver Hellmuth, Fraunhofer Institute for Integrated Circuits IIS - Erlangen, Germany
Media consumption is heading towards high degrees of content personalization. It is thus crucial to assess the perceptual performance of personalized media delivery. This work proposes the Adjustment/Satisfaction Test (A/ST), a perceptual test where subjects interact with a user-adjustable system and their adjustment preferences and the resulting satisfaction levels are studied. We employ the A/ST to evaluate an object-based audio system that enables the personalization of the balance between dialogue and background, i.e., a Dialogue Enhancement system. Both the case in which the original audio objects are readily available and the case in which they are estimated by blind source separation are compared. Personalization is extensively used, resulting in clearly increased satisfaction, even in the case with blind source separation.
Convention Paper 9842
P07-4 Automatic Text Clustering for Audio Attribute Elicitation Experiment Responses—Jon Francombe, University of Surrey - Guildford, Surrey, UK; Tim Brookes, University of Surrey - Guildford, Surrey, UK; Russell Mason, University of Surrey - Guildford, Surrey, UK
Collection of text data is an integral part of descriptive analysis, a method commonly used in audio quality evaluation experiments. Where large text data sets will be presented to a panel of human assessors (e.g., to group responses that have the same meaning), it is desirable to reduce redundancy as much as possible in advance. Text clustering algorithms have been used to achieve such a reduction. A text clustering algorithm was tested on a dataset for which manual annotation by two experts was also collected. The comparison between the manual annotations and automatically-generated clusters enabled evaluation of the algorithm. While the algorithm could not match human performance, it could produce a similar grouping with a significant redundancy reduction (approximately 48%).
Convention Paper 9843
Thursday, October 19, 11:00 am — 12:30 pm
P08-1 A Simplified 2-Layer Text-Dependent Speaker Authentication System—Giacomo Valenti, NXP Software - Mougins, France; EURECOM - Biot, France; Adrien Daniel, NXP Software - Mougins, France; Nicholas Evans, EURECOM - Sophia Antipolis, France
This paper describes a variation of the well-known HiLAM approach to speaker authentication that enables reliable text-dependent speaker recognition with short-duration enrollment. The modifications introduced in this system eliminate the need for an intermediate text-independent speaker model. While the simplified system is admittedly a modest modification to the original work, it delivers comparable levels of automatic speaker verification performance while requiring 97% less speaker enrollment data. Such a significant reduction in enrollment data improves usability and supports speaker authentication for smart device and Internet of Things applications.
Convention Paper 9844
P08-2 Binaural Sound Source Separation Based on Directional Power Spectral Densities—Joel Augusto Luft, Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul - Canoas, RS, Brazil; Universidade Federal do rio Grande do Sul - Porto Alegre, RS, Brazil; Fabio I. Pereira, Federal University of Rio Grande do Sul - Porto Alegre, Brazil; Altamiro Susin, Federal University of Rio Grande do Sul - Porto Alegre, Brazil
Microphone arrays are a common choice to be used in spatial sound source separation. In this paper a new method for binaural source separation is presented. The separation is performed using the spatial position of sound source, the Head-Related Transfer Function, and the Power Spectral Density of fixed beamformers. A non-negative constrained least-squares minimization approach is used to solve the Head-Related Transfer Function based directivity gain formulation and the Power Spectral Density is used as a magnitude estimation of the sound sources. Simulation examples are presented to demonstrate the performance of the proposed algorithm.
Convention Paper 9845
P08-3 Improving Neural Net Auto Encoders for Music Synthesis—Joseph Colonel, The Cooper Union for the Advancement of Science and Art - New York, NY, USA; Christopher Curro, The Cooper Union for the Advancement of Science and Art - New York, NY, USA; Sam Keene, The Cooper Union for the Advancement of Science and Art - New York, NY, USA
We present a novel architecture for a synthesizer based on an autoencoder that compresses and reconstructs magnitude short time Fourier transform frames. This architecture outperforms previous topologies by using improved regularization, employing several activation functions, creating a focused training corpus, and implementing the Adam learning method. By multiplying gains to the hidden layer, users can alter the autoencoder’s output, which opens up a palette of sounds unavailable to additive/subtractive synthesizers. Furthermore, our architecture can be quickly re-trained on any sound domain, making it flexible for music synthesis applications. Samples of the autoencoder’s outputs can be found at http://soundcloud.com/ann_synth , and the code used to generate and train the autoencoder is open source, hosted at http://github.com/JTColonel/ann_synth.
Convention Paper 9846
P08-4 Comparative Study of Self-Organizing Maps vs Subjective Evaluation of Quality of Allophone Pronunciation for Non-native English Speakers—Bozena Kostek, Gdansk University of Technology - Gdansk, Poland; Audio Acoustics Lab.; Magdalena Piotrowska, Gdansk University of Technology - Gdansk, Poland; Tomasz Ciszewski, University of Gdansk - Gdansk, Poland; Andrzej Czyzewski, Gdansk University of Technology - Gdansk, Poland
The purpose of this study was to apply Self-Organizing Maps to differentiate between the correct and the incorrect allophone pronunciations and to compare the results with subjective evaluation. Recordings of a list of target words, containing selected allophones of English plosive consonants, the velar nasal and the lateral consonant, were made twice. First, the target words were read from the list by nine non-native speakers and then repeated after a phonology expert’s recorded sample. Afterwards, two recorded signal sets were segmented into allophones and parameterized. For that purpose, a set of descriptors, commonly employed in music information retrieval, was utilized to determine whether they are effective in allophone analysis. The phonology expert’s task was to evaluate the pronunciation accuracy of each uttered allophone. Extracted feature vectors along with the assigned ratings were applied to SOMs.
Convention Paper 9847
P08-5 Automatic Masking Reduction in Balance Mixes Using Evolutionary Computing—Nicholas Jillings, Birmingham City University - Birmingham, UK; Ryan Stables, Birmingham City University - Birmingham, UK
Music production is a highly subjective task, which can be difficult to automate. Simple session structures can quickly expose complex mathematical tasks which are difficult to optimize. This paper presents a method for the reduction of masking in an unknown mix using genetic programming. The model uses results from a series of listening tests to guide its cost function. The program then returns a vector that best minimizes this cost. The paper explains the limitations of using such a method for audio as well as validating the results.Music production is a highly subjective task, which can be difficult to automate. Simple session structures can quickly expose complex mathematical tasks which are difficult to optimize. This paper presents a method for the reduction of masking in an unknown mix using genetic programming. The model uses results from a series of listening tests to guide its cost function. The program then returns a vector that best minimizes this cost. The paper explains the limitations of using such a method for audio as well as validating the results.
Convention Paper 9813
Thursday, October 19, 1:30 pm — 5:30 pm
P09-1 Analysis and Prediction of the Audio Feature Space when Mixing Raw Recordings into Individual Stems—Marco A. Martinez Ramirez, Queen Mary University of London - London, UK; Joshua D. Reiss, Queen Mary University of London - London, UK
Processing individual stems from raw recordings is one of the first steps of multitrack audio mixing. In this work we explore which set of low-level audio features are sufficient to design a prediction model for this transformation. We extract a large set of audio features from bass, guitar, vocal, and keys raw recordings and stems. We show that a procedure based on random forests classifiers can lead us to reduce significantly the number of features and we use the selected audio features to train various multi-output regression models. Thus, we investigate stem processing as a content-based transformation, where the inherent content of raw recordings leads us to predict the change of feature values that occurred within the transformation.
Convention Paper 9848
P09-2 The Beat Goes Static: A Tempo Analysis of U.S. Billboard Hot 100 #1 Songs from 1955–2015—Stephen Roessner, University of Rochester - Rochester, NY, USA
The Billboard Hot 100 is a rich source of information for tracking musical trends. Using available data analysis tools, we devised a method to accurately track tempo throughout a song. In this paper we demonstrate through an analysis of all number one songs from the chart that tempo variation within a song has declined over a 60-year period. In the 5-year span from 1955–1959, the average standard deviation of tempo was 5.01 beats per minute, or about 4.8%. Conversely, from 2010–2014, the average standard deviation was less than 1 beat per minute, or only about 0.85% of the average tempo.
Convention Paper 9849
P09-3 An Even-Order Harmonics Control Technique for Analog Pedal Effector—Kanako Takemoto, Hiroshima Institute of Technology - Hiroshima, Japan; Shiori Oshimo, Hiroshima Institute of Technology - Hiroshima, Hiroshima-ken, Japan; Toshihiko Hamasaki, Hiroshima Institute of Technology - Hiroshima, Japan
The primary distortion mechanism of the analog guitar pedal effector is saturating nonlinearity of a transfer function, which consists of operational amplifier and diode clippers with filters. The output spectrum of this system shows odd-order harmonics primarily, but it also contains even-order harmonics. We found that the intensity of this even-order harmonic varies depending on the power supply voltage and clarified the mechanism by analysis of the internal circuit topology of the operational amplifier. The analysis was justified compared with the conventional single-ended transistor pedal operation. Based on the analysis we proposed a new concept of even harmonic control technique, which was applied for analog “Distortion” pedals and demonstrated distinguished experimental results with a prototype.
Convention Paper 9850
P09-4 Unified Modeling for Series of Miniature Twin Triode Tube—Shiori Oshimo, Hiroshima Institute of Technology - Hiroshima, Hiroshima-ken, Japan; Kanako Takemoto, Hiroshima Institute of Technology - Hiroshima, Japan; Toshihiko Hamasaki, Hiroshima Institute of Technology - Hiroshima, Japan
Unification of high precision SPICE modeling for the series of MT vacuum tube has been succeeful for the first time. The model formula was validated based on the comparison of electrode and space physical dimensions among 12AX7, 12AU7, 12AY7, and 12AT7 associated with various aspects of properties not limited to the general Ip-Vp family curves. As a result, the non-linear behavior of the grid current and the plate current as a function of plus/minus grid voltage were able to be expressed entirely by 17 parameters of the newly proposed SPICE model, in which 4 tube type specific parameters and 4 universal parameters are constant and matching of twin valves of each tube as well as product dispersion are fitted by 9 variable parameters.
Convention Paper 9851
P09-5 Virtual Analog Modeling of a UREI 1176LN Dynamic Range Control System—Etienne Gerat, Helmut Schmidt University Hamburg - Hamburg, Germany; Felix Eichas, Helmut Schmidt University Hamburg - Hamburg, Germany; Udo Zölzer, Helmut-Schmidt-University Hamburg - Hamburg, Germany
This paper discusses an application of block-oriented modeling to a popular analog dynamic range compressor using iterative minimization. The reference device studied here is the UREI 1176LN, which has been widely used in music production and recording. A clone of the circuit built in a previous project has been used as a reference device to compare the results of the implementation. A parametric block-oriented model has been designed, improved, and tuned using the Levenberg-Marquardt iterative error minimization algorithm. Only input/output measurements have been performed following a gray-box modeling approach. Finally the model has been evaluated with objective scores and a listening test. This work led to very convincing modeling results.
Convention Paper 9852
P09-6 Amplitude Panning and the Interior Pan—Mark R. Thomas, Dolby Laboratories - San Francisco, CA, USA; Charles Q. Robinson, Dolby Laboratories - San Francisco, CA, USA
The perception of source location using multi-loudspeaker amplitude panning is considered. While there exist many perceptual models for pairwise panning, relatively few studies consider the general multi-loudspeaker case. This paper evaluates panning scenarios in which a source is panned on the boundary or within the volume bounded by discrete loudspeakers, referred to as boundary and interior pans respectively. Listening results reveal the following: (1) pans to a single loudspeaker yield lowest localization error, (2) pairwise pans tend to be consistently localized closer to the listener than single loudspeaker pans, (3) largest errors occur when the virtual source is panned close to the listener, (4) interior pans are accurately perceived and, surprisingly, in some cases more accurately than pairwise pans.
Convention Paper 9853
P09-7 Recording in a Virtual Acoustic Environment—Jonathan S. Abel, Stanford University - Stanford, CA, USA; Elliot K. Canfield-Dafilou, Center for Computer Research in Music and Acosutics (CCRMA), Stanford University - Stanford, CA, USA
A method is presented for high-quality recording of voice and acoustic instruments in loudspeaker-generated virtual acoustics. Auralization systems typically employ close-mic'ing to avoid feedback, while classical recording methods prefer high-quality room microphones to capture the instruments integrated with the space. Popular music production records dry tracks, and applies reverberation after primary edits are complete. Here a hybrid approach is taken, using close mics to produce real-time, loudspeaker-projected virtual acoustics, and room microphones to capture a balanced, natural sound. The known loudspeaker signals are then used to cancel the virtual acoustics from the room microphone tracks, providing a set of relatively dry tracks for use in editing and post-production. Example recordings of Byzantine chant in a virtual Hagia Sophia are described.
Convention Paper 9854
P09-8 A Study of Listener Bass and Loudness Preferences over Loudspeakers and Headphones—Elisabeth McMullin, Samsung Research America - Valencia, CA USA
In order to study listener bass and loudness preferences over loudspeakers and headphones a series experiments using a method of adjustment were run. Listeners adjusted the bass and loudness levels of multiple genres of music to their personal preference in separate listening sessions over loudspeakers in a listening room and headphones equalized to simulate loudspeakers in a listening room. The results indicated that listeners who preferred more bass over both headphones and loudspeakers also tended to listen at higher levels. Furthermore the majority of listeners preferred slightly higher bass and loudness levels over loudspeakers than over headphones. Listener factors including musical preferences, hearing ability, and training level are also explored.
Convention Paper 9855
Thursday, October 19, 1:30 pm — 3:30 pm
P10-1 Loudspeakers as Microphones for Infrasound—John Vanderkooy, University of Waterloo - Waterloo, ON, Canada
This paper shows that a sealed-box loudspeaker can be used as the sensor for a very high-performance infrasound microphone. Since the cone displacement essentially responds directly to infrasound pressure, the velocity-induced loudspeaker output must be electronically integrated to give a flat response. The undamped resonance peak of the loudspeaker is avoided by feeding the short-circuit voice coil current into the virtual ground input of an integrator op-amp. Design equations are given and a complete response analysis is presented. A prototype is compared with a conventional microphone used for infrasound measurement, showing the improved performance of the sealed-box loudspeaker design.
Convention Paper 9856
P10-2 A Low-Cost, High-Quality MEMS Ambisonic Microphone—Gabriel Zalles, New York University - New York, NY, USA; Yigal Kamel, New York University - New York, NY, USA; Ian Anderson, New York University - New York, NY, USA; Ming Yang Lee, New York University - New York, NY, USA; Chris Neil, New York University - New York, NY, USA; Monique Henry, New York University - New York, NY, USA; Spencer Cappiello, New York University - New York, NY, USA; Charlie Mydlarz, New York University, CUSP - New York, NY, USA; Melody Baglione, New York University - New York, NY, USA; Agnieszka Roginska, New York University - New York, NY, USA
While public interest for technologies that produce and deliver immersive VR content has been growing the price point for these tools has remained relatively high. This paper presents a low-cost, high-quality first-order ambisonics (FOA) microphone based on low-noise microelectromechanical systems (MEMS). This paper details the design, fabrication, and testing of a MEMS FOA microphone including its frequency and directivity response. To facilitate high resolution directivity response measurements a low-cost, automatic rotating microphone mount using an Arduino was designed. The automatic control of this platform was integrated into an in-house acoustic measurement library built in MATLAB allowing the user to generate polar plots at resolutions down to 1.8°. Subjective assessments compared the FOA mic prototype to commercially available FOA solutions at higher price points.
Convention Paper 9857
P10-3 Automated Auditory Monitoring Using Evoked Potentials and Consumer Headphones—Thomas Rouse, Plextek - Great Chesterford, Essex, UK; Loek Janssen, Plextek - Great Chesterford, UK
Auditory Evoked Potentials (AEP) are electrical signals resulting from activity in the auditory system in response to stimuli. The characteristic waveforms can be indicative of cochlea and auditory brainstem function and may change after the onset of tinnitus or hearing threshold shifts, whether permanent or temporary. AEP measurement is currently used by audiologists for hearing assessment in infants and to aid the diagnosis of some diseases. Measurements were made using a variety of consumer headphones and integrated electrodes and compared with a reference audiology system. The results showed the ability to record a consistent response and indicated that AEPs can be reliably measured outside a clinical environment. This could be used to automatically monitor for changes in a user's hearing.
Convention Paper 9858
P10-4 A Digital Class D Audio Amplifier with Pulse Density Modulation and Distortion Suppression Feedback Loop—Robert McKenzie, University of Toronto - Toronto, ON, Canada; Xinchang Li, Graduate University of the Chinese Academy of Sciences - Beijing, China; Martin Snelgrove, Kapik Integration - Toronto, ON, Canada; Wai Tung Ng, University of Toronto - Toronto, ON, Canada
A novel fully digital Class D amplifier is presented in which the output stage error is digitized by a 10-bit ADC and fed back into the modulation path to suppress distortion. This technique attenuates the in-band noise introduced by the output stage, and can tolerate large latency. A fully digital Pulse Density Modulation (PDM) Class D amplifier with output stage noise shaping is implemented on a PCB prototype. Feedback loop functionality is verified experimentally, and a 10 dB improvement in Total Harmonic Distortion plus Noise (THD+N) is realized.
Convention Paper 9859
Thursday, October 19, 2:00 pm — 3:30 pm
P11-1 Deep Neural Network Based HRTF Personalization Using Anthropometric Measurements—Chan Jun Chun, Korea Institute of Civil Engineering and Building Technology (KICT) - Goyang, Korea; Jung Min Moon, Gwangju Institute of Science and Technology (GIST) - Gwangju. Korea; Geon Woo Lee, Gwangju Institute of Science and Technology (GIST) - Gwangju. Korea; Nam Kyun Kim, Gwangju Institute of Science and Technology (GIST) - Gwangju, Korea; Hong Kook Kim, Gwangju Institute of Science and Tech (GIST) - Gwangju, Korea
A head-related transfer function (HRTF) is a very simple and powerful tool for producing spatial sound by filtering monaural sound. It represents the effects of the head, body, and pinna as well as the pathway from a given source position to a listener’s ears. Unfortunately, while the characteristics of HRTF differ slightly from person to person, it is usual to use the HRIR that is averaged over all the subjects. In addition, it is difficult to measure individual HRTFs for all horizontal and vertical directions. Thus, this paper proposes a deep neural network (DNN)-based HRTF personalization method using anthropometric measurements. To this end, the CIPIC HRTF database, which is a public domain database of HRTF measurements, is analyzed to generate a DNN model for HRTF personalization. The input features for the DNN are taken as the anthropometric measurements, including the head, torso, and pinna information. Additionally, the output labels are taken as the head-related impulse response (HRIR) samples of a left ear. The performance of the proposed method is evaluated by computing the root-mean-square error (RMSE) and log-spectral distortion (LSD) between the referenced HRIR and the estimated one by the proposed method. Consequently, it is shown that the RMSE and LSD for the estimated HRIR are smaller than those of the HRIR averaged over all the subjects from the CIPIC HRTF database.
Convention Paper 9860
P11-2 The Upmix Method for 22.2 Multichannel Sound Using Phase Randomized Impulse Responses—Toru Kamekawa, Tokyo University of the Arts - Adachi-ku, Tokyo, Japan; Atsushi Marui, Tokyo University of the Arts - Tokyo, Japan
The upmix technique for 22.2 multichannel sound was studied using room impulse responses (RIRs) processed by phase randomized technique. From the result of the first experiment, the spatial impression of proposed method was close to the original sound, but the timbre differed. In the second experiment we divided the RIRs at the moment when the diffuse reverberation tail begins (mixing time) by two kinds of time, namely fixed to 80 msec and different mixing times for each frequency band. From the result, the similarity of proposed methods and the original sound was improved, however, it is suggested that the similarity of the timbre depends on the sound sources and the suitable mixing time of RIRs.
Convention Paper 9861
P11-3 A 3D Sound Localization System Using Two Side Loudspeaker Matrices—Yoshihiko Sato, University of Aizu - Aizuwakamatsu-shi, Fukushima, Japan; Akira Saji, University of Aizu - Aizuwakamatsu City, Japan; Jie Huang, University of Aizu - Aizuwakamatsu City, Japan
We have proposed a new 3D sound reproduction system that consists of two side loudspeaker matrices each with four loudspeakers. The 3D sound images that applied to this system were created by the amplitude panning method and convolution of head-related transfer function (HRTF). In our past research we used the loudspeaker matrices arranged as a square shape, nevertheless the accuracy of sound image localization should be improved. We changed the shape of loudspeaker matrices from a square to a diamond by rotating 45 degrees to improve direction perception. As a result, we could be closer the localized sound images to intended directions than the square-shaped loudspeaker matrices by implementing the diamond-shaped loudspeaker matrices.
Convention Paper 9862
P11-4 Optimization of Interactive Binaural Processing —François Salmon, Ecole Nationale Supérieur Louis-Lumière - Paris, France; CMAP - Ecole Polytechnique - Paris, France; Matthieu Aussal, CMAP - Ecole Polytechnique - Paris, France; Etienne Hendrickx, Paris Conservatory (CNSMDP) - Paris, France; Jean-Christophe Messonnier, CNSMDP Conservatoire de Paris - Paris, France; Laurent Millot, ENS Louis-Lumière - Paris, France; Acte Institute (UMR 8218, CNRS/University Paris 1) - Paris, France
Several monitoring devices may be involved during a post-production. Given its lower cost and practical aspects, head-tracked binaural processing could be helpful for professionals to monitor spatialized audio contents. However, this technology provides significant spectral coloration in some sound incidences and suffers from its current comparison to a stereophonic signal reproduced through headphones. Therefore, different processing methods are proposed to optimize the binaural rendering and to find a new balance between externalization and timbral coloration. For this purpose, the alteration of the HRTF spectral cues in the frontal area only has been studied. In order to evaluate the accuracy of such treatments, listening tests were conducted. One HRTF processing method offered as much externalization as the original HRTFs while having a closer timbre quality to the original stereo signal.
Convention Paper 9863
P11-5 A Direct Comparison of Localization Performance When Using First, Third, and Fifth Ambisonics Order for Real Loudspeaker and Virtual Loudspeaker Rendering—Lewis Thresh, University of York - York, UK; Calum Armstrong, University of York - York, UK; Gavin Kearney, University of York - York, UK
Ambisonics is being used in applications such as virtual reality to render 3-dimensional sound fields over headphones through the use of virtual loudspeakers, the performance of which has previously been assessed up to third order. Through a localization test, the performance of first, third, and fifth order Ambisonics is investigated for optimized real and virtual loudspeaker arrays utilizing a generic HRTF set. Results indicate a minor improvement in localization accuracy when using fifth order over third though both show vast improvement over first. It is shown that individualized HRTFs are required to fully investigate the performance of Ambisonic binaural rendering.
Convention Paper 9864
Friday, October 20, 9:00 am — 12:00 pm
P12-1 Efficient Structures for Virtual Immersive Audio Processing—Jean-Marc Jot, Magic Leap - Sunnyvale, CA, USA; Daekyoung Noh, Xperi Corp - Santa Ana, CA, USA; Themis Katsianos, Xperi Corp - Highland, CA, USA
New consumer audio formats have been developed in recent years for the production and distribution of immersive multichannel audio recordings including surround and height channels. HRTF-based binaural synthesis and cross-talk cancellation techniques can simulate virtual loudspeakers, localized in the horizontal plane or at elevated apparent positions, for audio reproduction over headphones or convenient loudspeaker playback systems. In this paper we review and discuss the practical design and implementation challenges of immersive audio virtualization methods, and describe computationally efficient processing approaches and topologies enabling more robust and consistent reproduction of directional audio cues in consumer applications.
Convention Paper 9865
P12-2 Robust 3D Sound Capturing with Planar Microphone Arrays Using Directional Audio Coding—Oliver Thiergart, International Audio Laboratories Erlangen - Erlangen, Germany; Guendalina Milano, International Audio Laboratories Erlangen - Erlangen, Germany; Tobias Ascherl, International Audio Laboratories Erlangen - Erlangen, Germany; Emanuël A. P. Habets, International Audio Laboratories Erlangen - Erlangen, Germany
Real-world VR applications require to capture 3D sound with microphone setups that are hidden from the field-of-view of the 360-degree camera. Directional audio coding (DirAC) is a spatial sound capturing approach that can be applied to a wide range of compact microphone arrays. Unfortunately, its underlying parametric sound field model is often violated which leads to a degradation of the spatial sound quality. Therefore, we combine the non-linear DirAC processing with a linear beamforming approach that approximates the panning gains in DirAC such that the required amount of non-linear processing is reduced while increasing the robustness against model violations. Additionally, we derive a DOA estimator that enables 3D sound capturing with DirAC using compact 2D microphone arrays, which are often preferred in VR applications.
Convention Paper 9866
P12-3 Frequency Bands Distribution for Virtual Source Widening in Binaural Synthesis—Hengwei Su, Tokyo University of the Arts - Tokyo, Japan; Atsushi Marui, Tokyo University of the Arts - Tokyo, Japan; Toru Kamekawa, Tokyo University of the Arts - Adachi-ku, Tokyo, Japan
The aim of this study is to investigate the perceived width in binaural synthesis. To synthesize sounds with extended source widths, monophonic signals were divided by 1/3-octave filter bank, each component was then distributed to different directions by convolution with head-related transfer function within the intended width. A subjective listening experiment was conducted by using pairwise comparison to evaluate differences of perceived widths between stimuli with different synthesis widths and distribution methods. The results showed that this processing method can achieve a wider sound source width in binaural synthesis. However, effectiveness may vary with spectrum characteristics of source signals. Thus, a further revision of this method is needed to improve the stability and the performance.
Convention Paper 9867
P12-4 Improving Elevation Perception in Single-Layer Loudspeaker Array Display Using Equalizing Filters and Lateral Grouping—Julian Villegas, University of Aizu - Aizu Wakamatsu, Fukushima, Japan; Naoki Fukasawa, University of Aizu - Aizu Wakamatsu, Japan; Yurina Suzuki, University of Aizu - Aizu Wakamatsu, Japan
A system to improve the perception of elevated sources is presented. This method relies on “equalizing filters,” a technique that aims to compensate for unintended changes in the magnitude spectrum produced by the placement of loudspeakers with respect to the desired location. In the proposed method, when sources are on the horizon, a maximum of two loudspeakers are used for reproduction. Otherwise, the horizon spatialization is mixed with one that uses side loudspeakers grouped by lateral direction. Results from a subjective experiment suggest that the proposed method is capable of producing elevated images, but the perceived elevation range is somewhat compressed.
Convention Paper 9868
P12-5 Development and Application of a Stereophonic Multichannel Recording Technique for 3D Audio and VR—Helmut Wittek, SCHOEPS GmbH - Karlsruhe, Germany; Günther Theile, VDT - Geretsried, Germany
A newly developed microphone arrangement is presented that aims at an optimal pickup of ambient sound for 3D Audio. The ORTF-3D is a discrete 8ch setup that can be routed to the channels of a 3D Stereo format such as Dolby Atmos or Auro3D. It is also ideally suited for immersive sound formats such as wavefield synthesis or VR/Binaural, as it creates a complex 3D ambience that can be mixed or binauralized. The ORTF-3D setup was developed on the basis of stereophonic rules. It creates an optimal directional image in all directions as well as a high spatial sound quality due to highly uncorrelated signals in the diffuse sound. Reports from sound engineers affirm that it creates a highly immersive sound in a large listening area and still is compact and practical to use.
Convention Paper 9869
P12-6 Apparent Sound Source De-Elevation Using Digital Filters Based on Human Sound Localization—Adrian Celestinos, Samsung Research America - Valencia, CA, USA; Elisabeth McMullin, Samsung Research America - Valencia, CA USA; Ritesh Banka, Samsung Research America - Valencia, CA USA; William Decanio, Samsung Research America - Valencia, CA, USA; Allan Devantier, Samsung Research America - Valencia, CA, USA
The possibility of creating an apparent sound source elevated or de-elevated from its current physical location is presented in this study. For situations where loudspeakers need to be placed in different locations than the ideal placement for accurate sound reproduction digital filters are created and connected in the audio reproduction chain either to elevate or de-elevate the perceived sound from its physical location. The filters are based on head related transfer functions (HRTF) measured in human subjects. The filters relate to the average head, ears, and torso transfer functions of humans isolating the effect of elevation/de-elevation only. Preliminary tests in a movie theater setup indicate that apparent de-elevation can be achieved perceiving about –20 degrees from its physical location.
Convention Paper 9870
Friday, October 20, 11:00 am — 12:30 pm
P13-1 Equalization of Localized Sources on Flat-Panel Audio Displays—Michael Heilemann, University of Rochester - Rochester, NY, USA; David Anderson, University of Rochester - Rochester, NY, USA; Mark F. Bocko, University of Rochester - Rochester, NY, USA
An equalization method is presented for sound sources rendered by eigenmode superposition on flat-panel audio displays. A filter is designed to provide a constant mechanical acceleration for each localized source region at all frequencies below the spatial aliasing frequency of the actuator array used to excite the panel’s bending modes. Within this bandwidth, the vibration profile of the source remains consistent with the application of the equalization filter, preserving any spatial information conveyed to the listener from the source position. Directivity simulations and measurements show that these localized source regions do not exhibit the irregular directivity characteristic of single and multi-actuator distributed mode loudspeakers, but instead exhibit radiation characteristics similar to conventional piston loudspeakers within the array bandwidth.
Convention Paper 9871
P13-2 Loudspeaker 3D Directivity Estimation with First Order Microphone Measurements on a 2D Plane—Lachlan Birnie, Australian National University - Canberra, Australia; Thushara Abhayapala, Australian National University - Canberra, ACT, Australia; Prasanga Samarasinghe, Australian National University - Canberra, Australia
This paper proposes an efficient method to estimate the 3D directivity pattern of loudspeakers or portable devices with embedded speakers. We place the loudspeaker on a horizontal turntable and use a first order microphone located on the horizontal plane to measure pressure and pressure gradients along three orthogonal directions to construct equivalent virtual arrays of first order microphones on the horizontal plane. By exploiting the properties of the associated Legendre functions, we construct the 3D directivity pattern of the loudspeaker over frequencies. The method is equivalent to having a measurement setup consist of a dense spherical array encompassing the loudspeaker. The underlying theory and method are corroborated by simulations as well as measurements of the directivity of a physical loudspeaker.
Convention Paper 9872
P13-3 A Headphone Measurement System Covers both Audible Frequency and beyond 20 kHz (Part 3)—Naotaka Tsunoda, Sony Corporation - Shinagawa-ku, Tokyo, Japan; Takeshi Hara, Sony Video & Sound Products Inc. - Tokyo, Japan; Koji Nageno, Sony Video and Sound Corporation - Tokyo, Japan
New headphone frequency response measuring scheme was standardized as JEITA RC-8140B-1 in March 2016. The basic idea of the scheme is that the frequency response is to be measured by HATS and compensated by a free-field HRTF of HATS used in the measurement. One of the advantages of this measuring scheme is that obtained results have equivalent implication with the results of free-field frequency response of the loudspeakers. This report supplements the previous report that proposed the basic idea of above-mentioned scheme by adding topics regarding newly developed HATS to improve signal to noise ratio in high frequency areas above 20 kHz with ear simulators.
Convention Paper 9873
P13-4 Novel Type of MEMS Loudspeaker Featuring Membrane-Less Two-Way Sound Generation—Fabian Stoppel, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Florian Niekiel, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Thorsten Giese, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Shanshan Gu-Stoppel, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Andreas Männchen, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Johannes Nowak, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Daniel Beer, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Bernhard Wagner, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany
In this paper a novel type of piezoelectric microelectromechanical loudspeaker is presented. The device concept is based on concentrically cascaded lead zirconate titanate actuators making it the first integrated two-way MEMS speaker reported. As a further novelty, the device is designed to operate without a closed membrane significantly improving the acoustic performance, energy efficiency, and manufacturability. Extensive finite element analysis studies have revealed a very high SPL of more than 79 dB in 10 cm distance at 500 Hz for a device 1 cm² in size operated at 30 V. At higher frequencies even larger SPL values are calculated enabling a flat frequency response with 89 dB for frequencies above 800 Hz. Based on this concept first speaker prototypes have been fabricated using MEMS technology.
Convention Paper 9874
P13-5 Analysis of the Mechanical Vibration and Acoustic Behavior of a Piezoelectric MEMS Microspeaker—Andreas Männchen, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Daniel Beer, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Florian Niekiel, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Johannes Nowak, Fraunhofer Institute for Digital Media Technology IDMT - Ilmenau, Germany; Fabian Stoppel, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany; Bernhard Wagner, Fraunhofer Institute for Silicon Technology ISIT - Itzehoe, Germany
This paper investigates the performance of a piezoelectric MEMS-based microspeaker. The performance is compared to the state of the art in terms of electrodynamic microspeakers for mobile applications. The analysis is twofold: First, the mechanical behavior is evaluated using laser interferometry and discussed for different stimuli such as sine sweeps or static sinusoidal excitation. Second, the acoustic performance is assessed by way of measurements under anechoic conditions. Results show that the speaker performs well for its size while providing low power consumption. However, in order to achieve high broadband reproduction quality, further design improvements are necessary.
Convention Paper 9875
P13-6 Auditory-Based Smoothing for Equalization of Headphone-to-Eardrum Transfer Function—Guangju Li, Key Laboratory of Noise and Vibration Research,Institute of Acoustics, Chinese Academy of Sciences; University of Chinese Academy of Sciences - Beijing, China; Ziran Jiang, Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences; University of Chinese Academy of Sciences - Beijing, China; Jinqiu Sang, Institute of Acoustics, Chinese Academy of Science - Beijing, China; Chengshi Zheng, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China; Renhua Peng, Chinese Academy of Sciences - Beijing, China; Xiaodong Li, Chinese Academy of Sciences - Beijing, China; Chinese Academy of Sciences - Shanghai, China
Binaural headphone reproduction can be improved by equalization of headphone-to-eardrum transfer function (HETF) in an appropriate way. Direct inversion of HETF targeting at a flat frequency response cannot keep the peaks and valleys due to pinna and ear canal filtering that might help auditory perception. Moreover, Direct inversion might induce annoying high Q peak values due to variability across listeners. Smoothing the HETF before direct inversion can avoid over equalization. Two auditory-based spectral smoothing methods were studied in this research. One is based on roex filtering that can simulate human auditory filtering in the basilar membrane, and the other is cepstral smoothing that can simulate the auditory perception characteristic of frequency resolution. Subjective experiments show that, in comparison to direct inversion, the two proposed methods can improve binaural headphone reproduction.
Convention Paper 9876
P13-7 Interpolation and Display of Microphone Directivity Measurements Using Higher Order Spherical Harmonics—Jonathan D. Ziegler, Stuttgart Media University - Stuttgart, Germany; Eberhard Karls University Tübingen - Tübingen, Germany; Mark Rau, Center for Computer Research in Music and Acoustics, Stanford University - Palo Alto, CA, USA; McGill University - Montreal, QC, Canada; Andreas Schilling, Eberhard Karls University Tuebingen - Tuebingen, Germany; Andreas Koch, Stuttgart Media University - Stuttgart, Germany
The accurate display of frequency dependent polar response data of microphones has largely relied on the use of a defined set of test frequencies and a simple overlay of two-dimensional plots. In recent work, a novel approach to digital displays without fixed frequency points was introduced. Building on this, an enhanced interpolation algorithm is presented, using higher-order spherical harmonics for angular interpolation. The presented approach is compared to conventional interpolation methods in terms of computational cost and accuracy. In addition, a three-dimensional data processing prototype for the creation of interactive, frequency-dependent, three-dimensional microphone directivity plots is presented.
Convention Paper 9877
Friday, October 20, 1:30 pm — 4:00 pm
P14-1 A Statistical Model that Predicts Listeners’ Preference Ratings of In-Ear Headphones: Part 2—Development and Validation of the Model—Sean Olive, Harman International - Northridge, CA, USA; Todd Welti, Harman International Inc. - Northridge, CA, USA; Omid Khonsaripour, Harman International - Northridge, CA, USA
Part 1 of this paper presented the results of controlled listening tests where 71 listeners both trained and untrained gave preference ratings for 30 different models of in-ear (IE) headphones. Both trained and untrained listeners preferred the headphone equalized to Harman IE target curve. Objective measurements indicated the magnitude response of the headphone appeared to be a predictor of its preference rating, and the further it deviated from the response of the Harman IE target curve the less it was generally preferred. Part 2 presents a linear regression model that accurately predicts the headphone preference ratings (r = 0.91) based on the size, standard deviation and slope of the magnitude response deviation from the response of the Harman IE headphone target curve.
Convention Paper 9878
P14-2 Comparison of Hedonic and Quality Rating Scales for Perceptual Evaluation of High- and Intermediate Quality Stimuli—Nick Zacharov, DELTA SenseLab - Hørsholm, Denmark; Christer Volk, DELTA SenseLab - Hørsholm, Denmark; Tore Stegenborg-Andersen, DELTA SenseLab - Hørsholm, Denmark
In this study four rating scales for perceptual evaluation of Preference were compared: 9-point hedonic, Continuous Quality Scale (CQS) (e.g., used in ITU-R BS.1534-3 , “MUSHRA”), Labelled Hedonic Scale (LHS) , and a modified version of the LHS. The CQS was tested in three configurations to study the role and impact of the reference and anchor stimuli, namely: A full MUSHRA test with anchors and references, a test without references, and a test with neither references nor anchors. The six test configurations were tested with two groups of AAC codec qualities: High and Intermediate quality ranges. Results showed that the largest difference in scale usage were caused by having a declared reference, but also that the scale range usage is not strongly related to stimuli discrimination power.
Convention Paper 9879
P14-3 Perceptual Evaluation of Source Separation for Remixing Music—Hagen Wierstorf, University of Surrey - Guildford, Surrey, UK; Dominic Ward, University of Surrey - Guildford, Surrey, UK; Russell Mason, University of Surrey - Guildford, Surrey, UK; Emad M. Grais, University of Surrey - Guildford, Surrey, UK; Chris Hummersone, University of Surrey - Guildford, Surrey, UK; Mark D. Plumbley, University of Surrey - Guildford, Surrey, UK
Music remixing is difficult when the original multitrack recording is not available. One solution is to estimate the elements of a mixture using source separation. However, existing techniques suffer from imperfect separation and perceptible artifacts on single separated sources. To investigate their influence on a remix, five state-of-the-art source separation algorithms were used to remix six songs by increasing the level of the vocals. A listening test was conducted to assess the remixes in terms of loudness balance and sound quality. The results show that some source separation algorithms are able to increase the level of the vocals by up to 6 dB at the cost of introducing a small but perceptible degradation in sound quality.
Convention Paper 9880
P14-4 Adaptive Low-frequency Extension Using Auditory Filterbanks—Sunil G. Bharitkar, HP Labs., Inc. - San Francisco, CA, USA; Timothy Mauer, HP, Inc. - San Francisco, CA, USA; Charles Oppenheimer, HP, Inc. - San Francisco, CA, USA; Teresa Wells, HP, Inc. - San Francisco, CA, USA; David Berfanger, HP, Inc. - San Francisco, CA, USA
Microspeakers used in mobile devices and PCs have band-limited frequency response, from constraining small drivers in tight enclosures, resulting in the loss of low-frequency playback content. The lack of low-frequencies in turn degrades the audio quality and perceived loudness. A method to overcome this physical limitation is to leverage the auditory phenomena of the missing fundamental; where by synthesizing the harmonic structure decodes the missing fundamental frequency. The proposed approaches employs side-chain processing for synthesizing the harmonics with only dominant portions of the low-frequency signal using critical-band filters. Additionally a parametric filter is used to shape the harmonics. Listening tests reveal that the proposed technique is preferred in terms of both the overall sound quality and the bass-only quality.
Convention Paper 9881
P14-5 The Bandwidth of Human Perception and its Implications for Pro Audio—Thomas Lund, Genelec Oy - Iisalmi, Finland; Aki Mäkivirta, Genelec Oy - Iisalmi, Finland
Locked away inside its shell, the brain has ever only learned about the world through our five primary senses. With them, we just receive a fraction of the information actually available, while we perceive far less still. A fraction of a fraction: The perceptual bandwidth. Conscious perception is furthermore subject to 400 ms of latency, and associated with a temporal grey-zone that can only be tapped into via reflexes or training. Based on a broad review of physiological, clinical and psychological research, the paper proposes three types of listening strategies we should distinguish between; not only in our daily lives, but also when conducting subjective tests: Easy listening, trained listening, and slow listening.
Convention Paper 9882
Friday, October 20, 2:00 pm — 3:30 pm
P15-1 Noise Shaping Scheme Suppressing Quantization Noise Amount—Akihiko Yoneya, Nagoya Institute of Technology - Nagoya, Aichi-pref., Japan
A noise shaping scheme for multi-bit pulse-code-modulation suppressing the amount of the quantization noise is proposed. In the case with ordinary digital processing or digital-to-analog converters, noise is added over the whole frequency range and the shaped quantization noise of the source signal may only make the total signal-to-noise ratio worse. Therefore the amount of the quantization noise is preferable to be small even if the noise spectrum is shaped. In the proposed method, magnitude of the quantization noise is restricted at each sample and the optimal additional quantization pattern over a receding horizon with respect to the specified perception filter is searched in the look-ahead sigma-delta modulator manner. The amplitude of the quantization noise may be about 0.72 LSB regardless of the perception filter with the proposed method but a higher order perception filter requires a wide horizon of the optimization and a huge amount of the computation. An example is presented.
Convention Paper 9883
P15-2 Evaluation of the Acoustics of the Roman Theater in Benevento for Discreet Listening Points—Gino Iannace, Università della Campania "Luigi Vanvitelli" - Aversa, Italy; Amelia Trematerra, Universitá della Campania "Luigi Vanvitelli" - Aversa, Italy
This work reports the acoustics of the Roman Theater in Benevento evaluated for discreet listening points positioned in the cavea in three radial directions. The theater, built in the second century A.D., was abandoned due to historical reasons and natural events. The recovery work ended in 1950. The theater is the center of important social activities. The theater acoustic measurements were taken by placing an omnidirectional spherical sound source on the stage and in the orchestra, with the microphone along three distinct radial directions on the steps of the cavea. The acoustic properties in the various seating areas were measured. The aim of the work is to evaluate in which sectors of the cavea the acoustic parameters are optimal for listening to different types of theatrical performances.
Convention Paper 9884
P15-3 Modeling the Effects of Rooms on Frequency Modulated Tones—Sarah R. Smith, University of Rochester - Rochester, NY, USA; Mark F. Bocko, University of Rochester - Rochester, NY, USA
This paper describes how reverberation impacts the instantaneous frequency tracks of modulated audio signals. Although this effect has been observed in a number of contexts, less work has been done relating these deviations to acoustical parameters of the reverberation. This paper details the instantaneous frequency deviations resulting from a sum of echoes or a set of resonant modes and emphasizes the conditions that maximize the resulting effect. Results of these models are compared with the observed instantaneous frequencies of musical vibrato tones filtered with the corresponding impulse responses. It is demonstrated that these reduced models may adequately reproduce the deviations when the signal is filtered by only the early or low frequency portion of a recorded impulse response.
Convention Paper 9885
P15-4 New Research on Low-Frequency Absorption Using Membranes—John Calder, Acoustic Geometry - Minneapolis, MN, USA
Room modes are one of the greatest concerns when considering accurate sound recording and reproduction. Low-frequency (LF) absorbers are used to mitigate modes, however, most independent testing laboratories are only large enough to measure accurate absorption results above 160 Hz but not below. One lab is large enough to be accurate down to 40 Hz. A new LF absorber was designed to complement the capabilities of an original LF absorber. Summary: the type of absorber, and its location and orientation in a room, are all critical to LF absorber effectiveness. Without standardized laboratory absorption testing in a lab capable of accurately testing down to 40 Hz, it is difficult to state conclusively that low-frequency absorber products perform as claimed.
Convention Paper 9886
P15-5 Analysis of Drum Machine Kick and Snare Sounds—Jordie Shier, University of Victoria - Victoria, Canada; Kirk McNally, University of Victoria, School of Music - Victoria, BC, Canada; George Tzanetakis, University of Victoria - Victoria, BC, Canada
The use of electronic drum samples is widespread in contemporary music productions, with music producers having an unprecedented number of samples available to them. The development of new tools to assist users organizing and managing libraries of this type requires comprehensive audio analysis that is distinct from that used for general classification or onset detection tasks. In this paper 4230 kick and snare samples, representing 250 individual electronic drum machines are evaluated. Samples are segmented into different lengths and analyzed using comprehensive audio feature analysis. Audio classification is used to evaluate and compare the effect of this time segmentation and establish the overall effectiveness of the selected feature set. Results demonstrate that there is improvement in classification scores when using time segmentation as a pre-processing step.
Convention Paper 9887
P15-6 Dynamic Range Controller Ear Training: Description of a Methodology, Software Application, and Required Stimuli—Denis Martin, McGill University - Montreal, QC, Canada; CIRMMT - Montreal, QC, Canada; George Massenburg, Schulich School of Music, McGill University - Montreal, Quebec, Canada; Centre for Interdisciplinary Research in Music Media and Technology (CIRMMT) - Montreal, Quebec, Canada; Richard King, McGill University - Montreal, Quebec, Canada; The Centre for Interdisciplinary Research in Music Media and Technology - Montreal, Quebec, Canada
Several successful spectral ear training software applications are now available and being used by individuals and audio institutions around the world. While some listener training applications address other audio attributes, they have not received the same level of development and refinement. A methodology, software application, and the required stimuli for a dynamic range controller ear training program are described herein. This program, based on ideas developed for spectral ear training, addresses several limitations of earlier dynamic range controller ear training programs. It has been designed for web access, making use of the Web Audio API for audio processing, a custom audio compressor design, and a wide range of musical stimuli.
Convention Paper 9888
P15-7 An “Infinite” Sustain Effect Designed for Live Guitar Performance—Mark Rau, Center for Computer Research in Music and Acoustics, Stanford University - Palo Alto, CA, USA; McGill University - Montreal, QC, Canada; Orchisama Das, Center for Computer Research in Music and Acoustics (CCRMA), Stanford University - Palo Alto, CA, USA
An audio effect to extend the sustain of a musical note in real-time is implemented on a fixed point, standalone processor. Onset detection is used to look for new musical notes, and once they decay to steady state the audio is looped indefinitely until a new note onset occurs. To properly loop the audio, pitch detection is performed to extract one period and the new output buffer is written in a phase aligned manner.
Convention Paper 9889
Saturday, October 21, 9:00 am — 12:30 pm
P16-1 On Data-Driven Approaches to Head-Related-Transfer Function Personalization—Haytham Fayek, Oculus Research and Facebook - Redmond, WA, USA; Laurens van der Maaten, Facebook AI Research - New York, NY, USA; Griffin Romigh, Oculus Research - Redmond, WA, USA; Ravish Mehra, Oculus Research - Redmond, WA, USA
Head-Related Transfer Function (HRTF) personalization is key to improving spatial audio perception and localization in virtual auditory displays. We investigate the task of personalizing HRTFs from anthropometric measurements, which can be decomposed into two sub tasks: Interaural Time Delay (ITD) prediction and HRTF magnitude spectrum prediction. We explore both problems using state-of-the-art Machine Learning (ML) techniques. First, we show that ITD prediction can be significantly improved by smoothing the ITD using a spherical harmonics representation. Second, our results indicate that prior unsupervised dimensionality reduction-based approaches may be unsuitable for HRTF personalization. Last, we show that neural network models trained on the full HRTF representation improve HRTF prediction compared to prior methods.
Convention Paper 9890
P16-2 Eigen-Images of Head-Related Transfer Functions—Christoph Hold, Technische Universität Berlin - Berlin, Germany; Fabian Seipel, TU Berlin - Berlin, Germany; Fabian Brinkmann, Technical University of Berlin - Berlin, Germany; Athanasios Lykartsis, TU Berlin - Berlin, Germany; Stefan Weinzierl, Technical University of Berlin - Berlin, Germany
The individualization of head-related transfer functions (HRTFs) leads to perceptually enhanced virtual environments. Particularly the peak-notch structure in HRTF spectra depending on the listener’s specific head and pinna anthropometry contains crucial auditive cues, e.g., for the perception of sound source elevation. Inspired by the eigen-faces approach, we have decomposed image representations of individual full spherical HRTF data sets into linear combinations of orthogonal eigen-images by principle component analysis (PCA). Those eigen-images reveal regions of inter-subject variability across sets of HRTFs depending on direction and frequency. Results show common features as well as spectral variation within the individual HRTFs. Moreover, we can statistically de-noise the measured HRTFs using dimensionality reduction.
Convention Paper 9891
P16-3 A Method for Efficiently Calculating Head-Related Transfer Functions Directly from Head Scan Point Clouds—Rahulram Sridhar, Princeton University - Princeton, NJ, USA; Edgar Choueiri, Princeton University - Princeton, NJ, USA
A method is developed for efficiently calculating head-related transfer functions (HRTFs) directly from head scan point clouds of a subject using a database of HRTFs, and corresponding head scans, of many subjects. Consumer applications require HRTFs be estimated accurately and efficiently, but existing methods do not simultaneously meet these requirements. The presented method uses efficient matrix multiplications to compute HRTFs from spherical harmonic representations of head scan point clouds that may be obtained from consumer-grade cameras. The method was applied to a database of only 23 subjects, and while calculated interaural time difference errors are found to be above estimated perceptual thresholds for some spatial directions, HRTF spectral distortions up to 6 kHz fall below perceptual thresholds for most directions.
Convention Paper 9892
P16-4 Head Rotation Data Extraction from Virtual Reality Gameplay Using Non-Individualized HRTFs—Juan Simon Calle, New York University - New York, NY, USA; THX; Agnieszka Roginska, New York University - New York, NY, USA
A game was created to analyze the subject’s head rotation during the process of localizing a sound in a 360-degree sphere in a VR gameplay. In this game the subjects are asked to locate a series of sounds that are randomly placed in a sphere around their heads using generalized HRTFs. The only instruction given to the subjects is that they need to locate the sounds as fast and accurate as possible by looking at where the sound was and then pressing a trigger. To test this tool 16 subjects were used. It showed that the average time that it took the subjects to locate the sound was 3.7±1.8 seconds. The average error in accuracy was 15.4 degrees. The average time that it took the subjects to start moving their head was 0.2 seconds approximately. The average rotation speed achieved its maximum value at 0.8 seconds and the average speed at this point was approximately 102 degrees per second.
Convention Paper 9893
P16-5 Relevance of Headphone Characteristics in Binaural Listening Experiments: A Case Study—Florian Völk, Technische Universität München - Munich, Germany; WindAcoustics - Windach, Germany; Jörg Encke, Technical University of Munich - Munich, Germany; Jasmin Kreh, Technical University of Munich - Munich, Germany; Werner Hemmert, Technical University of Munich - Munich, Germany
Listening experiments typically target performance and capabilities of the auditory system. Another common application scenario is the perceptual validation of algorithms and technical systems. In both cases, systems other than the device or subject under test must not affect the results in an uncontrolled manner. Binaural listening experiments require that two signals with predefined amplitude or phase differences stimulate the left and right ear, respectively. Headphone playback is a common method for presenting the signals. This study quantifies potential headphone-induced interaural differences by physical measurements on selected circum-aural headphones and by comparison to psychoacoustic data. The results indicate that perceptually relevant effects may occur, in binaural listening experiments, traditional binaural headphone listening, and virtual acoustics rendering such as binaural synthesis.
Convention Paper 9894
P16-6 Evaluating Binaural Reproduction Systems from Behavioral Patterns in a Virtual Reality—A Case Study with Impaired Binaural Cues and Tracking Latency—Olli Rummukainen, International Audio Laboratories Erlangen - Erlangen, Germany; Fraunhofer IIS - Erlangen, Germany; Sebastian Schlecht, International Audio Laboratories - Erlangen, Germany; Axel Plinge, International Audio Laboratories Erlangen - Erlangen, Germany; Emanuël A. P. Habets, International Audio Laboratories Erlangen - Erlangen, Germany
This paper proposes a method for evaluating real-time binaural reproduction systems by means of a wayfinding task in six degrees of freedom. Participants physically walk to sound objects in a virtual reality created by a head-mounted display and binaural audio. The method allows for comparative evaluation of different rendering and tracking systems. We show how the localization accuracy of spatial audio rendering is reflected by objective measures of the participants' behavior and task performance. As independent variables we add tracking latency or reduce the binaural cues. We provide a reference scenario with loudspeaker reproduction and an anchor scenario with monaural reproduction for comparison.
Convention Paper 9895
P16-7 Coding Strategies for Multichannel Wiener Filters in Binaural Hearing Aids—Roberto Gil-Pita, University of Alcalá - Alcalá de Henares, Madrid, Spain; Beatriz Lopez-Garrido, Servicio de Salud de Castilla la Mancha (SESCAM) - Castilla-Mancha, Spain; Manuel Rosa-Zurera, University of Alcalá - Alcalá de Henares, Madrid, Spain
Binaural hearing aids use sound spatial techniques to increase intelligibility, but the design of the algorithms for these devices presents strong constraints. To minimize power consumption and maximize battery life, the digital signal processors embedded in these devices have very low frequency clocks and low amount of available memory. In the binaural case the wireless communication between both hearing devices also increases the power consumption, making necessary the study of relationship between intelligibility improvement and required transmission bandwidth. In this sense, this paper proposes and compares several coding strategies in the implementation of binaural multichannel Wiener filters, with the aim of keeping minimal communication bandwidth and transmission power. The obtained results demonstrate the suitability of the proposed coding strategies.
Convention Paper 9896
Saturday, October 21, 9:00 am — 12:00 pm
P17-1 Challenges of Audio Forensic Evaluation from Personal Recording Devices—Robert C. Maher, Montana State University - Bozeman, MT, USA
Typical law enforcement audio forensic investigations involve audio evidence recorded under less-than-ideal circumstances by mobile phones, surveillance systems, and personal audio recorders. Moreover, the audio information is often transmitted and stored using a data compression algorithm such as a speech coder (e.g., VSELP) or a wideband audio coder (e.g., MP3). There are few systematic studies of the signal behavior of these systems for forensically-relevant audio material, and this may discourage a forensic examiner from using such acoustic evidence to draw reliable conclusions. This paper includes simulation and evaluation of personal audio recording systems in the context of audio forensics. The results indicate areas of strength and weakness in the forensic realm.
Convention Paper 9897
P17-2 An Acoustic Study of Airbag Deployment in Vehicles—John Vanderkooy, University of Waterloo - Waterloo, ON, Canada; Kevin Krauel, University of Waterloo - Waterloo, Ontario, Canada
This study shows the acoustic pressures produced in typical airbag deployments and introduces the topic to the AES. Two representative vehicles were tested: a 2005 Pontiac Montana SV6 minivan and a 2006 Mazda 3 hatchback. Microphones were placed at the left driver ear, right passenger ear, and rear seat positions. Wideband pressure data was obtained for each of the steering wheel, passenger, and any optional side airbags. Our data agrees with the plethora of studies of earlier work. Weighted and unweighted peak SPL levels are calculated for various deployment scenarios. The influence of the cabin volume and the vents of the vehicles are discussed. Concerns over hearing loss, possible eardrum perforation, and other hearing-related symptoms are considered, gleaned mainly from important earlier studies. Some aspects are counterintuitive.
Convention Paper 9898
P17-3 CLEAR: Conditionally Lossless Encoding under Allowed Rates for Low-Delay Sound Data Transmission—Ryosuke Sugiura, NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation - Kanagawa, Japan; Yutaka Kamamoto, NTT Communication Science Laboratories - Kanagawa, Japan; Noboru Harada, NTT Communicatin Science Labs - Atsugi-shi, Kanagawa-ken, Japan; Takahito Kawanishi, 1NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation - Kanagawa, Japan; Takehiro Moriya, NTT Communication Science Labs - Atsugi-shi, Kanagawa-ken, Japan
We present in this paper a near-lossless full-band stereo compression scheme, Conditionally Lossless Encoding under Allowed Rates (CLEAR), aiming at its use in real-time transmission of sound data, sounds to be mixed or processed after transmitted. Using uniform quantizer with MPEG-4 Audio Lossless Coding (ALS) and adaptive pre- and post-processing, CLEAR controls the encoding bit rate with maximum fidelity of reconstructed signals. Objective experiments show an enhancement in signal to noise ratio (SNR) and from conventional low-delay codecs with compatible perceptual quality. Additionally, companding-based perceptual weighting designed for CLEAR is shown to make an improvement in Perceptual Evaluation of Audio Quality (PEAQ).
Convention Paper 9899
P17-4 A New THD+N Algorithm for Measuring Today's High Resolution Audio Systems—Alfred Roney, MathWorks, Inc. - Natick, MA, USA; Steve Temme, Listen, Inc. - Boston, MA, USA
We present a mathematical definition of Total Harmonic Distortion + Noise suitable for testing high-resolution digital audio systems. This formal definition of the "distortion analyzer" mentioned in AES17 defines THD+N as the RMS error of fitting a sinusoid to a noisy and distorted sequence of measurements. We present the key theoretical result that under realistic conditions a modern THD+N analyzer is well-described by a Normal probability distribution with a simple relationship between relative error and analysis dwell time. These findings are illustrated by comparing the output of a commercial distortion analyzer to our proposed method using Monte Carlo simulations of noisy signal channels. We will demonstrate that the bias of a well-designed distortion analyzer is negligible.
Convention Paper 9900
P17-5 Influences of a Key Map on Soundwalk Exploration with a Textile Sonic Map—Alessia Milo, Queen Mary University of London - London, UK; Nick Bryan-Kinns, Queen Mary University of London - London, UK; Media and Arts Technology Centre for Doctoral Training; Joshua D. Reiss, Queen Mary University of London - London, UK
Sonic maps are an increasingly popular form of exploring soundscapes and are a possible means of communicating the experience of a soundwalk. We describe how a printed key influenced exploration of an interactive textile sonic map. We explain the technology behind the map, employing capacitive sensing and real-time audio processing. The sonic map contained 18 binaural recordings extracted from a soundwalk. Thirty participants explored the map. The strengths and limitations of the interfaces were established, and participants’ modes of exploration were identified. Results show how the use of the key map levelled the location preference. The participants’ experience with the interface suggested possible uses of e-textiles for soundscape awareness promotion and studies and in the field of interactive audio.
Convention Paper 9901
P17-6 Challenges of IoT Smart Speaker Testing—Glenn Hess, Indy Acoustic Research LLC - Indianapolis, IN, USA; Daniel Knighten, Listen, Inc. - Boston, MA, USA
Quantitatively measuring the audio characteristics of IoT (Internet of Things) smart speakers presents several novel challenges. We discuss overcoming the practical challenges of testing such devices and demonstrate how to measure frequency response, distortion, and other common audio characteristics. In order to make these measurements, several measurement techniques and algorithms are presented that allow us to move past the practical difficulties presented by this class of emerging audio devices. We discuss test equipment requirements, selection of test signals, and especially overcoming the challenges around injecting and extracting test signals from the device.
Convention Paper 9902
Saturday, October 21, 2:00 pm — 4:30 pm
P18-1 Dynamic Diffuse Signal Processing for Low-Frequency Spatial Variance Minimization across Wide Audience Areas—Jonathan Moore, University of Derby - Stockport, UK; Adam J. Hill, University of Derby - Derby, Derbyshire, UK; Gand Concert Sound - Elk Grove Village, IL, USA
Diffuse signal processing (DiSP) is a method of decorrelating coherent audio signals that is applicable to various components of sound reinforcement systems. Previous tests have indicated that DiSP can successfully decorrelate multiple low-frequency sources, leading to the reduction of comb filtering effects. However, results also show that performance is variable with source material and that effectiveness is reduced in closed acoustic spaces. In this work a dynamic variant of DiSP is examined where the decorrelation algorithm varies over time. The effectiveness of the processing is analyzed and compared to static DiSP and unprocessed systems. Results show that dynamic DiSP provides superior low-frequency spatial variance reduction to static DiSP due to improved decorrelation between direct sounds and early reflections.
Convention Paper 9903
P18-2 A Novel Procedure for Direct-Method Measurement of the Full-Matrix Speech Transmission Index—Jan A. Verhave, Embedded acoustics BV - Delft, The Netherlands; Sander van Wijngaarden, Embedded acoustics BV - Delft, The Netherlands
When measuring the Speech Transmission Index (STI), until now one had to choose between two alternatives: impulse-response based full STI measurements (indirect method), or measurements based on modulated STIPA signals (direct method). Limitations apply when using either method. A novel procedure is proposed to measure the full STI through the direct method. The procedure combines advantages of indirect full STI measurements and direct STIPA measurements, completing a full STI measurement in 65.52 seconds. Similar to STIPA, the test signal is simultaneously modulated with 2 modulation frequencies per octave band. However, a rotation scheme is applied that uses a different set of modulation frequencies during different stages of the measurement, ending up with a full matrix (7 octaves x 14 modulation frequencies).
Convention Paper 9904
P18-3 Blind Estimation of the Reverberation Fingerprint of Unknown Acoustic Environments—Prateek Murgai, Center for Computer Research in Music and Acoustics (CCRMA), Stanford University - Palo Alto, CA, USA; Mark Rau, Center for Computer Research in Music and Acoustics, Stanford University - Palo Alto, CA, USA; McGill University - Montreal, QC, Canada; Jean-Marc Jot, Magic Leap - Sunnyvale, CA, USA
Methods for blind estimation of a room’s reverberation properties have been proposed for applications including speech dereverberation and audio forensics. In this paper, we study and evaluate algorithms for online estimation of a room’s “reverberation fingerprint”, defined by its volume and its frequency-dependent diffuse reverberation decay time. Both quantities are derived adaptively by analyzing a single-microphone reverberant signal recording, without access to acoustic source reference signals. The accuracy and convergence of the proposed techniques is evaluated experimentally against the ground truth obtained from geometric and impulse response measurements. The motivations of the present study include the development of improved headphone 3D audio rendering techniques for mobile computing devices.
Convention Paper 9905
P18-4 Microphone Selection Based on Direct to Reverberant Ratio Estimation—Alexis Favrot, Illusonic GmbH - Lausanne, Switzerland; Christof Faller, Illusonic GmbH - Uster, Zürich, Switzerland; EPFL - Lausanne, Switzerland
Microphone recording in a room is ideally carried out by using a close distance microphone to prevent reverberation and noise annoyances, but this restricts the flexibility and the surface covered by the recording. When using multiple distant microphones, a microphone selection algorithm is needed for selecting the momentarily best microphone, namely the one with the least reverberation. Given several microphones arbitrarily distributed in a room, this paper describes an algorithm which, based on an estimation of the direct-to-reverberation ratio for each microphone, switches to the best microphone. The algorithm allows prioritizing a microphone and compensation of different directivity patterns.
Convention Paper 9906
P18-5 Experimental Investigation on Varied Degrees of Sound Field Diffuseness in Full Scale Rooms—Alejandro Bidondo, Universidad Nacional de Tres de Febrero - UNTREF - Caseros, Buenos Aires, Argentina; Sergio Vazquez, Universidad Nacional de Tres de Febrero - Buenos Aires, Argentina; Javier Vazquez, Universidad Nacional de Tres de Febrero - Buenos Aires, Argentina
Sound field diffusion in enclosures should be experimentally quantified based on measured room impulse responses, at least to know how many scattering surfaces produce a sufficiently diffuse sound field for each application. To achieve this a parameter, the Sound Field Diffusion Coefficient (SFDC) which is still under development, was applied. SFDC expresses the reflection's amplitude control and temporal distribution gaussianity, using third octave-band energy-decay compensated impulse responses and taking reference with SFDC average results from a set of impulse responses synthesized with Gaussian white noise. In an attempt to demonstrate the quantification capability of the SFDC, a systematic investigation was conducted whereby varied room configurations using carefully designed scattered interior surfaces were examined with the hypothesis that varied degrees of surface scattering will ultimately lead to varied degrees of sound field diffusion inside two, full scale, rooms. To this end, each room´s floor was covered with different configurations ranging from no diffusers to 16.74 m2 of diffusely reflecting surfaces, in 3 steps. This paper discusses the experimental design and evaluates the results of data collected using systematic modifications of varied degrees of surface scattering, each with combinations of different source orientations and microphone positions.
Convention Paper 9907