Authors:Hoffbauer, Elias; Frank, Matthias
Affiliation:Institute of Electronic Music and Acoustics, University of Music and Performing Arts, Graz, Austria
For the creation of convincing virtual acoustics of existing rooms and spaces, it is useful to apply measured Ambisonic room impulse responses (ARIRs) as a convolution reverb. Typically, tetrahedral arrays offering only first-order resolution are the preferred practical choice for measurements, because they are easily available and processed. In contrast, higher order is preferred in playback because it is superior in terms of localization accuracy and spatial clarity. There are a number of algorithms that enhance the spatial resolution of firstorder ARIRs. However, these algorithms may introduce coloration and artifacts. This paper presents an improvement of the Ambisonic Spatial Decomposition Method by using four directions simultaneously. The additional signals increase the echo density and thereby better preserve the diffuse sound field components during the process of enhancing measured firstorder ARIRs to higher orders. An instrumental validation and a series of listening experiments compare the proposed Four-Directional Ambisonic Spatial Decomposition Method to other existing algorithms and prove its similarity to the best algorithm in terms of enhanced spatial clarity and coloration while producing the least artifacts.
Download: PDF (HIGH Res) (2.1MB)
Download: PDF (LOW Res) (715KB)
Authors:O’Dwyer, Hugh; Boland, Francis
Affiliation:Trinity College Dublin, Dublin, Ireland
This study shows how spherical sound source localization of binaural audio signals in the mismatched head-related transfer function (HRTF) condition can be improved by implementing HRTF clustering when usingmachine learning. A new feature set of cross-correlation function, interaural level difference, and Gammatone cepstral coefficients is introduced and shown to outperform state-of-the-art methods in vertical localization in the mismatched HRTF condition by up to 5%. By examining the performance of Deep Neural Networks trained on single HRTF sets from the CIPIC database on other HRTFs, it is shown that HRTF sets can be clustered into groups of similar HRTFs. This results in the formulation of central HRTF sets representative of their specific cluster.By training a machine learning algorithm on these central HRTFs, it is shown that a more robust algorithm can be trained capable of improving sound source localization accuracy by up to 13% in the mismatched HRTF condition. Concurrently, localization accuracy is decreased by approximately 6% in thematchedHRTF condition, which accounts for less than 9% of all test conditions. Results demonstrate that HRTF clustering can vastly improve the robustness of binaural sound source localization to unseenHRTF conditions.
Download: PDF (HIGH Res) (3.2MB)
Download: PDF (LOW Res) (552KB)
Authors:Dipassio, Tre; Heilemann, Michael C.; Bocko, Mark F.
Affiliation:University of Rochester, Rochester, NY
The microphones and loudspeakers of modern compact electronic devices such as smartphones and tablets typically require case penetrations that leave the device vulnerable to environmental damage. To address this, the authors propose a surface-based audio interface that employs force actuators for reproduction and structural vibration sensors to record the vibrations of the display panel induced by incident acoustic waves. This paper reports experimental results showing that recorded speech signals are of sufficient quality to enable high-reliability automatic speech recognition despite degradation by the panel's resonant properties. The authors report the results of experiments in which acoustic waves containing speech were directed to several panels, and the subsequent vibrations of the panels' surfaces were recorded using structural sensors. The recording quality was characterized by measuring the speech transmission index, and the recordings were transcribed to text using an automatic speech recognition system from which the resulting word error rate was determined. Experiments showed that the word error rate (10%--13%) achieved for the audio signals recorded by the method described in this paper was comparable to that for audio captured by a high-quality studio microphone (10%). The authors also demonstrated a crosstalk cancellation method that enables the system to simultaneously record and play audio signals.
Download: PDF (HIGH Res) (9.2MB)
Download: PDF (LOW Res) (947KB)
Authors:Schwabe, Markus; Murgul, Sebastian; Heizmann, Michael
Affiliation:Institute of Industrial Information Technology (IIIT), Karlsruhe Institute of Technology, Karlsruhe, Germany
Automatic music transcription with note level output is a current task in the field of music information retrieval. In contrast to the piano case with very good results using available large datasets, transcription of non-professional singing has been rarely investigated with deep learning approaches because of the lack of note level annotated datasets. In this work, two datasets are created concerning amateur singing recordings, one for training (synthetic singing dataset) and one for the evaluation task (SingReal dataset). The synthetic training dataset is generated by synthesizing a large scale of vocal melodies from artificial songs. Because the evaluation should represent a realistic scenario, the SingReal dataset is created from real recordings of non-professional singers. To transcribe singing notes, a new method called Dual Task Monophonic Singing Transcription is proposed, which divides the problem of singing transcription into the two subtasks onset detection and pitch estimation, realized by two small independent neural networks. This approach achieves a note level F1 score of 74.19% on the SingReal dataset, outperforming all state of the art transcription systems investigated with at least 3.5% improvement. Furthermore, Dual Task Monophonic Singing Transcription can be adapted very easily to the real-time transcription case.
Download: PDF (HIGH Res) (1.4MB)
Download: PDF (LOW Res) (457KB)
Authors:Lanterman, Aaron D.; Hasler, Jennifer O.
Affiliation:School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA
This paper explores voltage-controlled amplifier (VCA) designs based on operational transconductance amplifiers (OTAs) on a floating-gate--based Field-Programmable Analog Array (FPAA). Although preconfigured OTAs are available on the target FPAA, their gain must be fixed during the programming stage. Hence, the OTA that forms the variable-gain element the VCA must be constructed from the individual transistors that are also available on the FPAA. The current output of this more-flexible OTA is converted to a voltage via one of the built-in fixed-gain OTAs. The authors show how the use of a special floating-gate OTA with voltage attenuation at its inputs arising from capacitor dividers (analogous to resistor dividers used in traditional printed circuit board--level VCA designs) helps prevent a diverging nonlinearity from ruining the current-to-voltage conversion process. This exercise highlights the counterintuitive challenges facing engineers moving from board-level audio design with off-the-shelf chips and discrete bipolar junction transistors to very large--scale integration--level design with complementary metal oxide semiconductor technology.
Download: PDF (HIGH Res) (2.0MB)
Download: PDF (LOW Res) (479KB)