Meeting Topic: Recognizing and Separating Sounds: Deep Learning in Real-World Audio Signal Processing
Speaker Name: John Woodruff, Knowles Electronics
Meeting Location: Shure Incorporated, 5800 W. Touhy Ave, Niles, IL 60714
The presentation began by introducing the basic concepts of audio scene analysis (ASA) and discussing some of the work of Al Bregman. There is a well-known analogy comparing humans' ability to hear multiple sound sources to determining the number and position of boats on a lake simply by means of waves arriving at the shore. Many tasks, such as separating the sounds of multiple instruments, are easy for humans to do but difficult for computers.
Next, attention was given to concepts of deep learning. The current tool of choice is neural networks. We can train large neural networks with lots of parameters, on lots of data — these are called "deep networks." Research has shown that deep networks (with many layers) are better than broad networks (many nodes in each layer). John showed an example of using a neural network for facial recognition. The shallow layers accomplish simple tasks, like edge detection. The middle layers handle more complex entities, such as nose, eyes, and other features. The deepest layers are where the whole face is finally able to be identified.
A few different approaches to sound separation algorithms were described. In each case the goal is to isolate the target/desired sound (typically someone talking) from any interfering/distracting sounds (noise).
With a speech enhancement approach, the assumption is that the target signal has different spectro-temporal characteristics from the interference. A Wiener filter can be used to attenuate the time-frequency locations that are primarily noise. While this can work well, there is often some distortion of the target signal.
With a beamforming approach, the goal is to use multiple microphone elements to create a tight acoustic pickup pattern pointed at the target. The assumption is that the target and interference are spatially separated. The performance of this is largely dependent on the number and position of microphones, and on whether or not the beam is correctly pointed at the target.
Supervised sound separation is a technique that is showing promise, and can outperform conventional signal processing in many cases. The idea is to train a neural network to correctly isolate "target" sounds from "interference." This can be done with a frame of data in the frequency domain, or sample by sample in the time domain. Generalizing to any acoustic environment is a key challenge.
Written By: Sandy Guzman and Ross Penniman