In the classic signal processing context, the ability to identify and resolve acoustic objects from a compact and small number of directional microphones is a challenging problem. A practical example is developing a robust system for understanding voice activity in a reverberant conference room from a small number of co-incident directional microphones. In an application setting, many assumptions of the classic academic problem formulation are violated. The actual problem is inherently broad band with a wide dynamic range, simultaneous voice activity and multi-path acoustic responses leading to source correlation and ambiguity. Room and occupant noise is rarely stationary and irrelevant acoustic events are not easily classified separate from voice. There is however a useful set of assumptions which can be utilized. Whilst these can be di cult to formally specify, they correspond to the understandings, common sense and constraints of a real meeting environment. The higher order statistical independence of typical acoustic scenes and voice activity can be utilized to gather information selectively in time. The system discussed in this work combines a simple statistical framework, physical source object modeling and operational heuristics to decompose a meeting scene with low latency from an array of three co-incident directional microphones. An overview of the system architecture is presented with speci c details of the raw features, a convenient mapping utilized for clustering and heuristics over several time scales driven by a voice activity classi er. Longer time frames and suitable constraints on the object state provide robust operation and allow for the use of scene information for an interactive sound field application. Rather than an objective assessment of localization accuracy, the comparative assessment of algorithms and was based on field testing with the key requirements being reliability, testability and understanding potential failure modes. The work is presented as a demonstration and suggestion for the use of light weight computational auditory scene analysis in a deployed voice conference system.
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!
This paper costs $33 for non-members and is free for AES members and E-Library subscribers.