Although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. The challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. Unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. This approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. In contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. Universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. Evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!
This paper costs $33 for non-members and is free for AES members and E-Library subscribers.