Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries
×
Cite This
Citation & Abstract
S. Yu, H. Zhang, and Z. Duan, "Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries," J. Audio Eng. Soc., vol. 65, no. 5, pp. 377-388, (2017 May.). doi: https://doi.org/10.17743/jaes.2017.0009
S. Yu, H. Zhang, and Z. Duan, "Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries," J. Audio Eng. Soc., vol. 65 Issue 5 pp. 377-388, (2017 May.). doi: https://doi.org/10.17743/jaes.2017.0009
Abstract: Although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. The challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. Unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. This approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. In contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. Universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. Evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.
@article{yu2017singing,
author={yu, shiwei and zhang, hongjuan and duan, zhiyao},
journal={journal of the audio engineering society},
title={singing voice separation by low-rank and sparse spectrogram decomposition with prelearned dictionaries},
year={2017},
volume={65},
number={5},
pages={377-388},
doi={https://doi.org/10.17743/jaes.2017.0009},
month={may},}
@article{yu2017singing,
author={yu, shiwei and zhang, hongjuan and duan, zhiyao},
journal={journal of the audio engineering society},
title={singing voice separation by low-rank and sparse spectrogram decomposition with prelearned dictionaries},
year={2017},
volume={65},
number={5},
pages={377-388},
doi={https://doi.org/10.17743/jaes.2017.0009},
month={may},
abstract={although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. the challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. this approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. in contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.},}
TY - paper
TI - Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries
SP - 377
EP - 388
AU - Yu, Shiwei
AU - Zhang, Hongjuan
AU - Duan, Zhiyao
PY - 2017
JO - Journal of the Audio Engineering Society
IS - 5
VO - 65
VL - 65
Y1 - May 2017
TY - paper
TI - Singing Voice Separation by Low-Rank and Sparse Spectrogram Decomposition with Prelearned Dictionaries
SP - 377
EP - 388
AU - Yu, Shiwei
AU - Zhang, Hongjuan
AU - Duan, Zhiyao
PY - 2017
JO - Journal of the Audio Engineering Society
IS - 5
VO - 65
VL - 65
Y1 - May 2017
AB - Although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. The challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. Unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. This approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. In contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. Universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. Evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.
Although the human auditory system can easily distinguish the singing voice from the background music in a music recording, it is extremely difficult for computer systems to replicate this ability, especially when the music mixture is a single channel. The challenge arises from the variety of simultaneous sound sources as well from the rich pitch and timbre variations of a singing voice. Unsupervised spectrogram decomposition involves separating the mixture spectrogram into a sparse spectrogram for the singing voice and a low-rank spectrogram for the background music. This approach has two limitations: the unsupervised nature prevents the prelearning of voice and background in music dictionaries; some components of the singing voice and background music may not show the preferred sparse and low-rank properties. In contrast, the authors propose to decompose the mixture spectrogram into three parts: a sparse spectrogram representing the singing voice, a low-rank spectrogram representing the background music, and a residual spectrogram for the components that are not identified by either the sparse or the low-rank spectrogram. Universal dictionaries for the singing voice and background music are prelearned from isolated singing voice and background music training data, through which prior knowledge of the voice and background music is introduced to the separation process. Evaluations on two datasets show that the proposed method is effective and efficient for both the separated singing voice and music accompaniment at various voice-to-music ratios.
Authors:
Yu, Shiwei; Zhang, Hongjuan; Duan, Zhiyao
Affiliations:
Department of Mathematics, Shanghai University, Shanghai, P R China; Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA(See document for exact affiliation information.) JAES Volume 65 Issue 5 pp. 377-388; May 2017
Publication Date:
May 26, 2017Import into BibTeX
Permalink:
http://www.aes.org/e-lib/browse.cfm?elib=18731