Single-Channel Audio Source Separation Using Deep Neural Network Ensembles
×
Cite This
Citation & Abstract
EM. M.. Grais, G. Roma, AN. R.. Simpson, and MA. D.. Plumbley, "Single-Channel Audio Source Separation Using Deep Neural Network Ensembles," Paper 9494, (2016 May.). doi:
EM. M.. Grais, G. Roma, AN. R.. Simpson, and MA. D.. Plumbley, "Single-Channel Audio Source Separation Using Deep Neural Network Ensembles," Paper 9494, (2016 May.). doi:
Abstract: Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.
@article{grais2016single-channel,
author={grais, emad m. and roma, gerard and simpson, andrew j. r. and plumbley, mark d.},
journal={journal of the audio engineering society},
title={single-channel audio source separation using deep neural network ensembles},
year={2016},
volume={},
number={},
pages={},
doi={},
month={may},}
@article{grais2016single-channel,
author={grais, emad m. and roma, gerard and simpson, andrew j. r. and plumbley, mark d.},
journal={journal of the audio engineering society},
title={single-channel audio source separation using deep neural network ensembles},
year={2016},
volume={},
number={},
pages={},
doi={},
month={may},
abstract={deep neural networks (dnns) are often used to tackle the single channel source separation (scss) problem by predicting time-frequency masks. the predicted masks are then used to separate the sources from the mixed signal. different types of masks produce separated sources with different levels of distortion and interference. some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. in this paper a combination of different dnns’ predictions (masks) is used for scss to achieve better quality of the separated sources than using each dnn individually. we train four different dnns by minimizing four different cost functions to predict four different masks. the first and second dnns are trained to approximate reference binary and soft masks. the third dnn is trained to predict a mask from the reference sources directly. the last dnn is trained similarly to the third dnn but with an additional discriminative constraint to maximize the differences between the estimated sources. our experimental results show that combining the predictions of different dnns achieves separated sources with better quality than using each dnn individually.},}
TY - paper
TI - Single-Channel Audio Source Separation Using Deep Neural Network Ensembles
SP -
EP -
AU - Grais, Emad M.
AU - Roma, Gerard
AU - Simpson, Andrew J. R.
AU - Plumbley, Mark D.
PY - 2016
JO - Journal of the Audio Engineering Society
IS -
VO -
VL -
Y1 - May 2016
TY - paper
TI - Single-Channel Audio Source Separation Using Deep Neural Network Ensembles
SP -
EP -
AU - Grais, Emad M.
AU - Roma, Gerard
AU - Simpson, Andrew J. R.
AU - Plumbley, Mark D.
PY - 2016
JO - Journal of the Audio Engineering Society
IS -
VO -
VL -
Y1 - May 2016
AB - Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.
Deep neural networks (DNNs) are often used to tackle the single channel source separation (SCSS) problem by predicting time-frequency masks. The predicted masks are then used to separate the sources from the mixed signal. Different types of masks produce separated sources with different levels of distortion and interference. Some types of masks produce separated sources with low distortion, while other masks produce low interference between the separated sources. In this paper a combination of different DNNs’ predictions (masks) is used for SCSS to achieve better quality of the separated sources than using each DNN individually. We train four different DNNs by minimizing four different cost functions to predict four different masks. The first and second DNNs are trained to approximate reference binary and soft masks. The third DNN is trained to predict a mask from the reference sources directly. The last DNN is trained similarly to the third DNN but with an additional discriminative constraint to maximize the differences between the estimated sources. Our experimental results show that combining the predictions of different DNNs achieves separated sources with better quality than using each DNN individually.
Authors:
Grais, Emad M.; Roma, Gerard; Simpson, Andrew J. R.; Plumbley, Mark D.
Affiliation:
University of Surrey, Guildford, Surrey, UK
AES Convention:
140 (May 2016)
Paper Number:
9494
Publication Date:
May 26, 2016Import into BibTeX
Subject:
Audio Signal Processing: Coding, Encoding, and Perception
Permalink:
http://www.aes.org/e-lib/browse.cfm?elib=18193