Deep Neural Network Based Guided Speech Bandwidth Extension
×
Cite This
Citation & Abstract
K. Schmidt, and B. Edler, "Deep Neural Network Based Guided Speech Bandwidth Extension," Paper 10254, (2019 October.). doi:
K. Schmidt, and B. Edler, "Deep Neural Network Based Guided Speech Bandwidth Extension," Paper 10254, (2019 October.). doi:
Abstract: Up to today telephone speech is still limited to the range of 200 to 3400 Hz since the predominant codecs in public switched telephone networks are AMR-NB, G.711, and G.722 [1, 2, 3]. Blind bandwidth extension (blind BWE, BBWE) can improve the perceived quality as well as the intelligibility of coded speech without changing the transmission network or the speech codec. The BBWE used in this work is based on deep neural networks (DNNs) and has already shown good performance [4]. Although this BBWE enhances the speech without producing too many artifacts it sometimes fails to enhance prominent fricatives that can result in muffled speech. In order to better synthesize prominent fricatives the BBWE is extended by sending a single bit of side information—here referred to as guided BWE. This bit may be transmitted, e.g., by watermarking so that no changes to the transmission network or the speech codec have to be done. Different DNN con?gurations (including convolutional (Conv.) layers as well as long short-term memory layers (LSTM)) making use of this bit have been evaluated. The BBWE has a low computational complexity and an algorithmic delay of 12 ms only and can be applied in state-of-the-art speech and audio codecs.
@article{schmidt2019deep,
author={schmidt, konstantin and edler, bernd},
journal={journal of the audio engineering society},
title={deep neural network based guided speech bandwidth extension},
year={2019},
volume={},
number={},
pages={},
doi={},
month={october},}
@article{schmidt2019deep,
author={schmidt, konstantin and edler, bernd},
journal={journal of the audio engineering society},
title={deep neural network based guided speech bandwidth extension},
year={2019},
volume={},
number={},
pages={},
doi={},
month={october},
abstract={up to today telephone speech is still limited to the range of 200 to 3400 hz since the predominant codecs in public switched telephone networks are amr-nb, g.711, and g.722 [1, 2, 3]. blind bandwidth extension (blind bwe, bbwe) can improve the perceived quality as well as the intelligibility of coded speech without changing the transmission network or the speech codec. the bbwe used in this work is based on deep neural networks (dnns) and has already shown good performance [4]. although this bbwe enhances the speech without producing too many artifacts it sometimes fails to enhance prominent fricatives that can result in muffled speech. in order to better synthesize prominent fricatives the bbwe is extended by sending a single bit of side information—here referred to as guided bwe. this bit may be transmitted, e.g., by watermarking so that no changes to the transmission network or the speech codec have to be done. different dnn con?gurations (including convolutional (conv.) layers as well as long short-term memory layers (lstm)) making use of this bit have been evaluated. the bbwe has a low computational complexity and an algorithmic delay of 12 ms only and can be applied in state-of-the-art speech and audio codecs.},}
TY - paper
TI - Deep Neural Network Based Guided Speech Bandwidth Extension
SP -
EP -
AU - Schmidt, Konstantin
AU - Edler, Bernd
PY - 2019
JO - Journal of the Audio Engineering Society
IS -
VO -
VL -
Y1 - October 2019
TY - paper
TI - Deep Neural Network Based Guided Speech Bandwidth Extension
SP -
EP -
AU - Schmidt, Konstantin
AU - Edler, Bernd
PY - 2019
JO - Journal of the Audio Engineering Society
IS -
VO -
VL -
Y1 - October 2019
AB - Up to today telephone speech is still limited to the range of 200 to 3400 Hz since the predominant codecs in public switched telephone networks are AMR-NB, G.711, and G.722 [1, 2, 3]. Blind bandwidth extension (blind BWE, BBWE) can improve the perceived quality as well as the intelligibility of coded speech without changing the transmission network or the speech codec. The BBWE used in this work is based on deep neural networks (DNNs) and has already shown good performance [4]. Although this BBWE enhances the speech without producing too many artifacts it sometimes fails to enhance prominent fricatives that can result in muffled speech. In order to better synthesize prominent fricatives the BBWE is extended by sending a single bit of side information—here referred to as guided BWE. This bit may be transmitted, e.g., by watermarking so that no changes to the transmission network or the speech codec have to be done. Different DNN con?gurations (including convolutional (Conv.) layers as well as long short-term memory layers (LSTM)) making use of this bit have been evaluated. The BBWE has a low computational complexity and an algorithmic delay of 12 ms only and can be applied in state-of-the-art speech and audio codecs.
Up to today telephone speech is still limited to the range of 200 to 3400 Hz since the predominant codecs in public switched telephone networks are AMR-NB, G.711, and G.722 [1, 2, 3]. Blind bandwidth extension (blind BWE, BBWE) can improve the perceived quality as well as the intelligibility of coded speech without changing the transmission network or the speech codec. The BBWE used in this work is based on deep neural networks (DNNs) and has already shown good performance [4]. Although this BBWE enhances the speech without producing too many artifacts it sometimes fails to enhance prominent fricatives that can result in muffled speech. In order to better synthesize prominent fricatives the BBWE is extended by sending a single bit of side information—here referred to as guided BWE. This bit may be transmitted, e.g., by watermarking so that no changes to the transmission network or the speech codec have to be done. Different DNN con?gurations (including convolutional (Conv.) layers as well as long short-term memory layers (LSTM)) making use of this bit have been evaluated. The BBWE has a low computational complexity and an algorithmic delay of 12 ms only and can be applied in state-of-the-art speech and audio codecs.
Authors:
Schmidt, Konstantin; Edler, Bernd
Affiliations:
Friedrich Alexander University, Erlangen-Nürnberg, Germany; Fraunhofer IIS, Erlangen, Germany(See document for exact affiliation information.)
AES Convention:
147 (October 2019)
Paper Number:
10254
Publication Date:
October 8, 2019Import into BibTeX
Subject:
Posters: Audio Signal Processing
Permalink:
http://www.aes.org/e-lib/browse.cfm?elib=20627