Acoustic Scene Classification Using Pixel-Based Attention
×
Cite This
Citation & Abstract
X. Wang, Y. Xu, J. Shi, and X. Teng, "Acoustic Scene Classification Using Pixel-Based Attention," J. Audio Eng. Soc., vol. 68, no. 11, pp. 843-855, (2020 November.). doi: https://doi.org/10.17743/jaes.2020.0052
X. Wang, Y. Xu, J. Shi, and X. Teng, "Acoustic Scene Classification Using Pixel-Based Attention," J. Audio Eng. Soc., vol. 68 Issue 11 pp. 843-855, (2020 November.). doi: https://doi.org/10.17743/jaes.2020.0052
Abstract: In this paper, we propose a pixel-based attention (PBA) module for acoustic scene classification (ASC). By performing feature compression on the input spectrogram along the spatial dimension, PBA can obtain the global information of the spectrogram. Besides, PBA applies attention weights to each pixel of each channel through two convolutional layers combined with global information. In addition, the spectrogram applied after the attention weights is multiplied by the gamma coefficient and superimposed with the original spectrogram to obtain more effective spectrogram features for training the network model. Furthermore, this paper implements a convolutional neural network (CNN) based on PBA (PB-CNN) and compares its classification performance on task 1 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Challenge with CNN based on time attention (TB-CNN), CNN based on frequency attention (FB-CNN), and pure CNN. The experimental results show that the proposed PB-CNN achieves the highest accuracy of 89.2% among the four CNNs, 1.9% higher than that of TB-CNN (87.3%), 2.2% higher than that of FB-CNN (86.6%), and 3% higher than that of pure CNN (86.2%). Compared with DCASE 2016’s baseline system, the PB-CNN improved by 12%, and its 89.2% accuracy was the highest among all submitted single models.
@article{wang2020acoustic,
author={wang, xingmei and xu, yichao and shi, jiahao and teng, xuyang},
journal={journal of the audio engineering society},
title={acoustic scene classification using pixel-based attention},
year={2020},
volume={68},
number={11},
pages={843-855},
doi={https://doi.org/10.17743/jaes.2020.0052},
month={november},}
@article{wang2020acoustic,
author={wang, xingmei and xu, yichao and shi, jiahao and teng, xuyang},
journal={journal of the audio engineering society},
title={acoustic scene classification using pixel-based attention},
year={2020},
volume={68},
number={11},
pages={843-855},
doi={https://doi.org/10.17743/jaes.2020.0052},
month={november},
abstract={in this paper, we propose a pixel-based attention (pba) module for acoustic scene classification (asc). by performing feature compression on the input spectrogram along the spatial dimension, pba can obtain the global information of the spectrogram. besides, pba applies attention weights to each pixel of each channel through two convolutional layers combined with global information. in addition, the spectrogram applied after the attention weights is multiplied by the gamma coefficient and superimposed with the original spectrogram to obtain more effective spectrogram features for training the network model. furthermore, this paper implements a convolutional neural network (cnn) based on pba (pb-cnn) and compares its classification performance on task 1 of detection and classification of acoustic scenes and events (dcase) 2016 challenge with cnn based on time attention (tb-cnn), cnn based on frequency attention (fb-cnn), and pure cnn. the experimental results show that the proposed pb-cnn achieves the highest accuracy of 89.2% among the four cnns, 1.9% higher than that of tb-cnn (87.3%), 2.2% higher than that of fb-cnn (86.6%), and 3% higher than that of pure cnn (86.2%). compared with dcase 2016’s baseline system, the pb-cnn improved by 12%, and its 89.2% accuracy was the highest among all submitted single models.},}
TY - paper
TI - Acoustic Scene Classification Using Pixel-Based Attention
SP - 843
EP - 855
AU - Wang, Xingmei
AU - Xu, Yichao
AU - Shi, Jiahao
AU - Teng, Xuyang
PY - 2020
JO - Journal of the Audio Engineering Society
IS - 11
VO - 68
VL - 68
Y1 - November 2020
TY - paper
TI - Acoustic Scene Classification Using Pixel-Based Attention
SP - 843
EP - 855
AU - Wang, Xingmei
AU - Xu, Yichao
AU - Shi, Jiahao
AU - Teng, Xuyang
PY - 2020
JO - Journal of the Audio Engineering Society
IS - 11
VO - 68
VL - 68
Y1 - November 2020
AB - In this paper, we propose a pixel-based attention (PBA) module for acoustic scene classification (ASC). By performing feature compression on the input spectrogram along the spatial dimension, PBA can obtain the global information of the spectrogram. Besides, PBA applies attention weights to each pixel of each channel through two convolutional layers combined with global information. In addition, the spectrogram applied after the attention weights is multiplied by the gamma coefficient and superimposed with the original spectrogram to obtain more effective spectrogram features for training the network model. Furthermore, this paper implements a convolutional neural network (CNN) based on PBA (PB-CNN) and compares its classification performance on task 1 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Challenge with CNN based on time attention (TB-CNN), CNN based on frequency attention (FB-CNN), and pure CNN. The experimental results show that the proposed PB-CNN achieves the highest accuracy of 89.2% among the four CNNs, 1.9% higher than that of TB-CNN (87.3%), 2.2% higher than that of FB-CNN (86.6%), and 3% higher than that of pure CNN (86.2%). Compared with DCASE 2016’s baseline system, the PB-CNN improved by 12%, and its 89.2% accuracy was the highest among all submitted single models.
In this paper, we propose a pixel-based attention (PBA) module for acoustic scene classification (ASC). By performing feature compression on the input spectrogram along the spatial dimension, PBA can obtain the global information of the spectrogram. Besides, PBA applies attention weights to each pixel of each channel through two convolutional layers combined with global information. In addition, the spectrogram applied after the attention weights is multiplied by the gamma coefficient and superimposed with the original spectrogram to obtain more effective spectrogram features for training the network model. Furthermore, this paper implements a convolutional neural network (CNN) based on PBA (PB-CNN) and compares its classification performance on task 1 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Challenge with CNN based on time attention (TB-CNN), CNN based on frequency attention (FB-CNN), and pure CNN. The experimental results show that the proposed PB-CNN achieves the highest accuracy of 89.2% among the four CNNs, 1.9% higher than that of TB-CNN (87.3%), 2.2% higher than that of FB-CNN (86.6%), and 3% higher than that of pure CNN (86.2%). Compared with DCASE 2016’s baseline system, the PB-CNN improved by 12%, and its 89.2% accuracy was the highest among all submitted single models.
Open Access
Authors:
Wang, Xingmei; Xu, Yichao; Shi, Jiahao; Teng, Xuyang
Affiliations:
College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, People’s Republic of China; College of Communication Engineering, Hangzhou Dianzi University, Hangzhou, 310018, People’s Republic of China(See document for exact affiliation information.) JAES Volume 68 Issue 11 pp. 843-855; November 2020
Publication Date:
December 21, 2020Import into BibTeX
Permalink:
http://www.aes.org/e-lib/browse.cfm?elib=20998