To improve the performance of automatic speech recognition in noisy environments, the convolutional neural network (CNN) combined with time-delay neural network (TDNN) is introduced, which is referred as CNN-TDNN. The CNN-TDNN model is further optimized by factoring the parameter matrix in the time-delay neural network hidden layers and adding a time-restricted self-attention layer after the CNN-TDNN hidden layers. Experimental results show that the optimized CNN-TDNN model has better performance than DNN, CNN, TDNN, and CNN-TDNN. The average recognition word error rate (WER) can be reduced by 11.76% when comparing with the baselines.
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member and would like to subscribe to the E-Library then Join the AES!
This paper costs $33 for non-members and is free for AES members and E-Library subscribers.