research-article

Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling

Authors:

Hoirin KimAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 26, Issue 11

Pages 2204 - 2214

https://doi.org/10.1109/TASLP.2018.2858923

Published: 01 November 2018 Publication History

Abstract

In this paper, we introduce a new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions. For this purpose, we propose an auxiliary deep neural network DNN called a feature contribution network FCN whose output layer is composed of sigmoid-based contribution gates. In our framework, the FCN tries to learn element-level discriminative contributions of input features and an acoustic model network AMN is trained by gated features generated by element-wise multiplication between contribution gate outputs and input features. In addition, we also propose a regularization method for the FCN, which helps the FCN to activate the minimum number of the gates. The proposed methods were evaluated on the TED-LIUM release 1 corpus. We applied the proposed methods to DNN- and long short-term memory-based AMNs. Experimental results results showed that AMNs with the FCNs consistently improved recognition performance compared with AMN-only frameworks.

References

[1]

G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.

[2]

A.-R Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.

[3]

G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.

[4]

O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 10, pp. 1533-1545, Oct. 2014.

[5]

H. Sak, A. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," in Proc. Conf. Int. Speech Commun. Assoc., 2014, pp. 338-342.

[6]

Y. Miao, M. Gowayyed, and F. Metze, "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2015, pp. 167-174.

[7]

Y. Qian, M. Bi, T. Tan, and K. Yu, "Very deep convolutional neural networks for noise robust speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 12, pp. 2263-2276, Dec. 2016.

[8]

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2015, pp. 4580-4584.

[9]

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2016, pp. 4960-4964.

[10]

Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2017, pp. 4845-4849.

[11]

M. J. F. Gales, "Maximum likelihood linear transformations for HMM-based speech recognition," Comput. Speech Lang., vol. 12, no. 2, pp. 75-98, 1998.

[12]

S. P. Rath, D. Povey, K. Vesely, and J. H. Cernocky, "Improved feature processing for deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2013, pp. 109-113.

[13]

S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal, N. Strom, and S. Garimella, "fMLLR based feature-space speaker adaptation of DNN acoustic models," in Proc. Conf. Int. Speech Commun. Assoc., 2015, pp. 3630-3634.

[14]

B. Li and K. C. Sim, "Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems," in Proc. Conf. Int. Speech Commun. Assoc., 2010, pp. 526-529.

[15]

F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011, pp. 24-29.

[16]

Y. Xiao, Z. Zhang, S. Cai, J. Pan, and Y. Yan, "A initial attempt on taskspecific adaptation for deep neural network-based large vocabulary continuous speech recognition," in Proc. Conf. Int. Speech Commun. Assoc., 2012, pp. 2574-2577.

[17]

N. M. Joy, S. R. Kothinti, and S. Umesh, "FMLLR speaker normalization with i-vector: In pseudo-FMLLR and distillation framework," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 4, pp. 797-805, Apr. 2018.

[18]

Y. Miao, H. Zhang, and F. Metze, "Towards speaker adaptive training of deep neural network acoustic models," in Proc. Conf. Int. Speech Commun. Assoc., 2014, pp. 2189-2193.

[19]

Y. Miao, H. Zhang, and F. Metze, "Speaker adaptive training of deep neural network acoustic models using i-vectors," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 11, pp. 1938-1949, Nov. 2015.

[20]

X. Feng, Y. Zhang, and J. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2014, pp. 1759-1763.

[21]

Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 7-19, Jan. 2015.

[22]

D. Bagchi, M. I. Mandel, Z. Wang, Y. He, A. Plummer, and E. Fosler-Lussier, "Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2015, pp. 496-503.

[23]

M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, "A network of deep neural networks for distant speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2017, pp. 4880-4884.

[24]

A. Narayanan and D. Wang, "Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 92-101, Jan. 2015.

[25]

Z.-Q. Wang and D. Wang, "A joint training framework for robust automatic speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 4, pp. 796-806, Apr. 2016.

[26]

D. Palaz, M. Magimai-Doss, and R. Collobert, "Convolutional neural networks-based continuous speech recognition using raw speech signal," in Proc. Int. Conf. Acoust. Speech Signal Process., 2015, pp. 4295-4299.

[27]

P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, "Acoustic modelling from the signal domain using CNNs," in Proc. Conf. Int. Speech Commun. Assoc., 2016, pp. 3434-3438.

[28]

D. Yu and M. L. Seltzer, "Improved bottleneck features using pretrained deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2011, pp. 237-240.

[29]

J. Gehring, Y. Miao, F. Metze, and A. Waibel, "Extracting deep bottleneck features using stacked auto-encoders," in Proc. Int. Conf. Acoust. Speech Signal Process., 2013, pp. 3377-3381.

[30]

H. Seki, K. Yamamoto, and S. Nakagawa, "A deep neural network integrated with filterbank learning for speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2017, pp. 5480-5484.

[31]

I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182, 2003.

[32]

N. Kwak and C.-H. Choi, "Input feature selection for classification problems," IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 143-159, Jan. 2002.

[33]

B. Bonev, F. Escolano, and M. Cazorla, "Feature selection, mutual information, and the classification of high-dimensional patterns," Pattern Anal. Appl., vol. 11, nos. 3/4, pp. 309-319, 2008.

[34]

H. Yan and J. Yang, "Sparse discriminative feature selection," Pattern Recog., vol. 48, no. 5, pp. 1827-1835, 2015.

[35]

J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, "Feature selection based on structured sparsity: A comprehensive study," IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 7, pp. 1490-1507, Jul. 2017.

[36]

A. Rousseau, P. Deléglise, and Y. Estève, "TED-LIUM: An automatic speech recognition dedicated corpus," in Proc. Int. Conf. Lang. Resour. Eval., 2012, pp. 125-129.

[37]

A. Ng, "Sparse autoencoder," Lecture Notes CS294A, Stanford Univ., Stanford, CA, USA, 2011.

[38]

W. Williams, N. Prasad, D. Mrva, T. Ash, and T. Robinson, "Scaling recurrent neural network language models," in Proc. Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5391-5395.

[39]

M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. Stüker, "Overview of the IWSLT 2012 evaluation campaign," in Proc. 9th Int. Workshop Spoken Lang. Transl., 2012, pp. 12-33.

[40]

P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, "A pitch extraction algorithm tuned for automatic speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2014, pp. 2494-2498.

[41]

D. Kolossa, S. Zeiler, R. Saeidi, and R. Astudillo, "Noise-adaptive LDA: A new approach for speech recognition under observation uncertainty," IEEE Signal Process. Lett., vol. 20, no. 11, pp. 1018-1021, Nov. 2013.

[42]

D. Povey et al., "The Kaldi speech recognition toolkit," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011, pp. 1-4.

[43]

D. Povey, X. Zhang, and S. Khudanpur, "Parallel training of DNNs with natural gradient and parameter averaging," in Proc. Int. Conf. Learn. Represent., 2015.

[44]

L. Gillick and S. J. Cox, "Some statistical issues in the comparison of speech recognition algorithms," in Proc. Int. Conf. Acoust. Speech Signal Process., 1989, pp. 532-535.

[45]

K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative training of deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2013, pp. 2345-2349.

[46]

S. Renals, N. Morgan, M. Cohen, and H. Franco, "Connectionist probability estimation in the DECIPHER speech recognition system," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1992, pp. 601-604.

[47]

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, "Speaker adaptation of neural network acoustic models using i-vectors," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2013, pp. 55-59.

Index Terms

Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more
Improving Deep Learning Feature with Facial Texture Feature for Face Recognition

Face recognition in the reality, is a challenging problem, due to varieties in illumination, background, pose etc. Recently, the deep learning based face recognition algorithm is able to learn effective face features to obtain a very impressive ...
Deep learning based matrix completion

Previous matrix completion methods are generally based on linear and shallow models where the given incomplete matrices are of low-rank and the data are assumed to be generated by linear latent variable models. In this paper, we first propose a novel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 26, Issue 11

November 2018

302 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 01 November 2018

Published in TASLP Volume 26, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
71
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents