Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling

Published: 01 November 2018 Publication History

Abstract

In this paper, we introduce a new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions. For this purpose, we propose an auxiliary deep neural network DNN called a feature contribution network FCN whose output layer is composed of sigmoid-based contribution gates. In our framework, the FCN tries to learn element-level discriminative contributions of input features and an acoustic model network AMN is trained by gated features generated by element-wise multiplication between contribution gate outputs and input features. In addition, we also propose a regularization method for the FCN, which helps the FCN to activate the minimum number of the gates. The proposed methods were evaluated on the TED-LIUM release 1 corpus. We applied the proposed methods to DNN- and long short-term memory-based AMNs. Experimental results results showed that AMNs with the FCNs consistently improved recognition performance compared with AMN-only frameworks.

References

[1]
G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.
[2]
A.-R Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.
[3]
G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
[4]
O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 10, pp. 1533-1545, Oct. 2014.
[5]
H. Sak, A. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," in Proc. Conf. Int. Speech Commun. Assoc., 2014, pp. 338-342.
[6]
Y. Miao, M. Gowayyed, and F. Metze, "EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2015, pp. 167-174.
[7]
Y. Qian, M. Bi, T. Tan, and K. Yu, "Very deep convolutional neural networks for noise robust speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 12, pp. 2263-2276, Dec. 2016.
[8]
T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2015, pp. 4580-4584.
[9]
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2016, pp. 4960-4964.
[10]
Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2017, pp. 4845-4849.
[11]
M. J. F. Gales, "Maximum likelihood linear transformations for HMM-based speech recognition," Comput. Speech Lang., vol. 12, no. 2, pp. 75-98, 1998.
[12]
S. P. Rath, D. Povey, K. Vesely, and J. H. Cernocky, "Improved feature processing for deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2013, pp. 109-113.
[13]
S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal, N. Strom, and S. Garimella, "fMLLR based feature-space speaker adaptation of DNN acoustic models," in Proc. Conf. Int. Speech Commun. Assoc., 2015, pp. 3630-3634.
[14]
B. Li and K. C. Sim, "Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems," in Proc. Conf. Int. Speech Commun. Assoc., 2010, pp. 526-529.
[15]
F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011, pp. 24-29.
[16]
Y. Xiao, Z. Zhang, S. Cai, J. Pan, and Y. Yan, "A initial attempt on taskspecific adaptation for deep neural network-based large vocabulary continuous speech recognition," in Proc. Conf. Int. Speech Commun. Assoc., 2012, pp. 2574-2577.
[17]
N. M. Joy, S. R. Kothinti, and S. Umesh, "FMLLR speaker normalization with i-vector: In pseudo-FMLLR and distillation framework," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 4, pp. 797-805, Apr. 2018.
[18]
Y. Miao, H. Zhang, and F. Metze, "Towards speaker adaptive training of deep neural network acoustic models," in Proc. Conf. Int. Speech Commun. Assoc., 2014, pp. 2189-2193.
[19]
Y. Miao, H. Zhang, and F. Metze, "Speaker adaptive training of deep neural network acoustic models using i-vectors," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 11, pp. 1938-1949, Nov. 2015.
[20]
X. Feng, Y. Zhang, and J. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2014, pp. 1759-1763.
[21]
Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 7-19, Jan. 2015.
[22]
D. Bagchi, M. I. Mandel, Z. Wang, Y. He, A. Plummer, and E. Fosler-Lussier, "Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2015, pp. 496-503.
[23]
M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, "A network of deep neural networks for distant speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2017, pp. 4880-4884.
[24]
A. Narayanan and D. Wang, "Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 92-101, Jan. 2015.
[25]
Z.-Q. Wang and D. Wang, "A joint training framework for robust automatic speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 4, pp. 796-806, Apr. 2016.
[26]
D. Palaz, M. Magimai-Doss, and R. Collobert, "Convolutional neural networks-based continuous speech recognition using raw speech signal," in Proc. Int. Conf. Acoust. Speech Signal Process., 2015, pp. 4295-4299.
[27]
P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, "Acoustic modelling from the signal domain using CNNs," in Proc. Conf. Int. Speech Commun. Assoc., 2016, pp. 3434-3438.
[28]
D. Yu and M. L. Seltzer, "Improved bottleneck features using pretrained deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2011, pp. 237-240.
[29]
J. Gehring, Y. Miao, F. Metze, and A. Waibel, "Extracting deep bottleneck features using stacked auto-encoders," in Proc. Int. Conf. Acoust. Speech Signal Process., 2013, pp. 3377-3381.
[30]
H. Seki, K. Yamamoto, and S. Nakagawa, "A deep neural network integrated with filterbank learning for speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process., 2017, pp. 5480-5484.
[31]
I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182, 2003.
[32]
N. Kwak and C.-H. Choi, "Input feature selection for classification problems," IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 143-159, Jan. 2002.
[33]
B. Bonev, F. Escolano, and M. Cazorla, "Feature selection, mutual information, and the classification of high-dimensional patterns," Pattern Anal. Appl., vol. 11, nos. 3/4, pp. 309-319, 2008.
[34]
H. Yan and J. Yang, "Sparse discriminative feature selection," Pattern Recog., vol. 48, no. 5, pp. 1827-1835, 2015.
[35]
J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, "Feature selection based on structured sparsity: A comprehensive study," IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 7, pp. 1490-1507, Jul. 2017.
[36]
A. Rousseau, P. Deléglise, and Y. Estève, "TED-LIUM: An automatic speech recognition dedicated corpus," in Proc. Int. Conf. Lang. Resour. Eval., 2012, pp. 125-129.
[37]
A. Ng, "Sparse autoencoder," Lecture Notes CS294A, Stanford Univ., Stanford, CA, USA, 2011.
[38]
W. Williams, N. Prasad, D. Mrva, T. Ash, and T. Robinson, "Scaling recurrent neural network language models," in Proc. Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5391-5395.
[39]
M. Federico, M. Cettolo, L. Bentivogli, M. Paul, and S. Stüker, "Overview of the IWSLT 2012 evaluation campaign," in Proc. 9th Int. Workshop Spoken Lang. Transl., 2012, pp. 12-33.
[40]
P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, "A pitch extraction algorithm tuned for automatic speech recognition," in Proc. Int. Conf. Acoust. Speech, Signal Process., 2014, pp. 2494-2498.
[41]
D. Kolossa, S. Zeiler, R. Saeidi, and R. Astudillo, "Noise-adaptive LDA: A new approach for speech recognition under observation uncertainty," IEEE Signal Process. Lett., vol. 20, no. 11, pp. 1018-1021, Nov. 2013.
[42]
D. Povey et al., "The Kaldi speech recognition toolkit," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011, pp. 1-4.
[43]
D. Povey, X. Zhang, and S. Khudanpur, "Parallel training of DNNs with natural gradient and parameter averaging," in Proc. Int. Conf. Learn. Represent., 2015.
[44]
L. Gillick and S. J. Cox, "Some statistical issues in the comparison of speech recognition algorithms," in Proc. Int. Conf. Acoust. Speech Signal Process., 1989, pp. 532-535.
[45]
K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative training of deep neural networks," in Proc. Conf. Int. Speech Commun. Assoc., 2013, pp. 2345-2349.
[46]
S. Renals, N. Morgan, M. Cohen, and H. Franco, "Connectionist probability estimation in the DECIPHER speech recognition system," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1992, pp. 601-604.
[47]
G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, "Speaker adaptation of neural network acoustic models using i-vectors," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2013, pp. 55-59.

Index Terms

  1. Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
    IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 26, Issue 11
    November 2018
    302 pages
    ISSN:2329-9290
    EISSN:2329-9304
    Issue’s Table of Contents

    Publisher

    IEEE Press

    Publication History

    Published: 01 November 2018
    Published in TASLP Volume 26, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 71
      Total Downloads
    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media