Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Convolutional neural networks for speech recognition

Published: 01 October 2014 Publication History

Abstract

Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

References

[1]
H. Jiang, "Discriminative training for automatic speech recognition: A survey," Comput. Speech, Lang., vol. 24, no. 4, pp. 589-608, 2010.
[2]
X. He, L. Deng, and W. Chou, "Discriminative learning in sequential pattern recognition--A unifying review for optimization-oriented speech recognition," IEEE Signal Process. Mag., vol. 25, no. 5, pp. 14-36, Sep. 2008.
[3]
L. Deng and X. Li, "Machine learning paradigms for speech recognition: An overview," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060-1089, May 2013.
[4]
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, "Phone recognition with the mean-covariance restricted Boltzmann machine," Adv. Neural Inf. Process. Syst., no. 23, 2010.
[5]
A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, "Deep belief networks using discriminative features for phone recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2011, pp. 5060-5063.
[6]
D. Yu, L. Deng, and G. Dahl, "Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition," in Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn., 2010.
[7]
G. Dahl, D. Yu, L. Deng, and A. Acero, "Large vocabulary continuous speech recognition with context-dependent DBN-HMMs," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4688-4691.
[8]
F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE Workshop Autom. Speech Recognition Understand. (ASRU), 2011, pp. 24-29.
[9]
N. Morgan, "Deep and wide: Multiple layers in automatic speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 7-13, Jan. 2012.
[10]
A. Mohamed, G. Dahl, and G. Hinton, "Deep belief networks for phone recognition," in Proc. NIPS Workshop Deep Learn. Speech Recognition Related Applicat., 2009.
[11]
A. Mohamed, D. Yu, and L. Deng, "Investigation of full-sequence training of deep belief networks for speech recognition," in Proc. Interspeech, 2010, pp. 2846-2849.
[12]
L. Deng, D. Yu, and J. Platt, "Scalable stacking and learning for building deep architectures," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., 2012, pp. 2133-2136.
[13]
G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.
[14]
F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," in Proc. Interspeech, 2011, pp. 437-440.
[15]
T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed, "Making deep belief networks effective for large vocabulary continuous speech recognition," in IEEE Workshop Autom. Speech Recogn. Understand. (ASRU), 2011, pp. 30-35.
[16]
J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang, "Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling," in Proc. ISCSLP, 2012.
[17]
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
[18]
T. Landauer, C. Kamm, and S. Singhal, "Learning a minimally structured back propagation network to recognize speech," in Proc. 9th Annu. Conf. Cogn. Sci. Soc., 1987, pp. 531-536.
[19]
D. Burr, "A neural network digit recognizer," in Proc. IEEE Int. Conf. Syst., Man, Cybern., 1986.
[20]
Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, "Tandem connectionist feature extraction for conversational speech recognition," in Machine Learning for Multimodal Interaction. Berlin/Heidelberg, Germany: Springer, 2005, vol. 3361, pp. 223-231.
[21]
H. Hermansky, D. P. Ellis, and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2000, vol. 3, pp. 1635-1638.
[22]
F. Grézl, M. Karafiát, S. Kontár, and J. Cernocky, "Probabilistic and bottle-neck features for LVCSR of meetings," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2007, vol. 4, pp. 757-800.
[23]
L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, "Binary coding of speech spectrograms using a deep auto-encoder," in Proc. Interspeech, 2010.
[24]
Y. Bao, H. Jiang, L.-R. Dai, and C. Liu, "Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 6980-6984.
[25]
D. Zhang, L. Deng, and M. Elmasry, A Pipelined Neural Network Architecture For Speech Recognition, In Book: VLSI Artificial Neural Networks Engineering. Norwell, MA, USA: Kluwer, 1994.
[26]
L. Deng, K. Hassanein, and M. Elmasry, "Analysis of correlation structure for a neural predictive model with applications to speech recognition," Neural Netw., vol. 7, no. 2, pp. 331-339, 1994.
[27]
H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. Norwell, MA, USA: Kluwer, 1993.
[28]
G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Comput., vol. 14, pp. 1771-1800, 2002.
[29]
A. Mohamed, G. Hinton, and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2012, pp. 4273-4276.
[30]
J. Li, D. Yu, J.-T. Huang, and Y. Gong, "Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM," in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2012, pp. 131-136.
[31]
Y. LeCun and Y. Bengio, "Convolutional networks for images, speech, and time-series," in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995.
[32]
K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biol. Cybern., vol. 36, pp. 193-202, 1980.
[33]
H. Lee, P. Pham, Y. Largman, and A. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Proc. Adv. Neural Inf. Process. Syst. 22, 2009, pp. 1096-1104.
[34]
D. Hau and K. Chen, "Exploring hierarchical speech representations using a deep convolutional neural network," in Proc. 11th UK Workshop Comput. Intell. (UKCI '11), Manchester, U.K., 2011.
[35]
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, "Phoneme recognition using time-delay neural networks," IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328-339, Mar. 1989.
[36]
D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, "Feature learning in deep neural networks - studies on speech recognition tasks," in Proc. Int. Conf. Learn. Represent., 2013.
[37]
O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, "Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 2012, pp. 4277-4280.
[38]
O. Abdel-Hamid, L. Deng, and D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition," in Proc. Interspeech, 2013.
[39]
L. Deng, O. Abdel-Hamid, and D. Yu, "A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2013, pp. 6669-6673.
[40]
T. N. Sainath, A.-R. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2013, pp. 8614-8618.
[41]
A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, "Deep belief networks using discriminative features for phone recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 5060-5063.
[42]
H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," Tech. Rep. 14021128v1 [cs.NE], Feb. 2014, arXiv.
[43]
L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition and related applications: An overview," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 8599-8603.
[44]
D. Scherer, A. Müller, and S. Behnke, "Evaluation of pooling operations in convolutional architectures for object recognition," in Proc. 20th Int. Conf. Artif. Neural Netw.: Part III, Berlin/Heidelberg, Germany, 2010, pp. 92-101, Springer-Verlag ser. ICANN'10.
[45]
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations," in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 609-616.
[46]
A. Mohamed, G. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.
[47]
K. F. Lee and H. W. Hon, "Speaker-independent phone recognition using hidden Markov models," IEEE Trans. Audio, Speech, Lang. Process., vol. 37, no. 11, pp. 1641-1648, Nov. 1989.

Cited By

View all
  • (2024)Detection of cotton leaf curl disease’s susceptibility scale level based on deep learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00582-913:1Online publication date: 26-Feb-2024
  • (2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
  • (2024)ACO-Pruning for Deep Neural Networks: A Case Study in CNNsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664125(1895-1903)Online publication date: 14-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 22, Issue 10
October 2014
116 pages
ISSN:2329-9290
EISSN:2329-9304
  • Editor:
  • Li Deng
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 October 2014
Accepted: 05 July 2014
Revised: 04 February 2014
Received: 11 October 2013
Published in TASLP Volume 22, Issue 10

Author Tags

  1. convolution
  2. convolutional neural networks
  3. limited weight sharing (LWS) scheme
  4. pooling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)12
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Detection of cotton leaf curl disease’s susceptibility scale level based on deep learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00582-913:1Online publication date: 26-Feb-2024
  • (2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
  • (2024)ACO-Pruning for Deep Neural Networks: A Case Study in CNNsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664125(1895-1903)Online publication date: 14-Jul-2024
  • (2024)End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D NetsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360601923:1(1-21)Online publication date: 15-Jan-2024
  • (2024)Toward Better Low-Rate Deep Learning-Based CSI Feedback: A Test Channel-Based ApproachIEEE Transactions on Wireless Communications10.1109/TWC.2024.335423823:8_Part_1(8773-8786)Online publication date: 1-Aug-2024
  • (2024)Towards Understanding Convergence and Generalization of AdamWIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338229446:9(6486-6493)Online publication date: 1-Sep-2024
  • (2024)On the Number of Linear Regions of Convolutional Neural Networks With Piecewise Linear ActivationsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336115546:7(5131-5148)Online publication date: 1-Feb-2024
  • (2024)WOOD: Wasserstein-Based Out-of-Distribution DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332888346:2(944-956)Online publication date: 1-Feb-2024
  • (2024)Dynamically Shifting Multimodal Representations via Hybrid-Modal Attention for Multimodal Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2023.330371126(2740-2755)Online publication date: 1-Jan-2024
  • (2024)Neural Moderation of ASMR Erotica Content in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328350136:1(275-280)Online publication date: 1-Jan-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media