article

Convolutional neural networks for speech recognition

Editor: Li Deng Authors:

Ossama Abdel-Hamid,

Abdel-Rahman Mohamed,

Dong YuAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 22, Issue 10

Pages 1533 - 1545

https://doi.org/10.1109/TASLP.2014.2339736

Published: 01 October 2014 Publication History

Abstract

Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

References

[1]

H. Jiang, "Discriminative training for automatic speech recognition: A survey," Comput. Speech, Lang., vol. 24, no. 4, pp. 589-608, 2010.

[2]

X. He, L. Deng, and W. Chou, "Discriminative learning in sequential pattern recognition--A unifying review for optimization-oriented speech recognition," IEEE Signal Process. Mag., vol. 25, no. 5, pp. 14-36, Sep. 2008.

[3]

L. Deng and X. Li, "Machine learning paradigms for speech recognition: An overview," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060-1089, May 2013.

[4]

G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, "Phone recognition with the mean-covariance restricted Boltzmann machine," Adv. Neural Inf. Process. Syst., no. 23, 2010.

[5]

A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, "Deep belief networks using discriminative features for phone recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2011, pp. 5060-5063.

[6]

D. Yu, L. Deng, and G. Dahl, "Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition," in Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn., 2010.

[7]

G. Dahl, D. Yu, L. Deng, and A. Acero, "Large vocabulary continuous speech recognition with context-dependent DBN-HMMs," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4688-4691.

[8]

F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE Workshop Autom. Speech Recognition Understand. (ASRU), 2011, pp. 24-29.

[9]

N. Morgan, "Deep and wide: Multiple layers in automatic speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 7-13, Jan. 2012.

[10]

A. Mohamed, G. Dahl, and G. Hinton, "Deep belief networks for phone recognition," in Proc. NIPS Workshop Deep Learn. Speech Recognition Related Applicat., 2009.

[11]

A. Mohamed, D. Yu, and L. Deng, "Investigation of full-sequence training of deep belief networks for speech recognition," in Proc. Interspeech, 2010, pp. 2846-2849.

[12]

L. Deng, D. Yu, and J. Platt, "Scalable stacking and learning for building deep architectures," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., 2012, pp. 2133-2136.

[13]

G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.

[14]

F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," in Proc. Interspeech, 2011, pp. 437-440.

[15]

T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A. Mohamed, "Making deep belief networks effective for large vocabulary continuous speech recognition," in IEEE Workshop Autom. Speech Recogn. Understand. (ASRU), 2011, pp. 30-35.

[16]

J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang, "Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling," in Proc. ISCSLP, 2012.

[17]

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.

[18]

T. Landauer, C. Kamm, and S. Singhal, "Learning a minimally structured back propagation network to recognize speech," in Proc. 9th Annu. Conf. Cogn. Sci. Soc., 1987, pp. 531-536.

[19]

D. Burr, "A neural network digit recognizer," in Proc. IEEE Int. Conf. Syst., Man, Cybern., 1986.

[20]

Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, "Tandem connectionist feature extraction for conversational speech recognition," in Machine Learning for Multimodal Interaction. Berlin/Heidelberg, Germany: Springer, 2005, vol. 3361, pp. 223-231.

[21]

H. Hermansky, D. P. Ellis, and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2000, vol. 3, pp. 1635-1638.

[22]

F. Grézl, M. Karafiát, S. Kontár, and J. Cernocky, "Probabilistic and bottle-neck features for LVCSR of meetings," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2007, vol. 4, pp. 757-800.

[23]

L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, "Binary coding of speech spectrograms using a deep auto-encoder," in Proc. Interspeech, 2010.

[24]

Y. Bao, H. Jiang, L.-R. Dai, and C. Liu, "Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 6980-6984.

[25]

D. Zhang, L. Deng, and M. Elmasry, A Pipelined Neural Network Architecture For Speech Recognition, In Book: VLSI Artificial Neural Networks Engineering. Norwell, MA, USA: Kluwer, 1994.

[26]

L. Deng, K. Hassanein, and M. Elmasry, "Analysis of correlation structure for a neural predictive model with applications to speech recognition," Neural Netw., vol. 7, no. 2, pp. 331-339, 1994.

[27]

H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. Norwell, MA, USA: Kluwer, 1993.

[28]

G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Comput., vol. 14, pp. 1771-1800, 2002.

[29]

A. Mohamed, G. Hinton, and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2012, pp. 4273-4276.

[30]

J. Li, D. Yu, J.-T. Huang, and Y. Gong, "Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM," in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2012, pp. 131-136.

[31]

Y. LeCun and Y. Bengio, "Convolutional networks for images, speech, and time-series," in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995.

[32]

K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biol. Cybern., vol. 36, pp. 193-202, 1980.

[33]

H. Lee, P. Pham, Y. Largman, and A. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Proc. Adv. Neural Inf. Process. Syst. 22, 2009, pp. 1096-1104.

[34]

D. Hau and K. Chen, "Exploring hierarchical speech representations using a deep convolutional neural network," in Proc. 11th UK Workshop Comput. Intell. (UKCI '11), Manchester, U.K., 2011.

[35]

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, "Phoneme recognition using time-delay neural networks," IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328-339, Mar. 1989.

[36]

D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, "Feature learning in deep neural networks - studies on speech recognition tasks," in Proc. Int. Conf. Learn. Represent., 2013.

[37]

O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, "Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 2012, pp. 4277-4280.

[38]

O. Abdel-Hamid, L. Deng, and D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition," in Proc. Interspeech, 2013.

[39]

L. Deng, O. Abdel-Hamid, and D. Yu, "A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2013, pp. 6669-6673.

[40]

T. N. Sainath, A.-R. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2013, pp. 8614-8618.

[41]

A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, "Deep belief networks using discriminative features for phone recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 5060-5063.

[42]

H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," Tech. Rep. 14021128v1 [cs.NE], Feb. 2014, arXiv.

[43]

L. Deng, G. Hinton, and B. Kingsbury, "New types of deep neural network learning for speech recognition and related applications: An overview," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 8599-8603.

[44]

D. Scherer, A. Müller, and S. Behnke, "Evaluation of pooling operations in convolutional architectures for object recognition," in Proc. 20th Int. Conf. Artif. Neural Netw.: Part III, Berlin/Heidelberg, Germany, 2010, pp. 92-101, Springer-Verlag ser. ICANN'10.

[45]

H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, "Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations," in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 609-616.

[46]

A. Mohamed, G. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.

[47]

K. F. Lee and H. W. Hon, "Speaker-independent phone recognition using hidden Markov models," IEEE Trans. Audio, Speech, Lang. Process., vol. 37, no. 11, pp. 1641-1648, Nov. 1989.

Cited By

Nazeer RAli SHu ZAnsari GAl-Razgan MAwwad EGhadi Y(2024)Detection of cotton leaf curl disease’s susceptibility scale level based on deep learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00582-913:1Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1186/s13677-023-00582-9
Zouzoula SMaleki MAzhar MTrancoso P(2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673115
Dorighello RDelgado MLüders RPigatto DLi XHandl J(2024)ACO-Pruning for Deep Neural Networks: A Case Study in CNNsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664125(1895-1903)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638530.3664125
Show More Cited By

Index Terms

Convolutional neural networks for speech recognition

Recommendations

Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition

Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the ...
Dysarthric Speech Recognition Using Variational Mode Decomposition and Convolutional Neural Networks
Abstract
Dysarthric speech recognition requires a learning technique that is able to capture dysarthric speech specific features. Dysarthric speech is considered as speech with source distortion or noisy speech. Hence, as a first step speech enhancement is ...
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition

Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks CNNs have been successfully applied to computer vision and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 22, Issue 10

October 2014

116 pages

ISSN:2329-9290

EISSN:2329-9304

Editor:
Li Deng
Microsoft Research, Redmond, WA

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 October 2014

Accepted: 05 July 2014

Revised: 04 February 2014

Received: 11 October 2013

Published in TASLP Volume 22, Issue 10

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

330
Total Citations
View Citations
1,663
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)12

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nazeer RAli SHu ZAnsari GAl-Razgan MAwwad EGhadi Y(2024)Detection of cotton leaf curl disease’s susceptibility scale level based on deep learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00582-913:1Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1186/s13677-023-00582-9
Zouzoula SMaleki MAzhar MTrancoso P(2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673115
Dorighello RDelgado MLüders RPigatto DLi XHandl J(2024)ACO-Pruning for Deep Neural Networks: A Case Study in CNNsProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3638530.3664125(1895-1903)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638530.3664125
Gambhir PDev ABansal PSharma D(2024)End-to-end Multi-modal Low-resourced Speech Keywords Recognition Using Sequential Conv2D NetsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360601923:1(1-21)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3606019
Liang XJia ZGu XZhang L(2024)Toward Better Low-Rate Deep Learning-Based CSI Feedback: A Test Channel-Based ApproachIEEE Transactions on Wireless Communications10.1109/TWC.2024.335423823:8_Part_1(8773-8786)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TWC.2024.3354238
Zhou PXie XLin ZYan S(2024)Towards Understanding Convergence and Generalization of AdamWIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338229446:9(6486-6493)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3382294
Xiong HHuang LZang WZhen XXie GGu BSong L(2024)On the Number of Linear Regions of Convolutional Neural Networks With Piecewise Linear ActivationsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336115546:7(5131-5148)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3361155
Wang YSun WJin JKong ZYue X(2024)WOOD: Wasserstein-Based Out-of-Distribution DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332888346:2(944-956)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3328883
Lin RHu H(2024)Dynamically Shifting Multimodal Representations via Hybrid-Modal Attention for Multimodal Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2023.330371126(2740-2755)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3303711
Chen YJiang DTan CSong YZhang CChen L(2024)Neural Moderation of ASMR Erotica Content in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328350136:1(275-280)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3283501
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents