Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection

Jun-Hua Liu^1,2,
Zhen-Hua Ling ORCID: orcid.org/0000-0001-7853-5273¹,
Si Wei²,
Guo-Ping Hu² &
…
Li-Rong Dai¹

253 Accesses
Explore all metrics

Abstract

In this paper, we propose a cluster-based senone selection method to speed up the computation of deep neural networks (DNN) at the decoding time of automatic speech recognition (ASR) systems. In DNN-based acoustic models, the large number of senones at the output layer is one of the main causes that lead to the high computation complexity of DNNs. Inspired by the mixture selection method designed for the Gaussian mixture model (GMM)-based acoustic models, only a subset of the senones at the output layer of DNNs are selected to calculate the posterior probabilities in our proposed method. The senone selection strategy is derived by clustering acoustic features according to their transformed representations at the top hidden layer of the DNN acoustic model. Experimental results on Mandarin speech recognition tasks show that the average number of DNN parameters used for computation can be reduced by 22% and the overall speed of the recognition process can be accelerated by 13% without significant performance degradation after using our proposed method. Experimental results on the Switchboard task demonstrate that our proposed method can reduce the average number of DNN parameters used for computation by 38.8% for conventional DNN modeling and 22.7% for low-rank DNN modeling respectively with negligible performance loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Article 04 October 2017

From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

Adaptive data augmentation for mandarin automatic speech recognition

Article 24 April 2024

References

Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
HHinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–91.
Yu, D., & Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-training deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 30–42.
Article Google Scholar
Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & Schalkwyk, J. (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proceedings of ICASSP (pp. 4280–4284). Brisbane, Australia.
Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech (pp. 1468–1472). Dresden, Germany.
Deng, L., Li, J., Ting Huang, J., Yao, K., Yu, D., Seide, F., Seltzer, M.L., Zweig, G., He, X., Williams, J., Gong, Y., & Acero, A. (2013). Recent advances in deep learning for speech research at microsoft. In Proceedings of ICASSP (pp. 8604–8609).
Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neural Networks, 61, 85–117.
Article Google Scholar
Yu, D., Seide, F., Li, G., & Deng, L. (2012). Exploting sparseness in deep neural networks for large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4409–4412).
Lei, X., Senior, A., Gruenstein, A., & Sorensen, J. (2013). Accurate and compact large vocabulary speech recognition on mobile devices. In Proceedings of Interspeech (pp. 662–665).
Li, J., Zhao, R., Huang, J.T., & Gong, Y. (2014). Learning small-size DNN with output-distribution-based criteria. In Proceedings of Interspeech (pp. 1910–1914).
Vanhoucke, V., Devin, M., & Heigold, G. (2013). Multiframe deep neural networks for acoustic modeling. In Proceedings of ICASSP.
Xiao, Y., Si, Y., Xu, J., Pan, J., & Yan, Y. (2014). Speeding up deep neural network based speech recognition systems. Journal of Software, 9(10), 2706–2712.
Article Google Scholar
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proceeding of ICASSP (pp. 6655–6659).
Xue, J., Li, J., & Gong, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition. In Proceedings of Interspeech.
Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., & Chang, S.F. (2015). An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision (pp. 2857–2865).
He, T., Fan, Y., Qian, Y., Tan, T., & Yu, K. (2014). Reshaping deep neural network for fast decoding by node-pruning. In Proceedings of ICASSP (pp. 245–249).
Tu, M., Berisha, V., Woolf, M., sun Seo, J., & Cao, Y. (2016). Ranking the parameters of deep neural network using the fisher information. In Proceedings of ICASSP.
Vanhoucke, V., & Senior, A. (2011). Improving the speed of neural networkds on CPUs. In Deep learning and unsupervised feature learning workshop, NIPS 2011 (pp. 272–281).
Zhou, P., Jiang, H., Dai, L.R., Hu, Y., & Liu, Q.F. (2015). State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 23(4), 631–642.
Article Google Scholar
Zhu, Y., & Mak, B. (2017). Speeding up softmax computations in dnn-based large vocabulary speech recognition by senone weight vector selection. In Proceedings of ICASSP (pp. 5335–5339).
Chan, A., Sherwani, J., Mosur, R., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of ICSLP.
Lee, A., Kawahara, T., & Shikano, K. (2001). Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, (Vol. 1 pp. 69–72). Salt Lake, UT.
Bocchieri, E. (1993). Vector quantization for the efficient computation of continuous density likelihoods. In Proceedings of ICASSP, (Vol. 2 pp. 692–695). Minneapolis, MN, USA.
Zhang, C., Zheng, R., & Xu, B. (2011). Data-driven Gaussian component selection for fast GMM-based speaker verification. In Proceedings of Interspeech (pp. 245–248).
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., & Povey, D. (2006). The HTK book (for HTK version 3.4).
Rahman Mohamed, A., Dahl, G.E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 14–22.
Article Google Scholar
Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer handbook of speech processing (pp. 559–584). Springer.
Hawkins, J., & Blakeslee, S. (2007). On intelligence. New York: Times Books.
Google Scholar
Reynolds, D.A., Quatieri, T.F., & Dunn, R.B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.
Article Google Scholar
Karakos, D., Schwartz, R., Tsakalidis, S., & Le, Z. (2013). Score normalization and system combination for improved keyword spotting. In Automatic speech recognition and understanding (pp. 210–215).
Chiu, J., Wang, Y., Trmal, J., Povey, D., Chen, G., & Rudnicky, A.I. (2014). Combination of fst and cn search in spoken term detection. In Proceedings of Interspeech (p. 2784).
Godfrey, J.J., Holliman, E.C., & Mcdaniel, J. (1992). Switchboard: telephone speech corpus for research and development. In IEEE international conference on acoustics, speech and signal processing, (Vol.1 pp. 517–520).
Powers, D.M. (2011). Evaluation: from precision, recall and f-meansure to roc, informedness, markedeness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
MathSciNet Google Scholar
David, C.C., Miller, D., & Walker, K. (2004). The fisher corpus: a resource for the next generations of speech-to. In International conference on language resources & evaluation (pp. 69–71).

Download references

Acknowledgements

This work was partly supported by the National key research and development program(Grant No. 2016YFB1001300), the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001), and the CAS Strategic Priority Research Program (Grant No. XDB02070006).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Jun-Hua Liu, Zhen-Hua Ling & Li-Rong Dai
Research Department of iFLYTEK Co. LTD., Hefei, China
Jun-Hua Liu, Si Wei & Guo-Ping Hu

Authors

Jun-Hua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Hua Ling
View author publications
You can also search for this author in PubMed Google Scholar
Si Wei
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Ping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Li-Rong Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun-Hua Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, JH., Ling, ZH., Wei, S. et al. Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection. J Sign Process Syst 90, 999–1011 (2018). https://doi.org/10.1007/s11265-017-1288-9

Download citation

Received: 14 January 2017
Revised: 24 August 2017
Accepted: 18 September 2017
Published: 27 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11265-017-1288-9

Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

Adaptive data augmentation for mandarin automatic speech recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

Adaptive data augmentation for mandarin automatic speech recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation