Abstract
In this paper, we propose a cluster-based senone selection method to speed up the computation of deep neural networks (DNN) at the decoding time of automatic speech recognition (ASR) systems. In DNN-based acoustic models, the large number of senones at the output layer is one of the main causes that lead to the high computation complexity of DNNs. Inspired by the mixture selection method designed for the Gaussian mixture model (GMM)-based acoustic models, only a subset of the senones at the output layer of DNNs are selected to calculate the posterior probabilities in our proposed method. The senone selection strategy is derived by clustering acoustic features according to their transformed representations at the top hidden layer of the DNN acoustic model. Experimental results on Mandarin speech recognition tasks show that the average number of DNN parameters used for computation can be reduced by 22% and the overall speed of the recognition process can be accelerated by 13% without significant performance degradation after using our proposed method. Experimental results on the Switchboard task demonstrate that our proposed method can reduce the average number of DNN parameters used for computation by 38.8% for conventional DNN modeling and 22.7% for low-rank DNN modeling respectively with negligible performance loss.
Similar content being viewed by others
References
Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
HHinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–91.
Yu, D., & Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-training deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 30–42.
Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & Schalkwyk, J. (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proceedings of ICASSP (pp. 4280–4284). Brisbane, Australia.
Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech (pp. 1468–1472). Dresden, Germany.
Deng, L., Li, J., Ting Huang, J., Yao, K., Yu, D., Seide, F., Seltzer, M.L., Zweig, G., He, X., Williams, J., Gong, Y., & Acero, A. (2013). Recent advances in deep learning for speech research at microsoft. In Proceedings of ICASSP (pp. 8604–8609).
Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neural Networks, 61, 85–117.
Yu, D., Seide, F., Li, G., & Deng, L. (2012). Exploting sparseness in deep neural networks for large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4409–4412).
Lei, X., Senior, A., Gruenstein, A., & Sorensen, J. (2013). Accurate and compact large vocabulary speech recognition on mobile devices. In Proceedings of Interspeech (pp. 662–665).
Li, J., Zhao, R., Huang, J.T., & Gong, Y. (2014). Learning small-size DNN with output-distribution-based criteria. In Proceedings of Interspeech (pp. 1910–1914).
Vanhoucke, V., Devin, M., & Heigold, G. (2013). Multiframe deep neural networks for acoustic modeling. In Proceedings of ICASSP.
Xiao, Y., Si, Y., Xu, J., Pan, J., & Yan, Y. (2014). Speeding up deep neural network based speech recognition systems. Journal of Software, 9(10), 2706–2712.
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proceeding of ICASSP (pp. 6655–6659).
Xue, J., Li, J., & Gong, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition. In Proceedings of Interspeech.
Cheng, Y., Yu, F.X., Feris, R.S., Kumar, S., Choudhary, A., & Chang, S.F. (2015). An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE international conference on computer vision (pp. 2857–2865).
He, T., Fan, Y., Qian, Y., Tan, T., & Yu, K. (2014). Reshaping deep neural network for fast decoding by node-pruning. In Proceedings of ICASSP (pp. 245–249).
Tu, M., Berisha, V., Woolf, M., sun Seo, J., & Cao, Y. (2016). Ranking the parameters of deep neural network using the fisher information. In Proceedings of ICASSP.
Vanhoucke, V., & Senior, A. (2011). Improving the speed of neural networkds on CPUs. In Deep learning and unsupervised feature learning workshop, NIPS 2011 (pp. 272–281).
Zhou, P., Jiang, H., Dai, L.R., Hu, Y., & Liu, Q.F. (2015). State-clustering based multiple deep neural networks modeling approach for speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 23(4), 631–642.
Zhu, Y., & Mak, B. (2017). Speeding up softmax computations in dnn-based large vocabulary speech recognition by senone weight vector selection. In Proceedings of ICASSP (pp. 5335–5339).
Chan, A., Sherwani, J., Mosur, R., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of ICSLP.
Lee, A., Kawahara, T., & Shikano, K. (2001). Gaussian mixture selection using context-independent HMM. In Proceedings of ICASSP, (Vol. 1 pp. 69–72). Salt Lake, UT.
Bocchieri, E. (1993). Vector quantization for the efficient computation of continuous density likelihoods. In Proceedings of ICASSP, (Vol. 2 pp. 692–695). Minneapolis, MN, USA.
Zhang, C., Zheng, R., & Xu, B. (2011). Data-driven Gaussian component selection for fast GMM-based speaker verification. In Proceedings of Interspeech (pp. 245–248).
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., & Povey, D. (2006). The HTK book (for HTK version 3.4).
Rahman Mohamed, A., Dahl, G.E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech, and Language Processing, 20(1), 14–22.
Mohri, M., Pereira, F., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. In Springer handbook of speech processing (pp. 559–584). Springer.
Hawkins, J., & Blakeslee, S. (2007). On intelligence. New York: Times Books.
Reynolds, D.A., Quatieri, T.F., & Dunn, R.B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.
Karakos, D., Schwartz, R., Tsakalidis, S., & Le, Z. (2013). Score normalization and system combination for improved keyword spotting. In Automatic speech recognition and understanding (pp. 210–215).
Chiu, J., Wang, Y., Trmal, J., Povey, D., Chen, G., & Rudnicky, A.I. (2014). Combination of fst and cn search in spoken term detection. In Proceedings of Interspeech (p. 2784).
Godfrey, J.J., Holliman, E.C., & Mcdaniel, J. (1992). Switchboard: telephone speech corpus for research and development. In IEEE international conference on acoustics, speech and signal processing, (Vol.1 pp. 517–520).
Powers, D.M. (2011). Evaluation: from precision, recall and f-meansure to roc, informedness, markedeness & correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
David, C.C., Miller, D., & Walker, K. (2004). The fisher corpus: a resource for the next generations of speech-to. In International conference on language resources & evaluation (pp. 69–71).
Acknowledgements
This work was partly supported by the National key research and development program(Grant No. 2016YFB1001300), the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001), and the CAS Strategic Priority Research Program (Grant No. XDB02070006).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, JH., Ling, ZH., Wei, S. et al. Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection. J Sign Process Syst 90, 999–1011 (2018). https://doi.org/10.1007/s11265-017-1288-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-017-1288-9