Abstract
In this paper we investigate the GMM-derived features for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models with the focus on exploration of fusion of the adapted GMM-derived features and the conventional bottleneck features. We analyze and compare different types of fusion, such as feature level, posterior level, lattice level and others in order to discover the best possible way of fusion. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN setup at different levels and provide additional gain in recognition performance: up to 6 % of relative word error rate reduction (WERR) over the strong speaker adapted DNN baseline, and up to 22 % of relative WERR in comparison with a speaker independent DNN baseline model, trained on conventional features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)
Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training. In: Proceedings of ICASSP (2006)
Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, pp. 526–529 (2010)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of ASRU, pp. 24–29. IEEE (2011)
Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of SLT, pp. 366–369. IEEE (2012)
Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP, pp. 7947–7951. IEEE (2013)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of ICASSP, pp. 7893–7897 (2013)
Albesano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S.: Adaptation of artificial neural networks avoiding catastrophic forgetting. In: Proceedings of IJCNN 2006, pp. 1554–1561. IEEE (2006)
Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: Proceedings of ICASSP, pp. 6349–6353. IEEE (2014)
Siniscalchi, S.M., Li, J., Lee, C.-H.: Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 21(10), 2152–2161 (2013)
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of SLT, pp. 171–176. IEEE (2014)
Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Weng, C., Lee, C.-H.: Feature space maximum a posteriori linear regression for adaptation of deep neural networks. In: Proceedings of INTERSPEECH, pp. 2992–2996 (2014)
Huang, Z., Siniscalchi, S.M., Chen, I.-F., Li, J., Wu, J., Lee, C.-H.: Maximum a posteriori adaptation of network parameters in deep models. In: Proceedings of INTERSPEECH (2015)
Li, S., Lu, X., Akita, Y., Kawahara, T.: Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation. In: Proceedings of INTERSPEECH (2015)
Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Wu, J., Lee, C.-H.: Rapid adaptation for deep neural networks through multi-task learning. In: Proceedings of INTERSPEECH (2015)
Swietojanski, P., Bell, P., Renals, S.: Structured output layer with auxiliary targets for context-dependent acoustic modelling. In: Proceedings of INTERSPEECH (2015)
Karanasou, P., Wang, Y., Gales, M.J., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised i-vectors. In: Proceedings of INTERSPEECH, pp. 2180–2184 (2014)
Gupta, V., Kenny, P., Ouellet, P., Stafylakis, T.: I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In: Proceedings of ICASSP, pp. 6334–6338. IEEE (2014)
Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of ICASSP, pp. 225–229 (2014)
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)
Li, J., Huang, J.-T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of ICASSP, pp. 5537–5541. IEEE (2014)
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)
Kanagawa, H., Tachioka, Y., Watanabe, S., Ishii, J.: Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN (2015)
Lei, X., Lin, H., Heigold, G.: Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition. In: Proceedings of ICASSP, pp. 7634–7638. IEEE (2013)
Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition. In: Proceedings of ICASSP, pp. 195–199. IEEE (2014)
Murali Karthick, B., Kolhar, P., Umesh, S.: Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM (2015)
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of INTERSPEECH (2015)
Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: Proceedings of INTERSPEECH, pp. 2997–3001 (2014)
Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploring GMM-derived features for unsupervised adaptation of deep neural network acoustic models. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 304–311. Springer, Heidelberg (2016). doi:10.1007/978-3-319-43958-7_36
Tomashenko, N., Khokhlov, Y.: GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH, pp. 2882–2886 (2015)
Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploration de paramètres acoustiques dérivés de GMM pour l’adaptation non supervisée de modèles acoustiques à base de réseaux de neurones profonds. In: Proceedings of 31éme Journées d’Études sur la Parole (JEP), pp. 337–345 (2016)
Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of Gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)
Pinto, J.P., Hermansky, H.: Combining evidence from a generative and a discriminative model in phoneme recognition. Technical report, IDIAP (2008)
Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings of ICASSP, pp. 7654–7658 (2014)
Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of ICASSP, pp. 6744–6748. IEEE (2013)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of ASRU, pp. 347–354. IEEE (1997)
Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, Baltimore, vol. 27 (2000)
Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC, pp. 3935–3939 (2014)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
Acknowledgements
This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call and by the Government of the Russian Federation, Grant 074-U01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Tomashenko, N., Khokhlov, Y., Estève, Y. (2016). A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-45925-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45924-0
Online ISBN: 978-3-319-45925-7
eBook Packages: Computer ScienceComputer Science (R0)