A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Natalia Tomashenko^15,16,17,
Yuri Khokhlov¹⁷ &
Yannick Estève¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9918))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

528 Accesses

Abstract

In this paper we investigate the GMM-derived features for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models with the focus on exploration of fusion of the adapted GMM-derived features and the conventional bottleneck features. We analyze and compare different types of fusion, such as feature level, posterior level, lattice level and others in order to discover the best possible way of fusion. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN setup at different levels and provide additional gain in recognition performance: up to 6 % of relative word error rate reduction (WERR) over the strong speaker adapted DNN baseline, and up to 22 % of relative WERR in comparison with a speaker independent DNN baseline model, trained on conventional features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring GMM-derived Features for Unsupervised Adaptation of Deep Neural Network Acoustic Models

Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks

Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition

Notes

1.
http://cantabresearch.com/cantab-TEDLIUM.tar.bz2.

References

Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)
Article Google Scholar
Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training. In: Proceedings of ICASSP (2006)
Google Scholar
Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, pp. 526–529 (2010)
Google Scholar
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of ASRU, pp. 24–29. IEEE (2011)
Google Scholar
Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of SLT, pp. 366–369. IEEE (2012)
Google Scholar
Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP, pp. 7947–7951. IEEE (2013)
Google Scholar
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of ICASSP, pp. 7893–7897 (2013)
Google Scholar
Albesano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S.: Adaptation of artificial neural networks avoiding catastrophic forgetting. In: Proceedings of IJCNN 2006, pp. 1554–1561. IEEE (2006)
Google Scholar
Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: Proceedings of ICASSP, pp. 6349–6353. IEEE (2014)
Google Scholar
Siniscalchi, S.M., Li, J., Lee, C.-H.: Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 21(10), 2152–2161 (2013)
Article Google Scholar
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of SLT, pp. 171–176. IEEE (2014)
Google Scholar
Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Weng, C., Lee, C.-H.: Feature space maximum a posteriori linear regression for adaptation of deep neural networks. In: Proceedings of INTERSPEECH, pp. 2992–2996 (2014)
Google Scholar
Huang, Z., Siniscalchi, S.M., Chen, I.-F., Li, J., Wu, J., Lee, C.-H.: Maximum a posteriori adaptation of network parameters in deep models. In: Proceedings of INTERSPEECH (2015)
Google Scholar
Li, S., Lu, X., Akita, Y., Kawahara, T.: Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation. In: Proceedings of INTERSPEECH (2015)
Google Scholar
Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Wu, J., Lee, C.-H.: Rapid adaptation for deep neural networks through multi-task learning. In: Proceedings of INTERSPEECH (2015)
Google Scholar
Swietojanski, P., Bell, P., Renals, S.: Structured output layer with auxiliary targets for context-dependent acoustic modelling. In: Proceedings of INTERSPEECH (2015)
Google Scholar
Karanasou, P., Wang, Y., Gales, M.J., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised i-vectors. In: Proceedings of INTERSPEECH, pp. 2180–2184 (2014)
Google Scholar
Gupta, V., Kenny, P., Ouellet, P., Stafylakis, T.: I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In: Proceedings of ICASSP, pp. 6334–6338. IEEE (2014)
Google Scholar
Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of ICASSP, pp. 225–229 (2014)
Google Scholar
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)
Article Google Scholar
Li, J., Huang, J.-T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of ICASSP, pp. 5537–5541. IEEE (2014)
Google Scholar
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)
Google Scholar
Kanagawa, H., Tachioka, Y., Watanabe, S., Ishii, J.: Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN (2015)
Google Scholar
Lei, X., Lin, H., Heigold, G.: Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition. In: Proceedings of ICASSP, pp. 7634–7638. IEEE (2013)
Google Scholar
Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition. In: Proceedings of ICASSP, pp. 195–199. IEEE (2014)
Google Scholar
Murali Karthick, B., Kolhar, P., Umesh, S.: Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM (2015)
Google Scholar
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of INTERSPEECH (2015)
Google Scholar
Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: Proceedings of INTERSPEECH, pp. 2997–3001 (2014)
Google Scholar
Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploring GMM-derived features for unsupervised adaptation of deep neural network acoustic models. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 304–311. Springer, Heidelberg (2016). doi:10.1007/978-3-319-43958-7_36
Chapter Google Scholar
Tomashenko, N., Khokhlov, Y.: GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH, pp. 2882–2886 (2015)
Google Scholar
Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploration de paramètres acoustiques dérivés de GMM pour l’adaptation non supervisée de modèles acoustiques à base de réseaux de neurones profonds. In: Proceedings of 31éme Journées d’Études sur la Parole (JEP), pp. 337–345 (2016)
Google Scholar
Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of Gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)
Google Scholar
Pinto, J.P., Hermansky, H.: Combining evidence from a generative and a discriminative model in phoneme recognition. Technical report, IDIAP (2008)
Google Scholar
Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings of ICASSP, pp. 7654–7658 (2014)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of ICASSP, pp. 6744–6748. IEEE (2013)
Google Scholar
Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of ASRU, pp. 347–354. IEEE (1997)
Google Scholar
Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, Baltimore, vol. 27 (2000)
Google Scholar
Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC, pp. 3935–3939 (2014)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
Google Scholar

Download references

Acknowledgements

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call and by the Government of the Russian Federation, Grant 074-U01.

Author information

Authors and Affiliations

University of Le Mans, Le Mans, France
Natalia Tomashenko & Yannick Estève
ITMO University, Saint-Petersburg, Russia
Natalia Tomashenko
STC-innovations Ltd, Saint-Petersburg, Russia
Natalia Tomashenko & Yuri Khokhlov

Authors

Natalia Tomashenko
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Khokhlov
View author publications
You can also search for this author in PubMed Google Scholar
Yannick Estève
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalia Tomashenko .

Editor information

Editors and Affiliations

University of West Bohemia , Plzen, Czech Republic
Pavel Král
Rovira i Virgili University , Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tomashenko, N., Khokhlov, Y., Estève, Y. (2016). A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-45925-7_10
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45924-0
Online ISBN: 978-3-319-45925-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring GMM-derived Features for Unsupervised Adaptation of Deep Neural Network Acoustic Models

Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks

Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring GMM-derived Features for Unsupervised Adaptation of Deep Neural Network Acoustic Models

Improved Speaker Adaptation by Combining I-vector and fMLLR with Deep Bottleneck Networks

Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation