Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Offline to online speaker adaptation for real-time deep neural network based LVCSR systems

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this study, we investigate an offline to online strategy for speaker adaptation of automatic speech recognition systems. These systems are trained using the traditional feed-forward and the recent proposed lattice-free maximum mutual information (MMI) time-delay deep neural networks. In this strategy, the test speaker identity is modeled as an iVector which is offline estimated and then used in an online style during speech decoding. In order to ensure the quality of iVectors, we introduce a speaker enrollment stage which can ensure sufficient reliable speech for estimating an accurate and stable offline iVector. Furthermore, different iVector estimation techniques are also reviewed and investigated for speaker adaptation in large vocabulary continuous speech recognition (LVCSR) tasks. Experimental results on several real-time speech recognition tasks demonstrate that, the proposed strategy can not only provide a fast decoding speed, but also can result in significant reductions in word error rates (WERs) than traditional iVector based speaker adaptation frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Amodei D, Anubhai R, Battenberg E (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML. IEEE, pp 173–182

  2. Bahdanau D, Chorowski J, Serdyuk D et al. (2016) End-to-end attention-based large vocabulary speech recognition. In: ICASSP. IEEE, pp 4945–4949

  3. Dahl G E, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42

    Article  Google Scholar 

  4. Dehak N, Kenny P J, Dehak R et al. (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  5. Gales M J F (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang. 12(2):75–98

    Article  Google Scholar 

  6. Garimella S, Mandal A, Strom N (2015) Robust i-vector based adaptation of dnn acoustic model for speech recognition. In: Interspeech. ISCA, pp 2877–2881

  7. Gauvain JL, Lee CH (1990) Bayesian learning of Gaussian mixture densities for hidden Markov models. In: DARPA speech and natural language workshop, pp 272–277

  8. Ghahremani P, Manohar V, Povey D, Khudanpur S (2016) Acoustic modelling from the signal domain using CNNs. In: Interspeech. ISCA, pp 1996–2000

  9. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: ICASSP. IEEE, pp 6645–6649

  10. Hasan T, Saeidi R, Hansen JH, van Leeuwen DA (2013) Duration mismatch compensation for i-vector based speaker recognition systems. In: ICASSP. IEEE, pp 7663–7667

  11. Hinton G, Deng L, Yu D et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  12. Huang Z, Tang J, Xue S, Dai L (2016) Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code. In: ICASSP. IEEE, pp 5305–5309

  13. Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354

    Article  Google Scholar 

  14. Kenny P, Gupta V, Stafylakis T, Ouellet P, Alam J (2014) Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In: Odyssey. IEEE, pp 293–298

  15. Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP. IEEE, pp 1695–1699

  16. Leggetter C J, Woodland P C (1997) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov model. Comput Speech Lang 9(2):171–185

    Article  Google Scholar 

  17. Liao H (2013) Speaker adaptation of context dependent deep neural networks. In: ICASSP. IEEE, pp 7947–7951

  18. Liu C, Wang Y, Kumar K, Gong Y (2016) Investigations on speaker adaptation of LSTM RNN models for speech recognition. In: ICASSP. IEEE, pp 5020–5024

  19. Miao Y, Zhang H, Metze F (2015) Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans Audio Speech Lang Process 23(11):1938–1949

    Article  Google Scholar 

  20. Peddinti V, Povey D, Khudanpur S (2015) A time delay neural network architecture for different modeling of long temporal contexts. In: Interspeech. ISCA, pp 3214–3218

  21. Peddinti V, Chen G, Manohar V, Povey D, Khudanpur S (2015) HU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS. In: ASRU. IEEE, pp 539–546

  22. Povey D (2016) Nnet3: Neural network toolkit for generic acyclic computation graphs. http://www.danielpovey.com/kaldi-docs/dnn3code.html. Accessed 20 Jan 2017

  23. Povey D, Ghoshal A, Boulianne G et al. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011

  24. Povey D, Zhang X, Khudanpur S (2014) Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv:http://arXiv.org/abs/1410.7455. Accessed 12 Jan 2017

  25. Povey D, Peddinti V, Galvez D et al. (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech. ISCA, pp 2751–2755

  26. Richardson F, Reynolds D, Dehak N (2015) A unified deep neural network for speaker and language recognition. In: Interspeech. ISCA, pp 1146–1150

  27. Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP. IEEE, pp 338–342

  28. Sak H, Senior A, Rao K, Beaufays F (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. In: Interspeech. ISCA, pp 1468–1472

  29. Saon G, Soltau H, Nahamoo D, Picheny M (2013) Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU. IEEE, pp 55–59

  30. Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: ASRU. IEEE, pp 24–29

  31. Senior A, Lopez-Moreno I (2014) Improving DNN speaker independence with i-vector inputs. In: ICASSP. IEEE, pp 225–229

  32. Senior A, Sak H, Chaumont Quitry F et al. (2015) Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In: ASRU. IEEE, pp 604–609

  33. Senoussaoui M, Kenny P, Brummer N, Villiers ED, Dumouchel P (2011) Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: Interspeech. ISCA, pp 25–28

  34. Snyder D, Garcia-Romero D, Povey D (2015) Time delay deep neural network-based universal background models for speaker recognition. In: ASRU. IEEE, pp 92–97

  35. Swietojanski P, Renals S (2015) Differentiable pooling for unsupervised speaker adaptation. In: ICASSP. IEEE, pp 4305–4309

  36. Tan T, Qian Y, Yu D et al. (2016) Speaker-aware training of LSTM-RNNS for acoustic modelling. In: ICASSP. IEEE, pp 5280–5284

  37. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339

    Article  Google Scholar 

  38. Xue S, Abdel-Hamid O, Jiang H et al. (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(12):1713–1725

    Article  Google Scholar 

  39. Xue J, Li J, Seltzer M, Gong Y (2014) Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: ICASSP. IEEE, pp 6359–6363

  40. Xue S, Jiang H, Liu Q (2016) Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. J Signal Process Sys 82 (2):175–185

    Article  Google Scholar 

  41. Yu D, Deng L (2014) Automatic speech recognition a deep learning approach. Springer, New York

    MATH  Google Scholar 

  42. Yu D, Yao K, Su H, Li G, Seide F (2013) KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: ICASSP. IEEE, pp 7893–7897

  43. Zhang C, Woodland PC (2016) DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions. In: ICASSP. IEEE, pp 5300–5304

  44. Zhao Y, Li J, Gong Y (2016) Low-rank plus diagonal adaptation for deep neural networks. In: ICASSP. IEEE, pp 5005–5009

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhua Long.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Long, Y., Li, Y. & Zhang, B. Offline to online speaker adaptation for real-time deep neural network based LVCSR systems. Multimed Tools Appl 77, 28101–28119 (2018). https://doi.org/10.1007/s11042-018-6041-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6041-2

Keywords

Navigation