Abstract
In this study, we investigate an offline to online strategy for speaker adaptation of automatic speech recognition systems. These systems are trained using the traditional feed-forward and the recent proposed lattice-free maximum mutual information (MMI) time-delay deep neural networks. In this strategy, the test speaker identity is modeled as an iVector which is offline estimated and then used in an online style during speech decoding. In order to ensure the quality of iVectors, we introduce a speaker enrollment stage which can ensure sufficient reliable speech for estimating an accurate and stable offline iVector. Furthermore, different iVector estimation techniques are also reviewed and investigated for speaker adaptation in large vocabulary continuous speech recognition (LVCSR) tasks. Experimental results on several real-time speech recognition tasks demonstrate that, the proposed strategy can not only provide a fast decoding speed, but also can result in significant reductions in word error rates (WERs) than traditional iVector based speaker adaptation frameworks.
Similar content being viewed by others
References
Amodei D, Anubhai R, Battenberg E (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML. IEEE, pp 173–182
Bahdanau D, Chorowski J, Serdyuk D et al. (2016) End-to-end attention-based large vocabulary speech recognition. In: ICASSP. IEEE, pp 4945–4949
Dahl G E, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Dehak N, Kenny P J, Dehak R et al. (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Gales M J F (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang. 12(2):75–98
Garimella S, Mandal A, Strom N (2015) Robust i-vector based adaptation of dnn acoustic model for speech recognition. In: Interspeech. ISCA, pp 2877–2881
Gauvain JL, Lee CH (1990) Bayesian learning of Gaussian mixture densities for hidden Markov models. In: DARPA speech and natural language workshop, pp 272–277
Ghahremani P, Manohar V, Povey D, Khudanpur S (2016) Acoustic modelling from the signal domain using CNNs. In: Interspeech. ISCA, pp 1996–2000
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: ICASSP. IEEE, pp 6645–6649
Hasan T, Saeidi R, Hansen JH, van Leeuwen DA (2013) Duration mismatch compensation for i-vector based speaker recognition systems. In: ICASSP. IEEE, pp 7663–7667
Hinton G, Deng L, Yu D et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Huang Z, Tang J, Xue S, Dai L (2016) Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code. In: ICASSP. IEEE, pp 5305–5309
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354
Kenny P, Gupta V, Stafylakis T, Ouellet P, Alam J (2014) Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In: Odyssey. IEEE, pp 293–298
Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP. IEEE, pp 1695–1699
Leggetter C J, Woodland P C (1997) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov model. Comput Speech Lang 9(2):171–185
Liao H (2013) Speaker adaptation of context dependent deep neural networks. In: ICASSP. IEEE, pp 7947–7951
Liu C, Wang Y, Kumar K, Gong Y (2016) Investigations on speaker adaptation of LSTM RNN models for speech recognition. In: ICASSP. IEEE, pp 5020–5024
Miao Y, Zhang H, Metze F (2015) Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans Audio Speech Lang Process 23(11):1938–1949
Peddinti V, Povey D, Khudanpur S (2015) A time delay neural network architecture for different modeling of long temporal contexts. In: Interspeech. ISCA, pp 3214–3218
Peddinti V, Chen G, Manohar V, Povey D, Khudanpur S (2015) HU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS. In: ASRU. IEEE, pp 539–546
Povey D (2016) Nnet3: Neural network toolkit for generic acyclic computation graphs. http://www.danielpovey.com/kaldi-docs/dnn3code.html. Accessed 20 Jan 2017
Povey D, Ghoshal A, Boulianne G et al. (2011) The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011
Povey D, Zhang X, Khudanpur S (2014) Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv:http://arXiv.org/abs/1410.7455. Accessed 12 Jan 2017
Povey D, Peddinti V, Galvez D et al. (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Interspeech. ISCA, pp 2751–2755
Richardson F, Reynolds D, Dehak N (2015) A unified deep neural network for speaker and language recognition. In: Interspeech. ISCA, pp 1146–1150
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP. IEEE, pp 338–342
Sak H, Senior A, Rao K, Beaufays F (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. In: Interspeech. ISCA, pp 1468–1472
Saon G, Soltau H, Nahamoo D, Picheny M (2013) Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU. IEEE, pp 55–59
Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: ASRU. IEEE, pp 24–29
Senior A, Lopez-Moreno I (2014) Improving DNN speaker independence with i-vector inputs. In: ICASSP. IEEE, pp 225–229
Senior A, Sak H, Chaumont Quitry F et al. (2015) Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In: ASRU. IEEE, pp 604–609
Senoussaoui M, Kenny P, Brummer N, Villiers ED, Dumouchel P (2011) Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: Interspeech. ISCA, pp 25–28
Snyder D, Garcia-Romero D, Povey D (2015) Time delay deep neural network-based universal background models for speaker recognition. In: ASRU. IEEE, pp 92–97
Swietojanski P, Renals S (2015) Differentiable pooling for unsupervised speaker adaptation. In: ICASSP. IEEE, pp 4305–4309
Tan T, Qian Y, Yu D et al. (2016) Speaker-aware training of LSTM-RNNS for acoustic modelling. In: ICASSP. IEEE, pp 5280–5284
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339
Xue S, Abdel-Hamid O, Jiang H et al. (2014) Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(12):1713–1725
Xue J, Li J, Seltzer M, Gong Y (2014) Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: ICASSP. IEEE, pp 6359–6363
Xue S, Jiang H, Liu Q (2016) Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. J Signal Process Sys 82 (2):175–185
Yu D, Deng L (2014) Automatic speech recognition a deep learning approach. Springer, New York
Yu D, Yao K, Su H, Li G, Seide F (2013) KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: ICASSP. IEEE, pp 7893–7897
Zhang C, Woodland PC (2016) DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions. In: ICASSP. IEEE, pp 5300–5304
Zhao Y, Li J, Gong Y (2016) Low-rank plus diagonal adaptation for deep neural networks. In: ICASSP. IEEE, pp 5005–5009
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Long, Y., Li, Y. & Zhang, B. Offline to online speaker adaptation for real-time deep neural network based LVCSR systems. Multimed Tools Appl 77, 28101–28119 (2018). https://doi.org/10.1007/s11042-018-6041-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6041-2