Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Deep bidirectional recurrent network (DBRNN) is a powerful acoustic model that can capture the dynamics and coarticulation effect of speech signal. It can model the temporal sequences that depend on left and right contexts, whereas deep unidirectional recurrent neural network (or deep recurrent neural network) can model the temporal sequences that usually depend only on past information. When traditional DBRNNs are used, context-sensitive segments with carefully selected fixed length are exploited to balance recognition accuracy and latency for online speech recognition because the ASR decoder results in recognition latency, depending on the whole input sequence in each evaluation. On the other hand, acoustical realization of phoneme depends not only on the left-sided phoneme, but also on the right-sided phoneme, which should be considered in acoustic modeling for speech recognition. In this paper, we propose a DBRNN-based online speech recognition method that selects and exploits variable length chunks to take into account coarticulation effects appearing in speech production. In order to select variable length segments with the coarticulation effects, the vowel identification points predicted by a deep unidirectional recurrent neural network are used, and such variable length segments are used for training of DBRNN for online recognition. The deep unidirectional recurent neural network for predicting variable length segments is trained using the connectionist temporal classification (CTC) method. We show that the online recognizable DBRNN acoustic model constructed using variable length chunks with coarticulation effect in experiments on Korean speech recognition effectively limits recognition latency, resulting in performance comparable to traditional offline DBRNN, and provides improved performance than online recognition based on fixed-length context-sensitive chunks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Amodei, R., Anubhai, Battenberg, E., Case, C., Casper, J., Catanzaro, B., et al. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (pp. 173–182).

  • Ba, L., Kiros, J., & Hinton, G. (2016). Layer normalization. CoRR, vol. abs/1607.06450. http://arxiv.org/abs/1607.06450.

  • Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. ICASSP. IEEE (pp. 4945–4949).

  • Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP. IEEE (pp. 4960–4964).

  • Chen, K., Yan, Z., & Huo, Q. (2015). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT Approach. Interspeech (pp. 3600–3604).

  • Cho, K. et al. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. http://arxiv.org/abs/1406.1078.

  • Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems (pp. 577–585).

  • Graves A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine learning. ACM (pp. 369–376).

  • Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. Proceedings of International Conference on Artificial Neural Networks (pp. 799–804).

  • Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (pp. 1764–1772).

  • Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP. IEEE (pp. 6645–6649).

  • Hadian, H., Sameti, H., Povey, D., & Khudanpur, S. (2018). End-to-end speech recognition using lattice-free mmi. Interspeech (pp. 12–16).

  • Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (pp. 448–456).

  • Karita S., N., Soplin, Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. Interspeech (pp. 1408–1412).

  • Li, X. & Wu, X., (2015). Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. ICASSP. IEEE (pp. 4520–4524).

  • Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

  • Miao, Y., Gowayyed, M., & Metze, F. (2015). Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 167–174).

  • Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., & Waibel, A. (2016). An empirical exploration of ctc acoustic models. ICASSP. IEEE (pp. 2623–2627).

  • Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88.

    Article  Google Scholar 

  • Morgan N., & Bourlard, H. (1990). Continuous speech recognition using multilayer perceptrons with hidden markov models. ICASSP. IEEE.

  • Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. Interspeech (pp. 3214–3218).

  • Peddinti, V., Wang, Y., Povey, D., & Khudanpur, S. (2018). Low latency acoustic modeling using temporal convolution and lstms. IEEE Signal Processing Letters, 25(3), 373–377.

    Article  Google Scholar 

  • Povey, D., Ghoshal, A., Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. Proceedings of Automatic Speech Recognition and Understanding (ASRU) (pp. 1–4).

  • Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free MMI. Interspeech (pp. 2751–2755).

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Interspeech (pp. 338–342).

  • Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. Interspeech (pp. 1468–1472).

  • Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., & J. Schalkwyk (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. ICASSP. IEEE (pp. 4280–4284).

  • Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

    Article  Google Scholar 

  • Shi, Y., Hwang M., & Lei, X. (2019). End-to-end speech recognition using a high rank LSTM-CTC based model. ICASSP. IEEE (pp. 7080–7084)

  • Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the Importance of Momentum and Initialization in Deep Learning. Proceedings of the International Conference on Machine Learning.

  • Tian, Z., Yi, J., Tao, J., Bai, Y., & Wen, Z. 2019. Self-Attention Transducers for End-to-End Speech Recognition. Interspeech (pp. 4395–4399).

  • Xue, S. & Yan, Z. (2017). Improving latency-controlled BLSTM acoustic models for online speech recognition. ICASSP. IEEE.

  • Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. Springer.

  • Zeyer, A., Schluter, R., & Ney, H. (2016). Towards online-recognition with deep bidirectional LSTM acoustic models. Interspeech (pp. 3424–3428).

Download references

Acknowledgements

We appreciate the helpful discussions with Dr. Kim and Dr. Ri, anonymous reviewers and editors for many invaluable comments and suggestions to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Song-Il Mun.

Ethics declarations

Conflict of interest

There are no our items corresponding to the guideline.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mun, SI., Han, CJ. & Hong, HS. Exploiting variable length segments with coarticulation effect in online speech recognition based on deep bidirectional recurrent neural network and context-sensitive segment. Int J Speech Technol 25, 135–146 (2022). https://doi.org/10.1007/s10772-021-09885-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09885-1

Keywords

Navigation