Google Scholar

End-to-end attention-based large vocabulary speech recognition

D Bahdanau, J Chorowski, D Serdyuk… - … on acoustics, speech …, 2016 - ieeexplore.ieee.org

D Bahdanau, J Chorowski, D Serdyuk, P Brakel, Y Bengio

2016 IEEE international conference on acoustics, speech and signal …, 2016•ieeexplore.ieee.org

Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.

ieeexplore.ieee.org

Show moreShow less

Save Cite Cited by 1508 Related articles All 9 versions

Cite

Advanced search

Saved to My library

End-to-end attention-based large vocabulary speech recognition