End-to-end attention-based large vocabulary speech recognition

D Bahdanau, J Chorowski, D Serdyuk… - … on acoustics, speech …, 2016 - ieeexplore.ieee.org
2016 IEEE international conference on acoustics, speech and signal …, 2016ieeexplore.ieee.org
Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems
are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct
end-to-end methods have been investigated, in which neural architectures were trained to
model sequences of characters [1, 2]. To our knowledge, all these approaches relied on
Connectionist Temporal Classification [3] modules. We investigate an alternative method for
sequence modelling based on an attention mechanism that allows a Recurrent Neural …
Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.
ieeexplore.ieee.org