2024
pdf
bib
abs
Massive End-to-end Speech Recognition Models with Time Reduction
Weiran Wang
|
Rohit Prabhavalkar
|
Haozhe Shan
|
Zhong Meng
|
Dongseong Hwang
|
Qiujia Li
|
Khe Chai Sim
|
Bo Li
|
James Qin
|
Xingyu Cai
|
Adam Stooke
|
Chengjian Zheng
|
Yanzhang He
|
Tara Sainath
|
Pedro Moreno Mengibar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. The encoders of our models use the neural architecture of Google’s universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We also explore a few practical methods to mitigate potential accuracy loss due to time reduction, while enjoying most efficiency gain. Our methods are demonstrated to work with both Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), with up to 2B model parameters, and over two domains. For a large-scale voice search recognition task, we perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets, and show that a 900M RNN-T is very tolerant to severe time reduction, with as low encoder output frame rate as 640ms. We also provide ablation studies on the Librispeech benchmark for important training hyperparameters and architecture designs, in training 600M RNN-T models at the frame rate of 160ms.
pdf
bib
abs
Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR
Zelin Wu
|
Gan Song
|
Christopher Li
|
Pat Rondon
|
Zhong Meng
|
Xavier Velez
|
Weiran Wang
|
Diamantino Caseiro
|
Golan Pundak
|
Tsendsuren Munkhdalai
|
Angad Chandorkar
|
Rohit Prabhavalkar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Contextual biasing enables speech recognizers to transcribe important phrases in the speaker’s context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.
2019
pdf
bib
abs
Multimodal and Multi-view Models for Emotion Recognition
Gustavo Aguilar
|
Viktor Rozgic
|
Weiran Wang
|
Chao Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Studies on emotion recognition (ER) show that combining lexical and acoustic information results in more robust and accurate models. The majority of the studies focus on settings where both modalities are available in training and evaluation. However, in practice, this is not always the case; getting ASR output may represent a bottleneck in a deployment pipeline due to computational complexity or privacy-related constraints. To address this challenge, we study the problem of efficiently combining acoustic and lexical modalities during training while still providing a deployable acoustic model that does not require lexical inputs. We first experiment with multimodal models and two attention mechanisms to assess the extent of the benefits that lexical information can provide. Then, we frame the task as a multi-view learning problem to induce semantic information from a multimodal model into our acoustic-only network using a contrastive loss function. Our multimodal model outperforms the previous state of the art on the USC-IEMOCAP dataset reported on lexical and acoustic information. Additionally, our multi-view-trained acoustic network significantly surpasses models that have been exclusively trained with acoustic features.
2015
pdf
bib
Deep Multilingual Correlation for Improved Word Embeddings
Ang Lu
|
Weiran Wang
|
Mohit Bansal
|
Kevin Gimpel
|
Karen Livescu
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies