research-article

Learning to Listen... On-Device: Present and future perspectives of on-device ASR

Authors:

Ravichander Vipperla,

Sourav Bhattacharya,

Ilias Leontiadis,

Nicholas D. LaneAuthors Info & Claims

GetMobile: Mobile Computing and Communications, Volume 23, Issue 4

Pages 5 - 9

https://doi.org/10.1145/3400713.3400715

Published: 18 May 2020 Publication History

Abstract

We have reached an important milestone in Automatic Speech Recognition (ASR) technology, with major industrial AI companies, such as Samsung, Google, Apple, and Amazon releasing high-quality ASR models that run completely on-device, e.g., on consumer smartphones. This is the consequence of giant strides in technological advancements: from making commercial grade ASR systems feasible; to large scale cloud deployments; to the present day state-of-the-art models that run on resource constrained devices.

References

[1]

S.J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4. Cambridge University Press, 2006.

[2]

M. Mohri, F. Pereira, and M. Riley. January 2002. Weighted finite-state transducers in speech recognition, Comput. Speech Lang., vol. 16, no. 1, 69--88. https://doi.org/10.1006/csla.2001.0184

Digital Library

[3]

X. Aubert. An overview of decoding techniques for large vocabulary continuous speech recognition. January 2002. Computer Speech Language, vol. 16, pp. 89--114.

Digital Library

[4]

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, and et al. Deep speech 2: End-to-end speech recognition in English and Mandarin, in Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, ser. ICML'16. JMLR.org, 2016, 173--182.

[5]

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. September 2019. SpecAugment: A simple data augmentation method for Automatic Speech Recognition, Interspeech http://dx.doi.org/10.21437/Interspeech.2019--2680

[6]

G. Synnaeve, Q. Xu, J. Kahn, E. Grave, T. Likhomanenko, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert. 2019. End-to-end ASR: From supervised to semi-supervised learning with modern architectures, arXivpreprint arXiv:1911.08460.

[7]

K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, E. S. Kim, Y.-Y. Lee, J. Yeo, D. Kim, S. Jung, J. Lee, M. Han, and C. Kim. 2019. Attention based ondevice streaming speech recognition with large speech corpus, ASRU.

[8]

V. Pratap and R. Collobert. 2020. Online speech recognition with wav2letter@anywhere. https:// ai.facebook.com/blog/online-speech-recognitionwith- wav2letteranywhere

[9]

Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shang-guan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein. 2019. Streaming Endto- end Speech Recognition for Mobile Devices, in ICASSP, 6381--6385.

[10]

J. Huang, Y. Zhang, B. Ginsburg, and P. Chitale. 2019. Develop smaller speech recognition models with NVIDIA's NeMo framework. https://devblogs. nvidia.com/develop-smaller-speech-recognitionmodels- with-nvidias-nemo-framework/

[11]

N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit, in International Symposium on Computer Architecture. ACM.

Digital Library

[12]

C. Wu et al. 2019. Machine learning at Facebook: Understanding inference at the edge, in International Symposium on High Performance Computer Architecture. IEEE.

[13]

Lukasz Dudziak, M.S. Abdelfattah, R.Vipperla, S.Laskaridis, and N.D. Lane. 2019. ShrinkML: End-to-End ASR model compression using reinforcement learning, Interspeech.

[14]

P. Warden, Why are eight bits enough for deep neural networks? 2015. https://petewarden. com/2015/05/23/why-are-eight-bits-enough-fordeep- neural-networks/

[15]

R. Alvarez, R. Prabhavalkar, and A. Bakhtin. 2016. On the efficient representation and execution of deep acoustic models. arXiv:1607.04683.

[16]

E. Säckinger, B. Boser, J. Bromley, Y. LeCun, and L. D. Jackel. March 1992. Application of the ANNA neural network chip to high-speed character recognition, IEEE Transaction on Neural Networks, vol. 3, no. 2, 498--505.

Digital Library

[17]

J. Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI, in Proc. International Symposium on Computer Architecture. IEEE.

Digital Library

[18]

L. G. Valiant. 1990. A bridging model for parallel computation, Comm. ACM, vol. 33, no. 8.

Digital Library

[19]

M. Andreessen, "Why software is eating the world," Wall Street Journal, August 2011.

[20]

S. Clebsch, "We Software People are not Worthy: All Hail the Hardware Gods," 2017, keynote talk at ICOOOLPS 2017.

[21]

C. Hu, W. Bao, D. Wang, and F. Liu. April 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. April 2019. Proceedings of IEEE INFOCOM, 1423--1431.

[22]

H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu. 2019. JALAD: Joint accuracy-and latency-aware deep structure decoupling for edgecloud execution, in International Conference on Parallel and Distributed Systems.

[23]

M. Almeida, S. Laskaridis, I. Leontiadis, S.I. Venieris, and N.D. Lane, EmBench: Quantifying performance variations of deep neural networks across modern commodity devices, in 3rd International Workshop on Deep Learning for Mobile Systems and Applications, ser. EMDL.ACM, 2019, pp. 1--6. http://doi.acm. org/10.1145/3325413.3329793

Digital Library

Cited By

Benazir AXu ZLin FOkoshi TKo JLiKamWa R(2024)Speech Understanding on Tiny Devices with A Learning CacheProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661886(425-437)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661886
Chandra SJayamukesh Singh GVerma A(2024)Size and Inference Time Optimized Automatic Speech Recognition ModelApplied Soft Computing and Communication Networks10.1007/978-981-97-2004-0_33(463-473)Online publication date: 28-Jul-2024
https://doi.org/10.1007/978-981-97-2004-0_33
Jia JLi KMalek MMalik KMahadeokar JKalinli OSeide F(2023)Joint Federated Learning and Personalization for on-Device ASR2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU57964.2023.10389738(1-8)Online publication date: 16-Dec-2023
https://doi.org/10.1109/ASRU57964.2023.10389738
Show More Cited By

Recommendations

Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of ...
Device control using speech recognition
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

The objective of this work is to design a speech recognition circuit for device control. This circuit can be used to control things like Speech controlled appliances and toys, Speech assisted computer games, Speech assisted virtual reality, Telephone ...
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image GetMobile: Mobile Computing and Communications

GetMobile: Mobile Computing and Communications Volume 23, Issue 4

December 2019

34 pages

ISSN:2375-0529

EISSN:2375-0537

DOI:10.1145/3400713

Editor:
Landon Cox
Microsoft Research

Issue’s Table of Contents

Copyright © 2020 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2020

Published in SIGMOBILE-GETMOBILE Volume 23, Issue 4

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
283
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Benazir AXu ZLin FOkoshi TKo JLiKamWa R(2024)Speech Understanding on Tiny Devices with A Learning CacheProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661886(425-437)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661886
Chandra SJayamukesh Singh GVerma A(2024)Size and Inference Time Optimized Automatic Speech Recognition ModelApplied Soft Computing and Communication Networks10.1007/978-981-97-2004-0_33(463-473)Online publication date: 28-Jul-2024
https://doi.org/10.1007/978-981-97-2004-0_33
Jia JLi KMalek MMalik KMahadeokar JKalinli OSeide F(2023)Joint Federated Learning and Personalization for on-Device ASR2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU57964.2023.10389738(1-8)Online publication date: 16-Dec-2023
https://doi.org/10.1109/ASRU57964.2023.10389738
Venieris SPanopoulos IVenieris I(2021)OODIn: An Optimised On-Device Inference Framework for Heterogeneous Mobile Devices2021 IEEE International Conference on Smart Computing (SMARTCOMP)10.1109/SMARTCOMP52413.2021.00021(1-8)Online publication date: Aug-2021
https://doi.org/10.1109/SMARTCOMP52413.2021.00021
Venieris SPanopoulos ILeontiadis IVenieris I(2021)How to Reach Real-Time AI on Consumer Devices? Solutions for Programmable and Custom Architectures2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP52443.2021.00022(93-100)Online publication date: Jul-2021
https://doi.org/10.1109/ASAP52443.2021.00022

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents