Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2204.03793 (eess)

[Submitted on 8 Apr 2022 (v1), last revised 25 Jun 2022 (this version, v3)]

Title:Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Authors:Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw

View PDF

Abstract:Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.

Comments:	Accepted by INTERSPEECH 2022
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2204.03793 [eess.AS]
	(or arXiv:2204.03793v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2204.03793

Submission history

From: Shaojin Ding [view email]
[v1] Fri, 8 Apr 2022 00:49:19 UTC (177 KB)
[v2] Wed, 13 Apr 2022 04:17:24 UTC (177 KB)
[v3] Sat, 25 Jun 2022 02:12:45 UTC (178 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators