Computer Science > Computation and Language

arXiv:2302.14030 (cs)

[Submitted on 27 Feb 2023 (v1), last revised 9 Oct 2023 (this version, v3)]

Title:Multimodal Speech Recognition for Language-Guided Embodied Agents

Authors:Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason

View PDF

Abstract:Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models. this http URL

Comments:	5 pages, 5 figures
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2302.14030 [cs.CL]
	(or arXiv:2302.14030v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.14030
Journal reference:	Proceedings of Interspeech 2023, 1608-1612
Related DOI:	https://doi.org/10.21437/Interspeech.2023-2262

Submission history

From: Allen Chang [view email]
[v1] Mon, 27 Feb 2023 18:41:48 UTC (1,271 KB)
[v2] Wed, 31 May 2023 21:02:09 UTC (1,286 KB)
[v3] Mon, 9 Oct 2023 22:19:02 UTC (1,286 KB)

Computer Science > Computation and Language

Title:Multimodal Speech Recognition for Language-Guided Embodied Agents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multimodal Speech Recognition for Language-Guided Embodied Agents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators