Nothing Special   »   [go: up one dir, main page]

Textless Speech-to-Speech Translation With Limited Parallel Data

Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi


Abstract
Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topline.
Anthology ID:
2024.findings-emnlp.951
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16208–16224
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.951
DOI:
Bibkey:
Cite (ACL):
Anuj Diwan, Anirudh Srinivasan, David Harwath, and Eunsol Choi. 2024. Textless Speech-to-Speech Translation With Limited Parallel Data. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16208–16224, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Textless Speech-to-Speech Translation With Limited Parallel Data (Diwan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.951.pdf