Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664647.3681347acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Published: 28 October 2024 Publication History

Abstract

In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has predominantly concentrated on the training paradigms tailored for high-quality resources. However, owing to the challenges inherent in real-world data collection, audio-visual data are frequently affected by modality-distortion, which encompasses audio-visual asynchrony, video noise and audio noise. The recognition accuracy of existing AVSR method is significantly compromised when multiple modality-distortion coexist in low-resource data. In light of the above challenges, we propose PCD: cluster-Prompt with Contrastive Decomposition, a robust framework for modality-distortion speech recognition, specifically devised to transpose the pre-trained knowledge from high-resource domain to the targeted domain by leveraging contrast-augmented prompts. In contrast to previous studies, we take into consideration the possibility of various types of distortion in both the audio and visual modalities. Concretely, we design bespoke prompts to delineate each modality-distortion, guiding the model to achieve speech recognition applicable to various distortion scenarios with quite few learnable parameters. To materialize the prompt mechanism, we employ multiple cluster-based strategies that better suits the pre-trained audio-visual model. Additionally, we design a contrastive decomposition mechanism to restrict the explicit relationships among various modality conditions, given their shared task knowledge and disparate modality priors. Extensive results on LRS2 dataset demonstrate that PCD achieves state-of-the-art performance for audio-visual speech recognition under the constraints of distorted resources. Code is available at https://github.com/ballooncatt/PCD.

References

[1]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence 44, 12 (2018), 8717--8727.
[2]
Mohamed Sami Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Miguel Pino, and Changhan Wang. 2023. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation. ArXiv abs/2303.00628 (2023). https://api.semanticscholar.org/CorpusID: 257255284
[3]
Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, and Olivier Siohan. 2023. On Robustness to Missing Video for Audiovisual Speech Recognition. arXiv:2312.10088 [eess.AS]
[4]
Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, ZehanWang, Aoxiong Yin, Minglei Li, Xinyu Duan, Zhou Zhao, et al. 2023. TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation. arXiv preprint arXiv:2312.15197 (2023).
[5]
Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, and Zhou Zhao. 2023. Opensr: Open-modality speech recognition via maintaining multi-modality alignment. arXiv preprint arXiv:2306.06410 (2023).
[6]
Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, and Zhou Zhao. 2023. MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. arXiv:2303.05309 [cs.CV]
[7]
Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma, Haotian Wang, and Chin-Hui Lee. 2024. A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition. arXiv:2403.04245 [cs.SD]
[8]
Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, and Maja Pantic. 2023. SparseVSR: Lightweight and Noise Robust Visual Speech Recognition. ArXiv abs/2307.04552 (2023). https://api. semanticscholar.org/CorpusID:259501735
[9]
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 3816--3830. https://doi.org/10.18653/v1/2021.acl-long.295
[10]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[11]
Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021. PTR: Prompt Tuning with Rules for Text Classification. arXiv:2105.11259 [cs.CL]
[12]
Joanna Hong, Minsu Kim, Jeongsoo Choi, and Yong Man Ro. 2023.Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring. arXiv:2303.08536 [cs.MM]
[13]
Yuchen Hu, Ruizhe Li, Cheng Chen, Chengwei Qin, Qiu shi Zhu, and Eng Siong Chng. 2023. Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/ CorpusID:259202515
[14]
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661 [cs.SD]
[15]
Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, et al. 2023. Av-transpeech: Audio-visual robust speech-to-speech translation. arXiv preprint arXiv:2305.15403 (2023).
[16]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual Prompt Tuning. arXiv:2203.12119 [cs.CV]
[17]
Tao Jin, Xize Cheng, Linjun Li, Wang Lin, Ye Wang, and Zhou Zhao. 2023. Rethinking missing modality learning from a decoding perspective. In Proceedings of the 31st ACM International Conference on Multimedia. 4431--4439.
[18]
Tao Jin, Zhou Zhao, Meng Zhang, and Xingshan Zeng. 2022. Mc-slt: Towards low-resource signer-adaptive sign language translation. In Proceedings of the 30th ACM International Conference on Multimedia. 4939--4947.
[19]
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. MaPLe: Multi-modal Prompt Learning. arXiv:2210.03117 [cs.CV]
[20]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv:2104.08691 [cs.CL]
[21]
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190 [cs.CL]
[22]
Zhengyang Li, Chenwei Liang, Timo Lohrenz, Marvin Sach, Björn Möller, and Tim Fingscheidt. 2023. An Efficient and Noise-Robust Audiovisual Encoder for Audiovisual Speech Recognition. In Proc. INTERSPEECH 2023. 1583--1587. https://doi.org/10.21437/Interspeech.2023-793
[23]
Hong Liu, Wenhao Li, and Bing Yang. 2021. Robust Audio-Visual Speech Recognition Based on Hybrid Fusion. In 2020 25th International Conference on Pattern Recognition (ICPR). 7580--7586. https://doi.org/10.1109/ICPR48806.2021.9412817
[24]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv:2107.13586 [cs.CL]
[25]
Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. 2023. Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023), 1--5. https://api.semanticscholar.org/CorpusID:257767381
[26]
Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2021. End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7613--7617.
[27]
Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, and Olivier Siohan. 2019. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition. arXiv:1911.04890 [eess.AS]
[28]
Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, and Zhouhan Lin. 2022. Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition. arXiv:2203.07996 [cs.SD]
[29]
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. 2018. Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture. arXiv:1810.00108 [cs.CV]
[30]
Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. 2023. Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2941--2950. https://doi.org/10.1109/ICCV51070.2023.00276
[31]
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Prefix Conditioning Unifies Language and Label Supervision. arXiv:2206.01125 [cs.CV]
[32]
Dmitriy Serdyuk, Otavio Braga, and Olivier Siohan. 2022. Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video. arXiv:2201.10439 [cs.CV]
[33]
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. arXiv:2201.02184 [eess.AS]
[34]
Bowen Shi, Wei-Ning Hsu, and Abdelrahman Mohamed. 2022. Robust selfsupervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763 (2022).
[35]
Chaoyu Shi, Pengjie Ren, Dongjie Fu, Xin Xin, Shansong Yang, Fei Cai, Zhaochun Ren, and Zhumin Chen. 2024. Diversifying Sequential Recommendation with Retrospective and Prospective Transformers. ACM Transactions on Information Systems 42, 5 (2024), 1--37.
[36]
David Snyder, Guoguo Chen, and Daniel Povey. 2015. MUSAN: A Music, Speech, and Noise Corpus. arXiv:1510.08484 arXiv:1510.08484v1.
[37]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 [cs.LG]
[38]
Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. 2022. SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer. arXiv:2110.07904 [cs.CL]
[39]
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, JeffWang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, and Wei-Ning Hsu. 2023. Audiobox: Unified Audio Generation with Natural Language Prompts. arXiv:2312.15821 [cs.SD]
[40]
Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, and Zhou Zhao. 2023. 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363 (2023).
[41]
Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, and Zhou Zhao. 2023. Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2662--2671.
[42]
Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, et al. 2024. Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion. arXiv preprint arXiv:2405.04883 (2024).
[43]
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. 2022. Learning to Prompt for Continual Learning. arXiv:2112.08654 [cs.LG]
[44]
Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. 2024. OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces. arXiv:2407.11895 [cs.CV] https://arxiv.org/abs/2407.11895
[45]
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Roux, John R. Hershey, and Björn Schuller. 2015. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR. In Proceedings of the 12th International Conference on Latent Variable Analysis and Signal Separation - Volume 9237 (Liberec, Czech Republic) (LVA/ICA 2015). Springer-Verlag, Berlin, Heidelberg, 91--99. https://doi.org/10.1007/978-3-319-22482-4_11
[46]
Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. 2023. MmAP: Multi-modal Alignment Prompt for Cross-domain Multi-task Learning. arXiv:2312.08636 [cs.CV]
[47]
Jimin Xu, Tianbao Wang, Tao Jin, Shengyu Zhang, Dongjie Fu, Zhe Wang, Jiangjing Lyu, Chengfei Lv, Chaoyue Niu, Zhou Yu, et al. 2024. MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outlineto-Detail Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10682--10692.
[48]
Weicai Yan, Ye Wang, Wang Lin, Zirun Guo, Zhou Zhao, and Tao Jin. 2024. Low-rank Prompt Interaction for Continual Vision-Language Retrieval. In ACM Multimedia 2024.
[49]
Sheng Yang, Zheng Gong, and Jiacang Kang. 2023. An Improved End-to-End Audio-Visual Speech Recognition Model. INTERSPEECH 2023 (2023). https: //api.semanticscholar.org/CorpusID:260908510
[50]
Xiaoda Yang, Xize Cheng, Dongjie Fu, Minghui Fang, Jialung Zuo, Shengpeng Ji, Tao Jin, and Zhou Zhao. [n. d.]. SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning. In ACM Multimedia 2024.
[51]
Yuan Yao, Bowen Dong, Ao Zhang, Zhengyan Zhang, Ruobing Xie, Zhiyuan Liu, Leyu Lin, Maosong Sun, and Jianyong Wang. 2022. Prompt Tuning for Discriminative Pre-trained Language Models. arXiv:2205.11166 [cs.CL]
[52]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional Prompt Learning for Vision-Language Models. arXiv:2203.05557 [cs.CV]
[53]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision 130, 9 (July 2022), 2337--2348. https://doi.org/10.1007/s11263-022-01653-1
[54]
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023. Visual Prompt Multi-Modal Tracking. arXiv:2303.10826 [cs.CV]

Cited By

View all
  • (2024)Calibrating Prompt from History for Continual Vision-Language Retrieval and GroundingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681387(4302-4311)Online publication date: 28-Oct-2024
  • (2024)SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681386(8149-8158)Online publication date: 28-Oct-2024
  • (2024)Low-rank Prompt Interaction for Continual Vision-Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681264(8257-8266)Online publication date: 28-Oct-2024

Index Terms

  1. Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. audio-visual speech recognition
      2. modality-distortion
      3. multi-modal learning

      Qualifiers

      • Research-article

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)62
      • Downloads (Last 6 weeks)53
      Reflects downloads up to 14 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Calibrating Prompt from History for Continual Vision-Language Retrieval and GroundingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681387(4302-4311)Online publication date: 28-Oct-2024
      • (2024)SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681386(8149-8158)Online publication date: 28-Oct-2024
      • (2024)Low-rank Prompt Interaction for Continual Vision-Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681264(8257-8266)Online publication date: 28-Oct-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media