Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394171.3413690acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis

Published: 12 October 2020 Publication History

Abstract

Multimodal sentiment analysis is an emerging research field that aims to enable machines to recognize, interpret, and express emotion. Through the cross-modal interaction, we can get more comprehensive emotional characteristics of the speaker. Bidirectional Encoder Representations from Transformers (BERT) is an efficient pre-trained language representation model. Fine-tuning it has obtained new state-of-the-art results on eleven natural language processing tasks like question answering and natural language inference. However, most previous works fine-tune BERT only base on text data, how to learn a better representation by introducing the multimodal information is still worth exploring. In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, masked multimodal attention is designed to dynamically adjust the weight of words by combining the information of text and audio modality. We evaluate our method on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT. Besides, we visualize the masked multimodal attention and proves that it can reasonably adjust the weight of words by introducing audio modality information.

Supplementary Material

MP4 File (3394171.3413690.mp4)
In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT.

References

[1]
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.
[2]
Linqin Cai, Yaxin Hu, Jiangong Dong, and Sitong Zhou. 2019. Audio-Textual Emotion Recognition Based on Improved Neural Networks. Mathematical Problems in Engineering, Vol. 2019 (2019).
[3]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruvs aitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163--171.
[4]
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP?A collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 960--964.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[6]
Simon Dobrivs ek, Rok Gajvs ek, France Mihelivc, Nikola Pavevs ić, and Vitomir vS truc. 2013. Towards efficient multi-modal emotion recognition. International Journal of Advanced Robotic Systems, Vol. 10, 1 (2013), 53.
[7]
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems. 13042--13054.
[8]
Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan, Dongyan Zhao, and Rui Yan. 2019. How to Write Summaries with Patterns? Learning towards Abstractive Summa rization through Prototype Editing. In Proceedings of the 2019 Conference on Empirical Methods in Natural L anguage Processing and the 9th International Joint Conference on Natural Language Pro cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3741--3751.
[9]
Shen Gao, Xiuying Chen, Chang Liu, Li Liu, and Rui Zhao, Dongyan an d Yan. 2020 a. Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW '20). Association for Computing Machinery, New York, NY, USA, 1138--1148. https://doi.org/10.1145/3366423.3380191
[10]
Shen Gao, Xiuying Chen, Zhaochun Ren, Dongyan Zhao, and Rui Ya n. 2020 b. From Standard Summarization to New Tasks and Beyond: Summarization wit h Manifold Information. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organizatio n, 4854--4860. https://doi.org/10.24963/ijcai.2020/676 Survey track.
[11]
Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2018. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3454--3466.
[12]
Judith Holler and Stephen C Levinson. 2019. Multimodal language processing in human communication. Trends in Cognitive Sciences (2019).
[13]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746--1751.
[14]
Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng. 2019. Towards discriminative representation learning for speech emotion recognition. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI). 5060--5066.
[15]
Xinlong Li, Xingyu Fu, Guangluan Xu, Yang Yang, Jiuniu Wang, Li Jin, Qing Liu, and Tianyuan Xiang. 2020. Enhancing BERT Representation With Context-Aware Embedding for Aspect-Based Sentiment Analysis. IEEE Access, Vol. 8 (2020), 46868--46876.
[16]
Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Multimodal Language Analysis with Recurrent Multistage Fusion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 150--161.
[17]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247--2256.
[18]
Navonil Majumder, Devamanyu Hazarika, Alexander Gelbukh, Erik Cambria, and Soujanya Poria. 2018. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems, Vol. 161 (2018), 124--133.
[19]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT. 2227--2237.
[20]
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892--6899.
[21]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017a. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, Vol. 37 (2017), 98--125.
[22]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017b. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 873--883.
[23]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
[24]
Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359--2369.
[25]
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification?. In China National Conference on Chinese Computational Linguistics. Springer, 194--206.
[26]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019 a. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) .
[27]
Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019 b. Learning Factorized Multimodal Representations. In International Conference on Representation Learning.
[28]
Matthew Turk. 2014. Multimodal interaction: A review. Pattern Recognition Letters, Vol. 36 (2014), 189--195.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[30]
Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216--7223.
[31]
Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, Vol. 123, 5 (2008), 3878.
[32]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103--1114.
[33]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
[34]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence.
[35]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, Vol. 31, 6 (2016), 82--88.
[36]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236--2246.
[37]
Suping Zhou, Jia Jia, Qi Wang, Yufei Dong, Yufeng Yin, and Kehua Lei. 2018. Inferring emotion from conversational voice data: A semi-supervised multi-path generative neural network approach. In Thirty-Second AAAI Conference on Artificial Intelligence.

Cited By

View all
  • (2025)Multimodal dual perception fusion framework for multimodal affective analysisInformation Fusion10.1016/j.inffus.2024.102747115(102747)Online publication date: Mar-2025
  • (2024)Semantic-Driven Crossmodal Fusion for Multimodal Sentiment AnalysisInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35998520:1(1-26)Online publication date: 15-Nov-2024
  • (2024)A Two-Stage Stacked Transformer Framework for Multimodal Sentiment AnalysisIntelligent Computing10.34133/icomputing.00813Online publication date: 24-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention network
  2. multimodal sentiment analysis
  3. pretrained model

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Key R&D Program Projects of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)338
  • Downloads (Last 6 weeks)36
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Multimodal dual perception fusion framework for multimodal affective analysisInformation Fusion10.1016/j.inffus.2024.102747115(102747)Online publication date: Mar-2025
  • (2024)Semantic-Driven Crossmodal Fusion for Multimodal Sentiment AnalysisInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35998520:1(1-26)Online publication date: 15-Nov-2024
  • (2024)A Two-Stage Stacked Transformer Framework for Multimodal Sentiment AnalysisIntelligent Computing10.34133/icomputing.00813Online publication date: 24-May-2024
  • (2024)Text-Centric Multimodal Contrastive Learning for Sentiment AnalysisElectronics10.3390/electronics1306114913:6(1149)Online publication date: 21-Mar-2024
  • (2024)CNN-Based Models for Emotion and Sentiment Analysis Using Speech DataACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368730323:10(1-24)Online publication date: 8-Aug-2024
  • (2024)Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239820:9(1-23)Online publication date: 11-Jun-2024
  • (2024)GRACE: GRadient-based Active Learning with Curriculum Enhancement for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681617(5702-5711)Online publication date: 28-Oct-2024
  • (2024)KEBR: Knowledge Enhanced Self-Supervised Balanced Representation for Multimodal Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681163(5732-5741)Online publication date: 28-Oct-2024
  • (2024)Multimodal Dialog Act Classification for Digital Character ConversationsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665541(1-14)Online publication date: 8-Jul-2024
  • (2024)Muti-Modal Emotion Recognition via Hierarchical Knowledge DistillationIEEE Transactions on Multimedia10.1109/TMM.2024.338518026(9036-9046)Online publication date: 4-Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media