Text-centered cross-sample fusion network for multimodal sentiment analysis

Qionghao Huang¹,
Jili Chen¹,
Changqin Huang¹,
Xiaodi Huang² &
…
Yi Wang¹

549 Accesses
Explore all metrics

Abstract

Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a Text-centered Cross-sample Fusion Network (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Article 12 April 2024

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

No datasets were generated or analysed during the current study.

References

Shenoy, A., Sardana, A.: Multilogue-net: a context aware rnn for multi-modal emotion detection and sentiment analysis in conversation (2020). arXiv preprint arXiv:2002.08267
Fu, J., Mao, Q., Tu, J., Zhan, Y.: Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimed. Syst. 25(5), 451–461 (2019)
Article Google Scholar
Huang, Q., Huang, C., Wang, X., Jiang, F.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 580, 35–54 (2021)
Article MathSciNet Google Scholar
Luo, Y., Wu, R., Liu, J., Tang, X.: Balanced sentimental information via multimodal interaction model. Multimed. Syst. 30(1), 1–9 (2024)
Article Google Scholar
Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput. Surv. 55(13s), 1–38 (2023)
Article Google Scholar
Shaik, T., Tao, X., Li, L., Xie, H., Velásquez, J.D.: A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom. Inf Fusion 102, 102040 (2023)
Article Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 5634–5641 (2018)
Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf .Process. Syst. 33, 6616–6628 (2020)
Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)
Article Google Scholar
Wang, D., Liu, S., Wang, Q., Tian, Y., He, L., Gao, X.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multimed. 25, 4909–4921 (2022)
Article Google Scholar
Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 91, 424–444 (2023)
Article Google Scholar
Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)
Wang, D., Guo, X., Tian, Y., Liu, J., He, L., Luo, X.: Tetfn: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)
Article Google Scholar
Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)
Article Google Scholar
Luo, Y., Wu, R., Liu, J., Tang, X.: A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560, 126836 (2023)
Article Google Scholar
Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Article Google Scholar
Chen, Z., Zhang, C., Zhang, B., He, Y.: Triplet contrastive learning framework with adversarial hard-negative sample generation for multimodal remote sensing images. IEEE Trans. Geosci. Remote Sens. 62, 3354304 (2024)
Google Scholar
Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)
Article Google Scholar
Rahman, W., Hasan, M.K., Lee, S., Zadeh, A., Mao, C., Morency, L.-P., Hoque, E.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)
Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9180–9192 (2021)
Kim, K., Park, S.: Aobert: all-modalities-in-one BERT for multimodal sentiment analysis. Inf. Fusion 92, 37–45 (2023)
Article Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Peng, W., Hong, X., Zhao, G.: Adaptive modality distillation for separable multimodal sentiment analysis. IEEE Intell. Syst. 36(3), 82–89 (2021)
Article Google Scholar
Zhang, J., Wu, X., Huang, C.: Adamow: multimodal sentiment analysis based on adaptive modality-specific weight fusion network. IEEE Access 11, 48410–48420 (2023)
Article Google Scholar
Ando, A., Masumura, R., Takashima, A., Suzuki, S., Makishima, N., Suzuki, K., Moriya, T., Ashihara, T., Sato, H.: On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis. In: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 739–746. IEEE (2023)
Yang, B., Shao, B., Wu, L., Lin, X.: Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467, 130–137 (2022)
Article Google Scholar
Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1424–1437 (2021)
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Dixit, C., Satapathy, S.M.: A customizable framework for multimodal emotion recognition using ensemble of deep neural network models. Multimed. Syst. 29(6), 3151–3168 (2023)
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERT pretraining approach. arXiv e-prints, 1907 (2019)
Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.-N.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4730–4738 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv preprint arXiv:1807.03748
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Huang, J., Pu, Y., Zhou, D., Cao, J., Gu, J., Zhao, Z., Xu, D.: Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 565, 126992 (2024)
Article Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Article Google Scholar
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.-P., et al.: Ur-funny: a multimodal language dataset for understanding humor (2019). arXiv preprint arXiv:1904.06618
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2247–2256 (2018)
Tsai, Y.-H.H., Liang, P.P., Zadeh, A., Morency, L.-P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Representation Learning, pp. 1–20 (2019)
Wu, T., Peng, J., Zhang, W., Zhang, H., Tan, S., Yi, F., Ma, C., Huang, Y.: Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl.-Based Syst. 235, 107676 (2022)
Article Google Scholar
Lin, R., Hu, H.: Multi-task momentum distillation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15, 549–565 (2023)
Article Google Scholar
Fu, Y., Zhang, Z., Yang, R., Yao, C.: Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing 571, 127201 (2024)
Article Google Scholar
Shi, H., Pu, Y., Zhao, Z., Huang, J., Zhou, D., Xu, D., Cao, J.: Co-space representation interaction network for multimodal sentiment analysis. Knowl.-Based Syst. 283, 111149 (2024)
Article Google Scholar
Huang, J., Zhou, J., Tang, Z., Lin, J., Chen, C.Y.-C.: Tmbl: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl.-Based Syst. 285, 111346 (2024)
Article Google Scholar
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)
Baltrušaitis, T., Robinson, P., Morency, L.-P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)

Download references

Funding

This work was supported by the Pioneer “and Leading Goose” R&D Program of Zhejiang (No. 2022C03106), the National Natural Science Foundation of China (No. 62207028, 62337001), partially by Zhejiang Provincial Natural Science Foundation (No. LY23F020009), and Scientific Research Fund of Zhejiang Provincial Education Department (No. 2023SCG367), and Special Research Project of Zhejiang Normal University on Serving Provincial Strategic Planning and Promoting the Construction of Common Prosperity.

Author information

Authors and Affiliations

Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University, Jinhua, 321004, Zhejiang, China
Qionghao Huang, Jili Chen, Changqin Huang & Yi Wang
School of Computing, Mathematics and Engineering, Charles Sturt University, Albury, NSW, 2640, Australia
Xiaodi Huang

Authors

Qionghao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jili Chen
View author publications
You can also search for this author in PubMed Google Scholar
Changqin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.H. completed conceptualization, methodology, software, validation, writing and editing. J.C. completed conceptualization, methodology, formal analysis, software, validation, visualization and writing the original draft. C.H. completed validation, visualization, writing and editing. Y.W. completed validation, visualization, writing and editing. X.H. completed methodology, validation, visualization, writing and editing. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qionghao Huang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, Q., Chen, J., Huang, C. et al. Text-centered cross-sample fusion network for multimodal sentiment analysis. Multimedia Systems 30, 228 (2024). https://doi.org/10.1007/s00530-024-01421-w

Download citation

Received: 29 March 2024
Accepted: 15 July 2024
Published: 30 July 2024
DOI: https://doi.org/10.1007/s00530-024-01421-w

Text-centered cross-sample fusion network for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Text-centered cross-sample fusion network for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation