Abstract
Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a Text-centered Cross-sample Fusion Network (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.
Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Shenoy, A., Sardana, A.: Multilogue-net: a context aware rnn for multi-modal emotion detection and sentiment analysis in conversation (2020). arXiv preprint arXiv:2002.08267
Fu, J., Mao, Q., Tu, J., Zhan, Y.: Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimed. Syst. 25(5), 451–461 (2019)
Huang, Q., Huang, C., Wang, X., Jiang, F.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 580, 35–54 (2021)
Luo, Y., Wu, R., Liu, J., Tang, X.: Balanced sentimental information via multimodal interaction model. Multimed. Syst. 30(1), 1–9 (2024)
Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput. Surv. 55(13s), 1–38 (2023)
Shaik, T., Tao, X., Li, L., Xie, H., Velásquez, J.D.: A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom. Inf Fusion 102, 102040 (2023)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 5634–5641 (2018)
Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf .Process. Syst. 33, 6616–6628 (2020)
Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)
Wang, D., Liu, S., Wang, Q., Tian, Y., He, L., Gao, X.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multimed. 25, 4909–4921 (2022)
Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 91, 424–444 (2023)
Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)
Wang, D., Guo, X., Tian, Y., Liu, J., He, L., Luo, X.: Tetfn: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)
Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)
Luo, Y., Wu, R., Liu, J., Tang, X.: A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560, 126836 (2023)
Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Chen, Z., Zhang, C., Zhang, B., He, Y.: Triplet contrastive learning framework with adversarial hard-negative sample generation for multimodal remote sensing images. IEEE Trans. Geosci. Remote Sens. 62, 3354304 (2024)
Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)
Rahman, W., Hasan, M.K., Lee, S., Zadeh, A., Mao, C., Morency, L.-P., Hoque, E.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)
Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9180–9192 (2021)
Kim, K., Park, S.: Aobert: all-modalities-in-one BERT for multimodal sentiment analysis. Inf. Fusion 92, 37–45 (2023)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 34, 9694–9705 (2021)
Peng, W., Hong, X., Zhao, G.: Adaptive modality distillation for separable multimodal sentiment analysis. IEEE Intell. Syst. 36(3), 82–89 (2021)
Zhang, J., Wu, X., Huang, C.: Adamow: multimodal sentiment analysis based on adaptive modality-specific weight fusion network. IEEE Access 11, 48410–48420 (2023)
Ando, A., Masumura, R., Takashima, A., Suzuki, S., Makishima, N., Suzuki, K., Moriya, T., Ashihara, T., Sato, H.: On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis. In: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 739–746. IEEE (2023)
Yang, B., Shao, B., Wu, L., Lin, X.: Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467, 130–137 (2022)
Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1424–1437 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Dixit, C., Satapathy, S.M.: A customizable framework for multimodal emotion recognition using ensemble of deep neural network models. Multimed. Syst. 29(6), 3151–3168 (2023)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERT pretraining approach. arXiv e-prints, 1907 (2019)
Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.-N.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4730–4738 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv preprint arXiv:1807.03748
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Huang, J., Pu, Y., Zhou, D., Cao, J., Gu, J., Zhao, Z., Xu, D.: Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 565, 126992 (2024)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.-P., et al.: Ur-funny: a multimodal language dataset for understanding humor (2019). arXiv preprint arXiv:1904.06618
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2247–2256 (2018)
Tsai, Y.-H.H., Liang, P.P., Zadeh, A., Morency, L.-P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Representation Learning, pp. 1–20 (2019)
Wu, T., Peng, J., Zhang, W., Zhang, H., Tan, S., Yi, F., Ma, C., Huang, Y.: Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl.-Based Syst. 235, 107676 (2022)
Lin, R., Hu, H.: Multi-task momentum distillation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15, 549–565 (2023)
Fu, Y., Zhang, Z., Yang, R., Yao, C.: Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing 571, 127201 (2024)
Shi, H., Pu, Y., Zhao, Z., Huang, J., Zhou, D., Xu, D., Cao, J.: Co-space representation interaction network for multimodal sentiment analysis. Knowl.-Based Syst. 283, 111149 (2024)
Huang, J., Zhou, J., Tang, Z., Lin, J., Chen, C.Y.-C.: Tmbl: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl.-Based Syst. 285, 111346 (2024)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)
Baltrušaitis, T., Robinson, P., Morency, L.-P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)
Funding
This work was supported by the Pioneer “and Leading Goose” R&D Program of Zhejiang (No. 2022C03106), the National Natural Science Foundation of China (No. 62207028, 62337001), partially by Zhejiang Provincial Natural Science Foundation (No. LY23F020009), and Scientific Research Fund of Zhejiang Provincial Education Department (No. 2023SCG367), and Special Research Project of Zhejiang Normal University on Serving Provincial Strategic Planning and Promoting the Construction of Common Prosperity.
Author information
Authors and Affiliations
Contributions
Q.H. completed conceptualization, methodology, software, validation, writing and editing. J.C. completed conceptualization, methodology, formal analysis, software, validation, visualization and writing the original draft. C.H. completed validation, visualization, writing and editing. Y.W. completed validation, visualization, writing and editing. X.H. completed methodology, validation, visualization, writing and editing. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, Q., Chen, J., Huang, C. et al. Text-centered cross-sample fusion network for multimodal sentiment analysis. Multimedia Systems 30, 228 (2024). https://doi.org/10.1007/s00530-024-01421-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01421-w