Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Text-centered cross-sample fusion network for multimodal sentiment analysis

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a Text-centered Cross-sample Fusion Network (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1:
Algorithm 2:
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Shenoy, A., Sardana, A.: Multilogue-net: a context aware rnn for multi-modal emotion detection and sentiment analysis in conversation (2020). arXiv preprint arXiv:2002.08267

  2. Fu, J., Mao, Q., Tu, J., Zhan, Y.: Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimed. Syst. 25(5), 451–461 (2019)

    Article  Google Scholar 

  3. Huang, Q., Huang, C., Wang, X., Jiang, F.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 580, 35–54 (2021)

    Article  MathSciNet  Google Scholar 

  4. Luo, Y., Wu, R., Liu, J., Tang, X.: Balanced sentimental information via multimodal interaction model. Multimed. Syst. 30(1), 1–9 (2024)

    Article  Google Scholar 

  5. Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput. Surv. 55(13s), 1–38 (2023)

    Article  Google Scholar 

  6. Shaik, T., Tao, X., Li, L., Xie, H., Velásquez, J.D.: A survey of multimodal information fusion for smart healthcare: mapping the journey from data to wisdom. Inf Fusion 102, 102040 (2023)

    Article  Google Scholar 

  7. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)

  8. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 5634–5641 (2018)

  9. Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)

  10. Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf .Process. Syst. 33, 6616–6628 (2020)

    Google Scholar 

  11. Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)

  12. Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)

  13. Chen, Q., Huang, G., Wang, Y.: The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2689–2695 (2022)

    Article  Google Scholar 

  14. Wang, D., Liu, S., Wang, Q., Tian, Y., He, L., Gao, X.: Cross-modal enhancement network for multimodal sentiment analysis. IEEE Trans. Multimed. 25, 4909–4921 (2022)

    Article  Google Scholar 

  15. Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 91, 424–444 (2023)

    Article  Google Scholar 

  16. Sun, Z., Sarma, P., Sethares, W., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)

  17. Wang, D., Guo, X., Tian, Y., Liu, J., He, L., Luo, X.: Tetfn: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 136, 109259 (2023)

    Article  Google Scholar 

  18. Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)

    Article  Google Scholar 

  19. Luo, Y., Wu, R., Liu, J., Tang, X.: A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560, 126836 (2023)

    Article  Google Scholar 

  20. Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10941–10950 (2020)

  21. Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)

    Article  Google Scholar 

  22. Chen, Z., Zhang, C., Zhang, B., He, Y.: Triplet contrastive learning framework with adversarial hard-negative sample generation for multimodal remote sensing images. IEEE Trans. Geosci. Remote Sens. 62, 3354304 (2024)

    Google Scholar 

  23. Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)

    Article  Google Scholar 

  24. Rahman, W., Hasan, M.K., Lee, S., Zadeh, A., Mao, C., Morency, L.-P., Hoque, E.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2020, p. 2359. NIH Public Access (2020)

  25. Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9180–9192 (2021)

  26. Kim, K., Park, S.: Aobert: all-modalities-in-one BERT for multimodal sentiment analysis. Inf. Fusion 92, 37–45 (2023)

    Article  Google Scholar 

  27. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 34, 9694–9705 (2021)

    Google Scholar 

  28. Peng, W., Hong, X., Zhao, G.: Adaptive modality distillation for separable multimodal sentiment analysis. IEEE Intell. Syst. 36(3), 82–89 (2021)

    Article  Google Scholar 

  29. Zhang, J., Wu, X., Huang, C.: Adamow: multimodal sentiment analysis based on adaptive modality-specific weight fusion network. IEEE Access 11, 48410–48420 (2023)

    Article  Google Scholar 

  30. Ando, A., Masumura, R., Takashima, A., Suzuki, S., Makishima, N., Suzuki, K., Moriya, T., Ashihara, T., Sato, H.: On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis. In: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 739–746. IEEE (2023)

  31. Yang, B., Shao, B., Wu, L., Lin, X.: Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467, 130–137 (2022)

    Article  Google Scholar 

  32. Mai, S., Xing, S., Hu, H.: Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1424–1437 (2021)

    Article  Google Scholar 

  33. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929

  34. Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

  35. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

  36. Dixit, C., Satapathy, S.M.: A customizable framework for multimodal emotion recognition using ensemble of deep neural network models. Multimed. Syst. 29(6), 3151–3168 (2023)

    Article  Google Scholar 

  37. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERT pretraining approach. arXiv e-prints, 1907 (2019)

  38. Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.-N.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4730–4738 (2021)

  39. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)

    Google Scholar 

  40. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv preprint arXiv:1807.03748

  41. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)

  42. Huang, J., Pu, Y., Zhou, D., Cao, J., Gu, J., Zhao, Z., Xu, D.: Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 565, 126992 (2024)

    Article  Google Scholar 

  43. Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)

    Article  Google Scholar 

  44. Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)

  45. Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.-P., et al.: Ur-funny: a multimodal language dataset for understanding humor (2019). arXiv preprint arXiv:1904.06618

  46. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2247–2256 (2018)

  47. Tsai, Y.-H.H., Liang, P.P., Zadeh, A., Morency, L.-P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Representation Learning, pp. 1–20 (2019)

  48. Wu, T., Peng, J., Zhang, W., Zhang, H., Tan, S., Yi, F., Ma, C., Huang, Y.: Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl.-Based Syst. 235, 107676 (2022)

    Article  Google Scholar 

  49. Lin, R., Hu, H.: Multi-task momentum distillation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15, 549–565 (2023)

    Article  Google Scholar 

  50. Fu, Y., Zhang, Z., Yang, R., Yao, C.: Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing 571, 127201 (2024)

    Article  Google Scholar 

  51. Shi, H., Pu, Y., Zhao, Z., Huang, J., Zhou, D., Xu, D., Cao, J.: Co-space representation interaction network for multimodal sentiment analysis. Knowl.-Based Syst. 283, 111149 (2024)

    Article  Google Scholar 

  52. Huang, J., Zhou, J., Tang, Z., Lin, J., Chen, C.Y.-C.: Tmbl: transformer-based multimodal binding learning model for multimodal sentiment analysis. Knowl.-Based Syst. 285, 111346 (2024)

    Article  Google Scholar 

  53. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)

  54. Baltrušaitis, T., Robinson, P., Morency, L.-P.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)

Download references

Funding

This work was supported by the Pioneer “and Leading Goose” R&D Program of Zhejiang (No. 2022C03106), the National Natural Science Foundation of China (No. 62207028, 62337001), partially by Zhejiang Provincial Natural Science Foundation (No. LY23F020009), and Scientific Research Fund of Zhejiang Provincial Education Department (No. 2023SCG367), and Special Research Project of Zhejiang Normal University on Serving Provincial Strategic Planning and Promoting the Construction of Common Prosperity.

Author information

Authors and Affiliations

Authors

Contributions

Q.H. completed conceptualization, methodology, software, validation, writing and editing. J.C. completed conceptualization, methodology, formal analysis, software, validation, visualization and writing the original draft. C.H. completed validation, visualization, writing and editing. Y.W. completed validation, visualization, writing and editing. X.H. completed methodology, validation, visualization, writing and editing. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qionghao Huang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, Q., Chen, J., Huang, C. et al. Text-centered cross-sample fusion network for multimodal sentiment analysis. Multimedia Systems 30, 228 (2024). https://doi.org/10.1007/s00530-024-01421-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01421-w

Keywords

Navigation