Nothing Special   »   [go: up one dir, main page]

Skip to main content

Text-Dominant Interactive Attention for Cross-Modal Sentiment Analysis

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Abstract

In today’s era, Cross-modal Sentiment Analysis (CSA) finds widespread application, where the combination of image and text data allows for a richer expression of emotional content. Effectively capturing the semantic information contained in associated image and text data plays a crucial role. This paper proposes a Text-Dominated Interactive Attention cross-modal sentiment analysis network (TDiA), which utilizes cross-modal interactive encoding and semantic correlation of uni-modal features for sentiment analysis. In TDiA, the text-dominant Cross-Modal Attention encoding (CMA) process is employed to adequately interact the encoded uni-modal features into cross-modal features. Simultaneously, it is designed to utilize the Semantic Correlation Space (SCS) to fully associate semantically similar parts of heterogeneous uni-modal features in a low-dimensional space. The encoded cross-modal features undergo modal fusion modules to obtain the final classification. TDiA validates the improvement of the model on the publicly available cross-fmodal sentiment analysis dataset MVSA.

This work is supported by the National Key R&D Program of China (2022YFC3301800) and the National Natural Science Foundation of China (Grant No. 62072135).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Afzal, A.M.S.: Optimized support vector machine model for visual sentiment analysis. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pp. 171–175. IEEE (2021)

    Google Scholar 

  2. Bai, Q., Zhou, J., He, L.: PG-RNN: using position-gated recurrent neural networks for aspect-based sentiment classification. J. Supercomput. 78(3), 4073–4094 (2022)

    Article  Google Scholar 

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). arXiv:2004.05150

  4. Bhuvaneshwari, P., Rao, A.N., Robinson, Y.H., Thippeswamy, M.: Sentiment analysis for user reviews using bi-LSTM self-attention based CNN model. Multimed. Tools Appl. 81(9), 12405–12419 (2022)

    Article  Google Scholar 

  5. Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl.-Based Syst. 226, 107134 (2021)

    Google Scholar 

  6. Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer, pp. arXiv–2309 (2023)

    Google Scholar 

  7. Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)

    Google Scholar 

  8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  9. Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., Chilimbi, T.: Multi-modal alignment using representation codebook. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15651–15660 (2022)

    Google Scholar 

  10. Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022)

    Article  Google Scholar 

  11. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)

    Google Scholar 

  12. Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.p., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference On Multimodal Interaction, pp. 6–15 (2021)

    Google Scholar 

  13. Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)

    Google Scholar 

  14. Huang, Y., Tang, J., Chen, Z., Zhang, R., Zhang, X., Chen, W., Zhao, Z., Lv, T., Hu, Z., Zhang, W.: Structure-clip: enhance multi-modal language representations with structure knowledge. In: AAAI. AAAI Press (2024)

    Google Scholar 

  15. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)

    Google Scholar 

  16. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2019)

    Google Scholar 

  17. Krishna, D., Patil, A.: Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247 (2020)

    Google Scholar 

  18. Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7241–7259 (2022)

    Google Scholar 

  19. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  20. Li, Y., Lu, T., Li, Y., Yu, T., Huang, S., Zheng, H.T., Zhang, R., Yuan, J.: Mesed: a multi-modal entity set expansion dataset with fine-grained semantic classes and hard negative entities (2023)

    Google Scholar 

  21. Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2023)

    Google Scholar 

  22. Lian, Z., Liu, B., Tao, J.: Ctnet: conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)

    Article  Google Scholar 

  23. Liang, B., Su, H., Gui, L., Cambria, E., Xu, R.: Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl.-Based Syst. 235, 107643 (2022)

    Google Scholar 

  24. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692

  25. Mao, H., Yuan, Z., Xu, H., Yu, W., Liu, Y., Gao, K.: M-sena: an integrated platform for multimodal sentiment analysis. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 204–213 (2022)

    Google Scholar 

  26. Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)

    Google Scholar 

  27. Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)

    Google Scholar 

  28. Niu, T., Zhu, S., Pang, L., El Saddik, A.: Sentiment analysis on multi-view social data. In: MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22, pp. 15–27. Springer (2016)

    Google Scholar 

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society (2015)

    Google Scholar 

  30. Tang, A., Shen, L., Luo, Y., Zhan, Y., Hu, H., Du, B., Chen, Y., Tao, D.: Parameter-efficient multi-task model fusion with partial linearization. In: The Twelfth International Conference on Learning Representations (2023)

    Google Scholar 

  31. Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for computational linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)

    Google Scholar 

  32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  33. Wang, Y., Chen, Q., Ahmed, M., Li, Z., Pan, W., Liu, H.: Joint inference for aspect-level sentiment analysis by deep neural networks and linguistic hints. IEEE Trans. Knowl. Data Eng. 33(5), 2002–2014 (2019)

    Google Scholar 

  34. Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55(7), 5731–5780 (2022)

    Article  Google Scholar 

  35. Xu, G., Liu, P., Zhu, Z., Liu, J., Xu, F.: Attention-enhanced graph convolutional networks for aspect-based sentiment classification with multi-head attention. Appl. Sci. 11(8), 3640 (2021)

    Google Scholar 

  36. Xu, J., Gandelsman, Y., Bar, A., Yang, J., Gao, J., Darrell, T., Wang, X.: Improv: inpainting-based multimodal prompting for computer vision tasks (2023)

    Google Scholar 

  37. Xu, N., Mao, W.: Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)

    Google Scholar 

  38. Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 929–932 (2018)

    Google Scholar 

  39. Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)

    Article  Google Scholar 

  40. Yadav, A., Vishwakarma, D.K.: Sentiment analysis using deep learning architectures: a review. Artif. Intell. Rev. 53(6), 4335–4385 (2020)

    Article  Google Scholar 

  41. Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)

    Article  Google Scholar 

  42. Zhang, H., Xu, M.: Weakly supervised emotion intensity prediction for recognition of emotions in images. IEEE Trans. Multimed. 23, 2033–2044 (2020)

    Article  Google Scholar 

  43. Zhang, K., Zhu, Y., Zhang, W., Zhu, Y.: Cross-modal image sentiment analysis via deep correlation of textual semantic. Knowl.-Based Syst. 216, 106803 (2021)

    Google Scholar 

  44. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)

    Google Scholar 

  45. Zhao, Z., Deng, L., Bai, H., Cui, Y., Zhang, Z., Zhang, Y., Qin, H., Chen, D., Zhang, J., Wang, P., et al.: Image fusion via vision-language model, pp. arXiv–2402 (2024)

    Google Scholar 

  46. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 207–212 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiwei Pan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Z., Yang, S., Pan, H. (2025). Text-Dominant Interactive Attention for Cross-Modal Sentiment Analysis. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8620-6_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8619-0

  • Online ISBN: 978-981-97-8620-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics