Text-Dominant Interactive Attention for Cross-Modal Sentiment Analysis

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15035))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

183 Accesses

Abstract

In today’s era, Cross-modal Sentiment Analysis (CSA) finds widespread application, where the combination of image and text data allows for a richer expression of emotional content. Effectively capturing the semantic information contained in associated image and text data plays a crucial role. This paper proposes a Text-Dominated Interactive Attention cross-modal sentiment analysis network (TDiA), which utilizes cross-modal interactive encoding and semantic correlation of uni-modal features for sentiment analysis. In TDiA, the text-dominant Cross-Modal Attention encoding (CMA) process is employed to adequately interact the encoded uni-modal features into cross-modal features. Simultaneously, it is designed to utilize the Semantic Correlation Space (SCS) to fully associate semantically similar parts of heterogeneous uni-modal features in a low-dimensional space. The encoded cross-modal features undergo modal fusion modules to obtain the final classification. TDiA validates the improvement of the model on the publicly available cross-fmodal sentiment analysis dataset MVSA.

This work is supported by the National Key R&D Program of China (2022YFC3301800) and the National Natural Science Foundation of China (Grant No. 62072135).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Cross-Modal Sentiment Analysis Based on Fine-Grained Feature Interaction Learning

Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

Article 04 January 2024

References

Afzal, A.M.S.: Optimized support vector machine model for visual sentiment analysis. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pp. 171–175. IEEE (2021)
Google Scholar
Bai, Q., Zhou, J., He, L.: PG-RNN: using position-gated recurrent neural networks for aspect-based sentiment classification. J. Supercomput. 78(3), 4073–4094 (2022)
Article MATH Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). arXiv:2004.05150
Bhuvaneshwari, P., Rao, A.N., Robinson, Y.H., Thippeswamy, M.: Sentiment analysis for user reviews using bi-LSTM self-attention based CNN model. Multimed. Tools Appl. 81(9), 12405–12419 (2022)
Article MATH Google Scholar
Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl.-Based Syst. 226, 107134 (2021)
Google Scholar
Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer, pp. arXiv–2309 (2023)
Google Scholar
Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., Chilimbi, T.: Multi-modal alignment using representation codebook. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15651–15660 (2022)
Google Scholar
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022)
Article MATH Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
MATH Google Scholar
Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.p., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference On Multimodal Interaction, pp. 6–15 (2021)
Google Scholar
Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)
Google Scholar
Huang, Y., Tang, J., Chen, Z., Zhang, R., Zhang, X., Chen, W., Zhao, Z., Lv, T., Hu, Z., Zhang, W.: Structure-clip: enhance multi-modal language representations with structure knowledge. In: AAAI. AAAI Press (2024)
Google Scholar
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)
Google Scholar
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2019)
Google Scholar
Krishna, D., Patil, A.: Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247 (2020)
Google Scholar
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7241–7259 (2022)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, Y., Lu, T., Li, Y., Yu, T., Huang, S., Zheng, H.T., Zhang, R., Yuan, J.: Mesed: a multi-modal entity set expansion dataset with fine-grained semantic classes and hard negative entities (2023)
Google Scholar
Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2023)
Google Scholar
Lian, Z., Liu, B., Tao, J.: Ctnet: conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)
Article MATH Google Scholar
Liang, B., Su, H., Gui, L., Cambria, E., Xu, R.: Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl.-Based Syst. 235, 107643 (2022)
Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692
Mao, H., Yuan, Z., Xu, H., Yu, W., Liu, Y., Gao, K.: M-sena: an integrated platform for multimodal sentiment analysis. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 204–213 (2022)
Google Scholar
Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)
Google Scholar
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)
Google Scholar
Niu, T., Zhu, S., Pang, L., El Saddik, A.: Sentiment analysis on multi-view social data. In: MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22, pp. 15–27. Springer (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society (2015)
Google Scholar
Tang, A., Shen, L., Luo, Y., Zhan, Y., Hu, H., Du, B., Chen, Y., Tao, D.: Parameter-efficient multi-task model fusion with partial linearization. In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for computational linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, Y., Chen, Q., Ahmed, M., Li, Z., Pan, W., Liu, H.: Joint inference for aspect-level sentiment analysis by deep neural networks and linguistic hints. IEEE Trans. Knowl. Data Eng. 33(5), 2002–2014 (2019)
Google Scholar
Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55(7), 5731–5780 (2022)
Article MATH Google Scholar
Xu, G., Liu, P., Zhu, Z., Liu, J., Xu, F.: Attention-enhanced graph convolutional networks for aspect-based sentiment classification with multi-head attention. Appl. Sci. 11(8), 3640 (2021)
Google Scholar
Xu, J., Gandelsman, Y., Bar, A., Yang, J., Gao, J., Darrell, T., Wang, X.: Improv: inpainting-based multimodal prompting for computer vision tasks (2023)
Google Scholar
Xu, N., Mao, W.: Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)
Google Scholar
Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 929–932 (2018)
Google Scholar
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Article MATH Google Scholar
Yadav, A., Vishwakarma, D.K.: Sentiment analysis using deep learning architectures: a review. Artif. Intell. Rev. 53(6), 4335–4385 (2020)
Article MATH Google Scholar
Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)
Article Google Scholar
Zhang, H., Xu, M.: Weakly supervised emotion intensity prediction for recognition of emotions in images. IEEE Trans. Multimed. 23, 2033–2044 (2020)
Article MATH Google Scholar
Zhang, K., Zhu, Y., Zhang, W., Zhu, Y.: Cross-modal image sentiment analysis via deep correlation of textual semantic. Knowl.-Based Syst. 216, 106803 (2021)
Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)
Google Scholar
Zhao, Z., Deng, L., Bai, H., Cui, Y., Zhang, Z., Zhang, Y., Qin, H., Chen, D., Zhang, J., Wang, P., et al.: Image fusion via vision-language model, pp. arXiv–2402 (2024)
Google Scholar
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 207–212 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Harbin Engineering University, Harbin, 150000, China
Zebao Zhang, Shuang Yang & Haiwei Pan

Authors

Zebao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haiwei Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiwei Pan .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Yang, S., Pan, H. (2025). Text-Dominant Interactive Attention for Cross-Modal Sentiment Analysis. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_14

Download citation

DOI: https://doi.org/10.1007/978-981-97-8620-6_14
Published: 20 October 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics