Abstract
In today’s era, Cross-modal Sentiment Analysis (CSA) finds widespread application, where the combination of image and text data allows for a richer expression of emotional content. Effectively capturing the semantic information contained in associated image and text data plays a crucial role. This paper proposes a Text-Dominated Interactive Attention cross-modal sentiment analysis network (TDiA), which utilizes cross-modal interactive encoding and semantic correlation of uni-modal features for sentiment analysis. In TDiA, the text-dominant Cross-Modal Attention encoding (CMA) process is employed to adequately interact the encoded uni-modal features into cross-modal features. Simultaneously, it is designed to utilize the Semantic Correlation Space (SCS) to fully associate semantically similar parts of heterogeneous uni-modal features in a low-dimensional space. The encoded cross-modal features undergo modal fusion modules to obtain the final classification. TDiA validates the improvement of the model on the publicly available cross-fmodal sentiment analysis dataset MVSA.
This work is supported by the National Key R&D Program of China (2022YFC3301800) and the National Natural Science Foundation of China (Grant No. 62072135).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afzal, A.M.S.: Optimized support vector machine model for visual sentiment analysis. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pp. 171–175. IEEE (2021)
Bai, Q., Zhou, J., He, L.: PG-RNN: using position-gated recurrent neural networks for aspect-based sentiment classification. J. Supercomput. 78(3), 4073–4094 (2022)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). arXiv:2004.05150
Bhuvaneshwari, P., Rao, A.N., Robinson, Y.H., Thippeswamy, M.: Sentiment analysis for user reviews using bi-LSTM self-attention based CNN model. Multimed. Tools Appl. 81(9), 12405–12419 (2022)
Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl.-Based Syst. 226, 107134 (2021)
Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer, pp. arXiv–2309 (2023)
Chen, X., Wang, Y., Liu, Q.: Visual and textual sentiment analysis using deep fusion convolutional neural networks. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 1557–1561. IEEE (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., Chilimbi, T.: Multi-modal alignment using representation codebook. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15651–15660 (2022)
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.p., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference On Multimodal Interaction, pp. 6–15 (2021)
Huang, C., Zhang, J., Wu, X., Wang, Y., Li, M., Huang, X.: Tefna: text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 269, 110502 (2023)
Huang, Y., Tang, J., Chen, Z., Zhang, R., Zhang, X., Chen, W., Zhao, Z., Lv, T., Hu, Z., Zhang, W.: Structure-clip: enhance multi-modal language representations with structure knowledge. In: AAAI. AAAI Press (2024)
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning. pp. 5583–5594. PMLR (2021)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2019)
Krishna, D., Patil, A.: Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247 (2020)
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7241–7259 (2022)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Li, Y., Lu, T., Li, Y., Yu, T., Huang, S., Zheng, H.T., Zhang, R., Yuan, J.: Mesed: a multi-modal entity set expansion dataset with fine-grained semantic classes and hard negative entities (2023)
Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613 (2023)
Lian, Z., Liu, B., Tao, J.: Ctnet: conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)
Liang, B., Su, H., Gui, L., Cambria, E., Xu, R.: Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowl.-Based Syst. 235, 107643 (2022)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692
Mao, H., Yuan, Z., Xu, H., Yu, W., Liu, Y., Gao, K.: M-sena: an integrated platform for multimodal sentiment analysis. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 204–213 (2022)
Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: European Conference on Computer Vision, pp. 488–505. Springer (2022)
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10553–10563 (2022)
Niu, T., Zhu, S., Pang, L., El Saddik, A.: Sentiment analysis on multi-view social data. In: MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22, pp. 15–27. Springer (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society (2015)
Tang, A., Shen, L., Luo, Y., Zhan, Y., Hu, H., Du, B., Chen, Y., Tao, D.: Parameter-efficient multi-task model fusion with partial linearization. In: The Twelfth International Conference on Learning Representations (2023)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for computational linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, Y., Chen, Q., Ahmed, M., Li, Z., Pan, W., Liu, H.: Joint inference for aspect-level sentiment analysis by deep neural networks and linguistic hints. IEEE Trans. Knowl. Data Eng. 33(5), 2002–2014 (2019)
Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 55(7), 5731–5780 (2022)
Xu, G., Liu, P., Zhu, Z., Liu, J., Xu, F.: Attention-enhanced graph convolutional networks for aspect-based sentiment classification with multi-head attention. Appl. Sci. 11(8), 3640 (2021)
Xu, J., Gandelsman, Y., Bar, A., Yang, J., Gao, J., Darrell, T., Wang, X.: Improv: inpainting-based multimodal prompting for computer vision tasks (2023)
Xu, N., Mao, W.: Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)
Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 929–932 (2018)
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020)
Yadav, A., Vishwakarma, D.K.: Sentiment analysis using deep learning architectures: a review. Artif. Intell. Rev. 53(6), 4335–4385 (2020)
Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)
Zhang, H., Xu, M.: Weakly supervised emotion intensity prediction for recognition of emotions in images. IEEE Trans. Multimed. 23, 2033–2044 (2020)
Zhang, K., Zhu, Y., Zhang, W., Zhu, Y.: Cross-modal image sentiment analysis via deep correlation of textual semantic. Knowl.-Based Syst. 216, 106803 (2021)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)
Zhao, Z., Deng, L., Bai, H., Cui, Y., Zhang, Z., Zhang, Y., Qin, H., Chen, D., Zhang, J., Wang, P., et al.: Image fusion via vision-language model, pp. arXiv–2402 (2024)
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (volume 2: Short papers), pp. 207–212 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, Z., Yang, S., Pan, H. (2025). Text-Dominant Interactive Attention for Cross-Modal Sentiment Analysis. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_14
Download citation
DOI: https://doi.org/10.1007/978-981-97-8620-6_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)