Abstract
Social media platforms have seen an influx of multimodal data, leading to heightened attention on image-text multimodal classification. Existing methods for multimodal classification primarily focus on multimodal fusion from different modalities. However, owing to the heterogeneity and high-dimensionality of multimodal data, the fusion process frequently introduces redundant information and noise limiting the accuracy and generalization. To resolve the limitation, we propose a Compact Bilinear pooling and Multi-Loss network (CBMLNet). Compact bilinear pooling is used for feature fusion to learn low-dimensional and expressive multimodal representations efficiently. Furthermore, a multi-loss function is proposed to import the specific information carried by each single modality. Therefore, CBMLNet simultaneously considers the correlation between multimodality and the specificity of single modality for image-text classification. We evaluate the proposed CBMLNet on two publicly available datasets, Twitter-15 and Twitter-17, and on a private dataset, AIFUN. CBMLNet is compared with the advanced methods such as multimodal BERT with Max Pooling, Multi-Interactive Memory Network, Multi-level Multi-modal Cross-attention Network, Image-Text Correlation model (ITC), Target-oriented multimodal BERT and multimodal hierarchical attention model (MHA). Experimental results demonstrate that CBMLNet averagely improves F1_score by 0.28% and 0.44% compared with the best fine-grained baseline, MHA and the best coarse-grained baseline, ITC. It illustrates that CBMLNet is practical for real-world fuzzy applications as a coarse-grained model.
Similar content being viewed by others
Data availibility
The dataset generated during the current study is not publicly available due to data availability statement of support funding but is available from the corresponding author on reasonable request.
References
Xue, X., Zhang, C., Niu, Z., Wu, X.: Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 35(5), 5105–5118 (2022)
Bansal, S., Gowda, K., Kumar, N.: A hybrid deep neural network for multimodal personalized hashtag recommendation. IEEE Trans. Comput. Soc. Syst. 10(5), 2439–2459 (2022)
Zhao, J., Dong, W., Shi, L., Qiang, W., Kuang, Z., Xu, D., An, T.: Multimodal feature fusion method for unbalanced sample data in social network public opinion. Sensors 22(15), 5528 (2022)
Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.-F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232 (2013)
Cao, D., Ji, R., Lin, D., Li, S.: A cross-media public sentiment analysis system for microblog. Multimed. Syst. 22, 479–486 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 1, 649–657 (2015)
Ying, L., Yu, H., Wang, J., Ji, Y., Qian, S.: Multi-level multi-modal cross-attention network for fake news detection. IEEE Access 9, 132363–132373 (2021)
Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 32(6), 121 (2021)
Zhang, K., Geng, Y., Zhao, J., Liu, J., Li, W.: Sentiment analysis of social media via multimodal feature fusion. Symmetry 12(12), 2010 (2020)
Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412 (2021)
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)
Chua, W.W., Li, L., Goh, A.: Classifying multimodal data using transformers. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4780–4781 (2022)
Peng, Y., Qi, J.: Cm-gans: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 1–24 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. usion 91, 424–444 (2023)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345–379 (2010)
Sun, H., Dhingra, B., Zaheer, M., Mazaitis, K., Salakhutdinov, R., Cohen, W.W.: Open domain question answering using early fusion of knowledge bases and text. arXiv preprint arXiv:1809.00782 (2018)
Xu, H., He, K., Sigal, L., Sclaroff, S., Saenko, K.: Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113 2(6), 7 (2018)
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)
Ding, N., Tian, S.-W., Yu, L.: A multimodal fusion method for sarcasm detection based on late fusion. Multimed. Tools Appl. 81(6), 8597–8616 (2022)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Wang, J., Mao, H., Li, H.: Fmfn: fine-grained multimodal fusion networks for fake news detection. Appl. Sci. 12(3), 1093 (2022)
Baecchi, C., Uricchio, T., Bertini, M., Del Bimbo, A.: A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed. Tools Appl. 75, 2507–2525 (2016)
Xu, N., Mao, W.: Multisentinet: a deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)
Yu, J., Jiang, J., Xia, R.: Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 429–439 (2019)
Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)
Cheung, T.-H., Lam, K.-M.: Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing 514, 1–12 (2022)
Song, Z., Xue, Y., Gu, D., Zhang, H., Ding, W.: Target-oriented multimodal sentiment classification by using topic model and gating mechanism. Int. J. Mach. Learn. Cybern. 14(7), 2289–2299 (2023)
Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. 19(1), 1–19 (2023)
Le, H.-D., Lee, G.-S., Kim, S.-H., Kim, S., Yang, H.-J.: Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access 11, 14742–14751 (2023)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: International Colloquium on Automata, Languages, and Programming, pp. 693–703. Springer (2002)
Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Wang, B., Lu, W.: Learning latent opinions for aspect-level sentiment classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Yu, J., Jiang, J.: Adapting bert for target-oriented multimodal sentiment classification. IJCAI (2019)
Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 371–378 (2019)
Li, Y., Jiang, S., et al.: Multimodal sentiment analysis with image-text correlation modal. In: 2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 281–286. IEEE (2023)
Li, Z., An, Z., Cheng, W., Zhou, J., Zheng, F., Hu, B.: Mha: a multimodal hierarchical attention model for depression detection in social media. Health Inf. Sci. Syst. 11(1), 6 (2023)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62071189) and (Grant No. 62201220).
Author information
Authors and Affiliations
Contributions
Y.L. and X.Z. designed and wrote programs. X.Z. took charge of data curation. Y.L. and M.Z. wrote the main manuscript text. J.M. and Z.C. reviewed and edited the manuscript. Y.T. prepared figures 1-3. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflict of interest statements.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Zheng, X., Zhu, M. et al. Compact bilinear pooling and multi-loss network for social media multimodal classification. SIViP 18, 8403–8412 (2024). https://doi.org/10.1007/s11760-024-03482-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-024-03482-w