Compact bilinear pooling and multi-loss network for social media multimodal classification

Yushi Li^1,2,
Xin Zheng^1,2,
Ming Zhu^1,2,
Jie Mei^1,2,3,
Ziwen Chen^1,2 &
…
Yunfei Tao^1,2

193 Accesses
Explore all metrics

Abstract

Social media platforms have seen an influx of multimodal data, leading to heightened attention on image-text multimodal classification. Existing methods for multimodal classification primarily focus on multimodal fusion from different modalities. However, owing to the heterogeneity and high-dimensionality of multimodal data, the fusion process frequently introduces redundant information and noise limiting the accuracy and generalization. To resolve the limitation, we propose a Compact Bilinear pooling and Multi-Loss network (CBMLNet). Compact bilinear pooling is used for feature fusion to learn low-dimensional and expressive multimodal representations efficiently. Furthermore, a multi-loss function is proposed to import the specific information carried by each single modality. Therefore, CBMLNet simultaneously considers the correlation between multimodality and the specificity of single modality for image-text classification. We evaluate the proposed CBMLNet on two publicly available datasets, Twitter-15 and Twitter-17, and on a private dataset, AIFUN. CBMLNet is compared with the advanced methods such as multimodal BERT with Max Pooling, Multi-Interactive Memory Network, Multi-level Multi-modal Cross-attention Network, Image-Text Correlation model (ITC), Target-oriented multimodal BERT and multimodal hierarchical attention model (MHA). Experimental results demonstrate that CBMLNet averagely improves F1_score by 0.28% and 0.44% compared with the best fine-grained baseline, MHA and the best coarse-grained baseline, ITC. It illustrates that CBMLNet is practical for real-world fuzzy applications as a coarse-grained model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Multi-task Collaborative Network for Image-Text Retrieval

SaliencyBERT: Recurrent Attention Network for Target-Oriented Multimodal Sentiment Classification

Data availibility

The dataset generated during the current study is not publicly available due to data availability statement of support funding but is available from the corresponding author on reasonable request.

References

Xue, X., Zhang, C., Niu, Z., Wu, X.: Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 35(5), 5105–5118 (2022)
Google Scholar
Bansal, S., Gowda, K., Kumar, N.: A hybrid deep neural network for multimodal personalized hashtag recommendation. IEEE Trans. Comput. Soc. Syst. 10(5), 2439–2459 (2022)
Zhao, J., Dong, W., Shi, L., Qiang, W., Kuang, Z., Xu, D., An, T.: Multimodal feature fusion method for unbalanced sample data in social network public opinion. Sensors 22(15), 5528 (2022)
Article Google Scholar
Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.-F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232 (2013)
Cao, D., Ji, R., Lin, D., Li, S.: A cross-media public sentiment analysis system for microblog. Multimed. Syst. 22, 479–486 (2016)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 1, 649–657 (2015)
Ying, L., Yu, H., Wang, J., Ji, Y., Qian, S.: Multi-level multi-modal cross-attention network for fake news detection. IEEE Access 9, 132363–132373 (2021)
Article Google Scholar
Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 32(6), 121 (2021)
Article Google Scholar
Zhang, K., Geng, Y., Zhao, J., Liu, J., Li, W.: Sentiment analysis of social media via multimodal feature fusion. Symmetry 12(12), 2010 (2020)
Article Google Scholar
Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412 (2021)
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)
Article MathSciNet Google Scholar
Chua, W.W., Li, L., Goh, A.: Classifying multimodal data using transformers. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4780–4781 (2022)
Peng, Y., Qi, J.: Cm-gans: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 1–24 (2019)
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. usion 91, 424–444 (2023)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345–379 (2010)
Article Google Scholar
Sun, H., Dhingra, B., Zaheer, M., Mazaitis, K., Salakhutdinov, R., Cohen, W.W.: Open domain question answering using early fusion of knowledge bases and text. arXiv preprint arXiv:1809.00782 (2018)
Xu, H., He, K., Sigal, L., Sclaroff, S., Saenko, K.: Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113 2(6), 7 (2018)
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)
Ding, N., Tian, S.-W., Yu, L.: A multimodal fusion method for sarcasm detection based on late fusion. Multimed. Tools Appl. 81(6), 8597–8616 (2022)
Article Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Wang, J., Mao, H., Li, H.: Fmfn: fine-grained multimodal fusion networks for fake news detection. Appl. Sci. 12(3), 1093 (2022)
Article Google Scholar
Baecchi, C., Uricchio, T., Bertini, M., Del Bimbo, A.: A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed. Tools Appl. 75, 2507–2525 (2016)
Article Google Scholar
Xu, N., Mao, W.: Multisentinet: a deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)
Yu, J., Jiang, J., Xia, R.: Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 429–439 (2019)
Article Google Scholar
Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)
Article Google Scholar
Cheung, T.-H., Lam, K.-M.: Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing 514, 1–12 (2022)
Article Google Scholar
Song, Z., Xue, Y., Gu, D., Zhang, H., Ding, W.: Target-oriented multimodal sentiment classification by using topic model and gating mechanism. Int. J. Mach. Learn. Cybern. 14(7), 2289–2299 (2023)
Article Google Scholar
Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. 19(1), 1–19 (2023)
Article Google Scholar
Le, H.-D., Lee, G.-S., Kim, S.-H., Kim, S., Yang, H.-J.: Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access 11, 14742–14751 (2023)
Article Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: International Colloquium on Automata, Languages, and Programming, pp. 693–703. Springer (2002)
Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Wang, B., Lu, W.: Learning latent opinions for aspect-level sentiment classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Yu, J., Jiang, J.: Adapting bert for target-oriented multimodal sentiment classification. IJCAI (2019)
Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 371–378 (2019)
Li, Y., Jiang, S., et al.: Multimodal sentiment analysis with image-text correlation modal. In: 2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 281–286. IEEE (2023)
Li, Z., An, Z., Cheng, W., Zhou, J., Zheng, F., Hu, B.: Mha: a multimodal hierarchical attention model for depression detection in social media. Health Inf. Sci. Syst. 11(1), 6 (2023)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62071189) and (Grant No. 62201220).

Author information

Authors and Affiliations

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
Yushi Li, Xin Zheng, Ming Zhu, Jie Mei, Ziwen Chen & Yunfei Tao
Hubei Key Laboratory of Smart Internet Technology, Wuhan, 430074, Hubei, China
Yushi Li, Xin Zheng, Ming Zhu, Jie Mei, Ziwen Chen & Yunfei Tao
Wuhan Science and Technology Innovation Park, China Three Gorges Corporation, Wuhan, 430010, Hubei, China
Jie Mei

Authors

Yushi Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Mei
View author publications
You can also search for this author in PubMed Google Scholar
Ziwen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yunfei Tao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.L. and X.Z. designed and wrote programs. X.Z. took charge of data curation. Y.L. and M.Z. wrote the main manuscript text. J.M. and Z.C. reviewed and edited the manuscript. Y.T. prepared figures 1-3. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ming Zhu.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest statements.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Zheng, X., Zhu, M. et al. Compact bilinear pooling and multi-loss network for social media multimodal classification. SIViP 18, 8403–8412 (2024). https://doi.org/10.1007/s11760-024-03482-w

Download citation

Received: 20 June 2024
Revised: 23 July 2024
Accepted: 28 July 2024
Published: 12 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s11760-024-03482-w

Compact bilinear pooling and multi-loss network for social media multimodal classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Multi-task Collaborative Network for Image-Text Retrieval

SaliencyBERT: Recurrent Attention Network for Target-Oriented Multimodal Sentiment Classification

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Compact bilinear pooling and multi-loss network for social media multimodal classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Multi-task Collaborative Network for Image-Text Retrieval

SaliencyBERT: Recurrent Attention Network for Target-Oriented Multimodal Sentiment Classification

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now