Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Compact bilinear pooling and multi-loss network for social media multimodal classification

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Social media platforms have seen an influx of multimodal data, leading to heightened attention on image-text multimodal classification. Existing methods for multimodal classification primarily focus on multimodal fusion from different modalities. However, owing to the heterogeneity and high-dimensionality of multimodal data, the fusion process frequently introduces redundant information and noise limiting the accuracy and generalization. To resolve the limitation, we propose a Compact Bilinear pooling and Multi-Loss network (CBMLNet). Compact bilinear pooling is used for feature fusion to learn low-dimensional and expressive multimodal representations efficiently. Furthermore, a multi-loss function is proposed to import the specific information carried by each single modality. Therefore, CBMLNet simultaneously considers the correlation between multimodality and the specificity of single modality for image-text classification. We evaluate the proposed CBMLNet on two publicly available datasets, Twitter-15 and Twitter-17, and on a private dataset, AIFUN. CBMLNet is compared with the advanced methods such as multimodal BERT with Max Pooling, Multi-Interactive Memory Network, Multi-level Multi-modal Cross-attention Network, Image-Text Correlation model (ITC), Target-oriented multimodal BERT and multimodal hierarchical attention model (MHA). Experimental results demonstrate that CBMLNet averagely improves F1_score by 0.28% and 0.44% compared with the best fine-grained baseline, MHA and the best coarse-grained baseline, ITC. It illustrates that CBMLNet is practical for real-world fuzzy applications as a coarse-grained model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availibility

The dataset generated during the current study is not publicly available due to data availability statement of support funding but is available from the corresponding author on reasonable request.

References

  1. Xue, X., Zhang, C., Niu, Z., Wu, X.: Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 35(5), 5105–5118 (2022)

    Google Scholar 

  2. Bansal, S., Gowda, K., Kumar, N.: A hybrid deep neural network for multimodal personalized hashtag recommendation. IEEE Trans. Comput. Soc. Syst. 10(5), 2439–2459 (2022)

  3. Zhao, J., Dong, W., Shi, L., Qiang, W., Kuang, Z., Xu, D., An, T.: Multimodal feature fusion method for unbalanced sample data in social network public opinion. Sensors 22(15), 5528 (2022)

    Article  Google Scholar 

  4. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.-F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 223–232 (2013)

  5. Cao, D., Ji, R., Lin, D., Li, S.: A cross-media public sentiment analysis system for microblog. Multimed. Syst. 22, 479–486 (2016)

    Article  Google Scholar 

  6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012)

  7. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 1, 649–657 (2015)

  8. Ying, L., Yu, H., Wang, J., Ji, Y., Qian, S.: Multi-level multi-modal cross-attention network for fake news detection. IEEE Access 9, 132363–132373 (2021)

    Article  Google Scholar 

  9. Boulahia, S.Y., Amamra, A., Madi, M.R., Daikh, S.: Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 32(6), 121 (2021)

    Article  Google Scholar 

  10. Zhang, K., Geng, Y., Zhao, J., Liu, J., Li, W.: Sentiment analysis of social media via multimodal feature fusion. Symmetry 12(12), 2010 (2020)

    Article  Google Scholar 

  11. Han, W., Chen, H., Poria, S.: Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412 (2021)

  12. Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)

    Article  MathSciNet  Google Scholar 

  13. Chua, W.W., Li, L., Goh, A.: Classifying multimodal data using transformers. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4780–4781 (2022)

  14. Peng, Y., Qi, J.: Cm-gans: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 1–24 (2019)

    Article  Google Scholar 

  15. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  16. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  17. Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., Hussain, A.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. usion 91, 424–444 (2023)

  18. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345–379 (2010)

    Article  Google Scholar 

  19. Sun, H., Dhingra, B., Zaheer, M., Mazaitis, K., Salakhutdinov, R., Cohen, W.W.: Open domain question answering using early fusion of knowledge bases and text. arXiv preprint arXiv:1809.00782 (2018)

  20. Xu, H., He, K., Sigal, L., Sclaroff, S., Saenko, K.: Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113 2(6), 7 (2018)

  21. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)

  22. Ding, N., Tian, S.-W., Yu, L.: A multimodal fusion method for sarcasm detection based on late fusion. Multimed. Tools Appl. 81(6), 8597–8616 (2022)

    Article  Google Scholar 

  23. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  24. Wang, J., Mao, H., Li, H.: Fmfn: fine-grained multimodal fusion networks for fake news detection. Appl. Sci. 12(3), 1093 (2022)

    Article  Google Scholar 

  25. Baecchi, C., Uricchio, T., Bertini, M., Del Bimbo, A.: A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed. Tools Appl. 75, 2507–2525 (2016)

    Article  Google Scholar 

  26. Xu, N., Mao, W.: Multisentinet: a deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)

  27. Yu, J., Jiang, J., Xia, R.: Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 429–439 (2019)

    Article  Google Scholar 

  28. Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 23, 4014–4026 (2020)

    Article  Google Scholar 

  29. Cheung, T.-H., Lam, K.-M.: Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing 514, 1–12 (2022)

    Article  Google Scholar 

  30. Song, Z., Xue, Y., Gu, D., Zhang, H., Ding, W.: Target-oriented multimodal sentiment classification by using topic model and gating mechanism. Int. J. Mach. Learn. Cybern. 14(7), 2289–2299 (2023)

    Article  Google Scholar 

  31. Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. 19(1), 1–19 (2023)

    Article  Google Scholar 

  32. Le, H.-D., Lee, G.-S., Kim, S.-H., Kim, S., Yang, H.-J.: Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access 11, 14742–14751 (2023)

    Article  Google Scholar 

  33. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: International Colloquium on Automata, Languages, and Programming, pp. 693–703. Springer (2002)

  34. Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  35. Wang, B., Lu, W.: Learning latent opinions for aspect-level sentiment classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  36. Yu, J., Jiang, J.: Adapting bert for target-oriented multimodal sentiment classification. IJCAI (2019)

  37. Xu, N., Mao, W., Chen, G.: Multi-interactive memory network for aspect based multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 371–378 (2019)

  38. Li, Y., Jiang, S., et al.: Multimodal sentiment analysis with image-text correlation modal. In: 2023 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 281–286. IEEE (2023)

  39. Li, Z., An, Z., Cheng, W., Zhou, J., Zheng, F., Hu, B.: Mha: a multimodal hierarchical attention model for depression detection in social media. Health Inf. Sci. Syst. 11(1), 6 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62071189) and (Grant No. 62201220).

Author information

Authors and Affiliations

Authors

Contributions

Y.L. and X.Z. designed and wrote programs. X.Z. took charge of data curation. Y.L. and M.Z. wrote the main manuscript text. J.M. and Z.C. reviewed and edited the manuscript. Y.T. prepared figures 1-3. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ming Zhu.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest statements.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Zheng, X., Zhu, M. et al. Compact bilinear pooling and multi-loss network for social media multimodal classification. SIViP 18, 8403–8412 (2024). https://doi.org/10.1007/s11760-024-03482-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-024-03482-w

Keywords

Navigation