Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3470482.3479636acmconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
short-paper

Multimodal intent classification with incomplete modalities using text embedding propagation

Published: 05 November 2021 Publication History

Abstract

Determining the author's intent in a social media post is a challenging multimodal task and requires identifying complex relationships between image and text in the post. For example, the post image can represent an object, person, product, or company, while the text can be an ironic message about the image content. Similarly, a text can be a news headline, while the image represents a provocation, meme, or satire about the news. Existing approaches propose intent classification techniques combining both modalities. However, some posts may have missing textual annotations. Hence, we investigate a graph-based approach that propagates available text embedding data from complete multimodal posts to incomplete ones. This paper presents a text embedding propagation method, which transfers embeddings from BERT neural language models to image-only posts (i.e., posts with incomplete modality) considering the topology of a graph constructed from both visual and textual modalities available during the training step. By using this inference approach, our method provides competitive results when textual modality is available at different completeness levels, even compared to reference methods that require complete modalities.

References

[1]
Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In International Conference on Computer Vision (ICCV). IEEE.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV). IEEE.
[3]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence (TPAMI) (2018).
[4]
Tae Rang Choi and Yongjun Sung. 2018. Instagram versus Snapchat: Self-expression and privacy concern on social media. Telematics and Informatics (2018).
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). ACL.
[6]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE ++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference 2018 (BMVC). BMVA Press.
[7]
Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimosthenis Karatzas. 2020. Exploring Hate Speech Detection in Multimodal Publications. In Winter Conference on Applications of Computer Vision (WACV). IEEE.
[8]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition.
[9]
Wenzhong Guo, Jianwen Wang, and Shiping Wang. [n.d.]. Deep multimodal representation learning: A survey. IEEE Access ([n. d.]).
[10]
Arshit Gupta, John Hewitt, and Katrin Kirchhoff. 2019. Simple, Fast, Accurate Intent Classification and Slot Labeling for Goal-Oriented Dialogue Systems. In Annual SIGdial Meeting on Discourse and Dialogue. ACL.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition.
[12]
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
[13]
Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
[14]
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop. NeurIPS.
[15]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. CoRR (2020).
[16]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (IJCV) (2017).
[17]
Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran. 2019. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing.
[18]
Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. 2019. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing.
[19]
Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized click graphs. In Annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
[20]
Nils Murrugarra-Llerena and Adriana Kovashka. 2018. Image Retrieval with Mixed Initiative and Multimodal Feedback. In British Machine Vision Conference (BMVC).
[21]
Nils Murrugarra-Llerena and Adriana Kovashka. 2019. Cross-Modality Personalization for Retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
[22]
Nils Murrugarra-Llerena and Adriana Kovashka. 2021. Image retrieval with mixed initiative and multimodal feedback. Computer Vision and Image Understanding 207 (2021), 103204. https://doi.org/10.1016/j.cviu.2021.103204
[23]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). ACL.
[24]
Hemant Purohit, Guozhu Dong, Valerie Shalin, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Intent classification of short-text on social media. In International conference on smart city/socialcom/sustaincom (smartcity). IEEE.
[25]
Robin Rojowiec, Benjamin Roth, and Maximilian Fink. 2020. Intent Recognition in Doctor-Patient Interviews. In Language Resources and Evaluation Conference (LREC). European Language Resources Association.
[26]
Manos Tsagkias and Roi Blanco. 2012. Language intent models for inferring user browsing behavior. In International ACM SIGIR conference on Research and development in information retrieval. ACM.
[27]
Byron C. Wallace, Do Kook Choe, Laura Kertz, and Eugene Charniak. 2014. Humans Require Context to Infer Ironic Intent (so Computers Probably do, too). In Annual Meeting of the Association for Computational Linguistics (ACL). ACL.
[28]
Keren Ye and Adriana Kovashka. 2018. ADVISE: Symbolism and External Knowledge for Decoding Advertisements. In European Conference on Computer Vision (ECCV).
[29]
Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. Advances in neural information processing systems (NeurIPS) (2003).
[30]
Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning (2009).

Cited By

View all
  • (2024)Human and Large Language Model Intent Detection in Image-Based Self-Expression of People with Intellectual DisabilityProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638347(199-208)Online publication date: 10-Mar-2024
  • (2024)SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446922(10206-10210)Online publication date: 14-Apr-2024
  • (2024)Towards determining perceived audience intent for multimodal social media posts using the theory of reasoned actionScientific Reports10.1038/s41598-024-60299-w14:1Online publication date: 8-May-2024

Index Terms

  1. Multimodal intent classification with incomplete modalities using text embedding propagation

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web
      November 2021
      271 pages
      ISBN:9781450386098
      DOI:10.1145/3470482
      © 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      In-Cooperation

      • SBC: Brazilian Computer Society
      • CNPq: Conselho Nacional de Desenvolvimento Cientifico e Tecn
      • CAPES: Brazilian Higher Education Funding Council

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 November 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. multimodal learning
      2. network embedding
      3. social networks

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      • Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
      • Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
      • Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

      Conference

      WebMedia '21
      Sponsor:
      WebMedia '21: Brazilian Symposium on Multimedia and the Web
      November 5 - 12, 2021
      Minas Gerais, Belo Horizonte, Brazil

      Acceptance Rates

      WebMedia '21 Paper Acceptance Rate 24 of 75 submissions, 32%;
      Overall Acceptance Rate 270 of 873 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)27
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Human and Large Language Model Intent Detection in Image-Based Self-Expression of People with Intellectual DisabilityProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638347(199-208)Online publication date: 10-Mar-2024
      • (2024)SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446922(10206-10210)Online publication date: 14-Apr-2024
      • (2024)Towards determining perceived audience intent for multimodal social media posts using the theory of reasoned actionScientific Reports10.1038/s41598-024-60299-w14:1Online publication date: 8-May-2024
      • (2023)VETE: improving visual embeddings through text descriptions for eCommerce search enginesMultimedia Tools and Applications10.1007/s11042-023-14595-882:26(41343-41379)Online publication date: 29-Mar-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media