short-paper

Multimodal intent classification with incomplete modalities using text embedding propagation

Authors:

Victor Machado Gonzaga,

Nils Murrugarra-Llerena,

Ricardo MarcaciniAuthors Info & Claims

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

Pages 217 - 220

https://doi.org/10.1145/3470482.3479636

Published: 05 November 2021 Publication History

Abstract

Determining the author's intent in a social media post is a challenging multimodal task and requires identifying complex relationships between image and text in the post. For example, the post image can represent an object, person, product, or company, while the text can be an ironic message about the image content. Similarly, a text can be a news headline, while the image represents a provocation, meme, or satire about the news. Existing approaches propose intent classification techniques combining both modalities. However, some posts may have missing textual annotations. Hence, we investigate a graph-based approach that propagates available text embedding data from complete multimodal posts to incomplete ones. This paper presents a text embedding propagation method, which transfers embeddings from BERT neural language models to image-only posts (i.e., posts with incomplete modality) considering the topology of a graph constructed from both visual and textual modalities available during the training step. By using this inference approach, our method provides competitive results when textual modality is available at different completeness levels, even compared to reference methods that require complete modalities.

References

[1]

Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In International Conference on Computer Vision (ICCV). IEEE.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV). IEEE.

[3]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence (TPAMI) (2018).

Digital Library

[4]

Tae Rang Choi and Yongjun Sung. 2018. Instagram versus Snapchat: Self-expression and privacy concern on social media. Telematics and Informatics (2018).

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). ACL.

[6]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE ++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference 2018 (BMVC). BMVA Press.

[7]

Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimosthenis Karatzas. 2020. Exploring Hate Speech Detection in Multimodal Publications. In Winter Conference on Applications of Computer Vision (WACV). IEEE.

[8]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition.

[9]

Wenzhong Guo, Jianwen Wang, and Shiping Wang. [n.d.]. Deep multimodal representation learning: A survey. IEEE Access ([n. d.]).

[10]

Arshit Gupta, John Hewitt, and Katrin Kirchhoff. 2019. Simple, Fast, Accurate Intent Classification and Slot Labeling for Goal-Oriented Dialogue Systems. In Annual SIGdial Meeting on Discourse and Dialogue. ACL.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition.

[12]

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

[13]

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

[14]

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop. NeurIPS.

[15]

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. CoRR (2020).

[16]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (IJCV) (2017).

[17]

Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran. 2019. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing.

[18]

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. 2019. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing.

[19]

Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized click graphs. In Annual international ACM SIGIR conference on Research and development in information retrieval. ACM.

Digital Library

[20]

Nils Murrugarra-Llerena and Adriana Kovashka. 2018. Image Retrieval with Mixed Initiative and Multimodal Feedback. In British Machine Vision Conference (BMVC).

[21]

Nils Murrugarra-Llerena and Adriana Kovashka. 2019. Cross-Modality Personalization for Retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

[22]

Nils Murrugarra-Llerena and Adriana Kovashka. 2021. Image retrieval with mixed initiative and multimodal feedback. Computer Vision and Image Understanding 207 (2021), 103204. https://doi.org/10.1016/j.cviu.2021.103204

[23]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). ACL.

[24]

Hemant Purohit, Guozhu Dong, Valerie Shalin, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. Intent classification of short-text on social media. In International conference on smart city/socialcom/sustaincom (smartcity). IEEE.

[25]

Robin Rojowiec, Benjamin Roth, and Maximilian Fink. 2020. Intent Recognition in Doctor-Patient Interviews. In Language Resources and Evaluation Conference (LREC). European Language Resources Association.

[26]

Manos Tsagkias and Roi Blanco. 2012. Language intent models for inferring user browsing behavior. In International ACM SIGIR conference on Research and development in information retrieval. ACM.

Digital Library

[27]

Byron C. Wallace, Do Kook Choe, Laura Kertz, and Eugene Charniak. 2014. Humans Require Context to Infer Ironic Intent (so Computers Probably do, too). In Annual Meeting of the Association for Computational Linguistics (ACL). ACL.

[28]

Keren Ye and Adriana Kovashka. 2018. ADVISE: Symbolism and External Knowledge for Decoding Advertisements. In European Conference on Computer Vision (ECCV).

[29]

Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. Advances in neural information processing systems (NeurIPS) (2003).

[30]

Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning (2009).

Cited By

Hajizadeh Saffar ASitbon LHoogstrate MAbbas ARoomkham SMiller D(2024)Human and Large Language Model Intent Detection in Image-Based Self-Expression of People with Intellectual DisabilityProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638347(199-208)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638347
Huang SQin LWang BTu GXu R(2024)SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446922(10206-10210)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446922
Mittal TChowdhury SGuhan PChelluri SManocha D(2024)Towards determining perceived audience intent for multimodal social media posts using the theory of reasoned actionScientific Reports10.1038/s41598-024-60299-w14:1Online publication date: 8-May-2024
https://doi.org/10.1038/s41598-024-60299-w

Index Terms

Multimodal intent classification with incomplete modalities using text embedding propagation
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. World Wide Web
    1. Web mining

Recommendations

Multimodal Learning with Incomplete Modalities by Knowledge Distillation
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Multimodal learning aims at utilizing information from a variety of data modalities to improve the generalization performance. One common approach is to seek the common information that is shared among different modalities for learning, whereas we can ...
Cross-modal Ambiguity Learning for Multimodal Fake News Detection
WWW '22: Proceedings of the ACM Web Conference 2022

Cross-modal learning is essential to enable accurate fake news detection due to the fast-growing multimodal contents in online social communities. A fundamental challenge of multimodal fake news detection lies in the inherent ambiguity across different ...
Identifying the influential bloggers in a community
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

Blogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

November 2021

271 pages

ISBN:9781450386098

DOI:10.1145/3470482

Copyright © 2021 ACM.

© 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

SBC: Brazilian Computer Society
CNPq: Conselho Nacional de Desenvolvimento Cientifico e Tecn
CAPES: Brazilian Higher Education Funding Council

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Conference

WebMedia '21

Sponsor:

WebMedia '21: Brazilian Symposium on Multimedia and the Web

November 5 - 12, 2021

Minas Gerais, Belo Horizonte, Brazil

Acceptance Rates

WebMedia '21 Paper Acceptance Rate 24 of 75 submissions, 32%;

Overall Acceptance Rate 270 of 873 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
179
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hajizadeh Saffar ASitbon LHoogstrate MAbbas ARoomkham SMiller D(2024)Human and Large Language Model Intent Detection in Image-Based Self-Expression of People with Intellectual DisabilityProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638347(199-208)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638347
Huang SQin LWang BTu GXu R(2024)SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-Modal Intent DetectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446922(10206-10210)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446922
Mittal TChowdhury SGuhan PChelluri SManocha D(2024)Towards determining perceived audience intent for multimodal social media posts using the theory of reasoned actionScientific Reports10.1038/s41598-024-60299-w14:1Online publication date: 8-May-2024
https://doi.org/10.1038/s41598-024-60299-w
Martínez GSaavedra JMurrugara-Llerena N(2023)VETE: improving visual embeddings through text descriptions for eCommerce search enginesMultimedia Tools and Applications10.1007/s11042-023-14595-882:26(41343-41379)Online publication date: 29-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14595-8

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents