research-article

Gaze-based Multimodal Meaning Recovery for Noisy / Complex Environments

Authors:

Ganeshan Malhotra,

Chris BiemannAuthors Info & Claims

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Pages 673 - 681

https://doi.org/10.1145/3462244.3481002

Published: 18 October 2021 Publication History

Abstract

Reference resolution is an important problem that has enormous practical implications in daily life, for example in recovering the intended meaning in communication when the environment is noisy (acoustic noise in the spoken channel, or clutter / occlusion in the visual world). Recent literature indicates that cross-modal processing of all the contributive modalities improves the reference resolution in such settings. In this paper, we investigate the contribution of the eye-tracking methodology, a substantial but underrepresented component of face-to-face communication in NLP systems, to recover the meaning in noisy settings. We integrate gaze features into state-of-the-art language models and test the model on data where parts of the sentences are masked, mimicking noise in the acoustic channel. The results indicate that eye movements can compensate for the missing information in the situation and support communication when language and visual modality fail.

References

[1]

Özge Alaçam, Xingshan Li, Wolfgang Menzel, and Tobias Staron. 2020. Crossmodal Language Comprehension – Psycholinguistic Insights and Computational Approaches. Frontiers in Neurorobotics 14:2 (2020).

[2]

Özge Alaçam, Eugen Ruppert, Amr R. Salama, Tobias Staron, and Wolfgang Menzel. 2020. Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations. In Proceedings of the12th International Conference on Language Resources and Evaluation (LREC). Marseille, France, 2396–2404.

[3]

Elena Arabadzhiyska, Okan Tarhan Tursun, Karol Myszkowski, Hans-Peter Seidel, and Piotr Didyk. 2017. Saccade landing position prediction for gaze-contingent rendering. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.

Digital Library

[4]

Kavita Asnani, Douglas Vaz, Tanay PrabhuDesai, Surabhi Borgikar, Megha Bisht, Sharvari Bhosale, and Nikhil Balaji. 2015. Sentence Completion Using Text Prediction Systems. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014. Springer, 397–404.

[5]

Steffen Bickel, Peter Haider, and Tobias Scheffer. 2005. Learning to complete sentences. In Proceedings of the 16th European Conference on Machine Learning. Springer, Porto, Portugal, 497–504.

Digital Library

[6]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.

Digital Library

[7]

Braiden Brousseau, Jonathan Rose, and Moshe Eizenman. 2020. Hybrid Eye-Tracking on a Smartphone with CNN Feature Extraction and an Infrared 3D Model. Sensors 20, 2 (2020), 543.

[8]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:1301.3781(2019).

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019). Minneapolis, Minnesota, USA, 4171–4186.

[10]

Nestor Garay-Vitoria and Julio Abascal. 2004. A comparison of prediction techniques to enhance the communication rate. In ERCIM Workshop on User Interfaces for All. Springer, Vienna, Austria, 400–417.

[11]

Edward Gibson, Leon Bergen, and Steven T. Piantadosi. 2013. Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proceedings of the National Academy of Sciences 110, 20, 8051–8056.

[12]

John M. Henderson and Tim J. Smith. 2009. How are eye fixation durations controlled during scene viewing? Further evidence from a scene onset delay paradigm. Visual Cognition 17, 6-7 (2009), 1055–1082.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[14]

Laszlo Hunyadi. 2013. Incompleteness and fragmentation in spoken language syntax and its relation to prosody and gesturing: Cognitive processes vs. Possible formal cues. In IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom). Budapest, Hungary, 211–218.

[15]

Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018. The past, present, and future of gaze-enabled handheld mobile devices: survey and lessons learned. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. Barcelona, Spain, 1–17.

Digital Library

[16]

Sigrid Klerke and Barbara Plank. 2019. At a Glance: The Impact of Gaze Aggregation Views on Syntactic Tagging. In Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN). Hong Kong, China, 51–61.

[17]

Nikolina Koleva, Martín Villalba, Maria Staudte, and Alexander Koller. 2015. The impact of listener gaze on predicting reference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 2. Beijing, China, 812–817.

[18]

Alexander Koller, Konstantina Garoufi, Maria Staudte, and Matthew Crocker. 2012. Enhancing referential success by tracking hearer gaze. In Proceedings of the 13th annual meeting of the special interest group on discourse and dialogue. Stroudsburg, PA, USA, 30–39.

Digital Library

[19]

Roger Levy. 2008. A Noisy-channel Model of Rational Human Sentence Comprehension Under Uncertain Input. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08). Waikiki, Honolulu, Hawaii, 234–243.

Digital Library

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692(2019).

[21]

Jack M. Loomis, Jonathan W. Kelly, Matthias Pusch, Jeremy N. Bailenson, and Andrew C Beall. 2008. Psychophysics of perceiving eye-gaze and head direction with peripheral vision: Implications for the dynamics of eye-gaze behavior. Perception 37, 9 (2008), 1443–1457.

[22]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).

[23]

Tomas Mikolov, Ilya Sutskever, Kai Chen, G. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, 3111–3119.

[24]

Abhijit Mishra, Kuntal Dey, and Pushpak Bhattacharyya. 2017. Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada, 377–387.

[25]

Nikolina Mitev, Patrick Renner, Thies Pfeiffer, and Maria Staudte. 2018. Using Listener Gaze to Refer in Installments Benefits Understanding. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society. Madison,Wisconsin, USA, 2122–2127.

[26]

Zahar Prasov and Joyce Y Chai. 2008. What’s in a gaze? The role of eye-gaze in reference resolution in multimodal conversational interfaces. In Proceedings of the 13th international conference on Intelligent user interfaces. Gran Canaria, Spain, 20–29.

[27]

Wasifur Rahman, Md Kamrul Hasan, Amir Zadeh, Louis-Philippe Morency, and Mohammed Ehsan Hoque. 2019. M-BERT: Injecting Multimodal Information in the BERT Structure. arXiv preprint arXiv:1908.05787(2019).

[28]

Sol Rogers. 2019. Seven Reasons Why Eye-tracking Will Fundamentally Change VR. Forbes (2019). https://www.forbes.com/sites/solrogers/2019/02/05/seven-reasons-why-eye-tracking-will-fundamentally-change-vr/

[29]

Amr R. Salama, Özge Alaçam, and Wolfgang Menzel. 2018. Text Completion using Context-Integrated Dependency Parsing. In Proceedings of the 3rd Workshop on Representation Learning for NLP - ACL 2018. Melbourne, Australia, 41–49.

[30]

Andrea Vanzo, Danilo Croce, Emanuele Bastianelli, Roberto Basili, and Daniele Nardi. 2020. Grounded language interpretation of robotic commands through structured learning. Artificial Intelligence 278 (2020).

[31]

Cesco Willemse and Agnieszka Wykowska. 2019. In natural interaction with embodied robots, we prefer it when they follow our gaze: a gaze-contingent mobile eyetracking study. Philosophical Transactions of the Royal Society B 374, 1771(2019), 20180036.

[32]

Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45.

[33]

Jianfei Yu and Jing Jiang. 2019. Adapting BERT for target-oriented multimodal sentiment classification. In Proceedings of the 28th International Joint Conferences on Artificial Intelligence. Macao, China.

[34]

Zeynep Yücel, Albert Ali Salah, Çetin Meriçli, Tekin Meriçli, Roberto Valenti, and Theo Gevers. 2013. Joint attention by gaze interpolation and saliency. IEEE Transactions on cybernetics 43, 3 (2013), 829–842.

Gaze-based Multimodal Meaning Recovery for Noisy / Complex Environments
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Looking for Laughs: Gaze Interaction with Laughter Pragmatics and Coordination
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Laughter and gaze have an important role in managing and coordi-nating social interactions. In the current work, using a multimodal corpus of dyadic taste-testing interactions, we explore whether laughs performing different pragmatic functions are ...
Multimodal communication involving movements of a robot
CHI EA '08: CHI '08 Extended Abstracts on Human Factors in Computing Systems

Communication between humans is multimodal and involves movements as well. While communication between humans and robots is becoming more and more multimodal, movements of a robot in 2D space have not yet been used for communication. In this paper, we ...
Eye Gaze Analyses in L1 and L2 Conversations: From the Perspective of Listeners' Eye Gaze Activity
UM3I '14: Proceedings of the 2014 workshop on Understanding and Modeling Multiparty, Multimodal Interactions

The importance of conversation in a second language (L2) during international collaboration continues to increase, but the features of non-verbal communications such as eye gaze and gestures in L2 conversation have not been clarified to the extent they ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

October 2021

876 pages

ISBN:9781450384810

DOI:10.1145/3462244

Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

ICMI '21

Sponsor:

SIGCHI

ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 18 - 22, 2021

QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
102
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents