Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3462244.3481002acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Gaze-based Multimodal Meaning Recovery for Noisy / Complex Environments

Published: 18 October 2021 Publication History

Abstract

Reference resolution is an important problem that has enormous practical implications in daily life, for example in recovering the intended meaning in communication when the environment is noisy (acoustic noise in the spoken channel, or clutter / occlusion in the visual world). Recent literature indicates that cross-modal processing of all the contributive modalities improves the reference resolution in such settings. In this paper, we investigate the contribution of the eye-tracking methodology, a substantial but underrepresented component of face-to-face communication in NLP systems, to recover the meaning in noisy settings. We integrate gaze features into state-of-the-art language models and test the model on data where parts of the sentences are masked, mimicking noise in the acoustic channel. The results indicate that eye movements can compensate for the missing information in the situation and support communication when language and visual modality fail.

References

[1]
Özge Alaçam, Xingshan Li, Wolfgang Menzel, and Tobias Staron. 2020. Crossmodal Language Comprehension – Psycholinguistic Insights and Computational Approaches. Frontiers in Neurorobotics 14:2 (2020).
[2]
Özge Alaçam, Eugen Ruppert, Amr R. Salama, Tobias Staron, and Wolfgang Menzel. 2020. Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations. In Proceedings of the12th International Conference on Language Resources and Evaluation (LREC). Marseille, France, 2396–2404.
[3]
Elena Arabadzhiyska, Okan Tarhan Tursun, Karol Myszkowski, Hans-Peter Seidel, and Piotr Didyk. 2017. Saccade landing position prediction for gaze-contingent rendering. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
[4]
Kavita Asnani, Douglas Vaz, Tanay PrabhuDesai, Surabhi Borgikar, Megha Bisht, Sharvari Bhosale, and Nikhil Balaji. 2015. Sentence Completion Using Text Prediction Systems. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014. Springer, 397–404.
[5]
Steffen Bickel, Peter Haider, and Tobias Scheffer. 2005. Learning to complete sentences. In Proceedings of the 16th European Conference on Machine Learning. Springer, Porto, Portugal, 497–504.
[6]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
[7]
Braiden Brousseau, Jonathan Rose, and Moshe Eizenman. 2020. Hybrid Eye-Tracking on a Smartphone with CNN Feature Extraction and an Infrared 3D Model. Sensors 20, 2 (2020), 543.
[8]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:1301.3781(2019).
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019). Minneapolis, Minnesota, USA, 4171–4186.
[10]
Nestor Garay-Vitoria and Julio Abascal. 2004. A comparison of prediction techniques to enhance the communication rate. In ERCIM Workshop on User Interfaces for All. Springer, Vienna, Austria, 400–417.
[11]
Edward Gibson, Leon Bergen, and Steven T. Piantadosi. 2013. Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proceedings of the National Academy of Sciences 110, 20, 8051–8056.
[12]
John M. Henderson and Tim J. Smith. 2009. How are eye fixation durations controlled during scene viewing? Further evidence from a scene onset delay paradigm. Visual Cognition 17, 6-7 (2009), 1055–1082.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[14]
Laszlo Hunyadi. 2013. Incompleteness and fragmentation in spoken language syntax and its relation to prosody and gesturing: Cognitive processes vs. Possible formal cues. In IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom). Budapest, Hungary, 211–218.
[15]
Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018. The past, present, and future of gaze-enabled handheld mobile devices: survey and lessons learned. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. Barcelona, Spain, 1–17.
[16]
Sigrid Klerke and Barbara Plank. 2019. At a Glance: The Impact of Gaze Aggregation Views on Syntactic Tagging. In Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN). Hong Kong, China, 51–61.
[17]
Nikolina Koleva, Martín Villalba, Maria Staudte, and Alexander Koller. 2015. The impact of listener gaze on predicting reference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 2. Beijing, China, 812–817.
[18]
Alexander Koller, Konstantina Garoufi, Maria Staudte, and Matthew Crocker. 2012. Enhancing referential success by tracking hearer gaze. In Proceedings of the 13th annual meeting of the special interest group on discourse and dialogue. Stroudsburg, PA, USA, 30–39.
[19]
Roger Levy. 2008. A Noisy-channel Model of Rational Human Sentence Comprehension Under Uncertain Input. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08). Waikiki, Honolulu, Hawaii, 234–243.
[20]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692(2019).
[21]
Jack M. Loomis, Jonathan W. Kelly, Matthias Pusch, Jeremy N. Bailenson, and Andrew C Beall. 2008. Psychophysics of perceiving eye-gaze and head direction with peripheral vision: Implications for the dynamics of eye-gaze behavior. Perception 37, 9 (2008), 1443–1457.
[22]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, G. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, 3111–3119.
[24]
Abhijit Mishra, Kuntal Dey, and Pushpak Bhattacharyya. 2017. Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada, 377–387.
[25]
Nikolina Mitev, Patrick Renner, Thies Pfeiffer, and Maria Staudte. 2018. Using Listener Gaze to Refer in Installments Benefits Understanding. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society. Madison,Wisconsin, USA, 2122–2127.
[26]
Zahar Prasov and Joyce Y Chai. 2008. What’s in a gaze? The role of eye-gaze in reference resolution in multimodal conversational interfaces. In Proceedings of the 13th international conference on Intelligent user interfaces. Gran Canaria, Spain, 20–29.
[27]
Wasifur Rahman, Md Kamrul Hasan, Amir Zadeh, Louis-Philippe Morency, and Mohammed Ehsan Hoque. 2019. M-BERT: Injecting Multimodal Information in the BERT Structure. arXiv preprint arXiv:1908.05787(2019).
[28]
Sol Rogers. 2019. Seven Reasons Why Eye-tracking Will Fundamentally Change VR. Forbes (2019). https://www.forbes.com/sites/solrogers/2019/02/05/seven-reasons-why-eye-tracking-will-fundamentally-change-vr/
[29]
Amr R. Salama, Özge Alaçam, and Wolfgang Menzel. 2018. Text Completion using Context-Integrated Dependency Parsing. In Proceedings of the 3rd Workshop on Representation Learning for NLP - ACL 2018. Melbourne, Australia, 41–49.
[30]
Andrea Vanzo, Danilo Croce, Emanuele Bastianelli, Roberto Basili, and Daniele Nardi. 2020. Grounded language interpretation of robotic commands through structured learning. Artificial Intelligence 278 (2020).
[31]
Cesco Willemse and Agnieszka Wykowska. 2019. In natural interaction with embodied robots, we prefer it when they follow our gaze: a gaze-contingent mobile eyetracking study. Philosophical Transactions of the Royal Society B 374, 1771(2019), 20180036.
[32]
Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45.
[33]
Jianfei Yu and Jing Jiang. 2019. Adapting BERT for target-oriented multimodal sentiment classification. In Proceedings of the 28th International Joint Conferences on Artificial Intelligence. Macao, China.
[34]
Zeynep Yücel, Albert Ali Salah, Çetin Meriçli, Tekin Meriçli, Roberto Valenti, and Theo Gevers. 2013. Joint attention by gaze interpolation and saliency. IEEE Transactions on cybernetics 43, 3 (2013), 829–842.
  1. Gaze-based Multimodal Meaning Recovery for Noisy / Complex Environments

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
    October 2021
    876 pages
    ISBN:9781450384810
    DOI:10.1145/3462244
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gaze detection
    2. meaning recovery
    3. multimodal communication

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICMI '21
    Sponsor:
    ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
    October 18 - 22, 2021
    QC, Montréal, Canada

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 102
      Total Downloads
    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media