abstract

Multimodal Driver Interaction with Gesture, Gaze and Speech

Author:

Abdul Rafey AftabAuthors Info & Claims

ICMI '19: 2019 International Conference on Multimodal Interaction

Pages 487 - 492

https://doi.org/10.1145/3340555.3356093

Published: 14 October 2019 Publication History

Abstract

The ever-growing research in computer vision has created new avenues for user interaction. Speech commands and gesture recognition are already being applied in various touch-based inputs. It is, therefore, foreseeable, that the use of multimodal input methods for user interaction is the next phase in development. In this paper, I propose a research plan of novel methods for the use of multimodal inputs for the semantic interpretation of human-computer interaction, specifically applied to a car driver. A fusion methodology has to be designed that adequately makes use of a recognized gesture (specifically finger pointing), eye gaze and head pose for the identification of reference objects, while using the semantics from speech for a natural interactive environment for the driver. The proposed plan includes different techniques based on artificial neural networks for the fusion of the camera-based modalities (gaze, head and gesture). It then combines features extracted from speech with the fusion algorithm to determine the intent of the driver.

References

[1]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271(2018).

[2]

Hua Chai, Minrui Fei, Aolei Yang, and Ling Chen. 2018. 3D Gesture Recognition Method Based on Faster R-CNN Network. In 2018 Australian & New Zealand Control Conference (ANZCC). IEEE, 291–296.

[3]

Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+ gesture: Expressive, precise and targeted free-space interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 131–138.

Digital Library

[4]

Fang Chen, Marie Jonsson, Jessica Villing, and Staffan Larsson. 2010. Application of Speech Technology in Vehicles. In Speech Technology. Springer, 195–219.

[5]

Organización Internacional de Normalización (Ginebra). 1991. Road Vehicles, Vehicle Dynamics and Road-holding Ability: Vocabulary. ISO.

[6]

Jiang Dong, Dafang Zhuang, Yaohuan Huang, and Jingying Fu. 2009. Advances in multi-sensor data fusion: Algorithms and applications. Sensors 9, 10 (2009), 7771–7784.

[7]

Ali Erol, George Bebis, Mircea Nicolescu, Richard D Boyle, and Xander Twombly. 2007. Vision-based hand pose estimation: A review. Computer Vision and Image Understanding 108, 1-2 (2007), 52–73.

Digital Library

[8]

Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. 2018. User profiling through deep multimodal fusion. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 171–179.

Digital Library

[9]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.

[10]

Weimin Guo, Cheng Cheng, Mingkai Cheng, Yonghan Jiang, and Honglin Tang. 2013. Intent capturing through multimodal inputs. In International Conference on Human-Computer Interaction. Springer, 243–251.

Digital Library

[11]

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).

[12]

Robert JK Jacob and Keith S Karn. 2003. Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. In The mind’s eye. Elsevier, 573–605.

[13]

Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction: A survey. Computer vision and image understanding 108, 1-2 (2007), 116–134.

Digital Library

[14]

Qiang Ji and Xiaojie Yang. 2002. Real-time eye, gaze, and face pose tracking for monitoring driver vigilance. Real-time imaging 8, 5 (2002), 357–377.

[15]

Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning. 2342–2350.

Digital Library

[16]

Christian Lander. 2016. Methods for calibration free and multi-user eye tracking. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct. ACM, 899–900.

Digital Library

[17]

Zachary C Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019(2015).

[18]

Giulio Marin, Fabio Dominio, and Pietro Zanuttigh. 2014. Hand gesture recognition with leap motion and kinect devices. In 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 1565–1569.

[19]

Rohit Mishra and Edward Kim. 2009. Speech recognition assembly for acoustically controlling a function of a motor vehicle. US Patent App. 12/410,430.

[20]

Monika Mitrevska, Mohammad Mehdi Moniri, Robert Neßelrath, Tim Schwartz, Michael Feld, Yannick Körber, Matthieu Deru, and Christian Müller. 2015. SiAM-Situation-Adaptive Multimodal Interaction for Innovative Mobility Concepts of the Future. In 2015 International Conference on Intelligent Environments. IEEE, 180–183.

[21]

Leap Motion. 2019. Controller for PC. https://www.leapmotion.com/. [Online; accessed 23-June-2019].

[22]

Sankha S Mukherjee and Neil Martin Robertson. 2015. Deep head pose: Gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia 17, 11 (2015), 2094–2107.

Digital Library

[23]

Clifford Ivar Nass and Scott Brave. 2005. Wired for speech: How voice activates and advances the human-computer relationship. MIT press Cambridge, MA.

Digital Library

[24]

Robert Neßelrath, Mohammad Mehdi Moniri, and Michael Feld. 2016. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 190–193.

[25]

Oluwatobi Olabiyi, Eric Martinson, Vijay Chintalapudi, and Rui Guo. 2017. Driver action prediction using deep (bidirectional) recurrent neural network. arXiv preprint arXiv:1706.02257(2017).

[26]

Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In Proceedings of the 4th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 155–162.

Digital Library

[27]

Alex Poole and Linden J Ball. 2006. Eye tracking in HCI and usability research. In Encyclopedia of human computer interaction. IGI Global, 211–219.

[28]

Ivan Poupyrev. 2018. Radar-based gesture recognition. US Patent 9,921,660.

[29]

Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking?. In Advances in Neural Information Processing Systems. 199–207.

[30]

Adrià Recasens, Carl Vondrick, Aditya Khosla, and Antonio Torralba. 2016. Following gaze across views. arXiv preprint arXiv:1612.03094(2016).

[31]

Florian Roider and Tom Gross. 2018. I See Your Point: Integrating Gaze to Enhance Pointing Gesture Accuracy While Driving. In Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 351–358.

Digital Library

[32]

Sonja Rümelin, Chadly Marouane, and Andreas Butz. 2013. Free-hand pointing for identification and interaction with distant objects. In Proceedings of the 5th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 40–47.

Digital Library

[33]

Hari Singh and Jaswinder Singh. 2012. Human eye tracking and related issues: A review. International Journal of Scientific and Research Publications 2, 9(2012), 1–9.

[34]

Thad Starner, Jake Auxier, Daniel Ashbrook, and Maribeth Gandy. 2000. The gesture pendant: A self-illuminating, wearable, infrared computer vision system for home automation control and medical monitoring. In Digest of Papers. Fourth International Symposium on Wearable Computers. IEEE, 87–94.

[35]

Tobii. 2019. Tobii Eye-tracker. https://tobiigaming.com. [Online; accessed 23-June-2019].

[36]

Roel Vertegaal 2003. Attentive user interfaces. Commun. ACM 46, 3 (2003), 30–33.

Digital Library

[37]

Ulrich Weidenbacher, Georg Layher, Pierre Bayerl, and Heiko Neumann. 2006. Detection of head pose and gaze direction for human-computer interaction. In International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems. Springer, 9–19.

Digital Library

[38]

Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5934–5938.

Digital Library

[39]

Chao-Wei Yu, Chin-Hsuan Liu, Yen-Lin Chen, Posen Lee, and Meng-Syue Tian. 2018. Vision-based Hand Recognition Based on ToF Depth Camera. Smart Science 6, 1 (2018), 21–28.

[40]

Boxuan Yue, Junwei Fu, and Jun Liang. 2018. Residual recurrent neural networks for learning sequential representations. Information 9, 3 (2018), 56.

[41]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250(2017).

[42]

Thomas G Zimmerman, Jaron Lanier, Chuck Blanchard, Steve Bryson, and Young Harvill. 1987. A hand gesture interface device. In ACM SIGCHI Bulletin, Vol. 18. ACM, 189–192.

Cited By

Lee JWang JBrown EChu LRodriguez SFroehlich J(2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642230
Liu YHuang QYe HFan XLi CQu Z(2024)Capturing Visual Attention: A Methodical Approach to Gaze Estimation Data Acquisition and Annotation2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)10.1109/ICETCI61221.2024.10594217(390-394)Online publication date: 24-May-2024
https://doi.org/10.1109/ICETCI61221.2024.10594217
Cheng YZhu YWang ZHao HLiu YCheng SWang XChang H(2024)What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00154(1556-1565)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00154
Show More Cited By

Recommendations

Perception of gaze direction for situated interaction
Gaze-In '12: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction

Accurate human perception of robots' gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects ...
Enabling Finger-Gesture Interaction with Kinect
VINCI '15: Proceedings of the 8th International Symposium on Visual Information Communication and Interaction

A large number of tracking and gesture recognition algorithms and technologies have been developed in the field of human-computer interactions thanks to the introduction of cameras with depth sensors such as Microsoft's Kinect. Most of the techniques ...
Gaze-informed multimodal interaction
The Handbook of Multimodal-Multisensor Interfaces

Observe a person pointing out and describing something. Where is that person looking? Chances are good that this person also looks at what she is talking about and pointing at. Gaze is naturally coordinated with our speech and hand movements. By ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '19: 2019 International Conference on Multimodal Interaction

October 2019

601 pages

ISBN:9781450368605

DOI:10.1145/3340555

Editors:
Wen Gao
Peking University, China
,
Helen Mei Ling Meng
Chinese University of Hong Kong, China
,
Matthew Turk
Toyota Technological Institute at Chicago, USA
,
Susan R. Fussell
Cornell University, USA
,
Björn Schuller
Imperial College London / University of Augsburg, UK
,
Yale Song
Microsoft Research, USA
,
Kai Yu
Shanghai Jiao Tong University, China

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Check for updates

Author Tags

Qualifiers

Abstract
Research
Refereed limited

Conference

ICMI '19

ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 14 - 18, 2019

Suzhou, China

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
570
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)9

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee JWang JBrown EChu LRodriguez SFroehlich J(2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642230
Liu YHuang QYe HFan XLi CQu Z(2024)Capturing Visual Attention: A Methodical Approach to Gaze Estimation Data Acquisition and Annotation2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)10.1109/ICETCI61221.2024.10594217(390-394)Online publication date: 24-May-2024
https://doi.org/10.1109/ICETCI61221.2024.10594217
Cheng YZhu YWang ZHao HLiu YCheng SWang XChang H(2024)What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00154(1556-1565)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00154
Jansen PBritten JHäusele ASegschneider TColley MRukzio E(2023)AutoVis: Enabling Mixed-Immersive Analysis of Automotive User Interface Interaction StudiesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580760(1-23)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580760
Ivanko DKashevnik ARyumin DKitenko AAxyonov ALashkov IKarpov A(2022)MIDriveSafely: Multimodal Interaction for Drive SafelyProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3557037(733-735)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3536221.3557037
Jansen PColley MRukzio E(2022)A Design Space for Human Sensor and Actuator Focused In-Vehicle Interaction Based on a Systematic Literature ReviewProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35346176:2(1-51)Online publication date: 7-Jul-2022
https://dl.acm.org/doi/10.1145/3534617
Tan ZDai NSu YZhang RLi YWu DLi S(2022)Human–Machine Interaction in Intelligent and Connected Vehicles: A Review of Status Quo, Issues, and OpportunitiesIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.312721723:9(13954-13975)Online publication date: Sep-2022
https://doi.org/10.1109/TITS.2021.3127217
Nazari FTabibian SHomayounvala E(2022)Multimodal user interaction with in-car equipment in real conditions based on touch and speech modes in the Persian languageMultimedia Tools and Applications10.1007/s11042-022-13784-182:9(12995-13023)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/s11042-022-13784-1
Ding YYang YJiang WLiu YHe TZhang DKuipers FCaesar M(2021)Nationwide deployment and operation of a virtual arrival detection system in the wildProceedings of the 2021 ACM SIGCOMM 2021 Conference10.1145/3452296.3472911(705-717)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3452296.3472911

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents