Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340555.3356093acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
abstract

Multimodal Driver Interaction with Gesture, Gaze and Speech

Published: 14 October 2019 Publication History

Abstract

The ever-growing research in computer vision has created new avenues for user interaction. Speech commands and gesture recognition are already being applied in various touch-based inputs. It is, therefore, foreseeable, that the use of multimodal input methods for user interaction is the next phase in development. In this paper, I propose a research plan of novel methods for the use of multimodal inputs for the semantic interpretation of human-computer interaction, specifically applied to a car driver. A fusion methodology has to be designed that adequately makes use of a recognized gesture (specifically finger pointing), eye gaze and head pose for the identification of reference objects, while using the semantics from speech for a natural interactive environment for the driver. The proposed plan includes different techniques based on artificial neural networks for the fusion of the camera-based modalities (gaze, head and gesture). It then combines features extracted from speech with the fusion algorithm to determine the intent of the driver.

References

[1]
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271(2018).
[2]
Hua Chai, Minrui Fei, Aolei Yang, and Ling Chen. 2018. 3D Gesture Recognition Method Based on Faster R-CNN Network. In 2018 Australian & New Zealand Control Conference (ANZCC). IEEE, 291–296.
[3]
Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+ gesture: Expressive, precise and targeted free-space interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 131–138.
[4]
Fang Chen, Marie Jonsson, Jessica Villing, and Staffan Larsson. 2010. Application of Speech Technology in Vehicles. In Speech Technology. Springer, 195–219.
[5]
Organización Internacional de Normalización (Ginebra). 1991. Road Vehicles, Vehicle Dynamics and Road-holding Ability: Vocabulary. ISO.
[6]
Jiang Dong, Dafang Zhuang, Yaohuan Huang, and Jingying Fu. 2009. Advances in multi-sensor data fusion: Algorithms and applications. Sensors 9, 10 (2009), 7771–7784.
[7]
Ali Erol, George Bebis, Mircea Nicolescu, Richard D Boyle, and Xander Twombly. 2007. Vision-based hand pose estimation: A review. Computer Vision and Image Understanding 108, 1-2 (2007), 52–73.
[8]
Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. 2018. User profiling through deep multimodal fusion. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 171–179.
[9]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.
[10]
Weimin Guo, Cheng Cheng, Mingkai Cheng, Yonghan Jiang, and Honglin Tang. 2013. Intent capturing through multimodal inputs. In International Conference on Human-Computer Interaction. Springer, 243–251.
[11]
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).
[12]
Robert JK Jacob and Keith S Karn. 2003. Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. In The mind’s eye. Elsevier, 573–605.
[13]
Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human–computer interaction: A survey. Computer vision and image understanding 108, 1-2 (2007), 116–134.
[14]
Qiang Ji and Xiaojie Yang. 2002. Real-time eye, gaze, and face pose tracking for monitoring driver vigilance. Real-time imaging 8, 5 (2002), 357–377.
[15]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning. 2342–2350.
[16]
Christian Lander. 2016. Methods for calibration free and multi-user eye tracking. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct. ACM, 899–900.
[17]
Zachary C Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019(2015).
[18]
Giulio Marin, Fabio Dominio, and Pietro Zanuttigh. 2014. Hand gesture recognition with leap motion and kinect devices. In 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 1565–1569.
[19]
Rohit Mishra and Edward Kim. 2009. Speech recognition assembly for acoustically controlling a function of a motor vehicle. US Patent App. 12/410,430.
[20]
Monika Mitrevska, Mohammad Mehdi Moniri, Robert Neßelrath, Tim Schwartz, Michael Feld, Yannick Körber, Matthieu Deru, and Christian Müller. 2015. SiAM-Situation-Adaptive Multimodal Interaction for Innovative Mobility Concepts of the Future. In 2015 International Conference on Intelligent Environments. IEEE, 180–183.
[21]
Leap Motion. 2019. Controller for PC. https://www.leapmotion.com/. [Online; accessed 23-June-2019].
[22]
Sankha S Mukherjee and Neil Martin Robertson. 2015. Deep head pose: Gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia 17, 11 (2015), 2094–2107.
[23]
Clifford Ivar Nass and Scott Brave. 2005. Wired for speech: How voice activates and advances the human-computer relationship. MIT press Cambridge, MA.
[24]
Robert Neßelrath, Mohammad Mehdi Moniri, and Michael Feld. 2016. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 190–193.
[25]
Oluwatobi Olabiyi, Eric Martinson, Vijay Chintalapudi, and Rui Guo. 2017. Driver action prediction using deep (bidirectional) recurrent neural network. arXiv preprint arXiv:1706.02257(2017).
[26]
Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In Proceedings of the 4th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 155–162.
[27]
Alex Poole and Linden J Ball. 2006. Eye tracking in HCI and usability research. In Encyclopedia of human computer interaction. IGI Global, 211–219.
[28]
Ivan Poupyrev. 2018. Radar-based gesture recognition. US Patent 9,921,660.
[29]
Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking?. In Advances in Neural Information Processing Systems. 199–207.
[30]
Adrià Recasens, Carl Vondrick, Aditya Khosla, and Antonio Torralba. 2016. Following gaze across views. arXiv preprint arXiv:1612.03094(2016).
[31]
Florian Roider and Tom Gross. 2018. I See Your Point: Integrating Gaze to Enhance Pointing Gesture Accuracy While Driving. In Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 351–358.
[32]
Sonja Rümelin, Chadly Marouane, and Andreas Butz. 2013. Free-hand pointing for identification and interaction with distant objects. In Proceedings of the 5th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, 40–47.
[33]
Hari Singh and Jaswinder Singh. 2012. Human eye tracking and related issues: A review. International Journal of Scientific and Research Publications 2, 9(2012), 1–9.
[34]
Thad Starner, Jake Auxier, Daniel Ashbrook, and Maribeth Gandy. 2000. The gesture pendant: A self-illuminating, wearable, infrared computer vision system for home automation control and medical monitoring. In Digest of Papers. Fourth International Symposium on Wearable Computers. IEEE, 87–94.
[35]
Tobii. 2019. Tobii Eye-tracker. https://tobiigaming.com. [Online; accessed 23-June-2019].
[36]
Roel Vertegaal 2003. Attentive user interfaces. Commun. ACM 46, 3 (2003), 30–33.
[37]
Ulrich Weidenbacher, Georg Layher, Pierre Bayerl, and Heiko Neumann. 2006. Detection of head pose and gaze direction for human-computer interaction. In International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems. Springer, 9–19.
[38]
Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5934–5938.
[39]
Chao-Wei Yu, Chin-Hsuan Liu, Yen-Lin Chen, Posen Lee, and Meng-Syue Tian. 2018. Vision-based Hand Recognition Based on ToF Depth Camera. Smart Science 6, 1 (2018), 21–28.
[40]
Boxuan Yue, Junwei Fu, and Jun Liang. 2018. Residual recurrent neural networks for learning sequential representations. Information 9, 3 (2018), 56.
[41]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250(2017).
[42]
Thomas G Zimmerman, Jaron Lanier, Chuck Blanchard, Steve Bryson, and Young Harvill. 1987. A hand gesture interface device. In ACM SIGCHI Bulletin, Vol. 18. ACM, 189–192.

Cited By

View all
  • (2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
  • (2024)Capturing Visual Attention: A Methodical Approach to Gaze Estimation Data Acquisition and Annotation2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)10.1109/ICETCI61221.2024.10594217(390-394)Online publication date: 24-May-2024
  • (2024)What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00154(1556-1565)Online publication date: 16-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '19: 2019 International Conference on Multimodal Interaction
October 2019
601 pages
ISBN:9781450368605
DOI:10.1145/3340555
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Check for updates

Author Tags

  1. CNN.
  2. Data fusion
  3. LSTM
  4. RNN
  5. eye-tracking
  6. gesture recognition
  7. head pose
  8. late fusion
  9. speech commands

Qualifiers

  • Abstract
  • Research
  • Refereed limited

Conference

ICMI '19

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)87
  • Downloads (Last 6 weeks)9
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
  • (2024)Capturing Visual Attention: A Methodical Approach to Gaze Estimation Data Acquisition and Annotation2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI)10.1109/ICETCI61221.2024.10594217(390-394)Online publication date: 24-May-2024
  • (2024)What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00154(1556-1565)Online publication date: 16-Jun-2024
  • (2023)AutoVis: Enabling Mixed-Immersive Analysis of Automotive User Interface Interaction StudiesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580760(1-23)Online publication date: 19-Apr-2023
  • (2022)MIDriveSafely: Multimodal Interaction for Drive SafelyProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3557037(733-735)Online publication date: 7-Nov-2022
  • (2022)A Design Space for Human Sensor and Actuator Focused In-Vehicle Interaction Based on a Systematic Literature ReviewProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35346176:2(1-51)Online publication date: 7-Jul-2022
  • (2022)Human–Machine Interaction in Intelligent and Connected Vehicles: A Review of Status Quo, Issues, and OpportunitiesIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.312721723:9(13954-13975)Online publication date: Sep-2022
  • (2022)Multimodal user interaction with in-car equipment in real conditions based on touch and speech modes in the Persian languageMultimedia Tools and Applications10.1007/s11042-022-13784-182:9(12995-13023)Online publication date: 19-Sep-2022
  • (2021)Nationwide deployment and operation of a virtual arrival detection system in the wildProceedings of the 2021 ACM SIGCOMM 2021 Conference10.1145/3452296.3472911(705-717)Online publication date: 9-Aug-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media