research-article

Open access

Enhancing Mobile Voice Assistants with WorldGaze

Authors:

Chris HarrisonAuthors Info & Claims

CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Pages 1 - 10

https://doi.org/10.1145/3313831.3376479

Published: 23 April 2020 Publication History

All formats PDF

Abstract

Contemporary voice assistants require that objects of inter-est be specified in spoken commands. Of course, users are often looking directly at the object or place of interest ? fine-grained, contextual information that is currently unused. We present WorldGaze, a software-only method for smartphones that provides the real-world gaze location of a user that voice agents can utilize for rapid, natural, and precise interactions. We achieve this by simultaneously opening the front and rear cameras of a smartphone. The front-facing camera is used to track the head in 3D, including estimating its direction vector. As the geometry of the front and back cameras are fixed and known, we can raycast the head vector into the 3D world scene as captured by the rear-facing camera. This allows the user to intuitively define an object or region of interest using their head gaze. We started our investigations with a qualitative exploration of competing methods, before developing a functional, real-time implementation. We conclude with an evaluation that shows WorldGaze can be quick and accurate, opening new multimodal gaze+voice interactions for mobile voice agents.

Supplementary Material

SRT File (paper352pvc.srt)

Preview video captions

Download
.55 KB

MP4 File (paper352vf.mp4)

Supplemental video

Download
311.26 MB

MP4 File (paper352pv.mp4)

Preview video

Download
39.31 MB

MP4 File (a352-mayer-presentation.mp4)

Download
174.75 MB

References

[1]

Cengiz Acartürk, João Freitas, Mehmetcal Fal, and Miguel Sales Dias. 2015. Elderly Speech-Gaze Interaction. In International Conference on Universal Access in Human-Computer Interaction. Springer, Cham, 3--12.

[2]

Lisa Anthony, Jie Yang, and Kenneth R. Koedinger. 2005. Evaluation of multimodal input for entering mathematical equations on the computer. In CHI '05 Extended Abstracts on Human Factors in Computing Systems (CHI EA '05). ACM, NY, NY, USA, 1184--1187.

Digital Library

[3]

Apple Vision Framework. 2019. URL: https://developer.apple.com/documentation/vision

[4]

Apple Speech Framework. 2019. URL: https://developer.apple.com/documentation/speech

[5]

Vijay Badrinarayanan, Kendall Alex, and Cipolla Roberto. 2017. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39.12: 2481--2495.

[6]

Tadas Baltruaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision (WACV '16). IEEE, 1--10.

[7]

Tanya R. Beelders, and Pieter J. Blignaut. 2011. The Usability of Speech and Eye Gaze as a Multimodal Interface for a Word Processor. Speech Technologies, 386--404.

[8]

Tim Bailey, and Hugh Durrant-Whyte. 2006. Simultaneous localization and mapping (SLAM): Part II. IEEE robotics & automation magazine 13, no. 3, 108--117. IEEE.

[9]

Ann Blandford, Dominic Furniss, and Stephann Makri. 2016. Qualitative HCI research: Going behind the scenes. Synthesis lectures on human-centered informatics, 9(1), 1--115.

[10]

Richard A. Bolt. 1980. Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80). ACM, NY, NY, USA, 262--270.

Digital Library

[11]

John Brooke. 1996. SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), 4--7.

[12]

Drini Cami, Fabrice Matulic, Richard G. Calland, Brian Vogel, and Daniel Vogel. 2018. Unimanual Pen+Touch Input Using Variations of Precision Grip Postures. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, NY, NY, USA, 825--837.

Digital Library

[13]

Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). ACM, NY, NY, USA.

Digital Library

[14]

Leigh Clark, Phillip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, and Benjamin Cowan. 2018. The State of Speech in HCI: Trends, Themes and Challenges. In Proceedings of the Interacting with Computers.

[15]

Heiko Drewes, Alexander De Luca, and Albrecht Schmidt. 2007. Eye-gaze interaction for mobile phones. In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology (Mobility '07). ACM, NY, NY, USA, 364--371.

Digital Library

[16]

Augusto Esteves, Eduardo Velloso, Andreas Bulling, and Hans Gellersen. 2015. Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15). ACM, NY, NY, USA, 457--466.

Digital Library

[17]

Jorge Fuentes-Pacheco, José Ruiz-Ascencio, and Juan Manuel Rendón-Mancha. 2015. Visual simultaneous localization and mapping: a survey. Artificial Intelligence Review 43, no. 1, 55--81.

Digital Library

[18]

Alastair G. Gale. 1997. Human response to visual stimuli. In The perception of visual information. Springer, New York, NY, 127--147.

[19]

Google Cloud Vision AI. 2019. https://cloud.google.com/vision/automl/objectdetection/docs/

[20]

Floyd A. Glenn III, Helene P. Iavecchia, Lorna V. Ross, James M. Stokes, William J. Weiland, Daniel Weiss, and Allen L. Zaklad. 1986. Eyevoicecontrolled interface. In Proceedings of the Human Factors Society, 322--326.

[21]

Gunnar Harboe, and Elaine M. Huang. 2015. RealWorld Affinity Diagramming Practices: Bridging the Paper-Digital Gap. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, NY, NY, USA, 95-- 104.

Digital Library

[22]

Sandra G. Hart. 2006. NASA-task load index (NASATLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50, No. 9, 904--908, Los Angeles, CA, Sage publications.

[23]

Kaiming He, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. In Proceedings of the IEEE international conference on computer vision. IEEE, 2961--2969.

[24]

Ken Hinckley, Koji Yatani, Michel Pahud, Nicole Coddington, Jenny Rodenhouse, Andy Wilson, Hrvoje Benko, and Bill Buxton. 2010. Pen + touch = new tools. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10). ACM, NY, NY, USA, 27--36.

Digital Library

[25]

Ron Jacob. 1995. Eye tracking in advanced interface design. In Virtual Environments and Advanced Interface Design. New York: Oxford University Press, 258--288.

[26]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). IEEE 4565--4574.

[27]

David B. Koons, Carlton J. Sparrell, and Kristinn R. Thorisson. 1993. Integrating simultaneous input from speech, gaze, and hand gestures. MIT Press: Menlo Park, CA, 257--276.

[28]

Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016 (CVPR '16). IEEE, 2176--2184.

[29]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR '17). IEEE, 2117--2125.

[30]

Diako Mardanbegi, and Dan Witzner Hansen. 2011. Mobile gaze-based screen interaction in 3D environments. In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications (NGCA '11). ACM, NY, NY, USA, Article 2, 4 pages.

Digital Library

[31]

Sven Mayer, Katrin Wolf, Stefan Schneegass, and Niels Henze. 2015. Modeling Distant Pointing for Compensating Systematic Displacements. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, NY, NY, USA, 4165--4168.

Digital Library

[32]

Sven Mayer, Valentin Schwind, Robin Schweigert, and Niels Henze. 2018. The Effect of Offset Correction and Cursor on Mid-Air Pointing in Real and Virtual Environments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, NY, NY, USA, Paper 653, 13 pages.

Digital Library

[33]

Darius Miniotas, Ivan Tugoy, and I. Scott MacKenzie. 2006. Speech-augmented eye gaze interaction with small closely spaced targets. In Proceedings of the 2006 symposium on Eye tracking research & applications (ETRA '06). ACM, NY, NY, USA, 67--72.

Digital Library

[34]

Robert Neßelrath, Mohammad Mehdi Moniri, and Michael Feld. 2016. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In Proceedings of the 12th International Conference on Intelligent Environments (IE '16). IEEE.

[35]

Alexandra Papoutsaki, Patsorn Sangkloy, James Laskey, Nediyana Daskalova, Jeff Huang, and James Hays. 2016. Webgazer: Scalable webcam eye tracking using user interactions. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence-IJCAI 2016.

[36]

Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-touch: combining gaze with multi-touch for interaction on the same surface. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14). ACM, NY, NY, USA, 509--518.

Digital Library

[37]

Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In Proceedings of the 4th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI '12). ACM, NY, NY, USA, 155--162.

Digital Library

[38]

Katrin Plaumann, Matthias Weing, Christian Winkler, Michael Müller, and Enrico Rukzio. 2018. Towards accurate cursorless pointing: the effects of ocular dominance and handedness. Personal Ubiquitous Comput. 22, 4 (August 2018), 633--646.

Digital Library

[39]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR '16). IEEE, 779--788.

[40]

Florian Roider, Lars Reisig, and Tom Gross. 2018. Just Look: The Benefits of Gaze-Activated Voice Input in the Car. In Adjunct Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI '18). ACM, NY, NY, USA, 210--214.

Digital Library

[41]

David Rozado, Alexander McNeill, and Daniel Mazur. 2016. Voxvisio -- Combining Gaze And Speech For Accessible Hci. In Proceedings of RESNA/NCART 2016.

[42]

Albrecht Schmidt, Michael Beigl, and Hans Gellersen. 1999. There is more to context than location. Computers & Graphics 23.6, 893--901.

[43]

Julia Schwarz, Scott Hudson, Jennifer Mankoff, and Andrew D. Wilson. 2010. A framework for robust and flexible handling of inputs with uncertainty. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10). ACM, NY, NY, USA, 47--56.

Digital Library

[44]

Robin Schweigert, Valentin Schwind, and Sven Mayer. 2019. EyePointing: A Gaze-Based Selection Technique. In Proceedings of Mensch und Computer 2019 (MuC '19). ACM, NY, NY, USA, 719723.

Digital Library

[45]

Valentin Schwind, Sven Mayer, Alexandre Comeau Vermeersch, Robin Schweigert, and Niels Henze. 2018. Up to the Finger Tip: The Effect of Avatars on Mid-Air Pointing Accuracy in Virtual Reality. In Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play (CHI PLAY '18). ACM, NY, NY, USA, 477--488.

Digital Library

[46]

Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, NY, NY, USA, 581--593.

Digital Library

[47]

Daniel Vogel, and Ravin Balakrishnan. 2005. Distant freehand pointing and clicking on very large, high resolution displays. In Proceedings of the 18th annual ACM symposium on User interface software and technology (UIST '05). ACM, NY, NY, USA, 33--42.

Digital Library

[48]

Vuforia. URL: https://developer.vuforia.com

[49]

Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011. The aligned rank transform for nonparametric factorial analyses using only anova procedures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, NY, NY, USA, 143--146.

Digital Library

[50]

Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999. Manual and gaze input cascaded (MAGIC) pointing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI '99). ACM, 246--253.

Digital Library

[51]

Qiaohui Zhang, Atsumi Imamiya, Kentaro Go, and Xiaoyang Mao. 2004. Resolving ambiguities of a gaze and speech interface. In Proceedings of the 2004 symposium on Eye tracking research & applications (ETRA '04). ACM, NY, NY, USA, 85--92.

Digital Library

[52]

Xucong Zhang, Yusuke Sugano, M. Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition 2015 (CVPR '15). IEEE, 4511--4520.

Cited By

Fang CChwalek PKuang QMaes P(2024)WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual QueriesAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686776(1-4)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3672539.3686776
Dogan MGonzalez EAhuja KDu RColaço ALee JGonzalez-Franco MKim D(2024)Augmented Object Intelligence with XR-ObjectsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676379(1-15)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676379
Wu LLafreniere BGrossman TWhite TSantosa S(2024)Body Language for VUIs: Exploring Gestures to Enhance Interactions with Voice User InterfacesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660691(133-150)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3660691
Show More Cited By

Index Terms

Enhancing Mobile Voice Assistants with WorldGaze
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques

Recommendations

iWink: Exploring Eyelid Gestures on Mobile Devices
HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

Although gaze has been widely studied for mobile interactions, eyelid-based gestures are relatively understudied and limited to few basic gestures (e.g., blink). In this work, we propose a gesture grammar to construct both basic and compound eyelid ...
Interacting with objects in the environment using gaze tracking glasses and speech
OzCHI '14: Proceedings of the 26th Australian Computer-Human Interaction Conference on Designing Futures: the Future of Design

This work explores the combination of gaze and speech to interact with objects in the environment. A head-mounted wireless gaze tracker in the form of gaze tracking glasses is used for mobile monitoring of a subject's point of regard on the surrounding ...
Eyelid Gestures on Mobile Devices for People with Motor Impairments
ASSETS '20: Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility

Eye-based interactions for people with motor impairments have often used clunky or specialized equipment (e.g., eye-trackers with non-mobile computers) and primarily focused on gaze and blinks. However, two eyelids can open and close for different ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

April 2020

10688 pages

ISBN:9781450367080

DOI:10.1145/3313831

General Chairs:
Regina Bernhaupt
Eindhoven University of Technology, Netherlands
,
Florian 'Floyd' Mueller
Monash University, Australia
,
David Verweij
Newcastle University, UK
,
Josh Andres
RMIT, Australia
,
Program Chairs:
Joanna McGrenere
University of British Columbia, Canada
,
Andy Cockburn
University of Canterbury, New Zealand
,
Ignacio Avellino
University of Maryland Baltimore County, USA
,
Alix Goguey
Grenoble Alpes University, France
,
Pernille Bjørn
University of Copenhagen, Denmark
,
Shengdong (Shen) Zhao
National University of Singapore, Singapore
,
Briane Paul Samson
Future University Hakodate, Japan & De La Salle University, Philippines
,
Rafal Kocielnik
University of Washington, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CHI '20

Sponsor:

SIGCHI

CHI '20: CHI Conference on Human Factors in Computing Systems

April 25 - 30, 2020

HI, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI '25

Sponsor:
sigchi

CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
2,491
Total Downloads

Downloads (Last 12 months)548
Downloads (Last 6 weeks)85

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fang CChwalek PKuang QMaes P(2024)WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual QueriesAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686776(1-4)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3672539.3686776
Dogan MGonzalez EAhuja KDu RColaço ALee JGonzalez-Franco MKim D(2024)Augmented Object Intelligence with XR-ObjectsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676379(1-15)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676379
Wu LLafreniere BGrossman TWhite TSantosa S(2024)Body Language for VUIs: Exploring Gestures to Enhance Interactions with Voice User InterfacesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660691(133-150)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3660691
Xu CZheng XRen ZLiu LMa H(2024)UHeadProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435518:1(1-28)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.1145/3643551
Lee JWang JBrown EChu LRodriguez SFroehlich J(2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642230
Hu JJiang HLiu DXiao ZZhang QLiu JDustdar S(2024)Combining IMU With Acoustics for Head Motion Tracking Leveraging Wireless EarphoneIEEE Transactions on Mobile Computing10.1109/TMC.2023.332582623:6(6835-6847)Online publication date: Jun-2024
https://doi.org/10.1109/TMC.2023.3325826
Bennett JLange D(2024)Show & Tell: Visual and Verbal Cues for Controlling Digital ContentHCI International 2024 Posters10.1007/978-3-031-62110-9_27(255-264)Online publication date: 1-Jun-2024
https://doi.org/10.1007/978-3-031-62110-9_27
Almirabi AMehandjiev NSarantopoulos P(2024)Voice Assistants - Research LandscapeInformation Systems10.1007/978-3-031-56478-9_2(18-37)Online publication date: 30-Mar-2024
https://doi.org/10.1007/978-3-031-56478-9_2
Kim DMollyn VHarrison C(2023)WorldPoint: Finger Pointing as a Rapid and Natural Trigger for In-the-Wild Mobile InteractionsProceedings of the ACM on Human-Computer Interaction10.1145/36264787:ISS(357-375)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3626478
Alsakar NAbdrabou YStumpf SKhamis M(2023)Investigating Privacy Perceptions and Subjective Acceptance of Eye Tracking on Handheld Mobile DevicesProceedings of the ACM on Human-Computer Interaction10.1145/35911337:ETRA(1-16)Online publication date: 18-May-2023
https://dl.acm.org/doi/10.1145/3591133
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents