Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3313831.3376479acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article
Open access

Enhancing Mobile Voice Assistants with WorldGaze

Published: 23 April 2020 Publication History

Abstract

Contemporary voice assistants require that objects of inter-est be specified in spoken commands. Of course, users are often looking directly at the object or place of interest ? fine-grained, contextual information that is currently unused. We present WorldGaze, a software-only method for smartphones that provides the real-world gaze location of a user that voice agents can utilize for rapid, natural, and precise interactions. We achieve this by simultaneously opening the front and rear cameras of a smartphone. The front-facing camera is used to track the head in 3D, including estimating its direction vector. As the geometry of the front and back cameras are fixed and known, we can raycast the head vector into the 3D world scene as captured by the rear-facing camera. This allows the user to intuitively define an object or region of interest using their head gaze. We started our investigations with a qualitative exploration of competing methods, before developing a functional, real-time implementation. We conclude with an evaluation that shows WorldGaze can be quick and accurate, opening new multimodal gaze+voice interactions for mobile voice agents.

Supplementary Material

SRT File (paper352pvc.srt)
Preview video captions
MP4 File (paper352vf.mp4)
Supplemental video
MP4 File (paper352pv.mp4)
Preview video
MP4 File (a352-mayer-presentation.mp4)

References

[1]
Cengiz Acartürk, João Freitas, Mehmetcal Fal, and Miguel Sales Dias. 2015. Elderly Speech-Gaze Interaction. In International Conference on Universal Access in Human-Computer Interaction. Springer, Cham, 3--12.
[2]
Lisa Anthony, Jie Yang, and Kenneth R. Koedinger. 2005. Evaluation of multimodal input for entering mathematical equations on the computer. In CHI '05 Extended Abstracts on Human Factors in Computing Systems (CHI EA '05). ACM, NY, NY, USA, 1184--1187.
[3]
Apple Vision Framework. 2019. URL: https://developer.apple.com/documentation/vision
[4]
Apple Speech Framework. 2019. URL: https://developer.apple.com/documentation/speech
[5]
Vijay Badrinarayanan, Kendall Alex, and Cipolla Roberto. 2017. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39.12: 2481--2495.
[6]
Tadas Baltruaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision (WACV '16). IEEE, 1--10.
[7]
Tanya R. Beelders, and Pieter J. Blignaut. 2011. The Usability of Speech and Eye Gaze as a Multimodal Interface for a Word Processor. Speech Technologies, 386--404.
[8]
Tim Bailey, and Hugh Durrant-Whyte. 2006. Simultaneous localization and mapping (SLAM): Part II. IEEE robotics & automation magazine 13, no. 3, 108--117. IEEE.
[9]
Ann Blandford, Dominic Furniss, and Stephann Makri. 2016. Qualitative HCI research: Going behind the scenes. Synthesis lectures on human-centered informatics, 9(1), 1--115.
[10]
Richard A. Bolt. 1980. Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH '80). ACM, NY, NY, USA, 262--270.
[11]
John Brooke. 1996. SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), 4--7.
[12]
Drini Cami, Fabrice Matulic, Richard G. Calland, Brian Vogel, and Daniel Vogel. 2018. Unimanual Pen+Touch Input Using Variations of Precision Grip Postures. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, NY, NY, USA, 825--837.
[13]
Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). ACM, NY, NY, USA.
[14]
Leigh Clark, Phillip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, and Benjamin Cowan. 2018. The State of Speech in HCI: Trends, Themes and Challenges. In Proceedings of the Interacting with Computers.
[15]
Heiko Drewes, Alexander De Luca, and Albrecht Schmidt. 2007. Eye-gaze interaction for mobile phones. In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology (Mobility '07). ACM, NY, NY, USA, 364--371.
[16]
Augusto Esteves, Eduardo Velloso, Andreas Bulling, and Hans Gellersen. 2015. Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST '15). ACM, NY, NY, USA, 457--466.
[17]
Jorge Fuentes-Pacheco, José Ruiz-Ascencio, and Juan Manuel Rendón-Mancha. 2015. Visual simultaneous localization and mapping: a survey. Artificial Intelligence Review 43, no. 1, 55--81.
[18]
Alastair G. Gale. 1997. Human response to visual stimuli. In The perception of visual information. Springer, New York, NY, 127--147.
[19]
Google Cloud Vision AI. 2019. https://cloud.google.com/vision/automl/objectdetection/docs/
[20]
Floyd A. Glenn III, Helene P. Iavecchia, Lorna V. Ross, James M. Stokes, William J. Weiland, Daniel Weiss, and Allen L. Zaklad. 1986. Eyevoicecontrolled interface. In Proceedings of the Human Factors Society, 322--326.
[21]
Gunnar Harboe, and Elaine M. Huang. 2015. RealWorld Affinity Diagramming Practices: Bridging the Paper-Digital Gap. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, NY, NY, USA, 95-- 104.
[22]
Sandra G. Hart. 2006. NASA-task load index (NASATLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50, No. 9, 904--908, Los Angeles, CA, Sage publications.
[23]
Kaiming He, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. In Proceedings of the IEEE international conference on computer vision. IEEE, 2961--2969.
[24]
Ken Hinckley, Koji Yatani, Michel Pahud, Nicole Coddington, Jenny Rodenhouse, Andy Wilson, Hrvoje Benko, and Bill Buxton. 2010. Pen + touch = new tools. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10). ACM, NY, NY, USA, 27--36.
[25]
Ron Jacob. 1995. Eye tracking in advanced interface design. In Virtual Environments and Advanced Interface Design. New York: Oxford University Press, 258--288.
[26]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). IEEE 4565--4574.
[27]
David B. Koons, Carlton J. Sparrell, and Kristinn R. Thorisson. 1993. Integrating simultaneous input from speech, gaze, and hand gestures. MIT Press: Menlo Park, CA, 257--276.
[28]
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016 (CVPR '16). IEEE, 2176--2184.
[29]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR '17). IEEE, 2117--2125.
[30]
Diako Mardanbegi, and Dan Witzner Hansen. 2011. Mobile gaze-based screen interaction in 3D environments. In Proceedings of the 1st Conference on Novel Gaze-Controlled Applications (NGCA '11). ACM, NY, NY, USA, Article 2, 4 pages.
[31]
Sven Mayer, Katrin Wolf, Stefan Schneegass, and Niels Henze. 2015. Modeling Distant Pointing for Compensating Systematic Displacements. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, NY, NY, USA, 4165--4168.
[32]
Sven Mayer, Valentin Schwind, Robin Schweigert, and Niels Henze. 2018. The Effect of Offset Correction and Cursor on Mid-Air Pointing in Real and Virtual Environments. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, NY, NY, USA, Paper 653, 13 pages.
[33]
Darius Miniotas, Ivan Tugoy, and I. Scott MacKenzie. 2006. Speech-augmented eye gaze interaction with small closely spaced targets. In Proceedings of the 2006 symposium on Eye tracking research & applications (ETRA '06). ACM, NY, NY, USA, 67--72.
[34]
Robert Neßelrath, Mohammad Mehdi Moniri, and Michael Feld. 2016. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In Proceedings of the 12th International Conference on Intelligent Environments (IE '16). IEEE.
[35]
Alexandra Papoutsaki, Patsorn Sangkloy, James Laskey, Nediyana Daskalova, Jeff Huang, and James Hays. 2016. Webgazer: Scalable webcam eye tracking using user interactions. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence-IJCAI 2016.
[36]
Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-touch: combining gaze with multi-touch for interaction on the same surface. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14). ACM, NY, NY, USA, 509--518.
[37]
Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In Proceedings of the 4th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI '12). ACM, NY, NY, USA, 155--162.
[38]
Katrin Plaumann, Matthias Weing, Christian Winkler, Michael Müller, and Enrico Rukzio. 2018. Towards accurate cursorless pointing: the effects of ocular dominance and handedness. Personal Ubiquitous Comput. 22, 4 (August 2018), 633--646.
[39]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR '16). IEEE, 779--788.
[40]
Florian Roider, Lars Reisig, and Tom Gross. 2018. Just Look: The Benefits of Gaze-Activated Voice Input in the Car. In Adjunct Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI '18). ACM, NY, NY, USA, 210--214.
[41]
David Rozado, Alexander McNeill, and Daniel Mazur. 2016. Voxvisio -- Combining Gaze And Speech For Accessible Hci. In Proceedings of RESNA/NCART 2016.
[42]
Albrecht Schmidt, Michael Beigl, and Hans Gellersen. 1999. There is more to context than location. Computers & Graphics 23.6, 893--901.
[43]
Julia Schwarz, Scott Hudson, Jennifer Mankoff, and Andrew D. Wilson. 2010. A framework for robust and flexible handling of inputs with uncertainty. In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10). ACM, NY, NY, USA, 47--56.
[44]
Robin Schweigert, Valentin Schwind, and Sven Mayer. 2019. EyePointing: A Gaze-Based Selection Technique. In Proceedings of Mensch und Computer 2019 (MuC '19). ACM, NY, NY, USA, 719723.
[45]
Valentin Schwind, Sven Mayer, Alexandre Comeau Vermeersch, Robin Schweigert, and Niels Henze. 2018. Up to the Finger Tip: The Effect of Avatars on Mid-Air Pointing Accuracy in Virtual Reality. In Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play (CHI PLAY '18). ACM, NY, NY, USA, 477--488.
[46]
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, NY, NY, USA, 581--593.
[47]
Daniel Vogel, and Ravin Balakrishnan. 2005. Distant freehand pointing and clicking on very large, high resolution displays. In Proceedings of the 18th annual ACM symposium on User interface software and technology (UIST '05). ACM, NY, NY, USA, 33--42.
[48]
Vuforia. URL: https://developer.vuforia.com
[49]
Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011. The aligned rank transform for nonparametric factorial analyses using only anova procedures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, NY, NY, USA, 143--146.
[50]
Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999. Manual and gaze input cascaded (MAGIC) pointing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI '99). ACM, 246--253.
[51]
Qiaohui Zhang, Atsumi Imamiya, Kentaro Go, and Xiaoyang Mao. 2004. Resolving ambiguities of a gaze and speech interface. In Proceedings of the 2004 symposium on Eye tracking research & applications (ETRA '04). ACM, NY, NY, USA, 85--92.
[52]
Xucong Zhang, Yusuke Sugano, M. Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition 2015 (CVPR '15). IEEE, 4511--4520.

Cited By

View all
  • (2024)WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual QueriesAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686776(1-4)Online publication date: 13-Oct-2024
  • (2024)Augmented Object Intelligence with XR-ObjectsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676379(1-15)Online publication date: 13-Oct-2024
  • (2024)Body Language for VUIs: Exploring Gestures to Enhance Interactions with Voice User InterfacesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660691(133-150)Online publication date: 1-Jul-2024
  • Show More Cited By

Index Terms

  1. Enhancing Mobile Voice Assistants with WorldGaze

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems
    April 2020
    10688 pages
    ISBN:9781450367080
    DOI:10.1145/3313831
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 April 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. interaction techniques
    2. mobile interaction
    3. worldgaze

    Qualifiers

    • Research-article

    Conference

    CHI '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI '25
    CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)548
    • Downloads (Last 6 weeks)85
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)WatchThis: A Wearable Point-and-Ask Interface powered by Vision-Language Models for Contextual QueriesAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686776(1-4)Online publication date: 13-Oct-2024
    • (2024)Augmented Object Intelligence with XR-ObjectsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676379(1-15)Online publication date: 13-Oct-2024
    • (2024)Body Language for VUIs: Exploring Gestures to Enhance Interactions with Voice User InterfacesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660691(133-150)Online publication date: 1-Jul-2024
    • (2024)UHeadProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435518:1(1-28)Online publication date: 6-Mar-2024
    • (2024)GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented RealityProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642230(1-20)Online publication date: 11-May-2024
    • (2024)Combining IMU With Acoustics for Head Motion Tracking Leveraging Wireless EarphoneIEEE Transactions on Mobile Computing10.1109/TMC.2023.332582623:6(6835-6847)Online publication date: Jun-2024
    • (2024)Show & Tell: Visual and Verbal Cues for Controlling Digital ContentHCI International 2024 Posters10.1007/978-3-031-62110-9_27(255-264)Online publication date: 1-Jun-2024
    • (2024)Voice Assistants - Research LandscapeInformation Systems10.1007/978-3-031-56478-9_2(18-37)Online publication date: 30-Mar-2024
    • (2023)WorldPoint: Finger Pointing as a Rapid and Natural Trigger for In-the-Wild Mobile InteractionsProceedings of the ACM on Human-Computer Interaction10.1145/36264787:ISS(357-375)Online publication date: 1-Nov-2023
    • (2023)Investigating Privacy Perceptions and Subjective Acceptance of Eye Tracking on Handheld Mobile DevicesProceedings of the ACM on Human-Computer Interaction10.1145/35911337:ETRA(1-16)Online publication date: 18-May-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media