Gaze Assisted Visual Grounding

Kritika Johari¹⁶,
Christopher Tay Zi Tong¹⁶,
Vigneshwaran Subbaraju¹⁷,
Jung-Jae Kim¹⁸ &
…
U-Xuan Tan¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13086))

Included in the following conference series:

International Conference on Social Robotics

3076 Accesses
2 Citations

Abstract

There has been an increasing demand for visual grounding in various human-robot interaction applications. However, the accuracy is often limited by the size of the dataset that can be collected, which is often a challenge. Hence, this paper proposes using the natural implicit input modality of human gaze to assist and improve the visual grounding accuracy of human instructions to robotic agents. To demonstrate the capability, mechanical gear objects are used. To achieve that, we utilized a transformer-based text classifier and a small corpus to develop a baseline phrase grounding model. We evaluate this phrase grounding system with and without gaze input to demonstrate the improvement. Gaze information (obtained from Microsoft Hololens2) improves the performance accuracy from 26% to 65%, leading to more efficient human-robot collaboration and applicable to hands-free scenarios. This approach is data-efficient as it requires only a small training dataset to ground the natural language referring expressions.

This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project # A18A2b0046).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Vision-language navigation: a survey and taxonomy

Article 27 November 2023

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Robust Joint Visual Attention for HRI Using a Laser Pointer for Perspective Alignment and Deictic Referring

References

Bhardwaj, R., Majumder, N., Poria, S., Hovy, E.: More identifiable yet equally performant transformers for text classification. arXiv preprint arXiv:2106.01269 (2021)
Bloss, R.: Collaborative robots are rapidly providing major improvements in productivity, safety, programing ease, portability and cost while addressing many new applications. Ind. Robot Int. J. (2016)
Google Scholar
Chen, L., Ma, W., Xiao, J., Zhang, H., Chang, S.F.: Ref-NMS: breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1036–1044 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Johari, K., Karumpulli, N., Tan, U.X.: Complementing speech interaction design with touch for multi-robot systems. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), pp. 1400–1405. IEEE (2019)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Article Google Scholar
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Majaranta, P., Bulling, A.: Eye tracking and eye-based human–computer interaction. In: Fairclough, S.H., Gilleade, K. (eds.) Advances in Physiological Computing. HIS, pp. 39–65. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6392-3_3
Chapter Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Palinko, O., Rea, F., Sandini, G., Sciutti, A.: Robot reading human gaze: why eye tracking is better than head tracking for human-robot collaboration. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5048–5054. IEEE (2016)
Google Scholar
Park, K.B., Choi, S.H., Lee, J.Y., Ghasemi, Y., Mohammed, M., Jeong, H.: Hands-free human-robot interaction using multimodal gestures and deep learning in wearable mixed reality. IEEE Access 9, 55448–55464 (2021)
Article Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4694–4703 (2019)
Google Scholar
Scalise, R., Li, S., Admoni, H., Rosenthal, S., Srinivasa, S.S.: Natural language instructions for human-robot collaborative manipulation. Int. J. Robot. Res. 37(6), 558–565 (2018)
Article Google Scholar
Sharma, V.K., Murthy, L., Saluja, K.S., Mollyn, V., Sharma, G., Biswas, P.: Eye gaze controlled robotic arm for persons with ssmi. arXiv preprint arXiv:2005.11994 (2020)
Shridhar, M., Mittal, D., Hsu, D.: Ingress: interactive visual grounding of referring expressions. Int. J. Robot. Res. 39(2–3), 217–232 (2020)
Article Google Scholar
Sidenmark, L., Mardanbegi, D., Gomez, A.R., Clarke, C., Gellersen, H.: Bimodalgaze: seamlessly refined pointing with gaze and filtered gestural head movement. In: ACM Symposium on Eye Tracking Research and Applications, pp. 1–9 (2020)
Google Scholar
Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), vol. 3, pp. 2422–2427. IEEE (2004)
Google Scholar
Wang, M.Y., Kogkas, A.A., Darzi, A., Mylonas, G.P.: Free-view, 3D gaze-guided, assistive robotic system for activities of daily living. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2355–2361. IEEE (2018)
Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Google Scholar
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Google Scholar
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. IEEE Trans. Neural Netw. Learn. Syst. (2021)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Kritika Johari, Christopher Tay Zi Tong & U-Xuan Tan
Institute of High Performance Computing, A*STAR, Singapore, Singapore
Vigneshwaran Subbaraju
Institute for Infocomm Research, A*STAR, Singapore, Singapore
Jung-Jae Kim

Authors

Kritika Johari
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Tay Zi Tong
View author publications
You can also search for this author in PubMed Google Scholar
Vigneshwaran Subbaraju
View author publications
You can also search for this author in PubMed Google Scholar
Jung-Jae Kim
View author publications
You can also search for this author in PubMed Google Scholar
U-Xuan Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kritika Johari .

Editor information

Editors and Affiliations

Department of Electronic and Communication Engineering, National University of Singapore, Faculty of Engineering, Singapore, Singapore
Haizhou Li
The National University of Singapore, Singapore, Singapore
Shuzhi Sam Ge
A*STAR Institute for Infocomm Research, Singapore, Singapore
Yan Wu
Center for Human Technologies, Istituto Italiano Tecnologia, Genoa, Italy
Agnieszka Wykowska
Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS, USA
Hongsheng He
Qingdao University, Qingdao, China
Xiaorui Liu
School of Cyber Science and Technology, Beihang University, Beijing, Beijing, China
Dongyu Li
Social Cognition Human-Robot Interaction, Istituto Italiano di Tecnologia, Genoa, Italy
Jairo Perez-Osorio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Johari, K., Tong, C.T.Z., Subbaraju, V., Kim, JJ., Tan, UX. (2021). Gaze Assisted Visual Grounding. In: Li, H., et al. Social Robotics. ICSR 2021. Lecture Notes in Computer Science(), vol 13086. Springer, Cham. https://doi.org/10.1007/978-3-030-90525-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-90525-5_17
Published: 02 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90524-8
Online ISBN: 978-3-030-90525-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Gaze Assisted Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Vision-language navigation: a survey and taxonomy

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Robust Joint Visual Attention for HRI Using a Laser Pointer for Perspective Alignment and Deictic Referring

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Gaze Assisted Visual Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Vision-language navigation: a survey and taxonomy

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Robust Joint Visual Attention for HRI Using a Laser Pointer for Perspective Alignment and Deictic Referring

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation