Abstract
There has been an increasing demand for visual grounding in various human-robot interaction applications. However, the accuracy is often limited by the size of the dataset that can be collected, which is often a challenge. Hence, this paper proposes using the natural implicit input modality of human gaze to assist and improve the visual grounding accuracy of human instructions to robotic agents. To demonstrate the capability, mechanical gear objects are used. To achieve that, we utilized a transformer-based text classifier and a small corpus to develop a baseline phrase grounding model. We evaluate this phrase grounding system with and without gaze input to demonstrate the improvement. Gaze information (obtained from Microsoft Hololens2) improves the performance accuracy from 26% to 65%, leading to more efficient human-robot collaboration and applicable to hands-free scenarios. This approach is data-efficient as it requires only a small training dataset to ground the natural language referring expressions.
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project # A18A2b0046).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bhardwaj, R., Majumder, N., Poria, S., Hovy, E.: More identifiable yet equally performant transformers for text classification. arXiv preprint arXiv:2106.01269 (2021)
Bloss, R.: Collaborative robots are rapidly providing major improvements in productivity, safety, programing ease, portability and cost while addressing many new applications. Ind. Robot Int. J. (2016)
Chen, L., Ma, W., Xiao, J., Zhang, H., Chang, S.F.: Ref-NMS: breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1036–1044 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Johari, K., Karumpulli, N., Tan, U.X.: Complementing speech interaction design with touch for multi-robot systems. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), pp. 1400–1405. IEEE (2019)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Majaranta, P., Bulling, A.: Eye tracking and eye-based human–computer interaction. In: Fairclough, S.H., Gilleade, K. (eds.) Advances in Physiological Computing. HIS, pp. 39–65. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6392-3_3
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Palinko, O., Rea, F., Sandini, G., Sciutti, A.: Robot reading human gaze: why eye tracking is better than head tracking for human-robot collaboration. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5048–5054. IEEE (2016)
Park, K.B., Choi, S.H., Lee, J.Y., Ghasemi, Y., Mohammed, M., Jeong, H.: Hands-free human-robot interaction using multimodal gestures and deep learning in wearable mixed reality. IEEE Access 9, 55448–55464 (2021)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4694–4703 (2019)
Scalise, R., Li, S., Admoni, H., Rosenthal, S., Srinivasa, S.S.: Natural language instructions for human-robot collaborative manipulation. Int. J. Robot. Res. 37(6), 558–565 (2018)
Sharma, V.K., Murthy, L., Saluja, K.S., Mollyn, V., Sharma, G., Biswas, P.: Eye gaze controlled robotic arm for persons with ssmi. arXiv preprint arXiv:2005.11994 (2020)
Shridhar, M., Mittal, D., Hsu, D.: Ingress: interactive visual grounding of referring expressions. Int. J. Robot. Res. 39(2–3), 217–232 (2020)
Sidenmark, L., Mardanbegi, D., Gomez, A.R., Clarke, C., Gellersen, H.: Bimodalgaze: seamlessly refined pointing with gaze and filtered gestural head movement. In: ACM Symposium on Eye Tracking Research and Applications, pp. 1–9 (2020)
Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), vol. 3, pp. 2422–2427. IEEE (2004)
Wang, M.Y., Kogkas, A.A., Darzi, A., Mylonas, G.P.: Free-view, 3D gaze-guided, assistive robotic system for activities of daily living. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2355–2361. IEEE (2018)
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Zhou, Y., et al.: A real-time global inference network for one-stage referring expression comprehension. IEEE Trans. Neural Netw. Learn. Syst. (2021)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Johari, K., Tong, C.T.Z., Subbaraju, V., Kim, JJ., Tan, UX. (2021). Gaze Assisted Visual Grounding. In: Li, H., et al. Social Robotics. ICSR 2021. Lecture Notes in Computer Science(), vol 13086. Springer, Cham. https://doi.org/10.1007/978-3-030-90525-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-90525-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90524-8
Online ISBN: 978-3-030-90525-5
eBook Packages: Computer ScienceComputer Science (R0)