Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICRA.2018.8460699guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions

Published: 21 May 2018 Publication History

Abstract

Comprehension of spoken natural language is an essential skill for robots to communicate with humans effectively. However, handling unconstrained spoken instructions is challenging due to (1) complex structures and the wide variety of expressions used in spoken language, and (2) inherent ambiguity of human instructions. In this paper, we propose the first comprehensive system for controlling robots with unconstrained spoken language, which is able to effectively resolve ambiguity in spoken instructions. Specifically, we integrate deep learning-based object detection together with natural language processing technologies to handle unconstrained spoken instructions, and propose a method for robots to resolve instruction ambiguity through dialogue. Through our experiments on both a simulated environment as well as a physical industrial robot arm, we demonstrate the ability of our system to understand natural instructions from human operators effectively, and show how higher success rates of the object picking task can be achieved through an interactive clarification process.

References

[1]
Y. Ochiai et al., “Remote control system for multiple mobile robots using touch panel interface and autonomous mobility,” in IEEE International Conference on Intelligent Robots and Systems, 2014, pp. 3272–3277.
[2]
D. Shukla et al., “Probabilistic detection of pointing directions for human-robot interaction,” in International Conference on Digital Image Computing: Techniques and Applications, 2015, pp. 1–8.
[3]
W. Liu et al., “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016.
[4]
B. Alexe et al., “Measuring the objectness of image windows,” IEEE transactions on pattern analysis and machine intelligence., vol. 34, no. 11, pp. 2189–2202, Nov. 2012.
[5]
L. Yu et al., “Modeling context in referring expressions,” in European Conference on Computer Vision, 2016.
[6]
Y. Licheng et al., “A joint speaker-listener-reinforcer model for referring expressions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[7]
R. Paul et al., “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” in Robotics: Science and Systems, 2016.
[8]
M. Shridhar and D. Hsu, “Grounding spatio-semantic referring expressions for human-robot interaction,” arXiv preprint arXiv:, 2017.
[9]
D. Whitney et al., “Reducing errors in object-fetching interactions through social feedback,” in IEEE International Conference on Robotics and Automation., 2017, pp. 1006–1013.
[10]
B.-A. Dang-Vu, et al., “Interpreting manipulation actions: From language to execution,” in Robot 2015: Second Iberian Robotics Conference, vol. 417. Springer, 2016, pp. 175–187.
[11]
R. Kiros et al., “Multimodal neural language models,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 2, 2014, pp. 595–603.
[12]
O. Vinyals et al., “Show and tell: A neural image caption generator,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
[13]
A. Frome et al., “Devise: A deep visual-semantic embedding model,” in Neural Information Processing Systems, 2013.
[14]
R. Kiros et al., “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv:1411.2539, 2014.
[15]
R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, May 2017.
[16]
J. Johnson et al., “Densecap: Fully convolutional localization networks for dense captioning,” in IEEE Conference on Computer Vision and Pattern Recognition., 2016.
[17]
C. Lu et al., “Visual relationship detection with language priors,” in European Conference on Computer Vision, 2016.
[18]
S. Kazemzadeh et al., “Referit game: Referring to objects in photographs of natural scenes,” in Conference on Empirical Methods in Natural Language Processing, 2014.
[19]
J. Mao et al., “Generation and comprehension of unambiguous object descriptions,” in The IEEE Conference on Computer Vision and Pattern Recognition., June 2016.
[20]
V. K. Nagaraja et al., “Modeling context between objects for referring expression understanding,” in European Conference on Computer Vision. Springer, 2016, pp. 792–807.
[21]
S. Antol et al., “Vqa: Visual question answering,” in IEEE International Conference on Computer Vision., December 2015.
[22]
S. Ren et al., “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015.
[23]
M. Everingham et al., “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, June 2010.
[24]
T.-Y. Lin, et al., “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision, Zürich, 2014.
[25]
S. Tokui et al., “Chainer: a next-generation open source framework for deep learning,” in Workshop on machine learning systems on Neural Information Processing Systems, 2015.
[26]
Y. Niitani et al., “ChainerCV: a library for deep learning in computer vision,” in Proceedings of ACM Multimedia Workshop, 2017.
[27]
J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,” in Computer Vision and Pattern Recognition, 2009.
[28]
K. He et al., “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Cited By

View all
  • (2024)PrISM-Q&A: Step-Aware Voice Assistant on a Smartwatch Enabled by Multimodal Procedure Tracking and Large Language ModelsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997598:4(1-26)Online publication date: 21-Nov-2024
  • (2023)Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI TeamsProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584092(681-694)Online publication date: 27-Mar-2023
  • (2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
  • Show More Cited By

Index Terms

  1. Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image Guide Proceedings
            2018 IEEE International Conference on Robotics and Automation (ICRA)
            May 2018
            5954 pages

            Publisher

            IEEE Press

            Publication History

            Published: 21 May 2018

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 18 Feb 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)PrISM-Q&A: Step-Aware Voice Assistant on a Smartwatch Enabled by Multimodal Procedure Tracking and Large Language ModelsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997598:4(1-26)Online publication date: 21-Nov-2024
            • (2023)Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI TeamsProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584092(681-694)Online publication date: 27-Mar-2023
            • (2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
            • (2022)Asking Follow-Up Clarifications to Resolve Ambiguities in Human-Robot ConversationProceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction10.5555/3523760.3523822(461-469)Online publication date: 7-Mar-2022
            • (2022)Learning 6-DoF Object Poses to Grasp Category-Level Objects by Language Instructions2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811367(8476-8482)Online publication date: 23-May-2022
            • (2021)DASH: Modularized Human Manipulation Simulation with Vision and Language for Embodied AIProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation10.1145/3475946.3480950(1-12)Online publication date: 6-Sep-2021
            • (2020)Perception of interaction between hand and objectAdjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers10.1145/3410530.3414363(290-295)Online publication date: 10-Sep-2020

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media