research-article

Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions

Authors:

Sosuke Kobayashi,

Kuniyuki Takahashi,

Jethro TanAuthors Info & Claims

2018 IEEE International Conference on Robotics and Automation (ICRA)

Pages 3774 - 3781

https://doi.org/10.1109/ICRA.2018.8460699

Published: 21 May 2018 Publication History

Abstract

Comprehension of spoken natural language is an essential skill for robots to communicate with humans effectively. However, handling unconstrained spoken instructions is challenging due to (1) complex structures and the wide variety of expressions used in spoken language, and (2) inherent ambiguity of human instructions. In this paper, we propose the first comprehensive system for controlling robots with unconstrained spoken language, which is able to effectively resolve ambiguity in spoken instructions. Specifically, we integrate deep learning-based object detection together with natural language processing technologies to handle unconstrained spoken instructions, and propose a method for robots to resolve instruction ambiguity through dialogue. Through our experiments on both a simulated environment as well as a physical industrial robot arm, we demonstrate the ability of our system to understand natural instructions from human operators effectively, and show how higher success rates of the object picking task can be achieved through an interactive clarification process.

References

[1]

Y. Ochiai et al., “Remote control system for multiple mobile robots using touch panel interface and autonomous mobility,” in IEEE International Conference on Intelligent Robots and Systems, 2014, pp. 3272–3277.

[2]

D. Shukla et al., “Probabilistic detection of pointing directions for human-robot interaction,” in International Conference on Digital Image Computing: Techniques and Applications, 2015, pp. 1–8.

[3]

W. Liu et al., “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016.

[4]

B. Alexe et al., “Measuring the objectness of image windows,” IEEE transactions on pattern analysis and machine intelligence., vol. 34, no. 11, pp. 2189–2202, Nov. 2012.

Digital Library

[5]

L. Yu et al., “Modeling context in referring expressions,” in European Conference on Computer Vision, 2016.

[6]

Y. Licheng et al., “A joint speaker-listener-reinforcer model for referring expressions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[7]

R. Paul et al., “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” in Robotics: Science and Systems, 2016.

[8]

M. Shridhar and D. Hsu, “Grounding spatio-semantic referring expressions for human-robot interaction,” arXiv preprint arXiv:, 2017.

[9]

D. Whitney et al., “Reducing errors in object-fetching interactions through social feedback,” in IEEE International Conference on Robotics and Automation., 2017, pp. 1006–1013.

[10]

B.-A. Dang-Vu, et al., “Interpreting manipulation actions: From language to execution,” in Robot 2015: Second Iberian Robotics Conference, vol. 417. Springer, 2016, pp. 175–187.

[11]

R. Kiros et al., “Multimodal neural language models,” in International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 2, 2014, pp. 595–603.

[12]

O. Vinyals et al., “Show and tell: A neural image caption generator,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.

[13]

A. Frome et al., “Devise: A deep visual-semantic embedding model,” in Neural Information Processing Systems, 2013.

[14]

R. Kiros et al., “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv:1411.2539, 2014.

[15]

R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, May 2017.

Digital Library

[16]

J. Johnson et al., “Densecap: Fully convolutional localization networks for dense captioning,” in IEEE Conference on Computer Vision and Pattern Recognition., 2016.

[17]

C. Lu et al., “Visual relationship detection with language priors,” in European Conference on Computer Vision, 2016.

[18]

S. Kazemzadeh et al., “Referit game: Referring to objects in photographs of natural scenes,” in Conference on Empirical Methods in Natural Language Processing, 2014.

[19]

J. Mao et al., “Generation and comprehension of unambiguous object descriptions,” in The IEEE Conference on Computer Vision and Pattern Recognition., June 2016.

[20]

V. K. Nagaraja et al., “Modeling context between objects for referring expression understanding,” in European Conference on Computer Vision. Springer, 2016, pp. 792–807.

[21]

S. Antol et al., “Vqa: Visual question answering,” in IEEE International Conference on Computer Vision., December 2015.

[22]

S. Ren et al., “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015.

[23]

M. Everingham et al., “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, June 2010.

Digital Library

[24]

T.-Y. Lin, et al., “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision, Zürich, 2014.

[25]

S. Tokui et al., “Chainer: a next-generation open source framework for deep learning,” in Workshop on machine learning systems on Neural Information Processing Systems, 2015.

[26]

Y. Niitani et al., “ChainerCV: a library for deep learning in computer vision,” in Proceedings of ACM Multimedia Workshop, 2017.

[27]

J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,” in Computer Vision and Pattern Recognition, 2009.

[28]

K. He et al., “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Cited By

Arakawa RLehman JGoel M(2024)PrISM-Q&A: Step-Aware Voice Assistant on a Smartwatch Enabled by Multimodal Procedure Tracking and Large Language ModelsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997598:4(1-26)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699759
Lemmer SGuo ACorso J(2023)Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI TeamsProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584092(681-694)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584092
Arakawa RYakura HMollyn VNie SRussell EDeMeo DReddy HMaytin ACarroll BLehman JGoel M(2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3569504
Show More Cited By

Index Terms

Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions

Index terms have been assigned to the content through auto-classification.

Recommendations

Spoken language understanding and interaction

In recent years, the interest in research in speech understanding and spoken interaction has soared due to the emergence of virtual personal assistants. However, while the ability of these agents to recognise conversational speech is maturing rapidly, ...
An evaluation of strategies for selective utterance verification for spoken natural language dialog
ANLC '97: Proceedings of the fifth conference on Applied natural language processing

As with human-human interaction, spoken human-computer dialog will contain situations where there is miscommunication. In experimental trials consisting of eight different users, 141 problem-solving dialogs, and 2840 user utterances, the Circuit Fix-It ...
Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions
ICMI '14: Proceedings of the 16th International Conference on Multimodal Interaction

When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

2018 IEEE International Conference on Robotics and Automation (ICRA)

May 2018

5954 pages

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 21 May 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Arakawa RLehman JGoel M(2024)PrISM-Q&A: Step-Aware Voice Assistant on a Smartwatch Enabled by Multimodal Procedure Tracking and Large Language ModelsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997598:4(1-26)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699759
Lemmer SGuo ACorso J(2023)Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI TeamsProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584092(681-694)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3581641.3584092
Arakawa RYakura HMollyn VNie SRussell EDeMeo DReddy HMaytin ACarroll BLehman JGoel M(2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3569504
Doğan FTorre ILeite ISakamoto DWeiss AHiatt LShiomi M(2022)Asking Follow-Up Clarifications to Resolve Ambiguities in Human-Robot ConversationProceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction10.5555/3523760.3523822(461-469)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.5555/3523760.3523822
Cheang CLin HFu YXue X(2022)Learning 6-DoF Object Poses to Grasp Category-Level Objects by Language Instructions2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811367(8476-8482)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1109/ICRA46639.2022.9811367
Jiang YGuo MLi JExarchos IWu JLiu C(2021)DASH: Modularized Human Manipulation Simulation with Vision and Language for Embodied AIProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation10.1145/3475946.3480950(1-12)Online publication date: 6-Sep-2021
https://dl.acm.org/doi/10.1145/3475946.3480950
Toyosaka YOkita TTentori MWeibel NVan Laerhoven KAbowd GSalim F(2020)Perception of interaction between hand and objectAdjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers10.1145/3410530.3414363(290-295)Online publication date: 10-Sep-2020
https://dl.acm.org/doi/10.1145/3410530.3414363

View Options

View options

Figures

Tables

Media

View Table of Conten