research-article

STMMI: : A Self-Tuning Multi-Modal Fusion Algorithm Applied in Assist Robot Interaction

Authors:

Xin Zhang Academic Editor:

Pengwei WangAuthors Info & Claims

Scientific Programming, Volume 2022

https://doi.org/10.1155/2022/3952758

Published: 01 January 2022 Publication History

Abstract

While facing complex surroundings, robots need to identify the same intention which is expressed in different ways. In order to solve the problem of assisting robots to get a better intention understanding, a self-tuning multimodal fusion algorithm is put forward in this paper, which is not restricted by the expressions of interacting participants and environment. The multimodal fusion algorithm can be transferred to different application platforms. Robots can own the understanding competence and adapt new tasks by changing the content of the robot knowledge base. Compared with other multimodal fusion algorithms, this paper attempts to transfer the basic structure of feed-forward neural networks on discrete sets, which has strengthened the consistency and perfect the complementary relations between multiple mode, and has realized the simultaneous operation of fusion operator’s self-tuning and intention search. There are three kinds of modes selected in the paper: speech, gesture, and scene objects, where the single modal classifiers are trained separately. This method conducted a human-computer interaction experiment on the bionic robot Pepper platform, which proved that the method can effectively improve the accuracy and robustness of robots in aspects of understanding human intentions, and reduce the uncertainty about intention judgment in a single modal interaction.

References

[1]

B. Tony, B. Paul, R. Robin, W. Rachel, H. Cuayáhuitl, K. Bernd, R. Stefania, I. Kruijff-Korbayová, G. Athanasopoulos, E. Valentin, R. Looije, N. Mark, Y. Demiris, R. Ros-Espinoza, A. Beck, C. Lola, H. Antione, M. Lewis, I. Baroni, M. Nalin, P. Cosi, G. Paci, F. Tesser, G. Sommavilla, and R. Humber, “Multimodal child-robot interaction: building social bonds,” Journal of Human Robot Interaction, vol. 1, no. 2, pp. 33–53, 2013.

[2]

E. S. Kim, L. D. Berkovits, E. P. Bernier, D. Leyzberg, F. Shic, R. Paul, and B. Scassellati, “Social robots as embedded reinforcers of social behavior in children with autism,” Journal of Autism and Developmental Disorders, vol. 43, no. 5, pp. 1038–1049, 2013.

[3]

M. Fridin, “Storytelling by a kindergarten social assistive robot: a tool for constructive learning in preschool education,” Computers & Education, vol. 70, pp. 53–64, 2014.

[4]

M. Yang and J. Tao, “Intelligence methods of multi-modal information fusion in human-computer interaction,” Scientia Sinica Informationis, vol. 48, no. 4, pp. 433–448, 2018.

[5]

P. R. Cohen and D. R. McGee, “Tangible multimodal interfaces for safety-critical applications,” Communications of the ACM, vol. 47, no. 1, pp. 41–46, 2004.

Digital Library

[6]

A. Jaimes and N. Sebe, “Multimodal human-computer interaction: a survey,” Computer Vision and Image Understanding, vol. 108, no. 1-2, pp. 116–134, 2007.

Digital Library

[7]

H. Kivrak, F. Cakmak, H. Kose, and S Yavuz, “Social navigation framework for assistive robots in human inhabited unknown environments,” Engineering Science and Technology, an International Journal, vol. 24, no. 2, pp. 284–298, 2021.

[8]

J. Hemminahaus and S. Kopp, “Towards Adaptive Social Behavior Generation for Assistive Robots Using Reinforcement learning,” in Proceedings of the 2017 12th ACM/IEEE International Conference on Human-Robot Interaction, pp. 332–340, Vienna, Austria, March 2017.

[9]

J. K. Chorowski, D. Bahdanau, D. Serdyuk, C. Kyunghyun, and B. Yoshua, “Attention-based models for speech recognition,” Advances in Neural Information Processing Systems, vol. 28, 2015.

[10]

M. H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: a survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34–58, 2002.

Digital Library

[11]

N. Elena, S. Oleg, A. C. Miguel, T. Jonathan, S. Valerii, R. Viktor, and T. Dzmitry, “CobotAR:Interaction with Robots Using Omnidirectionally Projected Image and DNN-Based Gesture Recognition,” in Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, October 2021.

[12]

S. J. Wang, W. J. Yan, X. Li, G. Zhao, C. G. Zhou, X. Fu, M. Yang, and J. Tao, “Micro-expression recognition using color spaces,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 6034–6047, 2015.

Digital Library

[13]

L. Chao, J. Tao, M. Yang, L. Ya, and W. Zhengqi, “Long short term memory recurrent neural network based multimodal dimensional emotion recognition,” in Proceedings of the 5th International Workshop on Audio/visual Emotion challenge, pp. 65–72, Brisbane Australia, October 2015.

Digital Library

[14]

M. Liwicki, M. Weber, T. Zimmermann, and D. Andreas, “Seamless Integration of Handwriting Recognition into Pen-Enabled Displays for Fast User Interaction,” in Proceedings of the 2012 Iapr International Workshop on Document Analysis Systems, IEEE, Gold Coast, QLD, Australia, March 2012.

Digital Library

[15]

F. Tian, F. Lu, Y. Jiang, X. L. Zhang, X. Cao, G. Dai, and H. Wang, “An exploration of pen tail gestures for interactions,” International Journal of Human-Computer Studies, vol. 71, no. 5, pp. 551–569, 2013.

Digital Library

[16]

S. Cheng, Z. Sun, L. Sun, Y. Kirsten, and K. Anind, “Gaze-based Annotations for reading comprehension,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1569–1572, Seoul Republic of Korea, April 2015.

Digital Library

[17]

M. Kim and J. Leskovec, “Latent multi-group membership graph model,” 2012, https://arxiv.org/abs/1205.4546.

[18]

A. Vanzo, D. Croce, E. Bastianelli, R. Basili, and D. Nardi, “Grounded language interpretation of robotic commands through structured learning,” Artificial Intelligence, vol. 278, 2020.

Digital Library

[19]

S. Trick, D. Koert, J. Peters, and C. Rothkopf, “Multimodal uncertainty reduction for intention recognition in human-robot interaction,” in Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7009–7016, IEEE, Macau, SAR, China, November 2019.

[20]

S. S. Dou, Z. Q. Feng, J. L. Tian, X. Fan, Y. Hou, and X. Zhang, “An intention understanding algorithm based on multimodal information fusion,” Scientific Programming, vol. 2021, 11 pages, 2021.

Digital Library

[21]

S. Ondáš, M. Pleva, R. Krištan, H. Rastislav, and J. Jozef, “VoMIS-the VoiceXML-Based Multimodal Interactive System for NAO Robot,” in Proceedings of the 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), pp. 315–320, IEEE, Košice, Slovakia, August 2018.

[22]

M. Wang, Z. Yan, T. Wang, P. Cai, S. Gao, Y. Zeng, C. Wan, H. Wang, L. Pan, J. Yu, S. Pan, K. He, J. Lu, and X. Chen, “Gesture recognition using a bioinspired learning architecture that integrates visual data with somatosensory data from stretchable sensors,” Nature Electronics, vol. 3, no. 9, pp. 563–570, 2020.

[23]

I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami, and P. Maragos, “Multimodal Human Action Recognition in Assistive Human-Robot interaction,” in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2702–2706, IEEE, Shanghai, China, March 2016.

Digital Library

[24]

S. K. Kim, E. A. Kirchner, A. Stefes, and F. Kirchner, “Intrinsic interactive reinforcement learning - using error-related potentials for real world human-robot interaction,” Scientific Reports, vol. 7, no. 1, 2017.

[25]

J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and L. K. Alessandro, “Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion recognition,” 2019, https://arxiv.org/abs/1907.03196.

[26]

L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One Model to Learn Them all,” 2017, https://arxiv.org/abs/1706.05137.

[27]

E. Tzeng, J. Hoffman, T. Darrell, and S. Kate, “Simultaneous deep transfer across domains and tasks[C],” in Proceedings of the IEEE international conference on computer vision, pp. 4068–4076, Santiago, Chile, December 2015.

Digital Library

[28]

C. Fagiani, M. Betke, and J. Gips, “Evaluation of tracking methods for human-computer interaction,” in Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, 2002, pp. 121–126, Orlando, FL, USA, December 2002.

[29]

C. Mollaret, A. A. Mekonnen, F. Lerasle, I. Ferrané, J. Pinquier, B. Boudet, and P Rumeau, “A multi-modal perception based assistive robotic system for the elderly,” Computer Vision and Image Understanding, vol. 149, pp. 78–97, 2016.

Digital Library

[30]

D. Vaufreydaz, W. Johal, and C. Combe, “Starting engagement detection towards a companion robot using multimodal features,” Robotics and Autonomous Systems, vol. 75, pp. 4–16, 2016.

Digital Library

[31]

J. Redmon and A. Farhadi, “Yolov3: An Incremental improvement,” 2018, https://arxiv.org/abs/1804.02767.

[32]

Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang, “Ernie 2.0: a continual pre-training framework for language understanding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8968–8975, NY, USA, February 2020.

[33]

B. Xie, X. He, and Y. Li, “RGB‐D static gesture recognition based on convolutional neural network,” Journal of Engineering, vol. 2018, no. 16, pp. 1515–1520, 2018.

Index Terms

STMMI: A Self-Tuning Multi-Modal Fusion Algorithm Applied in Assist Robot Interaction

Index terms have been assigned to the content through auto-classification.

Recommendations

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
Dialog acts in greeting and leavetaking in social talk
ISIAA 2017: Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents

Conversation proceeds through dialogue moves or acts, and dialog act annotation can aid the design of artificial dialog. While many dialogs are task-based or instrumental, with clear goals, as in the case of a service encounter or business meeting, ...
Applying politeness maxims in social robotics polite dialogue
HRI '12: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction

An important element of human-robot interaction, as with inter-human interaction, is conversation. Having previously suggested the Gricean maxims as suitable guidelines for social robotics dialogue, we discovered that a preferable alternative set of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientific Programming

Scientific Programming Volume 2022, Issue

2022

11290 pages

ISSN:1058-9244

EISSN:1875-919X

Issue’s Table of Contents

Copyright © 2022 Ya Hou et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 January 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents