An Underwater Human–Robot Interaction Using a Visual–Textual Model for Autonomous Underwater Vehicles
<p>The CADDY system for assistance in diver missions as an example of U-HRI [<a href="#B6-sensors-23-00197" class="html-bibr">6</a>,<a href="#B7-sensors-23-00197" class="html-bibr">7</a>].</p> "> Figure 2
<p>The current traditiona l framework (<b>a</b>) and our multimodal framework (<b>b</b>).</p> "> Figure 3
<p>An overview of our proposed visual-textual baseline (VT-UHGR).</p> "> Figure 4
<p>The overall structure of the transformer block.</p> "> Figure 5
<p>The overall structure of the VIT model.</p> "> Figure 6
<p>The overall structure of the BERT model.</p> "> Figure 7
<p>Example images from the CADDY [<a href="#B6-sensors-23-00197" class="html-bibr">6</a>,<a href="#B7-sensors-23-00197" class="html-bibr">7</a>] dataset.</p> "> Figure 8
<p>Example images of the SCUBANet [<a href="#B48-sensors-23-00197" class="html-bibr">48</a>] dataset.</p> ">
Abstract
:1. Introduction
- Underwater gesture recognition is constructed as a multimodal problem, and the U-HRI performance of an AUV is optimized by introducing a text modality to fully explore the feature associations between images and text.
- We propose a new underwater visual–textual gesture recognition model (VT-UHGR).
2. Related Work
2.1. Hand Gesture Recognition
2.2. Underwater Gesture Recognition
2.3. Vision–Text Multimodality
3. Method
3.1. Pretraining
3.2. Transformer Block
3.3. Visual Feature Extraction
3.4. Textual Feature Extraction
3.5. Multimodal Interaction
3.6. Loss Function
4. Results
4.1. Datasets
4.2. Implementation Details
4.3. Comparison with the State of the Art
4.4. Ablation Study
5. Limitation
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Birk, A. A Survey of Underwater Human-Robot Interaction (U-HRI). Curr. Robot. Rep. 2022, 3, 199–211. [Google Scholar] [CrossRef]
- Mišković, N.; Egi, M.; Nad, D.; Pascoal, A.; Sebastiao, L.; Bibuli, M. Human-robot interaction underwater: Communication and safety requirements. In Proceedings of the 2016 IEEE Third Underwater Communications and Networking Conference (UComms), Lerici, Italy, 30 August–1 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
- Sun, K.; Cui, W.; Chen, C. Review of Underwater Sensing Technologies and Applications. Sensors 2021, 21, 7849. [Google Scholar] [CrossRef] [PubMed]
- Pan, S.; Shi, L.; Guo, S. A Kinect-Based Real-Time Compressive Tracking Prototype System for Amphibious Spherical Robots. Sensors 2015, 15, 8232–8252. [Google Scholar] [CrossRef] [PubMed]
- Qin, R.; Zhao, X.; Zhu, W.; Yang, Q.; He, B.; Li, G.; Yan, T. Multiple Receptive Field Network (MRF-Net) for Autonomous Underwater Vehicle Fishing Net Detection Using Forward-Looking Sonar Images. Sensors 2021, 21, 1933. [Google Scholar] [CrossRef]
- Chiarella, D.; Bibuli, M.; Bruzzone, G.; Caccia, M.; Ranieri, A.; Zereik, E.; Marconi, L.; Cutugno, P. A novel gesture-based language for underwater human–robot interaction. J. Mar. Sci. Eng. 2018, 6, 91. [Google Scholar] [CrossRef] [Green Version]
- Gomez Chavez, A.; Ranieri, A.; Chiarella, D.; Zereik, E.; Babić, A.; Birk, A. CADDY Underwater Stereo-Vision Dataset for Human–Robot Interaction (HRI) in the Context of Diver Activities. J. Mar. Sci. Eng. 2019, 7, 16. [Google Scholar] [CrossRef] [Green Version]
- Blizard, M.A. Ocean optics: Introduction and overview. In Ocean Optics VIII; SPIE: Bellingham, DC, USA, 1986; Volume 637, pp. 2–17. [Google Scholar]
- Schettini, R.; Corchs, S. Underwater image processing: State of the art of restoration and image enhancement methods. EURASIP J. Adv. Signal Process. 2010, 2010, 1–14. [Google Scholar] [CrossRef] [Green Version]
- Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef] [Green Version]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Fang, H.; Xiong, P.; Xu, L.; Chen, Y. Clip2video: Mastering video-text retrieval via image clip. arXiv 2021, arXiv:2106.11097. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Miech, A.; Laptev, I.; Sivic, J. Learning a text-video embedding from incomplete and heterogeneous data. arXiv 2018, arXiv:1804.02516. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7331–7341. [Google Scholar]
- Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar]
- Cheng, X.; Jia, M.; Wang, Q.; Zhang, J. A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6994–7004. [Google Scholar] [CrossRef]
- Chen, Q.; Georganas, N.D.; Petriu, E.M. Real-time vision-based hand gesture recognition using haar-like features. In Proceedings of the 2007 IEEE Instrumentation & Measurement Technology Conference IMTC, Warsaw, Poland, 1–3 May 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–6. [Google Scholar]
- Saha, S.; Lahiri, R.; Konar, A.; Banerjee, B.; Nagar, A.K. HMM-based gesture recognition system using Kinect sensor for improvised human-computer interaction. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2776–2783. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A. Hidden two-stream convolutional networks for action recognition. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 363–378. [Google Scholar]
- Devineau, G.; Moutarde, F.; Xi, W.; Yang, J. Deep learning for hand gesture recognition on skeletal data. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 106–113. [Google Scholar]
- Nguyen, X.S.; Brun, L.; Lézoray, O.; Bougleux, S. A neural network based on SPD manifold learning for skeleton-based hand gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 12036–12045. [Google Scholar]
- Bakar, M.Z.A.; Samad, R.; Pebrianti, D.; Mustafa, M.; Abdullah, N.R.H. Finger application using K-Curvature method and Kinect sensor in real-time. In Proceedings of the 2015 International Symposium on Technology Management and Emerging Technologies (ISTMET), Langkawi Island, Malaysia, 25–27 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 218–222. [Google Scholar]
- Wu, X.; Finnegan, D.; O’Neill, E.; Yang, Y.L. Handmap: Robust hand pose estimation via intermediate dense guidance map supervision. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 237–253. [Google Scholar]
- Ge, L.; Ren, Z.; Li, Y.; Xue, Z.; Wang, Y.; Cai, J.; Yuan, J. 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 10833–10842. [Google Scholar]
- Cai, Y.; Ge, L.; Cai, J.; Yuan, J. Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 666–682. [Google Scholar]
- Miao, Q.; Li, Y.; Ouyang, W.; Ma, Z.; Xu, X.; Shi, W.; Cao, X. Multimodal gesture recognition based on the resc3d network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3047–3055. [Google Scholar]
- Zhu, G.; Zhang, L.; Mei, L.; Shao, J.; Song, J.; Shen, P. Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 19–24. [Google Scholar]
- Kim, H.G.; Seo, J.; Kim, S.M. Underwater Optical-Sonar Image Fusion Systems. Sensors 2022, 22, 8445. [Google Scholar] [CrossRef]
- Du, W.; Yang, Y.; Liu, L. Research on the Recognition Performance of Bionic Sensors Based on Active Electrolocation for Different Materials. Sensors 2020, 20, 4608. [Google Scholar] [CrossRef]
- Yang, J.; Wilson, J.P.; Gupta, S. Diver gesture recognition using deep learning for underwater human-robot interaction. In Proceedings of the OCEANS 2019 MTS/IEEE SEATTLE, Seattle, WA, USA, 27–31 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Chavez, A.G.; Ranieri, A.; Chiarella, D.; Birk, A. Underwater Vision-Based Gesture Recognition: A Robustness Validation for Safe Human–Robot Interaction. IEEE Robot. Autom. Mag. 2021, 28, 67–78. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [Green Version]
- Zhao, M.; Hu, C.; Wei, F.; Wang, K.; Wang, C.; Jiang, Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors 2019, 19, 350. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Piergiovanni, A.; Ryoo, M. Learning multimodal representations for unseen activities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass village, Colorado, 2–5 March 2020; pp. 517–526. [Google Scholar]
- Alayrac, J.B.; Recasens, A.; Schneider, R.; Arandjelović, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; Zisserman, A. Self-supervised multimodal versatile networks. Adv. Neural Inf. Process. Syst. 2020, 33, 25–37. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Peng, Y.; Yan, S.; Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Cambria, E.; White, B. Jumping NLP curves: A review of natural language processing research. IEEE Comput. Intell. Mag. 2014, 9, 48–57. [Google Scholar] [CrossRef]
- Codd-Downey, R.; Jenkin, M. Finding divers with SCUBANet. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5746–5751. [Google Scholar] [CrossRef]
Methods | Acc | Params (M) | Times (ms) |
---|---|---|---|
AlexNet | 0.83 | 61.1 | 0.84 |
ResNet | 0.88 | 11.7 | 1.26 |
GoogleNet | 0.90 | 6.8 | 1.65 |
VggNet | 0.95 | 138.4 | 2.14 |
VT-UHGR (ours) | 0.98 | 178.4 | 2.87 |
Methods | Acc |
---|---|
ResNet | 0.75 |
GoogleNet | 0.78 |
VggNet | 0.82 |
VT-UHGR (ours) | 0.86 |
Methods | Acc |
---|---|
MD-NCMF | 0.77 |
SSD with MobileNets | 0.85 |
FC-CNN with ResNet-50 | 0.95 |
Deformable Faster R-CNN | 0.98 |
VT-UHGR (ours) | 0.98 |
Methods | Textual Encoder | Acc |
---|---|---|
VIT | − | 95.81 |
VT-UHGR | One-hot | 97.13 |
VT-UHGR | BERT | 98.32 |
Methods | Structure | Acc |
---|---|---|
VT-UHGR | − | 97.19 |
VT-UHGR | + Transformer Encoder | 97.78 |
VT-UHGR | + | 98.11 |
VT-UHGR | + | 98.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Jiang, Y.; Qi, H.; Zhao, M.; Wang, Y.; Wang, K.; Wei, F. An Underwater Human–Robot Interaction Using a Visual–Textual Model for Autonomous Underwater Vehicles. Sensors 2023, 23, 197. https://doi.org/10.3390/s23010197
Zhang Y, Jiang Y, Qi H, Zhao M, Wang Y, Wang K, Wei F. An Underwater Human–Robot Interaction Using a Visual–Textual Model for Autonomous Underwater Vehicles. Sensors. 2023; 23(1):197. https://doi.org/10.3390/s23010197
Chicago/Turabian StyleZhang, Yongji, Yu Jiang, Hong Qi, Minghao Zhao, Yuehang Wang, Kai Wang, and Fenglin Wei. 2023. "An Underwater Human–Robot Interaction Using a Visual–Textual Model for Autonomous Underwater Vehicles" Sensors 23, no. 1: 197. https://doi.org/10.3390/s23010197
APA StyleZhang, Y., Jiang, Y., Qi, H., Zhao, M., Wang, Y., Wang, K., & Wei, F. (2023). An Underwater Human–Robot Interaction Using a Visual–Textual Model for Autonomous Underwater Vehicles. Sensors, 23(1), 197. https://doi.org/10.3390/s23010197