Integration of Tracking, Re-Identification, and Gesture Recognition for Facilitating Human–Robot Interaction
<p>Overall flow architecture of the proposed Robot-Facilitated Interaction System (RFIS). Note that the number 1 in the red circle indicates the cropped body-part image sets detected from the cascaded YOLOv3.</p> "> Figure 2
<p>The proposed person and body-part detection and classification network is configured with the cascaded YOLOv3 and ResNet18-based transfer learning for feature embedding and classification.</p> "> Figure 3
<p>Schematic description of the proposed person/body-part tracking process with person prediction and fusion.</p> "> Figure 4
<p>The schematic flow of the proposed caretaker recognition and re-identification process particularly highlights the SSIM-based re-identification of the caretaker: (<b>a</b>) the caretaker recognition network based on EfficientNet. (<b>b</b>) the caretaker re-identification based on SSIM of the face and upper body images.</p> "> Figure 5
<p>Four categories of pre-defined gestures to communicate with robots: (<b>a</b>) “Here I am”, (<b>b</b>) “Stay there”, (<b>c</b>) “Come here”, and (<b>d</b>) “Follow me”. Note that in each figure, the white and yellow arrows indicate the pixel coordinates from the center of the face box to the center of the hand box.</p> "> Figure 5 Cont.
<p>Four categories of pre-defined gestures to communicate with robots: (<b>a</b>) “Here I am”, (<b>b</b>) “Stay there”, (<b>c</b>) “Come here”, and (<b>d</b>) “Follow me”. Note that in each figure, the white and yellow arrows indicate the pixel coordinates from the center of the face box to the center of the hand box.</p> "> Figure 6
<p>The proposed stacked LSTM architecture for gesture recognition.</p> "> Figure 7
<p>Dataset examples: (<b>a</b>) customized dataset, (<b>b</b>) benchmark datasets (MOT17, COCO, ChokePoint).</p> "> Figure 8
<p>Person/body-parts detection and classification results of each dataset: (<b>a</b>) MOT17, (<b>b</b>) COCO, (<b>c</b>) customized dataset.</p> "> Figure 8 Cont.
<p>Person/body-parts detection and classification results of each dataset: (<b>a</b>) MOT17, (<b>b</b>) COCO, (<b>c</b>) customized dataset.</p> "> Figure 9
<p>Person tracking results on the MOT17 benchmark dataset.</p> "> Figure 10
<p>Typical instances of person detection, classification, and tracking. (<b>a</b>–<b>c</b>) are intended to illustrate the robustness of the proposed person tracking under ill-conditioned situations: (<b>a</b>) tracking in a narrow passage with overlapping, (<b>b</b>) tracking in a short-term interval of 2 s, and (<b>c</b>) tracking under temporary disappearance with crossing.</p> "> Figure 11
<p>Caretaker recognition results of each dataset: (<b>a</b>) ChokePoint, (<b>b</b>) customized dataset. Note that the cropped face image is small because the face is a very small area of the overall scene image.</p> "> Figure 12
<p>Various typical instances that occurred during the process of caretaker re-identification, where (<b>a</b>) shows the head and upper-body images of a caretaker, (<b>b</b>) illustrates the re-identification of a newly appeared candidate as a caretaker, and (<b>c</b>) illustrates the re-identification of a newly appeared candidate as a non-caretaker.</p> "> Figure 13
<p>The confusion matrix detailing the performance of proposed gesture recognition for the five categories of gestures: (<b>a</b>) the results of level 1 to 4 scenarios. (<b>b</b>) the results of the level 5 scenario.</p> "> Figure 14
<p>Typical instances of 5 categories of gestures recognized in real time within an authentic environment are showcased, utilizing the gesture recognition network integrated into RFIS. Note: the use of a low-resolution camera for computational efficiency somewhat compromises the image quality.</p> "> Figure 15
<p>The caretaker recognition and re-identification results of the level 4 scenario experiment: (<b>a</b>) the caretaker recognition. (<b>b</b>) the caretaker re-identification.</p> "> Figure 15 Cont.
<p>The caretaker recognition and re-identification results of the level 4 scenario experiment: (<b>a</b>) the caretaker recognition. (<b>b</b>) the caretaker re-identification.</p> "> Figure 16
<p>The caretaker recognition and re-identification results of the level 5 scenario experiment: (<b>a</b>) the female caretaker recognition and switched to male caretaker by recognition after the female caretaker’s disappearance. (<b>b</b>) the male caretaker recognition and switched to female caretaker by recognition after the male caretaker’s disappearance.</p> ">
Abstract
:1. Introduction
1.1. Related Work
1.1.1. Person Detection and Tracking
1.1.2. Person Recognition and Re-Identification
1.1.3. Gesture Recognition
1.2. Problem Statement and Contribution
2. Overview of the Proposed System
3. Person and Body-Part Detection and Classification
4. Person Tracking
Algorithm 1. Person Tracking Algorithm |
Input: Individual boxes at frame k − 1 and k, YOLOv3 boxes at frame k + 1 Require: delta t = 30, rho = 0.04, iou_threshold = 0.4, T = 500 |
1: for i in frame k − 1 and k boxes: 2: (frame k + 1 box linear estimation) delta X, delta Y, delta W, delta H = (frame k − 1’s i-th box − frame k’s i-th box) 3: (Uncertainty Box) = [delta X, delta Y, delta t ∗ rho ∗ delta W, delta t ∗ rho ∗ delta H] 4: (Box Overlap) for j in YOLOv3 boxes: = YOLOv3 j-th box at frame k + 1 Box overlap ratio between YOLOv3 boxes at frame k + 1 and Uncertainty Box 5: (Distance between and ) if Box Overlap Ratio > iou_threshold: distance = 6: (Conditional Probability) for k in N: # N is the total number of line 5 result : Negative Conditional Probability : Positive Conditional Probability for n in N if n k: : Positive Conditional Probability : Negative Conditional Probability 7: (Assign Tracking Number) The max value of Conditional Probability means that is the same person as . trackingID = argmax() |
5. Caretaker Recognition and Re-Identification
5.1. Caretaker Recognition
5.2. Caretaker Re-Identification
6. Caretaker Gesture Recognition
7. Experimental Results
7.1. Experiment of Person/Body-Part Detection and Classification
7.2. Experiment of Person Tracking
7.3. Experiment of Caretaker Recognition and Re-Identification
7.4. Experiment of Gesture Recognition
7.5. Experiment of RFIS as an Integrated System
7.6. Discussion
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lee, I. Service robots: A systematic literature review. Electronics 2021, 10, 2658. [Google Scholar] [CrossRef]
- Lee, S.; Lee, S.; Kim, S.; Kim, A. Robot-Facilitated Human–Robot Interaction with Integrated Tracking, Re-identification and Gesture Recognition. In Proceedings of the International Conference on Intelligent Autonomous Systems, Suwon, Republic of Korea, 4–7 July 2023; Springer International Publishing: Cham, Switzerland, 2023. [Google Scholar]
- Sanjeewa, E.D.G.; Herath, K.K.L.; Madhusanka, B.G.D.A.; Priyankara, H.D.N.S. Visual attention model for mobile robot navigation in domestic environment. GSJ 2020, 8, 1960–1965. [Google Scholar]
- Zhao, X.; Naguib, A.M.; Lee, S. Kinect based calling gesture recognition for taking order service of elderly care robot. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication, Edinburgh, UK, 25–29 August 2014; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
- Liu, C.; Szirányi, T. Real-time human detection and gesture recognition for on-board UAV rescue. Sensors 2021, 21, 2180. [Google Scholar] [CrossRef] [PubMed]
- Rollo, F.; Zunino, A.; Raiola, G.; Amadio, F.; Ajoudani, A.; Tsagarakis, N. Followme: A robust person following framework based on visual re-identification and gestures. In Proceedings of the 2023 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), Berlin, Germany, 5–7 June 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Müller, S.; Wengefeld, T.; Trinh, T.Q.; Aganian, D.; Eisenbach, M.; Gross, H.-M. A multi-modal person perception framework for socially interactive mobile service robots. Sensors 2020, 20, 722. [Google Scholar] [CrossRef] [PubMed]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
- Lee, T.-H.; Kim, K.-J.; Yun, K.-S.; Kim, K.-J.; Choi, D.-H. A method of Counting Vehicle and Pedestrian using Deep Learning based on CCTV. J. Korean Inst. Intell. Syst. 2018, 28, 219–224. [Google Scholar] [CrossRef]
- Mukhtar, A.; Cree, M.J.; Scott, J.B.; Streeter, L. Mobility aids detection using convolution neural network (cnn). In Proceedings of the 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), Auckland, New Zealand, 19–21 November 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
- Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Tracking by prediction: A deep generative model for mutli-person localisation and tracking. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
- Choi, W. Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Manzoor, S.; Kim, E.-J.; Bae, S.-H.; Kuc, T.-Y. Edge Deployment of Vision-Based Model for Human Following Robot. In Proceedings of the 2023 23rd International Conference on Control, Automation and Systems (ICCAS), Yeosu, Republic of Koerea, 17–20 October 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Jader, G.; Fontineli, J.; Ruiz, M.; Abdalla, K.; Pithon, M.; Oliveira, L. Deep face recognition: A survey. In Proceedings of the 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Parana, Brazil, 29 October–1 November 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
- Sohail, M.; Shoukat, I.A.; Khan, A.U.; Fatima, H.; Jafri, M.R.; Yaqub, M.A.; Liotta, A. Deep Learning Based Multi Pose Human Face Matching System. IEEE Access 2024, 12, 26046–26061. [Google Scholar] [CrossRef]
- Condés, I.; Fernández-Conde, J.; Perdices, E.; Cañas, J.M. Robust Person Identification and Following in a Mobile Robot Based on Deep Learning and Optical Tracking. Electronics 2023, 12, 4424. [Google Scholar] [CrossRef]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
- Wei, W.; Yang, W.; Zuo, E.; Qian, Y.; Wang, L. Person re-identification based on deep learning—An overview. J. Vis. Commun. Image Represent. 2022, 88, 103418. [Google Scholar] [CrossRef]
- Wang, G.; Lai, J.; Huang, P.; Xie, X. Spatial-temporal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33. [Google Scholar]
- Rollo, F.; Zunino, A.; Tsagarakis, N.; Hoffman, E.M.; Ajoudani, A. Carpe-id: Continuously adaptable re-identification for personalized robot assistance. arXiv 2013, arXiv:2310.19413. [Google Scholar]
- He, T.; Jin, X.; Shen, X.; Huang, J.; Chen, Z.; Hua, X.S. Dense interaction learning for video-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Narayana, P.; Ross, B.; Draper, B.A. Gesture recognition: Focus on the hands. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Zhang, L.; Zhu, G.; Shen, P.; Song, J.; Shah, S.A.; Bennamoun, M. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Al-Hammadi, M.; Muhammad, G.; Abdul, W.; Alsulaiman, M.; Bencherif, M.A.; Mekhtiche, M.A. Hand gesture recognition for sign language using 3DCNN. IEEE Access 2020, 8, 79491–79509. [Google Scholar] [CrossRef]
- Dadashzadeh, A.; Targhi, A.T.; Tahmasbi, M.; Mirmehdi, M. HGR-Net: A fusion network for hand gesture segmenta-tion and recognition. IET Comput. Vis. 2019, 13, 700–707. [Google Scholar] [CrossRef]
- Yu, J.; Qin, M.; Zhou, S. Dynamic gesture recognition based on 2D convolutional neural network and feature fusion. Sci. Rep. 2022, 12, 4345. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.; Zou, W.; Xu, L.; Hu, Y.; Zhu, Z.; Chang, M.; Huang, J.; Huang, G.; Du, D. Action machine: Rethinking action recognition in trimmed videos. arXiv 2018, arXiv:1812.05770. [Google Scholar]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
- Redmon, J.; Ali, F. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: London, UK, 2019. [Google Scholar]
- Nepal, U.; Eslamiat, H. Comparing YOLOv3, YOLOv4 and YOLOv5 for autonomous landing spot detection in faulty UAVs. Sensors 2022, 22, 464. [Google Scholar] [CrossRef] [PubMed]
- Stanojevic, V.D.; Todorovic, B.T. BoostTrack: Boosting the similarity measure and detection confidence for improved multiple object tracking. Mach. Vis. Appl. 2024, 35, 53. [Google Scholar] [CrossRef]
- Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; IEEE: Piscataway, NJ, USA, 2010. [Google Scholar]
- Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft Coco: Common Objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Kasturi, R.; Goldgof, D.; Soundararajan, P.; Manohar, V.; Garofolo, J.; Bowers, R.; Boonstra, M.; Korzhova, V.; Zhang, J. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 319–336. [Google Scholar] [CrossRef]
- Wong, Y.; Chen, S.; Mau, S.; Sanderson, C.; Lovell, B.C. Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition. In Proceedings of the CVPR 2011 Workshops, Colorado Springs, CO, USA, 20–25 June 2011; IEEE: Piscataway, NJ, USA, 2011. [Google Scholar]
- Dodd, L.E.; Pepe, M.S. Partial AUC estimation and regression. Biometrics 2003, 59, 614–623. [Google Scholar] [CrossRef] [PubMed]
- Ristani, E.; Solera, F.; Zou, R.; Rita, C.; Carlo, T. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2016; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
- Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. Mars: A video benchmark for large-scale person re-identification. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Hyun, J.; Kang, M.; Wee, D.; Yeung, D.Y. Detection recovery in online multi-object tracking with sparse graph tracker. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023. [Google Scholar]
IoU threshold | MODA | MODP | |
---|---|---|---|
Person Detection | 0.6 | 94.7% | 94.6% |
Body-part Detection | 0.6 | 96.63% | 97.34% |
Tracking | MOTA |
---|---|
Person Tracking | 74.96% |
p-Value | pAUC | |
---|---|---|
ChokePoint Dataset | 1.0 | 1.0 |
0.5 | 1.0 | |
0.1 | 1.0 | |
Customized Dataset—Single Caretaker | 1.0 | 0.9965 |
0.5 | 0.9955 | |
0.1 | 0.9835 | |
Customized Dataset—Multiple Caretaker | 1.0 | 0.9958 |
0.5 | 0.9948 | |
0.1 | 0.9829 |
T Frame | Person A | Person B | |
---|---|---|---|
T − n Frame | (Caretaker) | (Non-Caretaker) | |
Person A (Caretaker) | Face | 0.66830 | 0.05613 |
Upper-Body | 0.41907 | 0.13146 | |
Conditional Probability | 0.5283 | 0.0286 | |
Person B (Non-Caretaker) | Face | 0.08085 | 0.73129 |
Upper-Body | 0.11232 | 0.69456 | |
Conditional Probability | 0.0350 | 0.6701 |
GT | Caretaker | Non-Caretaker | Recall | Precision | Accuracy | |
---|---|---|---|---|---|---|
Prediction | ||||||
Caretaker | 11 | 1 | 1.0 | 0.9167 | 0.9643 | |
Non-Caretaker | 0 | 16 | 0.9412 | 1.0 |
Gesture Type | Accuracy | ||
---|---|---|---|
Case 1 (Single) | Case 2 (Multiple) | ||
“Here I am” | 100% | 100% | Average 93.9% |
“Stay there” | 94.8% | 94.1% | |
“Come here” | 93.3% | 85.3% | |
“Follow me” | 95.9% | 98.8% | |
“No gesture” | 92.8% | 84.2% |
1 C + 0 N/C | 1 C + 2 N/C | 1 C + 2 N/C + ReID | 1 C + 3 N/C + ReID | 2 C + 3 N/C + ReID | |||
---|---|---|---|---|---|---|---|
Caretaker Recognition | Accuracy | 100% | 100% | 100% | 100% | 100% | |
Caretaker Tracking | MOTA | 100% | 100% | 87.5% (1 failure) | 100% | 100% | |
Caretaker Re-identification | Accuracy | N/A | N/A | 100% | 100% | 93.75% (1 failure) | |
Caretaker Gesture Recognition | Accuracy | “Here I am” | 95.8% | 100% | 95.8% | 98.7% | 100% |
“Stay there” | 95.6% | 100% | 89.2% | 91.8% | 90.6% | ||
“Come here” | 89.2% | 86.3% | 80.0% | 92.2% | 85.6% | ||
“Follow me” | 96.1% | 97.1% | 100% | 94.6% | 95.3% | ||
None | 97.4% | 100% | 88.1% | 92.9% | 86.4% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, S.; Lee, S.; Park, H. Integration of Tracking, Re-Identification, and Gesture Recognition for Facilitating Human–Robot Interaction. Sensors 2024, 24, 4850. https://doi.org/10.3390/s24154850
Lee S, Lee S, Park H. Integration of Tracking, Re-Identification, and Gesture Recognition for Facilitating Human–Robot Interaction. Sensors. 2024; 24(15):4850. https://doi.org/10.3390/s24154850
Chicago/Turabian StyleLee, Sukhan, Soojin Lee, and Hyunwoo Park. 2024. "Integration of Tracking, Re-Identification, and Gesture Recognition for Facilitating Human–Robot Interaction" Sensors 24, no. 15: 4850. https://doi.org/10.3390/s24154850
APA StyleLee, S., Lee, S., & Park, H. (2024). Integration of Tracking, Re-Identification, and Gesture Recognition for Facilitating Human–Robot Interaction. Sensors, 24(15), 4850. https://doi.org/10.3390/s24154850