Abstract
The task of recognition of human behavior in a collaborative robotic system is crucial for the organization of seamless and productive collaboration. We design a vision system for the industrial scenario for riveting a metal plate and concentrate on the task of recognizing in-hand mechanical tools. However, there is a severe occlusion problem during hand-object interaction process. Incorporating attention modules into the backbone part are often utilized to handle occlusion and enhance the ability of extract features with contextual information. In view of that, three modified occlusion-aware models based on YOLOv5 for in-hand mechanical tools recognition are proposed: by adding SimAM into each of bottleneck network in the backbone part, inserting a Criss-Cross attention layer between the last C3 block and the SPPF block of the back-bone network, and replacing the last C3 block of the backbone network with Criss-Cross attention layer. We create a dataset specifically for our task of in-hand mechanical tools recognition and validate four modified models after training separately, which proves the effectiveness of SimAM module and ineffectiveness of Criss-Cross attention module. The real-time detection is still imperfect under the occlusion of various directions of the hands.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Serebrenny, V., Lapin, D., Mokaeva, A.: The concept of an aircraft hull structures assembly process robotization. In: AIP Conference Proceedings, vol. 2171, no. 1 (2019)
Ranz, F., Hummel, V., Sihn, W.: Capability-based task allocation in human-robot collaboration. Proc. Manuf. 9, 182–189 (2017)
Tsarouchi, P., Michalos, G., Makris, S., Athanasatos, T., Dimoulas, K., Chryssolouris, G.: On a human-robot workplace design and task allocation system. Int. J. Comput. Integr. Manuf. 30(12), 1272–1279 (2017)
Mazhar, O., Navarro, B., Ramdani, S., Passama, R., Cherubini, A.: A real-time human-robot interaction framework with robust background invariant hand gesture detection. Robot. Comput.-Integr. Manuf. 60, 34–48 (2019)
Qi, W., Ovur, S.E., Li, Z., Marzullo, A., Song, R.: Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network. IEEE Robot. Autom. Lett. 6(3), 6039–6045 (2021)
Fan, H., Zhuo, T., Yu, X., Yang, Y., Kankanhalli, M.: Understanding atomic hand-object interaction with human intention. IEEE Trans. Circ. Syst. Video Technol. 32(1), 275–285 (2021)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736 (2018)
Lomonaco, V., Maltoni, D.: CORe50: a new dataset and benchmark for continuous object recognition. In: Conference on Robot Learning, pp. 17–26 (2017)
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1569–1578 (2021)
Nguyen, K., Todorovic, S.: A weakly supervised amodal segmenter with boundary uncertainty estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7396–7405 (2021)
Hasson, Y., Varol, G., Schmid, C., Laptev, I.: Towards unconstrained joint hand-object reconstruction from RGB videos. In: 2021 International Conference on 3D Vision (3DV), pp. 659–668 (2021)
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39(6), 1137–1149 (2017)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Bochkovskiy, A., Wang, C., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475 (2023)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
WorkingHands Dataset. https://www3.cs.stonybrook.edu/~minhhoai/downloads.html. Accessed 01 June 2023
Saleh, K., Szénási, S., Vámossy, Z.: Occlusion handling in generic object detection: a review. In: 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 477–484 (2021)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 13001–13008 (2020)
Zhou, C., Yuan, J.: Occlusion pattern discovery for object detection and occlusion reasoning. IEEE Trans. Circ. Syst. Video Technol. 30(7), 2067–2080 (2019)
Wang, X., Shrivastava, A., Gupta, A.: A-Fast-RCNN: hard positive generation via adversary for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2606–2615 (2017)
Ehsani, K., Mottaghi, R., Farhadi, A.: SeGAN: segmenting and generating the invisible. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6144–6153 (2018)
Tang, Y., Wang, Z., Lu, J., Feng, J., Zhou, J.: Multi-stream deep neural networks for RGB-D ego-centric action recognition. IEEE Trans. Circ. Syst. Video Technol. 29(10), 3001–3015 (2018)
Lee, K., Kacorri, H.: Hands holding clues for object recognition in teachable machines. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019)
Dosovitskiy, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-Cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 603–612 (2019)
Yang, L., Zhang, R.Y., Li, L., Xie, X.: SimAM: a simple, parameter-free attention module for convolutional neural networks. In: International Conference on Machine Learning (ICML), pp. 11863–11874 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, G., Shen, X., Serebrenny, V. (2023). Attention Guided In-hand Mechanical Tools Recognition in Human-Robot Collaborative Process. In: Ronzhin, A., Sadigov, A., Meshcheryakov, R. (eds) Interactive Collaborative Robotics. ICR 2023. Lecture Notes in Computer Science(), vol 14214. Springer, Cham. https://doi.org/10.1007/978-3-031-43111-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-43111-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43110-4
Online ISBN: 978-3-031-43111-1
eBook Packages: Computer ScienceComputer Science (R0)