Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping
<p>Work flow of the integrated SLAM. The SLAM backbone detects feature points and reconstructs a 3D map of sparse points. The object reconstruction thread detects garments and desks and reconstructs them by DeepSDF [<a href="#B24-sensors-24-07622" class="html-bibr">24</a>] in the map. The plane detection thread detects and reconstructs planes using CAPE [<a href="#B21-sensors-24-07622" class="html-bibr">21</a>] and matches them to the map.</p> "> Figure 2
<p>Examples of the 6 garment states.</p> "> Figure 3
<p>Example images from different version’s dataset.</p> "> Figure 4
<p>Comparison of a potential false association case with the correct result by the global association method. In situations where two objects, A and B, are in close proximity, B’s true match might be misidentified as A’s closest candidate by the original method, leading to its removal from B’s candidate list. A global association method, however, could correctly associate the objects.</p> "> Figure 5
<p>Illustration of graph optimization. Orange vertices represent camera poses, purple vertices represent feature points, green vertices represent object poses, and blue vertices represent planes. The edges connecting these vertices represent the error terms to be optimized.</p> "> Figure 6
<p>Charts of Average Precision (AP) and Average Recall (AR) evaluated on different test sets for the four models.</p> "> Figure 7
<p>Variations in models’ performances trained on different datasets, tested on v1.</p> "> Figure 8
<p>Variations in models’ performances trained on different datasets, tested on v2c.</p> "> Figure 9
<p>Variations in models’ performances trained on different datasets, tested on v3m.</p> "> Figure 10
<p>Variations in models’ performances trained on different datasets, tested on v4sd.</p> ">
Abstract
:1. Introduction
- An object-oriented SLAM that incorporates plane detection is proposed, which includes planes in the joint optimization process and reconstructs them to facilitate understanding of the environment.
- The proposed SLAM-based method is tailored specifically to address the challenges of garment recognition and reconstruction, accommodating the highly deformable and irregular nature of garments.
- Six common garment states observed in industrial processes are identified and defined, providing a semantic basis for understanding and processing garments in various conditions. And segmentation models and mesh reconstruction models are trained using datasets collected and annotated based on the defined garment states.
2. Related Works
3. Materials and Methods
3.1. System Overview
3.2. Garment Detection
- Flat: This state denotes a t-shirt laid out largely flat on a surface, maintaining its recognizable shape. It may exhibit minor folds, particularly at the corners of the cuff or hem and sleeves.
- Strip: In this state, the t-shirt is folded vertically but spread horizontally, assuming a strip-like shape. This can occur either by casually picking up the garment around the collar or both shoulders and then placing and dragging it on the table, or by carefully folding it from the flat state.
- Stack: This state occurs when the t-shirt is picked up randomly at one or two points and then casually dropped on the table.
- 2-fold (or fold-2 as referred to later): Here, the t-shirt is folded once horizontally from the flat state, with the positions of the sleeves being random.
- 4-fold (or fold-4 as referred to later): In this state, the t-shirt is folded into a square-like shape. This can be achieved either by folding one more time vertically from the 2-fold state, or by folding in the style of a “dress shirt fold” or “military fold”.
- Strip-fold (or fold-s as referred to later): Here, the t-shirt is folded once more horizontally from the strip state. This mimics the method typically employed when one wishes to fold a t-shirt swiftly and casually.
- V1 comprises 200 RGB images of a single t-shirt exhibiting painted sections, inclusive of both long and short-sleeved variants. The camera is strategically positioned above the table, capturing a bird’s-eye view of the garment and the table beneath it.
- V2 adds colored and strip-patterned t-shirts. The test set featuring these new images is denoted as v2c in subsequent references.
- V3 features multiple t-shirts (ranging from 2 to 6) within the same frame. The general test set featuring v3’s new data is identified as v3m. These images can be further divided into various sets based on the interaction among the t-shirts: (1) v3ma: several t-shirts of identical or differing colors placed separate from each other. (2) v3mcd: numerous t-shirts of varying colors placed adjacent to each other. (3) v3mcs: several t-shirts of the same color positioned right next to each other, with some arranged intentionally to create confusion. V3mcs type is only used for testing purposes, and is not included in the training set.
- V4’s test set is referred to as v4sd. Two elements undergo change or addition: (1) camera view: the camera is no longer static over the table’s top. Instead, the images are derived from videos where the camera’s movements mimic a mobile robot, focusing on the operating table with the garments placed on it. (2) A new object class: the operating table is introduced as the 7th class in the detection.
3.3. Garment Reconstruction
3.4. Plane Detection and Reconstruction
3.5. SLAM Integration
3.5.1. Object Reconstruction and Association
3.5.2. Plane Matching
3.5.3. Joint Optimization
4. Results
4.1. 2D Garment Recognition
- Single t-shirt. For the detection of a single clothing item, the models are ranked based on mask precision as follows: SOLOv2 >= SOLOv2-light > YOLACT > Mask-RCNN. The primary distinction in this scenario is the accuracy of mask edges.
- Multiple t-shirts. When it comes to detecting multiple clothing items, the rankings shift to: Mask-RCNN > YOLACT >> SOLOv2-light >= SOLOv2, the main distinction lying in the models’ ability to differentiate between different pieces of clothing. Notably, even though SOLOv2 and SOLOv2-light yield the most accurate masks among the four models, in most cases, they struggle to distinguish different t-shirt pieces. This discrepancy may contribute to their higher rankings when only considering mask precision on the v3m test set, where the ranking is Mask-RCNN > SOLOv2 > SOLOv2-light > YOLACT.
- Shifting view angle. In the context of recognizing objects from various angles and identifying desks, the ranking shifts to SOLOv2-light >= SOLOv2 > YOLACT > Mask-RCNN.
- Reference speed. Finally, in terms of inference speed, SOLOv2-light and YOLACT are the fastest, trailed by Mask-RCNN and SOLOv2.
4.2. Garment Mesh Reconstruction
4.3. SLAM Integration
5. Discussion
6. Conclusions
- Instance segmentation models trained on our dataset achieved strong performance, with an Average Precision of 0.882 and an Average Recall of 0.913, demonstrating their effectiveness in recognizing garment states and shapes.
- The DeepSDF-based garment reconstruction achieved an average Chamfer distance of 0.000998, successfully reconstructing garments but showing limitations with flatter or more complex shapes.
- The integrated SLAM system achieved an absolute trajectory error of 0.0202 in sequences with a small number of point features, highlighting its feasibility in industrial environments with plane reconstruction.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Foresti, G.L.; Pellegrino, F.A. Automatic visual recognition of deformable objects for grasping and manipulation. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2004, 34, 325–333. [Google Scholar] [CrossRef]
- Wang, X.; Zhao, J.; Jiang, X.; Liu, Y.H. Learning-based fabric folding and box wrapping. IEEE Robot. Autom. Lett. 2022, 7, 5703–5710. [Google Scholar] [CrossRef]
- Willimon, B.; Birchfield, S.; Walker, I. Classification of clothing using interactive perception. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1862–1868. [Google Scholar]
- Doumanoglou, A.; Stria, J.; Peleka, G.; Mariolis, I.; Petrik, V.; Kargakos, A.; Wagner, L.; Hlaváč, V.; Kim, T.K.; Malassiotis, S. Folding clothes autonomously: A complete pipeline. IEEE Trans. Robot. 2016, 32, 1461–1478. [Google Scholar] [CrossRef]
- Avigal, Y.; Berscheid, L.; Asfour, T.; Kröger, T.; Goldberg, K. Speedfolding: Learning efficient bimanual folding of garments. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 1–8. [Google Scholar]
- He, C.; Meng, L.; Wang, J.; Meng, M.Q.H. FabricFolding: Learning Efficient Fabric Folding without Expert Demonstrations. arXiv 2023, arXiv:2303.06587. [Google Scholar] [CrossRef]
- Wu, R.; Lu, H.; Wang, Y.; Wang, Y.; Dong, H. UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16340–16350. [Google Scholar]
- Huang, Z.; Lin, X.; Held, D. Mesh-based dynamics with occlusion reasoning for cloth manipulation. arXiv 2022, arXiv:2206.02881. [Google Scholar]
- Wang, W.; Li, G.; Zamora, M.; Coros, S. Trtm: Template-based reconstruction and target-oriented manipulation of crumpled cloths. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12522–12528. [Google Scholar]
- Ye, L.; Wu, F.; Zou, X.; Li, J. Path planning for mobile robots in unstructured orchard environments: An improved kinematically constrained bi-directional RRT approach. Comput. Electron. Agric. 2023, 215, 108453. [Google Scholar] [CrossRef]
- Hu, K.; Chen, Z.; Kang, H.; Tang, Y. 3D vision technologies for a self-developed structural external crack damage recognition robot. Autom. Constr. 2024, 159, 105262. [Google Scholar] [CrossRef]
- Tang, Y.; Qi, S.; Zhu, L.; Zhuo, X.; Zhang, Y.; Meng, F. Obstacle avoidance motion in mobile robotics. J. Syst. Simul. 2024, 36, 1–26. [Google Scholar]
- Cheng, J.; Zhang, L.; Chen, Q.; Hu, X.; Cai, J. A review of visual SLAM methods for autonomous driving vehicles. Eng. Appl. Artif. Intell. 2022, 114, 104992. [Google Scholar] [CrossRef]
- Ren, Z.; Wang, L.; Bi, L. Robust GICP-based 3D LiDAR SLAM for underground mining environment. Sensors 2019, 19, 2915. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Yang, B.; Chen, D.; Wang, N.; Zhang, G.; Bao, H. Survey and evaluation of monocular visual-inertial SLAM algorithms for augmented reality. Virtual Real. Intell. Hardw. 2019, 1, 386–410. [Google Scholar]
- Yang, S.; Scherer, S. Monocular object and plane slam in structured environments. IEEE Robot. Autom. Lett. 2019, 4, 3145–3152. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, Y.; Zhu, D.; Deng, Z.; Sun, W.; Chen, X.; Zhang, J. An object slam framework for association, mapping, and high-level tasks. IEEE Trans. Robot. 2023, 39, 2912–2932. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Li, Y.; Yunus, R.; Brasch, N.; Navab, N.; Tombari, F. RGB-D SLAM with structural regularities. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11581–11587. [Google Scholar]
- Proença, P.F.; Gao, Y. Fast cylinder and plane extraction from depth cameras for visual odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 6813–6820. [Google Scholar]
- Wang, J.; Rünz, M.; Agapito, L. DSP-SLAM: Object oriented SLAM with deep shape priors. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 1362–1371. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar]
- Torralba, A.; Russell, B.C.; Yuen, J. Labelme: Online image annotation and applications. Proc. IEEE 2010, 98, 1467–1484. [Google Scholar] [CrossRef]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
- Community, B.O. Blender—A 3D Modelling and Rendering Package; Blender Foundation, Stichting Blender Foundation: Amsterdam, The Netherlands, 2018. [Google Scholar]
- Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
- Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11807–11816. [Google Scholar]
- Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Proceedings of the Sensor Fusion IV: Control Paradigms and Data Structures, Boston, MA, USA, 12–15 November 1991; SPIE: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Choi, S.; Zhou, Q.Y.; Miller, S.; Koltun, V. A large dataset of object scans. arXiv 2016, arXiv:1602.02481. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Features | Size (Number of Images) | |||||||
---|---|---|---|---|---|---|---|---|
Top View Only | Multiple Colors | Multiple T-Shirts | Including Desk Class | Train | Validation | Test | All | |
v1 | ∘ | 180 | 10 | 10 | 200 | |||
v2 | ∘ | ∘ | 419 | 23 | 23 | 465 | ||
v3 | ∘ | ∘ | ∘ | 614 | 34 | 34 | 682 | |
v4 | ∘ | ∘ | ∘ | 774 | 43 | 43 | 860 |
Flat | Strip | Stack | Fold-2 | Fold-4 | Fold-s | |
---|---|---|---|---|---|---|
top | ||||||
trimetric |
Segmentation | AP [IoU 0.50:0.95] (maxDets = 100) | ||||||||
---|---|---|---|---|---|---|---|---|---|
val | test | v1 | v2c | v3m | v3m-a | v3m-cd | v3m-cs | v4sd | |
Mask-RCNN [23] | 0.814 | 0.883 | 0.745 | 0.846 | 0.892 | 0.859 | 0.836 | 0.410 | 0.846 |
YOLACT [26] | 0.833 | 0.822 | 0.769 | 0.875 | 0.694 | 0.751 | 0.688 | 0.052 | 0.874 |
SOLOv2 [27] | 0.867 | 0.882 | 0.800 | 0.917 | 0.797 | 0.741 | 0.780 | 0.084 | 0.918 |
SOLOv2-light [27] | 0.840 | 0.882 | 0.780 | 0.909 | 0.749 | 0.698 | 0.747 | 0.064 | 0.921 |
Segmentation | AR (maxDets = 100, 300, 1000) | ||||||||
---|---|---|---|---|---|---|---|---|---|
val | test | v1 | v2c | v3m | v3m-a | v3m-cd | v3m-cs | v4sd | |
Mask-RCNN [23] | 0.872 | 0.908 | 0.752 | 0.892 | 0.899 | 0.869 | 0.877 | 0.511 | 0.864 |
YOLACT [26] | 0.877 | 0.875 | 0.773 | 0.917 | 0.759 | 0.791 | 0.725 | 0.1 | 0.890 |
SOLOv2 [27] | 0.910 | 0.912 | 0.800 | 0.917 | 0.828 | 0.764 | 0.855 | 0.152 | 0.940 |
SOLOv2-light [27] | 0.883 | 0.913 | 0.780 | 0.908 | 0.807 | 0.789 | 0.818 | 0.109 | 0.934 |
v1 | v2c | v3ma | v3mcd | |
---|---|---|---|---|
input | ||||
ground truth | ||||
Mask R-CNN | ||||
YOLACT | ||||
SOLOv2 | ||||
SOLOv2-light |
v3mcs | v4sd | Sequence | Camera | |
---|---|---|---|---|
input | ||||
ground truth | ||||
Mask R-CNN | ||||
YOLACT | ||||
SOLOv2 | ||||
SOLOv2-light |
Mask-RCNN | YOLACT | SOLOv2 | SOLOv2-Light | |
---|---|---|---|---|
sequence | 4.62 | 5.54 | 4.29 | 6.00 |
camera | 2.67 | 4.40 | 1.63 | 3.41 |
Flat | Strip | Stack | Fold-2 | Fold-4 | Fold-s | Desk | |
---|---|---|---|---|---|---|---|
Chamfer distance | 0.005761 | 0.000670 | 0.000051 | 0.000156 | 0.000100 | 0.000084 | 0.000166 |
Desk | Flat | Strip | Stack | Fold-2 | Fold-4 | Fold-s | |
---|---|---|---|---|---|---|---|
ground truth | |||||||
reconstruction |
With Only Object Thread | With Only Plane Thread | With Both Integrated | |
---|---|---|---|
detection | |||
reconstructions |
CAPE [21] | 300 | YOLACT | 4.40 | ||
with only plane thread | 1.77 | with only object thread | 0.69 | with both integrated | 0.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Hashimoto, K. Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping. Sensors 2024, 24, 7622. https://doi.org/10.3390/s24237622
Zhang Y, Hashimoto K. Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping. Sensors. 2024; 24(23):7622. https://doi.org/10.3390/s24237622
Chicago/Turabian StyleZhang, Yilin, and Koichi Hashimoto. 2024. "Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping" Sensors 24, no. 23: 7622. https://doi.org/10.3390/s24237622
APA StyleZhang, Y., & Hashimoto, K. (2024). Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping. Sensors, 24(23), 7622. https://doi.org/10.3390/s24237622