Abstract
Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Furthermore, models trained on this data typically struggle to recognize object classes beyond the annotated training classes, i.e. , they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, recent 2D foundation models have demonstrated strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate high-quality training labels for 3D segmentation models. The resulting model, Segment3D, generalizes significantly better than the models trained on costly manual 3D labels and enables easily adding new training data to further boost the segmentation performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)
Baruch, G., et al.: ARKitScenes: a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data. arXiv preprint arXiv:2111.08897 (2021)
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Chen, M., et al.: STPLS3D: a large-scale synthetic and real aerial photogrammetry 3D point cloud dataset. arXiv preprint arXiv:2203.09065 (2022)
Chen, R., et al.: CLIP2Scene: towards label-efficient 3D scene understanding by CLIP. In: CVPR (2023)
Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X.: Hierarchical aggregation for 3D instance segmentation. In: ICCV (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Choy, C., Gwak, J., Savarese, S.: 4D spatio-temporal convnets: minkowski convolutional neural networks. In: CVPR (2019)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)
Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration. TOG 36(4), 1 (2017)
Delitzas, A., Takmaz, A., Tombari, F., Sumner, R., Pollefeys, M., Engelmann, F.: SceneFun3D: fine-grained functionality and affordance understanding in 3D scenes. In: CVPR (2024)
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: PLA: language-driven open-vocabulary 3D scene understanding. In: CVPR (2023)
Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3D-MPA: multi-proposal aggregation for 3D semantic instance segmentation. In: CVPR (2020)
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Tombari, F.: OpenNeRF: open set 3D neural scene segmentation with pixel-wise features and rendered novel views. In: ICLR (2024)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59, 167–181 (2004)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022)
Ha, H., Song, S.: semantic abstraction: open-world 3D scene understanding from 2D vision-language models. In: CoRL (2022)
Huang, T., et al.: CLIP2Point: transfer CLIP to point cloud classification with image-depth pre-training. In: ICCV (2023)
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: UIST (2011)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3D instance segmentation. In: CVPR (2020)
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: LERF: language embedded radiance fields. In: ICCV (2023)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. In: NeurIPS (2022)
Lemke, O., Bauer, Z., Zurbrügg, R., Pollefeys, M., Engelmann, F., Blum, H.: Spot-compose: a framework for open-vocabulary object retrieval and drawer manipulation in point clouds. In: 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024 (2024)
Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K.: Instance segmentation in 3D scenes using semantic superpoint tree networks. In: CVPR (2021)
Lu, J., Deng, J., Wang, C., He, J., Zhang, T.: Query refinement transformer for 3D instance segmentation. In: ICCV (2023)
Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3D annotation. In: CVPR (2023)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Nekrasov, A., Schult, J., Litany, O., Leibe, B., Engelmann, F.: Mix3D: Out-of-context data augmentation for 3D scenes. In: 3DV (2021)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: StyleCLIP: text-driven manipulation of StyleGAN imagery. In: ICCV (2021)
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: OpenScene: 3D scene understanding with open vocabularies. In: CVPR (2023)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Roynard, X., Deschaud, J.E., Goulette, F.: Paris-Lille-3D: a large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification. IJRR 37, 545–557 (2018)
Rozenberszki, D., Litany, O., Dai, A.: Language-grounded indoor 3D semantic segmentation in the wild. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 125–141. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19827-4_8
Rozenberszki, D., Litany, O., Dai, A.: UnScene3D: unsupervised 3D instance segmentation for indoor scenes. In: CVPR (2024)
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3D: mask transformer for 3D semantic instance segmentation. In: ICRA (2023)
Schult, J., Engelmann, F., Kontogianni, T., Leibe, B.: DualConvMesh-net: joint geodesic and euclidean convolutions on 3D meshes. In: CVPR (2020)
Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
Sun, J., Qing, C., Tan, J., Xu, X.: Superpoint transformer for 3D scene instance segmentation. In: AAAI (2023)
Sun, T., et al.: Nothing stands still: a spatiotemporal benchmark on 3d point cloud registration under large geometric and temporal change. arXiv preprint arXiv:2311.09346 (2023)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: OpenMask3D: open-vocabulary 3D instance segmentation. In: NeurIPS (2023)
Takmaz, A., et al.: 3D segmentation of humans in point clouds with synthetic data. In: ICCV (2023)
Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: flexible and deformable convolution for point clouds. In: ICCV (2019)
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: SoftGroup for 3D instance segmentation on point clouds. In: CVPR (2022)
Weder, S., Blum, H., Engelmann, F., Pollefeys, M.: LabelMaker: automatic semantic label generation from RGB-D trajectories. In: 3DV (2024)
Yang, B., et al.: Learning object bounding boxes for 3D instance segmentation on point clouds. In: NeurIPS (2019)
Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: SAM3D: segment anything in 3D scenes. In: ICCVW (2023)
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: ScanNet++: a high-fidelity dataset of 3D indoor scenes. In: ICCV (2023)
Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J.: GSPN: generative shape proposal network for 3D instance segmentation in point cloud. In: CVPR (2019)
Yue, Y., Das, A., Engelmann, F., Tang, S., Lenssen, J.: Improving 2D feature representations by 3D-aware fine-tuning. In: ECCV (2024)
Yue, Y., Kontogianni, T., Schindler, K., Engelmann, F.: Connecting the dots: floorplan reconstruction using two-level queries. In: CVPR (2023)
Zeng, Y., et al.: CLIP2: contrastive language-image-point pretraining from real-world point cloud data. In: CVPR (2023)
Zhang, R., et al.: PointCLIP: point cloud understanding by CLIP. In: CVPR (2022)
Zurbrügg, R., et al.: ICGNet: a unified approach for instance-centric grasping. In: ICRA (2024)
Acknowledgement
Gao Huang is supported in part by the National Natural Science Foundation of China under grants (62321005 and 42327901). Francis Engelmann is partially supported by an ETH AI Center postdoctoral research fellowship and an ETH Zurich Career Seed Award. Songyou Peng is supported by an Innosuisse funding (100.567 IP-ICT), and Ayça Takmaz is supported by an Innosuisse grant (48727.1 IP-ICT).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, R. et al. (2025). Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation Without Manual Labels. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15092. Springer, Cham. https://doi.org/10.1007/978-3-031-72754-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-72754-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72753-5
Online ISBN: 978-3-031-72754-2
eBook Packages: Computer ScienceComputer Science (R0)