Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3612107acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

Published: 27 October 2023 Publication History

Abstract

Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.

References

[1]
J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. 2019. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV.
[2]
Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. 2019. Zero-shot semantic segmentation. ç (2019).
[3]
Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. 2023. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In CVPR.
[4]
Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, and Wenping Wang. 2022. Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives. arXiv preprint arXiv:2210.09923 (2022).
[5]
HuiXian Cheng, XianFeng Han, and GuoQiang Xiao. 2022. Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In ICME.
[6]
Ali Cheraghian, Shafin Rahman, Dylan Campbell, and Lars Petersson. 2019b. Mitigating the hubness problem for zero-shot learning of 3d objects. arXiv preprint arXiv:1907.06371 (2019).
[7]
Ali Cheraghian, Shafin Rahman, and Lars Petersson. 2019a. Zero-shot learning of 3d point cloud objects. In MVA.
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[9]
Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, and Abhinav Valada. 2021. Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. arXiv preprint arXiv:2109.03805 (2021).
[10]
Daniel Garrido, Rui Rodrigues, A Augusto Sousa, Joao Jacob, and Daniel Castro Silva. 2021. Point cloud interaction and manipulation in virtual reality. In AIVR.
[11]
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. 2022. Scaling open-vocabulary image segmentation with image-level labels. In ECCV.
[12]
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
[13]
Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. 2020. Context-aware feature generation for zero-shot semantic segmentation. In ACM MM.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[15]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[16]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.
[17]
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. 2022. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).
[18]
Bo Liu, Shuang Deng, Qiulei Dong, and Zhanyi Hu. 2021a. Language-Level Semantics Conditioned 3D Point Cloud Segmentation. arXiv preprint arXiv:2107.00430 (2021).
[19]
Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. 2021b. 3d-to-2d distillation for indoor scene parsing. In CVPR.
[20]
Björn Michele, Alexandre Boulch, Gilles Puy, Maxime Bucher, and Renaud Marlet. 2021. Generative zero-shot learning for semantic segmentation of 3d point clouds. In 3DV.
[21]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In AAAI.
[22]
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. 2023. Openscene: 3d scene understanding with open vocabularies. In CVPR.
[23]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
[24]
Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, and Renaud Marlet. 2022. Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR.
[25]
Xiao Sun, Zhouhui Lian, and Jianguo Xiao. 2019. Srinet: Learning strictly rotation-invariant representations for point cloud classification and segmentation. In ACM MM.
[26]
Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. 2020. Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV.
[27]
Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. 2006. Stanley: The robot that won the DARPA Grand Challenge. J FIELD ROBOT (2006).
[28]
Guiyu Tian, Shuai Wang, Jie Feng, Li Zhou, and Yadong Mu. 2020. Cap2seg: Inferring semantic and spatial context from captions for zero-shot image segmentation. In ACM MM.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS (2017).
[30]
Luting Wang, Xiaojie Li, Yue Liao, Zeren Jiang, Jianlong Wu, Fei Wang, Chen Qian, and Si Liu. 2022. Head: Hetero-assists distillation for heterogeneous object detectors. In ECCV.
[31]
Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. 2020. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In ACM MM.
[32]
Jian Wu, Jianbo Jiao, Qingxiong Yang, Zheng-Jun Zha, and Xuejin Chen. 2019. Ground-aware point cloud semantic segmentation for autonomous driving. In ACM MM.
[33]
Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. 2019. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[34]
Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu, Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. 2022b. Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models. In ECCV.
[35]
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022a. Groupvit: Semantic segmentation emerges from text supervision. In CVPR.
[36]
Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2022. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV.
[37]
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-vocabulary detr with conditional matching. In ECCV.
[38]
Peng Zhang, Li Su, Liang Li, BingKun Bao, Pamela Cosman, GuoRong Li, and Qingming Huang. 2019. Training efficient saliency prediction models with knowledge distillation. In ACM MM.
[39]
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. 2022. Pointclip: Point cloud understanding by clip. In CVPR.
[40]
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In CVPR.
[41]
Chong Zhou, Chen Change Loy, and Bo Dai. 2022. Extract free dense labels from clip. In ECCV.
[42]
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. 2022. PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning. arXiv preprint arXiv:2211.11682 (2022).
[43]
Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. 2021. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR.

Cited By

View all
  • (2024)Gait Recognition in Large-scale Free Environment via Single LiDARProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681166(380-389)Online publication date: 28-Oct-2024
  • (2024)LaneCMKT: Boosting Monocular 3D Lane Detection with Cross-Modal Knowledge TransferProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681038(4283-4291)Online publication date: 28-Oct-2024
  • (2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal distillation
  2. point cloud segmentation
  3. semantic segmentation
  4. zero-shot learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)313
  • Downloads (Last 6 weeks)13
Reflects downloads up to 01 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Gait Recognition in Large-scale Free Environment via Single LiDARProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681166(380-389)Online publication date: 28-Oct-2024
  • (2024)LaneCMKT: Boosting Monocular 3D Lane Detection with Cross-Modal Knowledge TransferProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681038(4283-4291)Online publication date: 28-Oct-2024
  • (2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
  • (2024)Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680651(9019-9028)Online publication date: 28-Oct-2024
  • (2024)Class Probability Space Regularization for semi-supervised semantic segmentationComputer Vision and Image Understanding10.1016/j.cviu.2024.104146249(104146)Online publication date: Dec-2024
  • (2024)Pseudo-embedding for Generalized Few-Shot 3D SegmentationComputer Vision – ECCV 202410.1007/978-3-031-72764-1_22(383-400)Online publication date: 25-Oct-2024
  • (2024)LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data ExploitationComputer Vision – ECCV 202410.1007/978-3-031-72646-0_15(252-269)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media