research-article

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

Authors:

Si LiuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3745 - 3754

https://doi.org/10.1145/3581783.3612107

Published: 27 October 2023 Publication History

Abstract

Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.

References

[1]

J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. 2019. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV.

[2]

Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. 2019. Zero-shot semantic segmentation. ç (2019).

[3]

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. 2023. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP. In CVPR.

[4]

Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, and Wenping Wang. 2022. Zero-shot Point Cloud Segmentation by Transferring Geometric Primitives. arXiv preprint arXiv:2210.09923 (2022).

[5]

HuiXian Cheng, XianFeng Han, and GuoQiang Xiao. 2022. Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In ICME.

[6]

Ali Cheraghian, Shafin Rahman, Dylan Campbell, and Lars Petersson. 2019b. Mitigating the hubness problem for zero-shot learning of 3d objects. arXiv preprint arXiv:1907.06371 (2019).

[7]

Ali Cheraghian, Shafin Rahman, and Lars Petersson. 2019a. Zero-shot learning of 3d point cloud objects. In MVA.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[9]

Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, and Abhinav Valada. 2021. Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. arXiv preprint arXiv:2109.03805 (2021).

[10]

Daniel Garrido, Rui Rodrigues, A Augusto Sousa, Joao Jacob, and Daniel Castro Silva. 2021. Point cloud interaction and manipulation in virtual reality. In AIVR.

[11]

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. 2022. Scaling open-vocabulary image segmentation with image-level labels. In ECCV.

[12]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).

[13]

Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. 2020. Context-aware feature generation for zero-shot semantic segmentation. In ACM MM.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[15]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[16]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.

[17]

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. 2022. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).

[18]

Bo Liu, Shuang Deng, Qiulei Dong, and Zhanyi Hu. 2021a. Language-Level Semantics Conditioned 3D Point Cloud Segmentation. arXiv preprint arXiv:2107.00430 (2021).

[19]

Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. 2021b. 3d-to-2d distillation for indoor scene parsing. In CVPR.

[20]

Björn Michele, Alexandre Boulch, Gilles Puy, Maxime Bucher, and Renaud Marlet. 2021. Generative zero-shot learning for semantic segmentation of 3d point clouds. In 3DV.

[21]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In AAAI.

[22]

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. 2023. Openscene: 3d scene understanding with open vocabularies. In CVPR.

[23]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.

[24]

Corentin Sautier, Gilles Puy, Spyros Gidaris, Alexandre Boulch, Andrei Bursuc, and Renaud Marlet. 2022. Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR.

[25]

Xiao Sun, Zhouhui Lian, and Jianguo Xiao. 2019. Srinet: Learning strictly rotation-invariant representations for point cloud classification and segmentation. In ACM MM.

Digital Library

[26]

Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. 2020. Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV.

[27]

Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. 2006. Stanley: The robot that won the DARPA Grand Challenge. J FIELD ROBOT (2006).

[28]

Guiyu Tian, Shuai Wang, Jie Feng, Li Zhou, and Yadong Mu. 2020. Cap2seg: Inferring semantic and spatial context from captions for zero-shot image segmentation. In ACM MM.

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS (2017).

[30]

Luting Wang, Xiaojie Li, Yue Liao, Zeren Jiang, Jianlong Wu, Fei Wang, Chen Qian, and Si Liu. 2022. Head: Hetero-assists distillation for heterogeneous object detectors. In ECCV.

[31]

Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. 2020. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In ACM MM.

[32]

Jian Wu, Jianbo Jiao, Qingxiong Yang, Zheng-Jun Zha, and Xuejin Chen. 2019. Ground-aware point cloud semantic segmentation for autonomous driving. In ACM MM.

[33]

Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. 2019. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]

Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu, Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. 2022b. Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models. In ECCV.

[35]

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022a. Groupvit: Semantic segmentation emerges from text supervision. In CVPR.

[36]

Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2022. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV.

[37]

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-vocabulary detr with conditional matching. In ECCV.

[38]

Peng Zhang, Li Su, Liang Li, BingKun Bao, Pamela Cosman, GuoRong Li, and Qingming Huang. 2019. Training efficient saliency prediction models with knowledge distillation. In ACM MM.

[39]

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. 2022. Pointclip: Point cloud understanding by clip. In CVPR.

[40]

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In CVPR.

[41]

Chong Zhou, Chen Change Loy, and Bo Dai. 2022. Extract free dense labels from clip. In ECCV.

[42]

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. 2022. PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning. arXiv preprint arXiv:2211.11682 (2022).

[43]

Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. 2021. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR.

Cited By

Han XRen YCong PSun YWang JXu LMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Gait Recognition in Large-scale Free Environment via Single LiDARProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681166(380-389)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681166
Zhao RWang HCai WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)LaneCMKT: Boosting Monocular 3D Lane Detection with Cross-Modal Knowledge TransferProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681038(4283-4291)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681038
Han XRen YYao YSun YMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680720
Show More Cited By

Index Terms

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

Crossmodal Few-shot 3D Point Cloud Semantic Segmentation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Recently, few-shot 3D point cloud semantic segmentation methods have been introduced to mitigate the limitations of existing fully supervised approaches, i.e., heavy dependence on labeled 3D data and poor capacity to generalize to new categories. ...
Transductive Visual-Semantic Embedding for Zero-shot Learning
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches ...
Zero-shot semantic segmentation via spatial and multi-scale aware visual class embedding
Highlights
- We proposed new Spatial and Multi-scale Visual Class Embedding NETwork (SMVCENet) for zero-shot semantic segmentation.
Abstract
Fully supervised semantic segmentation technologies bring a paradigm shift in scene understanding. However, the burden of expensive labeling cost remains as a challenge. To solve the cost problem, recent studies proposed language model ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China
CCF-DiDi GAIA Collaborative Research Funds for Young Scholars

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
332
Total Downloads

Downloads (Last 12 months)313
Downloads (Last 6 weeks)13

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han XRen YCong PSun YWang JXu LMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Gait Recognition in Large-scale Free Environment via Single LiDARProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681166(380-389)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681166
Zhao RWang HCai WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)LaneCMKT: Boosting Monocular 3D Lane Detection with Cross-Modal Knowledge TransferProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681038(4283-4291)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681038
Han XRen YYao YSun YMa YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Towards Practical Human Motion Prediction with LiDAR Point CloudsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680720(7629-7638)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680720
Liu HZhuo JLiang CChen JMa HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680651(9019-9028)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680651
Yin JYan SChen TChen YYao Y(2024)Class Probability Space Regularization for semi-supervised semantic segmentationComputer Vision and Image Understanding10.1016/j.cviu.2024.104146249(104146)Online publication date: Dec-2024
https://doi.org/10.1016/j.cviu.2024.104146
Tsai CChen HLiu T(2024)Pseudo-embedding for Generalized Few-Shot 3D SegmentationComputer Vision – ECCV 202410.1007/978-3-031-72764-1_22(383-400)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-72764-1_22
Li JDong Q(2024)LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data ExploitationComputer Vision – ECCV 202410.1007/978-3-031-72646-0_15(252-269)Online publication date: 28-Oct-2024
https://doi.org/10.1007/978-3-031-72646-0_15

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents