Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664647.3680998acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

Published: 28 October 2024 Publication History

Abstract

3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module that analyzes the interrelationships among linguistic primitives to consolidate their insights and pinpoint common characteristics, helping to capture holistic information and enhance the precision of target identification. The proposed RefMask3D achieves new state-of-the-art performance on 3D referring segmentation, 3D visual grounding, and also 2D referring image segmentation. Especially, RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset. Code is available at https://github.com/heshuting555/RefMask3D.

References

[1]
Ahmed Abdelreheem, Ujjwal Upadhyay, Ivan Skorokhodov, Rawan Al Yahya, Jun Chen, and Mohamed Elhoseiny. 2022. 3dreftransformer: Fine-grained object identification in real-world scenes using natural language. In WACV. 3941--3950.
[2]
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV. Springer, 422--440.
[3]
Eslam Bakr, Yasmeen Alsaedy, and Mohamed Elhoseiny. 2022. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. In NeurIPS. 37146--37158.
[4]
Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In CVPR. 16464--16473.
[5]
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV. Springer, 202--221.
[6]
Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and Angel X Chang. 2022. D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. In ECCV. Springer, 487--505.
[7]
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In CVPR. 1290--1299.
[8]
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR.
[9]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR. 5828--5839.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. 2020. Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In ECCV. Springer, 417--435.
[12]
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. 2023. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV. 2694--2703.
[13]
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2021. Vision-language transformer and query generation for referring segmentation. In ICCV. 16321--16330.
[14]
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2022. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 6 (2022), 7900--7916.
[15]
Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Ajmal Mian. 2021. Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722--3731.
[16]
Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. 2023. ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance. In ICCV.
[17]
Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In ACM MM. 2344--2352.
[18]
Shuting He and Henghui Ding. 2024. Decoupling static and hierarchical motion perception for referring video segmentation. In CVPR. 13332--13341.
[19]
Shuting He, Henghui Ding, and Wei Jiang. 2023. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR.
[20]
Shuting He, Henghui Ding, and Wei Jiang. 2023. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In CVPR.
[21]
Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. 2024. SegPoint: Segment Any Point Cloud via Large Language Model. In ECCV.
[22]
Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. 2023. GREC: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182 (2023).
[23]
Shuting He, Xudong Jiang, Wei Jiang, and Henghui Ding. 2023 d. Prototype adaption and projection for few-and zero-shot 3d point cloud semantic segmentation. IEEE TIP (2023), 3199--3211.
[24]
Joy Hsu, Jiayuan Mao, and Jiajun Wu. 2023. Ns3d: Neuro-symbolic grounding of 3d objects and relations. In CVPR. 2614--2623.
[25]
Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. 2021. Text-guided graph neural networks for referring 3d instance segmentation. In AAAI. 1610--1618.
[26]
Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In CVPR. 15524--15533.
[27]
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Katerina Fragkiadaki. 2022. Bottom up top down detection transformers for language grounding in images and point clouds. In ECCV. Springer, 417--433.
[28]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
[29]
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. 2020. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR. 4867--4876.
[30]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[31]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR. 9404--9413.
[32]
Maxim Kolodiazhnyi, Anna Vorontsova, Anton Konushin, and Danila Rukhovich. 2024. OneFormer3D: One transformer for unified point cloud segmentation. In CVPR.
[33]
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. In CVPR.
[34]
Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, and Jiaya Jia. 2023. Mask-Attention-Free Transformer for 3D Instance Segmentation. In ICCV.
[35]
Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. NeurIPS (2021), 19652--19664.
[36]
Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, and Chen Change Loy. 2024. Transformer-based visual segmentation: A survey. IEEE TPAMI (2024).
[37]
Chang Liu, Henghui Ding, and Xudong Jiang. 2023. GRES: Generalized Referring Expression Segmentation. In CVPR. 23592--23601.
[38]
Chang Liu, Henghui Ding, Yulun Zhang, and Xudong Jiang. 2023. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing, Vol. 32 (2023), 3054--3065.
[39]
Chang Liu, Xudong Jiang, and Henghui Ding. 2022. Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia, Vol. 25 (2022), 3657--3667.
[40]
Chang Liu, Xudong Jiang, and Henghui Ding. 2024. PrimitiveNet: decomposing the global constraints for referring segmentation. Visual Intelligence, Vol. 2, 1 (2024), 16.
[41]
Chang Liu, Xiangtai Li, and Henghui Ding. 2024. Referring Image Editing: Object-level Image Editing via Referring Expressions. In CVPR. 13128--13138.
[42]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[43]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. 10012--10022.
[44]
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. 2021. Group-free 3d object detection via transformers. In ICCV. 2949--2958.
[45]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034--10043.
[46]
Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 2022. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454--16463.
[47]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR. 11--20.
[48]
Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An end-to-end transformer model for 3d object detection. In CVPR. 2906--2917.
[49]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS.
[50]
Zhipeng Qian, Yiwei Ma, Jiayi Ji, and Xiaoshuai Sun. 2024. X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In AAAI. 4551--4559.
[51]
Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. 2022. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning. PMLR, 1046--1056.
[52]
Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. 2023. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA. IEEE, 8216--8223.
[53]
Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. 2024. Context-Aware Integration of Language and Visual References for Natural Language Tracking. In CVPR.
[54]
Jiahao Sun, Chunmei Qing, Junpeng Tan, and Xiangmin Xu. 2023. Superpoint transformer for 3d scene instance segmentation. In AAAI. 2393--2401.
[55]
Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. Cris: Clip-driven referring image segmentation. In CVPR. 11686--11695.
[56]
Changli Wu, Yihang Liu, Yiwei Ma, Haowei Wang, Gen Luo, Jiayi Ji, Henghui Ding, Xiaoshuai Sun, and Rongrong Ji. 2024. 3D-GRES: Generalized 3D Referring Expression Segmentation. In ACM MM.
[57]
Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, and Xiaoshuai Sun. 2024 d. 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. In AAAI. 5940--5948.
[58]
Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. 2024. Towards robust referring image segmentation. IEEE TIP, Vol. 33 (2024), 1782--1794.
[59]
Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, et al. 2024. Towards open vocabulary learning: A survey. IEEE TPAMI (2024).
[60]
Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In CVPR. 19231--19242.
[61]
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022. Groupvit: Semantic segmentation emerges from text supervision. In CVPR. 18134--18144.
[62]
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. 2022. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155--18165.
[63]
Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. 2021. Sat: 2d semantics assisted training for 3d visual grounding. In ICCV. 1856--1866.
[64]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In ECCV. Springer, 69--85.
[65]
Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In CVPR. 1791--1800.
[66]
Yiming Zhang, ZeMing Gong, and Angel X Chang. 2023. Multi3DRefer: Grounding Text Description to Multiple 3D Objects. In ICCV. 15225--15236.
[67]
Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928--2937.
[68]
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 2023. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In ICCV.

Cited By

View all
  • (2024)[inline-graphic not available: see fulltext] SegPoint: Segment Any Point Cloud via Large Language ModelComputer Vision – ECCV 202410.1007/978-3-031-72670-5_20(349-367)Online publication date: 29-Sep-2024

Index Terms

  1. RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d referring segmentation
    2. language-guided transformer
    3. vision-language learning

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)37
    Reflects downloads up to 28 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)[inline-graphic not available: see fulltext] SegPoint: Segment Any Point Cloud via Large Language ModelComputer Vision – ECCV 202410.1007/978-3-031-72670-5_20(349-367)Online publication date: 29-Sep-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media