research-article

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation

Authors:

Henghui DingAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 8316 - 8325

https://doi.org/10.1145/3664647.3680998

Published: 28 October 2024 Publication History

Abstract

3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module that analyzes the interrelationships among linguistic primitives to consolidate their insights and pinpoint common characteristics, helping to capture holistic information and enhance the precision of target identification. The proposed RefMask3D achieves new state-of-the-art performance on 3D referring segmentation, 3D visual grounding, and also 2D referring image segmentation. Especially, RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset. Code is available at https://github.com/heshuting555/RefMask3D.

References

[1]

Ahmed Abdelreheem, Ujjwal Upadhyay, Ivan Skorokhodov, Rawan Al Yahya, Jun Chen, and Mohamed Elhoseiny. 2022. 3dreftransformer: Fine-grained object identification in real-world scenes using natural language. In WACV. 3941--3950.

[2]

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV. Springer, 422--440.

[3]

Eslam Bakr, Yasmeen Alsaedy, and Mohamed Elhoseiny. 2022. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. In NeurIPS. 37146--37158.

[4]

Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 2022. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In CVPR. 16464--16473.

[5]

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. In ECCV. Springer, 202--221.

[6]

Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and Angel X Chang. 2022. D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding. In ECCV. Springer, 487--505.

[7]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In CVPR. 1290--1299.

[8]

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR.

[9]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR. 5828--5839.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[11]

Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. 2020. Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In ECCV. Springer, 417--435.

[12]

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. 2023. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV. 2694--2703.

[13]

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2021. Vision-language transformer and query generation for referring segmentation. In ICCV. 16321--16330.

[14]

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2022. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 6 (2022), 7900--7916.

Digital Library

[15]

Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Ajmal Mian. 2021. Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722--3731.

[16]

Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. 2023. ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance. In ICCV.

[17]

Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In ACM MM. 2344--2352.

[18]

Shuting He and Henghui Ding. 2024. Decoupling static and hierarchical motion perception for referring video segmentation. In CVPR. 13332--13341.

[19]

Shuting He, Henghui Ding, and Wei Jiang. 2023. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR.

[20]

Shuting He, Henghui Ding, and Wei Jiang. 2023. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In CVPR.

[21]

Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. 2024. SegPoint: Segment Any Point Cloud via Large Language Model. In ECCV.

[22]

Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. 2023. GREC: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182 (2023).

[23]

Shuting He, Xudong Jiang, Wei Jiang, and Henghui Ding. 2023 d. Prototype adaption and projection for few-and zero-shot 3d point cloud semantic segmentation. IEEE TIP (2023), 3199--3211.

[24]

Joy Hsu, Jiayuan Mao, and Jiajun Wu. 2023. Ns3d: Neuro-symbolic grounding of 3d objects and relations. In CVPR. 2614--2623.

[25]

Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. 2021. Text-guided graph neural networks for referring 3d instance segmentation. In AAAI. 1610--1618.

[26]

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. 2022. Multi-view transformer for 3d visual grounding. In CVPR. 15524--15533.

[27]

Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Katerina Fragkiadaki. 2022. Bottom up top down detection transformers for language grounding in images and point clouds. In ECCV. Springer, 417--433.

[28]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).

[29]

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. 2020. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR. 4867--4876.

[30]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[31]

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR. 9404--9413.

[32]

Maxim Kolodiazhnyi, Anna Vorontsova, Anton Konushin, and Danila Rukhovich. 2024. OneFormer3D: One transformer for unified point cloud segmentation. In CVPR.

[33]

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. In CVPR.

[34]

Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, and Jiaya Jia. 2023. Mask-Attention-Free Transformer for 3D Instance Segmentation. In ICCV.

[35]

Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. NeurIPS (2021), 19652--19664.

[36]

Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, and Chen Change Loy. 2024. Transformer-based visual segmentation: A survey. IEEE TPAMI (2024).

[37]

Chang Liu, Henghui Ding, and Xudong Jiang. 2023. GRES: Generalized Referring Expression Segmentation. In CVPR. 23592--23601.

[38]

Chang Liu, Henghui Ding, Yulun Zhang, and Xudong Jiang. 2023. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing, Vol. 32 (2023), 3054--3065.

Digital Library

[39]

Chang Liu, Xudong Jiang, and Henghui Ding. 2022. Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia, Vol. 25 (2022), 3657--3667.

Digital Library

[40]

Chang Liu, Xudong Jiang, and Henghui Ding. 2024. PrimitiveNet: decomposing the global constraints for referring segmentation. Visual Intelligence, Vol. 2, 1 (2024), 16.

[41]

Chang Liu, Xiangtai Li, and Henghui Ding. 2024. Referring Image Editing: Object-level Image Editing via Referring Expressions. In CVPR. 13128--13138.

[42]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[43]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. 10012--10022.

[44]

Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. 2021. Group-free 3d object detection via transformers. In ICCV. 2949--2958.

[45]

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034--10043.

[46]

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 2022. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454--16463.

[47]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR. 11--20.

[48]

Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An end-to-end transformer model for 3d object detection. In CVPR. 2906--2917.

[49]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS.

[50]

Zhipeng Qian, Yiwei Ma, Jiayi Ji, and Xiaoshuai Sun. 2024. X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In AAAI. 4551--4559.

[51]

Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. 2022. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning. PMLR, 1046--1056.

[52]

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. 2023. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA. IEEE, 8216--8223.

[53]

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. 2024. Context-Aware Integration of Language and Visual References for Natural Language Tracking. In CVPR.

[54]

Jiahao Sun, Chunmei Qing, Junpeng Tan, and Xiangmin Xu. 2023. Superpoint transformer for 3d scene instance segmentation. In AAAI. 2393--2401.

[55]

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. Cris: Clip-driven referring image segmentation. In CVPR. 11686--11695.

[56]

Changli Wu, Yihang Liu, Yiwei Ma, Haowei Wang, Gen Luo, Jiayi Ji, Henghui Ding, Xiaoshuai Sun, and Rongrong Ji. 2024. 3D-GRES: Generalized 3D Referring Expression Segmentation. In ACM MM.

[57]

Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, and Xiaoshuai Sun. 2024 d. 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. In AAAI. 5940--5948.

[58]

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. 2024. Towards robust referring image segmentation. IEEE TIP, Vol. 33 (2024), 1782--1794.

[59]

Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, et al. 2024. Towards open vocabulary learning: A survey. IEEE TPAMI (2024).

[60]

Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, and Jian Zhang. 2023. EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding. In CVPR. 19231--19242.

[61]

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022. Groupvit: Semantic segmentation emerges from text supervision. In CVPR. 18134--18144.

[62]

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. 2022. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155--18165.

[63]

Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo Luo. 2021. Sat: 2d semantics assisted training for 3d visual grounding. In ICCV. 1856--1866.

[64]

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In ECCV. Springer, 69--85.

[65]

Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Sheng Wang, Zhen Li, and Shuguang Cui. 2021. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In CVPR. 1791--1800.

[66]

Yiming Zhang, ZeMing Gong, and Angel X Chang. 2023. Multi3DRefer: Grounding Text Description to Multiple 3D Objects. In ICCV. 15225--15236.

[67]

Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 2021. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928--2937.

[68]

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 2023. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In ICCV.

Cited By

He SDing HJiang XWen B(2024)[inline-graphic not available: see fulltext] SegPoint: Segment Any Point Cloud via Large Language ModelComputer Vision – ECCV 202410.1007/978-3-031-72670-5_20(349-367)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72670-5_20

Index Terms

RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Cross-modal transformer with language query for referring image segmentation
Abstract
Referring image segmentation (RIS) aims to predict a segmentation mask for a target specified by a natural language expression. However, the existing methods failed to implement deep interaction between vision and language is needed in ...
Structured Multimodal Fusion Network for Referring Image Segmentation
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Referring image segmentation aims to segment one particular object referred by a natural language expression in the image. One major challenge of this task is how to understand and align vision and language to distinguish the referent. Another major ...
Fuse and Calibrate: A Bi-directional Vision-Language Guided Framework for Referring Image Segmentation
Advanced Intelligent Computing Technology and Applications
Abstract
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
37
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)37

Reflects downloads up to 28 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

He SDing HJiang XWen B(2024)[inline-graphic not available: see fulltext] SegPoint: Segment Any Point Cloud via Large Language ModelComputer Vision – ECCV 202410.1007/978-3-031-72670-5_20(349-367)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72670-5_20

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents