research-article

Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation

Authors:

Sungeun HongAuthors Info & Claims

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 217 - 225

https://doi.org/10.1145/3591106.3592235

Published: 12 June 2023 Publication History

Abstract

In this paper, we introduce a novel approach to address the challenge of effectively utilizing both RGB and depth information for semantic segmentation. Our approach, Intra-inter Modal Attention (IMA) blocks, considers both intra-modal and inter-modal aspects of the information to produce better results than prior methods which primarily focused on inter-modal relationships. The IMA blocks consist of a cross-modal non-local module and an adaptive channel-wise fusion module. The cross-modal non-local module captures both intra-modal and inter-modal variations at the spatial level through inter-modality parameter sharing, while the adaptive channel-wise fusion module refines the spatially-correlated features. Experimental results on RGB-D benchmark datasets demonstrate consistent performance improvements over various baseline segmentation networks when using the IMA blocks. Our in-depth analysis provides comprehensive results on the impact of intra-, inter-, and intra-inter modal attention on RGB-D segmentation.

References

[1]

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. on Pattern Anal. Mach. Intell. (TPAMI) 39, 12 (2017), 2481–2495.

[2]

Lizhi Bai, Jun Yang, Chunqi Tian, Yaoru Sun, Maoyu Mao, Yanjun Xu, and Weirong Xu. 2022. DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation. arXiv preprint arXiv:2210.06747 (2022).

[3]

Yuri Boykov and Gareth Funka-Lea. 2006. Graph cuts and efficient ND image segmentation. Int’l Journal of Computer Vision (IJCV) 70, 2 (2006), 109–131.

Digital Library

[4]

Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. 2021. Shapeconv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 7088–7097.

[5]

Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proc. of Int’l Conf. on Computer Vision (ICCV). 0–0.

[6]

Tony F Chan and Luminita A Vese. 2001. Active contours without edges. IEEE Trans. on Image Processing (TIP) 10, 2 (2001), 266–277.

Digital Library

[7]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).

[8]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. on Pattern Anal. Mach. Intell. (TPAMI) 40, 4 (2017), 834–848.

[9]

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).

[10]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. of European Conf. on Computer Vision (ECCV). 801–818.

Digital Library

[11]

Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang Zeng. 2020. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 561–577.

Digital Library

[12]

Zhang Chen, Zhiqiang Tian, Jihua Zhu, Ce Li, and Shaoyi Du. 2022. C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image. In Proc. of Computer Vision and Pattern Recognition (CVPR). 11676–11685.

[13]

Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 12475–12485.

[14]

Donghyeon Cho, Sungeun Hong, Sungil Kang, and Jiwon Kim. 2019. Key instance selection for unsupervised video object segmentation. arXiv preprint arXiv:1906.07851 (2019).

[15]

Camille Couprie, Clément Farabet, Laurent Najman, and Yann LeCun. 2013. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013).

[16]

Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. 2018. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 2393–2402.

[17]

David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. of Int’l Conf. on Computer Vision (ICCV). 2650–2658.

Digital Library

[18]

Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In Proc. of Computer Vision and Pattern Recognition (CVPR). 564–571.

Digital Library

[19]

Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. 2017. Segmentation-aware convolutional networks using local attention masks. In Proc. of Int’l Conf. on Computer Vision (ICCV). 5038–5047.

[20]

Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. 2016. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proc. of Asian Conf. on Computer Vision (ACCV). Springer, 213–228.

[21]

Junjun He, Zhongying Deng, and Yu Qiao. 2019. Dynamic multi-scale filters for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 3562–3572.

[22]

Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. 2019. Adaptive pyramid context network for semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7519–7528.

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR). 770–778.

[24]

Alexander Hermans, Georgios Floros, and Bastian Leibe. 2014. Dense 3d semantic mapping of indoor scenes from rgb-d images. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 2631–2638.

[25]

Sungeun Hong and Jongbin Ryu. 2020. Attention-guided adaptation factors for unsupervised facial domain adaptation. Electronics Letters 56, 16 (2020), 816–818.

[26]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7132–7141.

[27]

Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. 2019. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In IEEE Int’l Conf. on Image Processing (ICIP). IEEE, 1440–1444.

[28]

Jindong Jiang, Lunan Zheng, Fei Luo, and Zhijun Zhang. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018).

[29]

Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. 2019. Geometry-aware distillation for indoor semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 2869–2878.

[30]

Shu Kong and Charless C Fowlkes. 2018. Recurrent scene parsing with perspective understanding in the loop. In Proc. of Computer Vision and Pattern Recognition (CVPR). 956–965.

[31]

Jiehao Li, Yingpeng Dai, Junzheng Wang, Xiaohang Su, and Ruijun Ma. 2022. Towards broad learning networks on unmanned mobile robot for semantic segmentation. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 9228–9234.

Digital Library

[32]

Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-Ann Heng, and Hui Huang. 2017. Cascaded feature network for semantic segmentation of RGB-D images. In Proc. of Int’l Conf. on Computer Vision (ICCV). 1311–1319.

[33]

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 1925–1934.

[34]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 3431–3440.

[35]

Lingni Ma, Jörg Stückler, Christian Kerl, and Daniel Cremers. 2017. Multi-view deep learning for consistent semantic mapping with rgb-d cameras. In IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS). IEEE, 598–605.

Digital Library

[36]

Federico Nesti, Giulio Rossolini, Saasha Nair, Alessandro Biondi, and Giorgio Buttazzo. 2022. Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In Proc. of Winter Conf. on Application of Computer Vision (ECCV). 2280–2289.

[37]

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 1520–1528.

Digital Library

[38]

Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee. 2017. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 4980–4989.

[39]

Nils Plath, Marc Toussaint, and Shinichi Nakajima. 2009. Multi-class image segmentation using conditional random fields and global classification. In Proc. of Int’l Conf. on Machine Learning (ICML). 817–824.

Digital Library

[40]

Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. 3d graph neural networks for rgbd semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 5199–5208.

[41]

Xiaofeng Ren, Liefeng Bo, and Dieter Fox. 2012. Rgb-(d) scene labeling: Features and algorithms. In Proc. of Computer Vision and Pattern Recognition (CVPR). IEEE, 2759–2766.

[42]

Daniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. 2021. Efficient rgb-d semantic segmentation for indoor scene analysis. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 13525–13531.

Digital Library

[43]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 746–760.

Digital Library

[44]

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proc. of Computer Vision and Pattern Recognition (CVPR). 567–576.

[45]

J-L Starck, Michael Elad, and David L Donoho. 2005. Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans. on Image Processing (TIP) 14, 10 (2005), 1570–1582.

Digital Library

[46]

Yuejiao Su, Yuan Yuan, and Zhiyu Jiang. 2021. Deep feature selection-and-fusion for RGB-D semantic segmentation. In IEEE Int’l Conf. on Multimedia and Expo (ICME). IEEE, 1–6.

[47]

Matteo Terreran, Elia Bonetto, and Stefano Ghidoni. 2021. Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests. In Proc. of Int’l Conf. on Pattern Recognition (ICPR). IEEE, 4634–4641.

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proc. of Neural Information Processing Systems (NeurIPS) 30 (2017).

[49]

Jinghua Wang, Zhenhua Wang, Dacheng Tao, Simon See, and Gang Wang. 2016. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 664–679.

[50]

Weiyue Wang and Ulrich Neumann. 2018. Depth-aware cnn for rgb-d segmentation. In Proc. of European Conf. on Computer Vision (ECCV). 135–150.

Digital Library

[51]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7794–7803.

[52]

Xide Xia and Brian Kulis. 2017. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506 (2017).

[53]

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Proc. of Neural Information Processing Systems (NeurIPS) 34 (2021), 12077–12090.

[54]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 1492–1500.

[55]

Shiqiang Yang, Cheng Zhao, Zhengkun Wu, Yan Wang, Guodong Wang, and Dexin Li. 2022. Visual SLAM based on semantic segmentation and geometric constraints for dynamic indoor environments. IEEE Access 10 (2022), 69636–69649.

[56]

Yuchun Yue, Wujie Zhou, Jingsheng Lei, and Lu Yu. 2021. Two-stage cascaded decoder for semantic segmentation of RGB-D images. IEEE Signal Processing Letters 28 (2021), 1115–1119.

[57]

Guodong Zhang, Jing-Hao Xue, Pengwei Xie, Sifan Yang, and Guijin Wang. 2021. Non-local aggregation for RGB-D semantic segmentation. IEEE Signal Processing Letters 28 (2021), 658–662.

[58]

Youjia Zhang, Soyun Choi, and Sungeun Hong. 2022. Spatio-channel Attention Blocks for Cross-modal Crowd Counting. In Proc. of Asian Conf. on Computer Vision (ACCV). 90–107.

[59]

Yang Zhang, Yang Yang, Chenyun Xiong, Guodong Sun, and Yanwen Guo. 2022. Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint arXiv:2201.01427 (2022).

[60]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 4106–4115.

[61]

Hao Zhou, Lu Qi, Zhaoliang Wan, Hai Huang, and Xu Yang. 2020. RGB-D co-attention network for semantic segmentation. In Proc. of Asian Conf. on Computer Vision (ACCV).

[62]

Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, and Jian Yang. 2020. Pattern-structure diffusion for multi-task learning. In Proc. of Computer Vision and Pattern Recognition (CVPR). 4514–4523.

[63]

Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. 2019. Asymmetric non-local neural networks for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 593–602.

Cited By

Jo BPark IHong S(2025)Perceptual metric for face image quality with pixel-level interpretabilityNeurocomputing10.1016/j.neucom.2024.128780614(128780)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128780

Index Terms

Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

Cross-modal attention fusion network for RGB-D semantic segmentation
Abstract
RGB-D semantic segmentation is crucial for robots to understand scenes. Most existing methods take depth information as an additional input, leading to cross-modal semantic segmentation networks that cannot achieve the purpose of multi-scale ...
Triple fusion and feature pyramid decoder for RGB-D semantic segmentation
Abstract
Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the ...
TandemFuse: An Intra- and Inter-Modal Fusion Strategy for RGB-T Tracking
CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

Visual object tracking is a prominent task in the field of computer vision, with significant potential in autonomous driving, human-computer interaction, and intelligent surveillance. Many studies have focused on tracking using single-modality data. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

June 2023

694 pages

ISBN:9798400701788

DOI:10.1145/3591106

Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMR '23

Sponsor:

SIGMM

ICMR '23: International Conference on Multimedia Retrieval

June 12 - 15, 2023

Thessaloniki, Greece

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
233
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jo BPark IHong S(2025)Perceptual metric for face image quality with pixel-level interpretabilityNeurocomputing10.1016/j.neucom.2024.128780614(128780)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128780

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents