Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3591106.3592235acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation

Published: 12 June 2023 Publication History

Abstract

In this paper, we introduce a novel approach to address the challenge of effectively utilizing both RGB and depth information for semantic segmentation. Our approach, Intra-inter Modal Attention (IMA) blocks, considers both intra-modal and inter-modal aspects of the information to produce better results than prior methods which primarily focused on inter-modal relationships. The IMA blocks consist of a cross-modal non-local module and an adaptive channel-wise fusion module. The cross-modal non-local module captures both intra-modal and inter-modal variations at the spatial level through inter-modality parameter sharing, while the adaptive channel-wise fusion module refines the spatially-correlated features. Experimental results on RGB-D benchmark datasets demonstrate consistent performance improvements over various baseline segmentation networks when using the IMA blocks. Our in-depth analysis provides comprehensive results on the impact of intra-, inter-, and intra-inter modal attention on RGB-D segmentation.

References

[1]
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. on Pattern Anal. Mach. Intell. (TPAMI) 39, 12 (2017), 2481–2495.
[2]
Lizhi Bai, Jun Yang, Chunqi Tian, Yaoru Sun, Maoyu Mao, Yanjun Xu, and Weirong Xu. 2022. DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation. arXiv preprint arXiv:2210.06747 (2022).
[3]
Yuri Boykov and Gareth Funka-Lea. 2006. Graph cuts and efficient ND image segmentation. Int’l Journal of Computer Vision (IJCV) 70, 2 (2006), 109–131.
[4]
Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. 2021. Shapeconv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 7088–7097.
[5]
Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proc. of Int’l Conf. on Computer Vision (ICCV). 0–0.
[6]
Tony F Chan and Luminita A Vese. 2001. Active contours without edges. IEEE Trans. on Image Processing (TIP) 10, 2 (2001), 266–277.
[7]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).
[8]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. on Pattern Anal. Mach. Intell. (TPAMI) 40, 4 (2017), 834–848.
[9]
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
[10]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. of European Conf. on Computer Vision (ECCV). 801–818.
[11]
Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang Zeng. 2020. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 561–577.
[12]
Zhang Chen, Zhiqiang Tian, Jihua Zhu, Ce Li, and Shaoyi Du. 2022. C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical Image. In Proc. of Computer Vision and Pattern Recognition (CVPR). 11676–11685.
[13]
Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 12475–12485.
[14]
Donghyeon Cho, Sungeun Hong, Sungil Kang, and Jiwon Kim. 2019. Key instance selection for unsupervised video object segmentation. arXiv preprint arXiv:1906.07851 (2019).
[15]
Camille Couprie, Clément Farabet, Laurent Najman, and Yann LeCun. 2013. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013).
[16]
Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. 2018. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 2393–2402.
[17]
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. of Int’l Conf. on Computer Vision (ICCV). 2650–2658.
[18]
Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In Proc. of Computer Vision and Pattern Recognition (CVPR). 564–571.
[19]
Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. 2017. Segmentation-aware convolutional networks using local attention masks. In Proc. of Int’l Conf. on Computer Vision (ICCV). 5038–5047.
[20]
Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. 2016. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proc. of Asian Conf. on Computer Vision (ACCV). Springer, 213–228.
[21]
Junjun He, Zhongying Deng, and Yu Qiao. 2019. Dynamic multi-scale filters for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 3562–3572.
[22]
Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. 2019. Adaptive pyramid context network for semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7519–7528.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition (CVPR). 770–778.
[24]
Alexander Hermans, Georgios Floros, and Bastian Leibe. 2014. Dense 3d semantic mapping of indoor scenes from rgb-d images. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 2631–2638.
[25]
Sungeun Hong and Jongbin Ryu. 2020. Attention-guided adaptation factors for unsupervised facial domain adaptation. Electronics Letters 56, 16 (2020), 816–818.
[26]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7132–7141.
[27]
Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. 2019. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In IEEE Int’l Conf. on Image Processing (ICIP). IEEE, 1440–1444.
[28]
Jindong Jiang, Lunan Zheng, Fei Luo, and Zhijun Zhang. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018).
[29]
Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. 2019. Geometry-aware distillation for indoor semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 2869–2878.
[30]
Shu Kong and Charless C Fowlkes. 2018. Recurrent scene parsing with perspective understanding in the loop. In Proc. of Computer Vision and Pattern Recognition (CVPR). 956–965.
[31]
Jiehao Li, Yingpeng Dai, Junzheng Wang, Xiaohang Su, and Ruijun Ma. 2022. Towards broad learning networks on unmanned mobile robot for semantic segmentation. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 9228–9234.
[32]
Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-Ann Heng, and Hui Huang. 2017. Cascaded feature network for semantic segmentation of RGB-D images. In Proc. of Int’l Conf. on Computer Vision (ICCV). 1311–1319.
[33]
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 1925–1934.
[34]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 3431–3440.
[35]
Lingni Ma, Jörg Stückler, Christian Kerl, and Daniel Cremers. 2017. Multi-view deep learning for consistent semantic mapping with rgb-d cameras. In IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS). IEEE, 598–605.
[36]
Federico Nesti, Giulio Rossolini, Saasha Nair, Alessandro Biondi, and Giorgio Buttazzo. 2022. Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In Proc. of Winter Conf. on Application of Computer Vision (ECCV). 2280–2289.
[37]
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 1520–1528.
[38]
Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee. 2017. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 4980–4989.
[39]
Nils Plath, Marc Toussaint, and Shinichi Nakajima. 2009. Multi-class image segmentation using conditional random fields and global classification. In Proc. of Int’l Conf. on Machine Learning (ICML). 817–824.
[40]
Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. 3d graph neural networks for rgbd semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 5199–5208.
[41]
Xiaofeng Ren, Liefeng Bo, and Dieter Fox. 2012. Rgb-(d) scene labeling: Features and algorithms. In Proc. of Computer Vision and Pattern Recognition (CVPR). IEEE, 2759–2766.
[42]
Daniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. 2021. Efficient rgb-d semantic segmentation for indoor scene analysis. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 13525–13531.
[43]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 746–760.
[44]
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proc. of Computer Vision and Pattern Recognition (CVPR). 567–576.
[45]
J-L Starck, Michael Elad, and David L Donoho. 2005. Image decomposition via the combination of sparse representations and a variational approach. IEEE Trans. on Image Processing (TIP) 14, 10 (2005), 1570–1582.
[46]
Yuejiao Su, Yuan Yuan, and Zhiyu Jiang. 2021. Deep feature selection-and-fusion for RGB-D semantic segmentation. In IEEE Int’l Conf. on Multimedia and Expo (ICME). IEEE, 1–6.
[47]
Matteo Terreran, Elia Bonetto, and Stefano Ghidoni. 2021. Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests. In Proc. of Int’l Conf. on Pattern Recognition (ICPR). IEEE, 4634–4641.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proc. of Neural Information Processing Systems (NeurIPS) 30 (2017).
[49]
Jinghua Wang, Zhenhua Wang, Dacheng Tao, Simon See, and Gang Wang. 2016. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In Proc. of European Conf. on Computer Vision (ECCV). Springer, 664–679.
[50]
Weiyue Wang and Ulrich Neumann. 2018. Depth-aware cnn for rgb-d segmentation. In Proc. of European Conf. on Computer Vision (ECCV). 135–150.
[51]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 7794–7803.
[52]
Xide Xia and Brian Kulis. 2017. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506 (2017).
[53]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Proc. of Neural Information Processing Systems (NeurIPS) 34 (2021), 12077–12090.
[54]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proc. of Computer Vision and Pattern Recognition (CVPR). 1492–1500.
[55]
Shiqiang Yang, Cheng Zhao, Zhengkun Wu, Yan Wang, Guodong Wang, and Dexin Li. 2022. Visual SLAM based on semantic segmentation and geometric constraints for dynamic indoor environments. IEEE Access 10 (2022), 69636–69649.
[56]
Yuchun Yue, Wujie Zhou, Jingsheng Lei, and Lu Yu. 2021. Two-stage cascaded decoder for semantic segmentation of RGB-D images. IEEE Signal Processing Letters 28 (2021), 1115–1119.
[57]
Guodong Zhang, Jing-Hao Xue, Pengwei Xie, Sifan Yang, and Guijin Wang. 2021. Non-local aggregation for RGB-D semantic segmentation. IEEE Signal Processing Letters 28 (2021), 658–662.
[58]
Youjia Zhang, Soyun Choi, and Sungeun Hong. 2022. Spatio-channel Attention Blocks for Cross-modal Crowd Counting. In Proc. of Asian Conf. on Computer Vision (ACCV). 90–107.
[59]
Yang Zhang, Yang Yang, Chenyun Xiong, Guodong Sun, and Yanwen Guo. 2022. Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint arXiv:2201.01427 (2022).
[60]
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR). 4106–4115.
[61]
Hao Zhou, Lu Qi, Zhaoliang Wan, Hai Huang, and Xu Yang. 2020. RGB-D co-attention network for semantic segmentation. In Proc. of Asian Conf. on Computer Vision (ACCV).
[62]
Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, and Jian Yang. 2020. Pattern-structure diffusion for multi-task learning. In Proc. of Computer Vision and Pattern Recognition (CVPR). 4514–4523.
[63]
Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. 2019. Asymmetric non-local neural networks for semantic segmentation. In Proc. of Int’l Conf. on Computer Vision (ICCV). 593–602.

Cited By

View all
  • (2025)Perceptual metric for face image quality with pixel-level interpretabilityNeurocomputing10.1016/j.neucom.2024.128780614(128780)Online publication date: Jan-2025

Index Terms

  1. Intra-inter Modal Attention Blocks for RGB-D Semantic Segmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
    June 2023
    694 pages
    ISBN:9798400701788
    DOI:10.1145/3591106
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-modal Fusion
    2. Non-local Attention
    3. RGB-D Semantic Segmentation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICMR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)128
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Perceptual metric for face image quality with pixel-level interpretabilityNeurocomputing10.1016/j.neucom.2024.128780614(128780)Online publication date: Jan-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media