Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing

Published: 01 January 2023 Publication History

Abstract

Scene parsing is a fundamental task in computer vision. Various RGB-D (color and depth) scene parsing methods based on fully convolutional networks have achieved excellent performance. However, color and depth information are different in nature and existing methods cannot optimize the cooperation of high-level and low-level information when aggregating modal information, which introduces noise or loss of key information in the aggregated features and generates inaccurate segmentation maps. The features extracted from the depth branch are weak because of the low quality of the depth map, which results in unsatisfactory feature representation. To address these drawbacks, we propose a progressive guided fusion and depth enhancement network (PGDENet) for RGB-D indoor scene parsing. First, high-quality RGB images are used to improve depth data through a depth enhancement module, in which the depth maps are strengthened in terms of channel and spatial correlations. Then, we integrate information from the RGB and enhance depth modalities using a progressive complementary fusion module, in which we start with high-level semantic information and move down layerwise to guide the fusion of adjacent layers while reducing hierarchy-based differences. Extensive experiments are conducted on two public indoor scene datasets, and the results show that the proposed PGDENet outperforms state-of-the-art methods in RGB-D scene parsing.

References

[1]
S. Liu, G. Tian, Y. Zhang, and P. Duan, “Scene recognition mechanism for service robot adapting various families: A CNN-based approach using multi-type cameras,” IEEE Trans. Multimedia, May 2021.
[2]
W. Zhou, Y. Lv, J. Lei, and L. Yu, “Global and local-contrast guides content-aware fusion for RGB-D saliency prediction,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. 51, no. 6, pp. 3641–3649, Jun. 2021.
[3]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3431–3440.
[4]
W. Zhou, Q. Guo, J. Lei, L. Yu, and J.-N. Hwang, “IRFR-Net: Interactive recursive feature-reshaping network for detecting salient objects in RGB-D images,” IEEE Trans. Neural Netw. Learn. Syst., early access, Aug. 2021.
[5]
J. Fu et al., “Dual attention network for scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3146–3154.
[6]
J. He, Z. Deng, and Y. Qiao, “Dynamic multi-scale filters for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3562–3572.
[7]
W. Zhou, J. Wu, J. Lei, J.-N. Hwang, and L. Yu, “Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder,” IEEE Trans. Multimedia, vol. 23, pp. 3388–3399, Sep. 2021.
[8]
W. Wang and U. Neumann, “Depth-aware CNN for RGB-D segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 135–150.
[9]
W. Zhou et al., “Local and global feature learning for blind quality evaluation of screen content and natural scene images,” IEEE Trans. Image Process., vol. 27, no. 5, pp. 2086–2095, May 2018.
[10]
S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor scenes from RGB-D images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 564–571.
[11]
B. Jiang, Z. Zhou, X. Wang, J. Tang, and B. Luo, “cmSalGAN: RGB-D salient object detection with cross-view generative adversarial networks,” IEEE Trans. Multimedia, vol. 23, pp. 1343–1353, May 2021.
[12]
C. Dal mutto, P. Zanuttigh, and G. M. Cortelazzo, “Fusion of geometry and color information for scene segmentation,” IEEE J. Sel. Topics Signal Process., vol. 6, no. 5, pp. 505–521, Apr. 2012.
[13]
W. Zhou, J. Liu, J. Lei, J.-N. Hwang, and L. Yu, “GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation,” IEEE Trans. Image Process., vol. 30, pp. 7790–7802, Sep. 2021.
[14]
N. Huang, Y. Yang, D. Zhang, Q. Zhang, and J. Han, “Employing bilinear fusion and saliency prior information for RGB-D salient object detection,” IEEE Trans. Multimedia, Mar. 2021.
[15]
N. Huang, Y. Liu, Q. Zhang, and J. Han, “Joint cross-modal and unimodal features for RGB-D salient object detection,” IEEE Trans. Multimedia, vol. 23, pp. 2428–2441, Jul. 2021.
[16]
W. Zhou, Q. Guo, J. Lei, L. Yu, and J.-N. Hwang, “ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1224–1235, May 2021.
[17]
L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 801–818.
[18]
A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” Int. J. Comput. Vis., vol. 128, no. 5, pp. 1–47, May 2019.
[19]
X. Chen et al., “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., Aug. 2020, pp. 561–577.
[20]
M. Fayyaz et al., “STFCN: Spatio-temporal fully convolutional neural network for semantic segmentation of street scenes,” in Proc. Asian Conf. Comput. Vis., Nov. 2016, pp. 493–509.
[21]
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervention, Oct. 2015, pp. 234–241.
[22]
P. O. Pinheiro, T. Y. Lin, R. Collobert, and P. Dollár, “Learning to refine object segments,” in Proc. Eur. Conf. Comput. Vis., Oct. 2016, pp. 75–91.
[23]
V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Jan. 2017.
[24]
Q. Tang, F. Liu, J. Jiang, and Y. Zhang, “Epreet: Efficient pyramid representation network for real-time street scene segmentation,” IEEE Trans. Intell. Transp. Syst., Mar. 2021.
[25]
X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks for rgb-d semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 5209–5218.
[26]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2881–2890.
[27]
X. Li, Z. Zhao, and Q. Wang, “ABSSNet: Attention-based spatial segmentation network for traffic scene understanding,” IEEE Trans. Cybern., Feb. 2021.
[28]
H. Zhang et al., “Context encoding for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7151–7160.
[29]
H. Fu, M. Gong, C. Wang, and D. Tao, “MoE-SPNet: A mixture-of experts scene parsing network,” Pattern Recognit, vol. 84, pp. 226–236, Dec. 2018.
[30]
H. Zhao et al., “Psanet: Point-wise spatial attention network for scene parsing,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 267–283.
[31]
Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai, “Asymmetric non-local neural networks for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 593–602.
[32]
Y. Zhao, J. Li, Y. Zhang, and Y. Tian, “Multi-class part parsing with joint boundary-semantic awareness,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 9177–9186.
[33]
Z. Huang, C. Wang, X. Wang, W. Liu, and J. Wang, “Semantic image segmentation by scale-adaptive networks,” IEEE Trans. Image Process., vol. 29, pp. 2066–2077, Oct. 2020.
[34]
H. Ding, X. Jiang, A. Q. Liu, N. M. Thalmann, and G. Wang, “Boundary-aware feature propagation for scene segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 6819–6829.
[35]
T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5229–5238.
[36]
J. Fu et al., “Stacked deconvolutional network for semantic segmentation,” IEEE Trans. Image Process., Jan. 2019.
[37]
P. Zhang, W. Liu, Y. Lei, H. Wang, and H. Lu, “Deep multiphase level set for scene parsing,” IEEE Trans. Image Process., vol. 29, pp. 4556–4567, Feb. 2020.
[38]
C. Couprie, C. Farabet, L. Najman, and Y. Lecun, “Indoor semantic segmentation using depth information,” 2013, arXiv:1301.3572.
[39]
F. Liu, G. Lin, and C. Shen, “Discriminative training of deep fully connected continuous crfs with task-specific loss,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2127–2136, Feb. 2017.
[40]
L. Ma, J. Stückler, C. Kerl, and D. Cremers, “Multi-view deep learning for consistent semantic mapping with rgb-d cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Sep. 2017, pp. 598–605.
[41]
C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Proc. Asian Conf. Comput. Vis., Nov. 2016, pp. 213–228.
[42]
J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “RedNet: Residual encoder-decoder network for indoor rgb-d semantic segmentation,” 2018, arXiv:1806.01054.
[43]
Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1475–1483.
[44]
Y. He, W. Chiu, M. Keuper, and M. Fritz, “Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7158–7167.
[45]
H. Liu, W. Wu, X. Wang, and Y. Qian, “RGB-D joint modelling with scene geometric information for indoor semantic segmentation,” Multimedia Tools Appl., vol. 77, no. 17, pp. 22475–22488, Sep. 2018.
[46]
W. Wang and U. Neumann, “Depth-aware cnn for rgb-d segmentation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 135–150.
[47]
W. Zhou, J. Yuan, J. Lei, and T. Luo, “TSNet: Three-stream self-attention network for rgb-d indoor semantic segmentation,” IEEE Intell. Syst., vol. 36, no. 4, pp. 73–78, Jun. 2020.
[48]
D. Lin, G. Chen, D. Cohen-Or, P. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of RGB-D images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 1320–1328.
[49]
L. -Z. Chen, Z. Lin, Z. Wang, Y. -L. Yang, and M. -M. Cheng, “Spatial information guided convolution for real-time rgbd semantic segmentation,” IEEE Trans. Image Process., vol. 30, pp. 2313–2324, Jan. 2021.
[50]
D. Lin, R. Zhang, Y. Ji, P. Li, and H. Huang, “Scn: Switchable context network for semantic segmentation of rgb-d images,” IEEE Trans. Cybern., vol. 50, no. 3, pp. 1120–1131, Jan. 2020.
[51]
Z. Li et al., “LSTM-CF: Unifying context modeling and fusion with lstms for rgb-d scene labeling,” in Proc. Eur. Conf. Comput. Vis., Oct. 2016, pp. 541–557.
[52]
J. Yuan, W. Zhou, and T. Luo, “DMFNet: Deep multi-modal fusion network for RGB-D indoor scene segmentation,” IEEE Access, vol. 7, pp. 169350–169358, Nov. 2019.
[53]
Y. Yuan and J. Wang, “OCNet: Object context network for scene parsing,” 2018, arXiv:1809.00916.
[54]
Z. Huang et al., “CCnet: Criss-cross attention for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 603–612.
[55]
X. Hu, K. Yang, L. Fei, and K. Wang, “ACNet: Attention based network to exploit complementary features for rgbd semantic segmentation,” in Proc. IEEE Int. Conf. Image Process., Sep. 2019, pp. 1440–1444.
[56]
L. Deng, M. Yang, T. Li, Y. He, and C. Wang, “RFBNet: Deep multimodal networks with residual fusion blocks for rgb-d semantic segmentation,” 2019, arXiv:1907.00135.
[57]
C. Zhu, K. Cai, T. Huang, H. Li, and G. Li, “PDNet: Prior-model guided depth-enhanced network for salient object detection,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2019, pp. 199–204.
[58]
C. Zhu, W. Zhang, T. Li, S. Liu, and G. Li, “Exploiting the value of the center-dark channel prior for salient object detection,” ACM Trans. Intell. Syst. Technol. (TIST), vol. 10, no. 3, pp. 1–20, May 2019.
[59]
Z. Xiong, Y. Yuan, N. Guo, and Q. Wang, “Variational context-deformable convnets for indoor scene parsing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3991–4001.
[60]
S. Park, K. Hong, and S. Lee, “RDFNet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 4980–4989.
[61]
P. Zhang, W. Liu, Y. Lei, H. Wang, and H. Lu, “RAPNet: Residual atrous pyramid network for importance-aware street scene parsing,” IEEE Trans. Image Process., vol. 29, pp. 5010–5021, Mar. 2020.
[62]
H. Chen and Y. Li, “Progressively complementarity-aware fusion network for RGB-D salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3051–3060.
[63]
Z. Huang, H. X. Chen, T. Zhou, Y. Z. Yang, and B. Y.L. Liu, “Multi-level cross-modal interaction network for rgb-d salient object detection,” Neurocomputing, vol. 452, pp. 200–211, 2020.
[64]
Y. Pang, X. Zhao, L. Zhang, and H. Lu, “Multi-scale interactive network for salient object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9410–9419.
[65]
D. Fan, Y. Zhai, A. Borji, J. Yang, and L. Shao, “Bbs-net: Rgb-d salient object detection with a bifurcated backbone strategy network,” in Proc. Eur. Conf. Comput. Vis., Aug. 2020, pp. 275–292.
[66]
G. Zhang, J. -H. Xue, P. Xie, S. Yang, and G. Wang, “Non-local aggregation for rgb-d semantic segmentation,” IEEE Signal Process. Lett., vol. 28, pp. 658–662, Mar. 2021.
[67]
W. Zhou, Y. Zhu, J. Lei, J. Wan, and L. Yu, “CCAFNet: Crossflow and cross-scale adaptive fusion network for detecting salient objects in rgb-d images,” IEEE Trans. Multimedia, May 2021.
[68]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[69]
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proc. Eur. Conf. Comput. Vis., Oct. 2012, pp. 746–760.
[70]
J. Xiao, A. Owens, and A. Torralba, “Sun3d: A database of big spaces reconstructed using sfm and object labels,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1625–1632.
[71]
W. Zhou, X. Lin, J. Lei, L. Yu, and J.-N. Hwang, “MFFENet: Multiscale feature fusion and enhancement network for RGB–Thermal urban road scene parsing,” IEEE Trans. Multimedia, early access, Jun. 2021.
[72]
D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H. Gross, “Efficient rgb-d semantic segmentation for indoor scene analysis,” 2020, arXiv:2011.06961.
[73]
G. Li et al., “Hierarchical alternate interaction network for rgb-d salient object detection,” IEEE Trans. Image Process., vol. 30, pp. 3528–3542, Mar. 2021.
[74]
W. Zhang, Y. Jiang, K. Fu, and Q. Zhao, “BTS-Net: Bi-directional transfer-and-selection network for RGB-D salient object detection,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2021, pp. 1–6.

Cited By

View all
  • (2024)MaskMentor: Unlocking the Potential of Masked Self-Teaching for Missing Modality RGB-D Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681698(1915-1923)Online publication date: 28-Oct-2024
  • (2024)PrimKD: Primary Modality Guided Multimodal Fusion for RGB-D Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681253(1943-1951)Online publication date: 28-Oct-2024
  • (2024)EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and ImagesIEEE Transactions on Multimedia10.1109/TMM.2024.338025526(8639-8650)Online publication date: 21-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 25, Issue
2023
8932 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MaskMentor: Unlocking the Potential of Masked Self-Teaching for Missing Modality RGB-D Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681698(1915-1923)Online publication date: 28-Oct-2024
  • (2024)PrimKD: Primary Modality Guided Multimodal Fusion for RGB-D Semantic SegmentationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681253(1943-1951)Online publication date: 28-Oct-2024
  • (2024)EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and ImagesIEEE Transactions on Multimedia10.1109/TMM.2024.338025526(8639-8650)Online publication date: 21-Mar-2024
  • (2024)EGFNet: Edge-Aware Guidance Fusion Network for RGB–Thermal Urban Scene ParsingIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330636825:1(657-669)Online publication date: 1-Jan-2024
  • (2024)DGPINet-KD: Deep Guided and Progressive Integration Network With Knowledge Distillation for RGB-D Indoor Scene AnalysisIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338235434:9(7844-7855)Online publication date: 1-Sep-2024
  • (2024)Pixel Difference Convolutional Network for RGB-D Semantic SegmentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329616234:3(1481-1492)Online publication date: 1-Mar-2024
  • (2023)MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic SegmentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327531433:12(7096-7108)Online publication date: 1-Dec-2023
  • (2023)Dual-Space Graph-Based Interaction Network for RGB-Thermal Semantic Segmentation in Electric Power SceneIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.321631333:4(1577-1592)Online publication date: 1-Apr-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media