research-article

DGPINet-KD: Deep Guided and Progressive Integration Network With Knowledge Distillation for RGB-D Indoor Scene Analysis

Authors:

Qiuping JiangAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 34, Issue 9

Pages 7844 - 7855

https://doi.org/10.1109/TCSVT.2024.3382354

Published: 27 March 2024 Publication History

Abstract

Significant advancements in RGB-D semantic segmentation have been made owing to the increasing availability of robust depth information. Most researchers have combined depth with RGB data to capture complementary information in images. Although this approach improves segmentation performance, it requires excessive model parameters. To address this problem, we propose DGPINet-KD, a deep-guided and progressive integration network with knowledge distillation (KD) for RGB-D indoor scene analysis. First, we used branching attention and depth guidance to capture coordinated, precise location information and extract more complete spatial information from the depth map to complement the semantic information for the encoded features. Second, we trained the student network (DGPINet-S) with a well-trained teacher network (DGPINet-T) using a multilevel KD. Third, an integration unit was developed to explore the contextual dependencies of the decoding features and to enhance relational KD. Comprehensive experiments on two challenging indoor benchmark datasets, NYUDv2 and SUN RGB-D, demonstrated that DGPINet-KD achieved improved performance in indoor scene analysis tasks compared with existing methods. Notably, on the NYUDv2 dataset, DGPINet-KD (DGPINet-S with KD) achieves a pixel accuracy gain of 1.7% and a class accuracy gain of 2.3% compared with DGPINet-S. In addition, compared with DGPINet-T, the proposed DGPINet-KD (DGPINet-S with KD) utilizes significantly fewer parameters (29.3M) while maintaining accuracy. The source code is available at <uri>https://github.com/XUEXIKUAIL/DGPINet</uri>.

References

[1]

H. Sheng, R. Cong, D. Yang, R. Chen, S. Wang, and Z. Cui, “UrbanLF: A comprehensive light field dataset for semantic segmentation of urban scenes,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7880–7893, Nov. 2022.

Digital Library

[2]

H. Li, D.-H. Zhai, and Y. Xia, “ERDUnet: An efficient residual double-coding unet for medical image segmentation,” IEEE Trans. Circuits Syst. Video Technol., early access, Aug. 1, 2024. 10.1109/TCSVT.2023.3300846.

Digital Library

[3]

C. Fanget al., “Reliable mutual distillation for medical image segmentation under imperfect annotations,” IEEE Trans. Med. Imag., vol. 42, no. 6, pp. 1720–1734, Jan. 2023.

[4]

G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “RGB-T semantic segmentation with location, activation, and sharpening,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1223–1235, Mar. 2023.

Digital Library

[5]

W. Shiet al., “RGB-D semantic segmentation and label-oriented voxelgrid fusion for accurate 3D semantic mapping,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 183–197, Jan. 2022.

Digital Library

[6]

F. Zhanget al., “Multi-target domain adaptation building instance extraction of remote sensing imagery with domain-common approximation learning,” IEEE Trans. Geosci. Remote Sens.. 10.1109/TGRS.2024.3376719.

[7]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.

[8]

C.-Y. Wu, Q. Gao, C.-C. Hsu, T.-L. Wu, J.-W. Chen, and U. Neumann, “InSpaceType: Reconsider space type in indoor monocular depth estimation,” 2023, arXiv:2309.13516.

[9]

X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks for RGBD semantic segmentation,” in Proc. ICCV, 2017, pp. 5199–5208.

[10]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.

[11]

S. Ji, Z. Zhang, S. Ying, L. Wang, X. Zhao, and Y. Gao, “Kullback–Leibler divergence metric learning,” IEEE Trans. Cybern., vol. 52, no. 4, pp. 2047–2058, Apr. 2022.

[12]

S. Duet al., “Agree to disagree: Adaptive ensemble knowledge distillation in gradient space,” in Proc. NIPS, 2020, pp. 12345–12355.

[13]

B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, “A comprehensive overhaul of feature distillation,” in Proc. ICCV, 2019, pp. 1921–1930.

[14]

B. Penget al., “Correlation congruence for knowledge distillation,” in Proc. ICCV, 2019, pp. 5007–5016.

[15]

Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2019, arXiv:1910.10699.

[16]

Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proc. CVPR, 2019, pp. 2604–2613.

[17]

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” 2014, arXiv:1412.6550.

[18]

C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise knowledge distillation for dense prediction,” in Proc. ICCV, 2021, pp. 5311–5320.

[19]

X. Jinet al., “Knowledge distillation via route constrained optimization,” in Proc. ICCV, 2019, pp. 1345–1354.

[20]

E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, Apr. 2017.

Digital Library

[21]

D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of RGB-D images,” in Proc. ICCV, 2017, pp. 1311–1319.

[22]

Z. Wuet al., “Object segmentation by mining cross-modal semantics,” in Proc. ACM MM, 2023, pp. 3455–3464.

[23]

J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “RedNet: Residual encoder–decoder network for indoor RGB-D semantic segmentation,” 2018, arXiv:1806.01054.

[24]

W. Zhou, E. Yang, J. Lei, J. Wan, and L. Yu, “PGDENet: Progressive guided fusion and depth enhancement network for RGB-D indoor scene parsing,” IEEE Trans. Multimedia, vol. 25, pp. 3483–3494, 2023.

Digital Library

[25]

J. Yang, L. Bai, Y. Sun, C. Tian, M. Mao, and G. Wang, “Pixel difference convolutional network for RGB-D semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 3, pp. 1481–1492, Mar. 2024. 10.1109/TCSVT.2023.3296162.

Digital Library

[26]

D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.-M. Gross, “Efficient RGB-D semantic segmentation for indoor scene analysis,” in Proc. ICRA, 2021, pp. 13525–13531.

[27]

X. Hu, K. Yang, L. Fei, and K. Wang, “ACNET: Attention based network to exploit complementary features for RGBD semantic segmentation,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, pp. 1440–1444.

[28]

J. Cheng, Z. Wu, S. Wang, C. Demonceaux, and Q. Jiang, “Bidirectional collaborative mentoring network for marine organism detection and beyond,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11, pp. 6595–6608, Nov. 2023.

Digital Library

[29]

W. Zhou, E. Yang, J. Lei, and L. Yu, “FRNet: Feature reconstruction network for RGB-D indoor scene parsing,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 4, pp. 677–687, Jun. 2022.

[30]

S. Lee, S. Park, and K. Hong, “RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation,” in Proc. ICCV, 2017, pp. 4990–4999.

[31]

J. Yaoet al., “Position-based anchor optimization for point supervised dense nuclei detection,” Neural Netw., vol. 171, pp. 159–170, Mar. 2024.

Digital Library

[32]

W. Zhou, Y. Yue, M. Fang, X. Qian, R. Yang, and L. Yu, “BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images,” Inf. Fusion, vol. 94, pp. 32–42, Jun. 2023.

Digital Library

[33]

W. Gao, G. Liao, S. Ma, G. Li, Y. Liang, and W. Lin, “Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2091–2106, Apr. 2022.

[34]

R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proc. ICCV, 2021, pp. 12179–12188.

[35]

Z. Xieet al., “Cross-modality double bidirectional interaction and fusion network for RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4149–4163, Jul. 2023.

Digital Library

[36]

X. Chenet al., “Bi-directional cross-modality feature propagation with separation and aggregation gate for RGB-D semantic segmentation,” in Proc. ECCV, 2020, pp. 561–577.

[37]

E. Yang, W. Zhou, X. Qian, and L. Yu, “MGCNet: Multilevel gated collaborative network for RGB-D semantic segmentation of indoor scene,” IEEE Signal Process. Lett., vol. 29, pp. 2567–2571, 2022.

[38]

R. Conget al., “A weakly supervised learning framework for salient object detection via hybrid labels,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 534–548, Feb. 2023.

[39]

Y. Zhang, C. Xiong, J. Liu, X. Ye, and G. Sun, “Spatial information-guided adaptive context-aware network for efficient RGB-D semantic segmentation,” IEEE Sensors J., vol. 23, no. 19, pp. 23512–23521, Oct. 2023.

[40]

S. Du, W. Wang, R. Guo, and S. Tang, “AsymFormer: Asymmetrical cross-modal representation learning for mobile platform real-time RGB-D semantic segmentation,” 2023, arXiv:2309.14065.

[41]

H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep feature aggregation for real-time semantic segmentation,” in Proc. CVPR, 2019, pp. 9514–9523.

[42]

G. Yueet al., “Dual-constraint coarse-to-fine network for camouflaged object detection,” IEEE Trans. Circuits Syst. Video Technol., early access, Sep. 25, 2023. 10.1109/TCSVT.2023.3318672.

Digital Library

[43]

D. Ji, H. Wang, M. Tao, J. Huang, X.-S. Hua, and H. Lu, “Structural and statistical texture knowledge distillation for semantic segmentation,” in Proc. CVPR, 2022, pp. 16876–16885.

[44]

Y. Wang, W. Zhou, T. Jiang, X. Bai, and Y. Xu, “Intra-class feature variation distillation for semantic segmentation,” in Proc. ECCV, 2020, pp. 346–362.

[45]

Q. Guoet al., “Online knowledge distillation via collaborative learning,” in Proc. CVPR, 2020, pp. 11020–11029.

[46]

R. Liuet al., “TransKD: Transformer knowledge distillation for efficient semantic segmentation,” 2022, arXiv:2202.13393.

[47]

S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” 2016, arXiv:1612.03928.

[48]

N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledge transfer,” in Proc. ECCV, 2018, pp. 268–284.

[49]

P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in Proc. CVPR, 2021, pp. 5008–5017.

[50]

Z. Tianet al., “Adaptive perspective distillation for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1372–1387, Feb. 2023.

[51]

D. Chen, J. Mei, Y. Zhang, C. Wang, Y. Feng, and C. Chen, “Cross-layer distillation with semantic calibration,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 7028–7036.

[52]

S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic segmentation via self-attention and self-distillation,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 15256–15266, Sep. 2022.

Digital Library

[53]

D. Zhanget al., “Weakly supervised semantic segmentation via alternate self-dual teaching,” IEEE Trans. Image Process., early access, Dec. 20, 2023. 10.1109/TIP.2023.3343112.

[54]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. NIPS, 2021, pp. 12077–12090.

[55]

Y. Zhang, J. Zhang, Q. Wang, and Z. Zhong, “DyNet: Dynamic convolution for accelerating convolutional neural networks,” 2020, arXiv:2004.10694.

[56]

A. G. Howardet al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017, arXiv:1704.04861.

[57]

J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” in Proc. ICCV, 2019, pp. 4794–4802.

[58]

T. Huanget al., “DyRep: Bootstrapping training with dynamic reparameterization,” in Proc. CVPR, 2022, pp. 588–597.

[59]

T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” in Proc. NIPS, 2022, pp. 33716–33727.

[60]

K. Pearson, “Mathematical contributions to the theory of evolution—III. Regression, heredity, and panmixia,” Philos. Trans. Roy. Soc. London. Ser. A, Containing Papers Math. Phys. Character, vol. 187, pp. 253–318, Dec. 1896.

[61]

S. Z. Li, Markov Random Field Modeling in Image Analysis. Berlin, Germany: Springer, 2001, pp. 1–40.

[62]

T. Liu, X. Yang, and C. Chen, “Normalized feature distillation for semantic segmentation,” 2022, arXiv:2207.05256.

[63]

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. ECCV, 2012, pp. 746–760.

[64]

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun RGB-D: A RGB-D scene understanding benchmark suite,” in Proc. CVPR, 2015, pp. 567–576.

[65]

S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmentation network,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 9441–9447.

[66]

Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5108–5115.

[67]

Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation,” in Proc. CVPR, 2021, pp. 2633–2642.

[68]

Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, Jul. 2019.

[69]

W. Liang, Y. Yang, F. Li, X. Long, and C. Shan, “Mask-guided modality difference reduction network for RGB-T semantic segmentation,” Neurocomputing, vol. 523, pp. 9–17, Feb. 2023.

Digital Library

[70]

S. Zhao, Y. Liu, Q. Jiao, Q. Zhang, and J. Han, “Mitigating modality discrepancies for RGB-T semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., early access, Jan. 6, 2023. 10.1109/TNNLS.2022.3233089.

[71]

S. Yi, M. Chen, X. Liu, J. Li, and L. Chen, “HAFFseg: RGB-thermal semantic segmentation network with hybrid adaptive feature fusion strategy,” Signal Process., Image Commun., vol. 117, Sep. 2023, Art. no.

[72]

S. Zhao and Q. Zhang, “A feature divide-and-conquer network for RGB-T semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2892–2905, Jun. 2023.

Digital Library

[73]

W. Zhou, H. Zhang, W. Yan, and W. Lin, “MMSMCNet: Modal memory sharing and morphological complementary networks for RGB-T urban scene semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7096–7108, Dec. 2023.

Digital Library

Cited By

Zhou WXu GFang MMao SYang RYu L(2024)PGGNetImage Communication10.1016/j.image.2024.117164128:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.image.2024.117164

Index Terms

DGPINet-KD: Deep Guided and Progressive Integration Network With Knowledge Distillation for RGB-D Indoor Scene Analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Index terms have been assigned to the content through auto-classification.

Recommendations

PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing
Scene parsing is a fundamental task in computer vision. Various RGB-D (color and depth) scene parsing methods based on fully convolutional networks have achieved excellent performance. However, color and depth information are different in nature and ...
Depth-map completion for large indoor scene reconstruction
Highlights
- Propose a new depth completion algorithm for MVS depth-maps.
- Use occlusion ...
Abstract
Traditional Multi View Stereo (MVS) algorithms are often difficult to deal with large-scale indoor scene reconstruction, due to the photo-consistency measurement errors in weak textured regions, which are commonly exist in indoor ...
21/2 D Scene Reconstruction of Indoor Scenes from Single RGB-D Images
CCIW 2013: Proceedings of the 4th International Workshop on Computational Color Imaging - Volume 7786

Using the Manhattan world assumption we propose a new method for global 21/2D geometry estimation of indoor environments from single low quality RGB-D images. This method exploits both color and depth information at the same time and allows to obtain a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 34, Issue 9

Sept. 2024

1180 pages

Issue’s Table of Contents

1051-8215 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 27 March 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou WXu GFang MMao SYang RYu L(2024)PGGNetImage Communication10.1016/j.image.2024.117164128:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.image.2024.117164

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents