Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DGPINet-KD: Deep Guided and Progressive Integration Network With Knowledge Distillation for RGB-D Indoor Scene Analysis

Published: 27 March 2024 Publication History

Abstract

Significant advancements in RGB-D semantic segmentation have been made owing to the increasing availability of robust depth information. Most researchers have combined depth with RGB data to capture complementary information in images. Although this approach improves segmentation performance, it requires excessive model parameters. To address this problem, we propose DGPINet-KD, a deep-guided and progressive integration network with knowledge distillation (KD) for RGB-D indoor scene analysis. First, we used branching attention and depth guidance to capture coordinated, precise location information and extract more complete spatial information from the depth map to complement the semantic information for the encoded features. Second, we trained the student network (DGPINet-S) with a well-trained teacher network (DGPINet-T) using a multilevel KD. Third, an integration unit was developed to explore the contextual dependencies of the decoding features and to enhance relational KD. Comprehensive experiments on two challenging indoor benchmark datasets, NYUDv2 and SUN RGB-D, demonstrated that DGPINet-KD achieved improved performance in indoor scene analysis tasks compared with existing methods. Notably, on the NYUDv2 dataset, DGPINet-KD (DGPINet-S with KD) achieves a pixel accuracy gain of 1.7% and a class accuracy gain of 2.3% compared with DGPINet-S. In addition, compared with DGPINet-T, the proposed DGPINet-KD (DGPINet-S with KD) utilizes significantly fewer parameters (29.3M) while maintaining accuracy. The source code is available at <uri>https://github.com/XUEXIKUAIL/DGPINet</uri>.

References

[1]
H. Sheng, R. Cong, D. Yang, R. Chen, S. Wang, and Z. Cui, “UrbanLF: A comprehensive light field dataset for semantic segmentation of urban scenes,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7880–7893, Nov. 2022.
[2]
H. Li, D.-H. Zhai, and Y. Xia, “ERDUnet: An efficient residual double-coding unet for medical image segmentation,” IEEE Trans. Circuits Syst. Video Technol., early access, Aug. 1, 2024. 10.1109/TCSVT.2023.3300846.
[3]
C. Fanget al., “Reliable mutual distillation for medical image segmentation under imperfect annotations,” IEEE Trans. Med. Imag., vol. 42, no. 6, pp. 1720–1734, Jan. 2023.
[4]
G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “RGB-T semantic segmentation with location, activation, and sharpening,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1223–1235, Mar. 2023.
[5]
W. Shiet al., “RGB-D semantic segmentation and label-oriented voxelgrid fusion for accurate 3D semantic mapping,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 183–197, Jan. 2022.
[6]
F. Zhanget al., “Multi-target domain adaptation building instance extraction of remote sensing imagery with domain-common approximation learning,” IEEE Trans. Geosci. Remote Sens.. 10.1109/TGRS.2024.3376719.
[7]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
[8]
C.-Y. Wu, Q. Gao, C.-C. Hsu, T.-L. Wu, J.-W. Chen, and U. Neumann, “InSpaceType: Reconsider space type in indoor monocular depth estimation,” 2023, arXiv:2309.13516.
[9]
X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3D graph neural networks for RGBD semantic segmentation,” in Proc. ICCV, 2017, pp. 5199–5208.
[10]
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.
[11]
S. Ji, Z. Zhang, S. Ying, L. Wang, X. Zhao, and Y. Gao, “Kullback–Leibler divergence metric learning,” IEEE Trans. Cybern., vol. 52, no. 4, pp. 2047–2058, Apr. 2022.
[12]
S. Duet al., “Agree to disagree: Adaptive ensemble knowledge distillation in gradient space,” in Proc. NIPS, 2020, pp. 12345–12355.
[13]
B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, “A comprehensive overhaul of feature distillation,” in Proc. ICCV, 2019, pp. 1921–1930.
[14]
B. Penget al., “Correlation congruence for knowledge distillation,” in Proc. ICCV, 2019, pp. 5007–5016.
[15]
Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2019, arXiv:1910.10699.
[16]
Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proc. CVPR, 2019, pp. 2604–2613.
[17]
A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” 2014, arXiv:1412.6550.
[18]
C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise knowledge distillation for dense prediction,” in Proc. ICCV, 2021, pp. 5311–5320.
[19]
X. Jinet al., “Knowledge distillation via route constrained optimization,” in Proc. ICCV, 2019, pp. 1345–1354.
[20]
E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, Apr. 2017.
[21]
D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of RGB-D images,” in Proc. ICCV, 2017, pp. 1311–1319.
[22]
Z. Wuet al., “Object segmentation by mining cross-modal semantics,” in Proc. ACM MM, 2023, pp. 3455–3464.
[23]
J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “RedNet: Residual encoder–decoder network for indoor RGB-D semantic segmentation,” 2018, arXiv:1806.01054.
[24]
W. Zhou, E. Yang, J. Lei, J. Wan, and L. Yu, “PGDENet: Progressive guided fusion and depth enhancement network for RGB-D indoor scene parsing,” IEEE Trans. Multimedia, vol. 25, pp. 3483–3494, 2023.
[25]
J. Yang, L. Bai, Y. Sun, C. Tian, M. Mao, and G. Wang, “Pixel difference convolutional network for RGB-D semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 3, pp. 1481–1492, Mar. 2024. 10.1109/TCSVT.2023.3296162.
[26]
D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.-M. Gross, “Efficient RGB-D semantic segmentation for indoor scene analysis,” in Proc. ICRA, 2021, pp. 13525–13531.
[27]
X. Hu, K. Yang, L. Fei, and K. Wang, “ACNET: Attention based network to exploit complementary features for RGBD semantic segmentation,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2019, pp. 1440–1444.
[28]
J. Cheng, Z. Wu, S. Wang, C. Demonceaux, and Q. Jiang, “Bidirectional collaborative mentoring network for marine organism detection and beyond,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11, pp. 6595–6608, Nov. 2023.
[29]
W. Zhou, E. Yang, J. Lei, and L. Yu, “FRNet: Feature reconstruction network for RGB-D indoor scene parsing,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 4, pp. 677–687, Jun. 2022.
[30]
S. Lee, S. Park, and K. Hong, “RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation,” in Proc. ICCV, 2017, pp. 4990–4999.
[31]
J. Yaoet al., “Position-based anchor optimization for point supervised dense nuclei detection,” Neural Netw., vol. 171, pp. 159–170, Mar. 2024.
[32]
W. Zhou, Y. Yue, M. Fang, X. Qian, R. Yang, and L. Yu, “BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images,” Inf. Fusion, vol. 94, pp. 32–42, Jun. 2023.
[33]
W. Gao, G. Liao, S. Ma, G. Li, Y. Liang, and W. Lin, “Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2091–2106, Apr. 2022.
[34]
R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proc. ICCV, 2021, pp. 12179–12188.
[35]
Z. Xieet al., “Cross-modality double bidirectional interaction and fusion network for RGB-T salient object detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4149–4163, Jul. 2023.
[36]
X. Chenet al., “Bi-directional cross-modality feature propagation with separation and aggregation gate for RGB-D semantic segmentation,” in Proc. ECCV, 2020, pp. 561–577.
[37]
E. Yang, W. Zhou, X. Qian, and L. Yu, “MGCNet: Multilevel gated collaborative network for RGB-D semantic segmentation of indoor scene,” IEEE Signal Process. Lett., vol. 29, pp. 2567–2571, 2022.
[38]
R. Conget al., “A weakly supervised learning framework for salient object detection via hybrid labels,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 534–548, Feb. 2023.
[39]
Y. Zhang, C. Xiong, J. Liu, X. Ye, and G. Sun, “Spatial information-guided adaptive context-aware network for efficient RGB-D semantic segmentation,” IEEE Sensors J., vol. 23, no. 19, pp. 23512–23521, Oct. 2023.
[40]
S. Du, W. Wang, R. Guo, and S. Tang, “AsymFormer: Asymmetrical cross-modal representation learning for mobile platform real-time RGB-D semantic segmentation,” 2023, arXiv:2309.14065.
[41]
H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep feature aggregation for real-time semantic segmentation,” in Proc. CVPR, 2019, pp. 9514–9523.
[42]
G. Yueet al., “Dual-constraint coarse-to-fine network for camouflaged object detection,” IEEE Trans. Circuits Syst. Video Technol., early access, Sep. 25, 2023. 10.1109/TCSVT.2023.3318672.
[43]
D. Ji, H. Wang, M. Tao, J. Huang, X.-S. Hua, and H. Lu, “Structural and statistical texture knowledge distillation for semantic segmentation,” in Proc. CVPR, 2022, pp. 16876–16885.
[44]
Y. Wang, W. Zhou, T. Jiang, X. Bai, and Y. Xu, “Intra-class feature variation distillation for semantic segmentation,” in Proc. ECCV, 2020, pp. 346–362.
[45]
Q. Guoet al., “Online knowledge distillation via collaborative learning,” in Proc. CVPR, 2020, pp. 11020–11029.
[46]
R. Liuet al., “TransKD: Transformer knowledge distillation for efficient semantic segmentation,” 2022, arXiv:2202.13393.
[47]
S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” 2016, arXiv:1612.03928.
[48]
N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledge transfer,” in Proc. ECCV, 2018, pp. 268–284.
[49]
P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in Proc. CVPR, 2021, pp. 5008–5017.
[50]
Z. Tianet al., “Adaptive perspective distillation for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1372–1387, Feb. 2023.
[51]
D. Chen, J. Mei, Y. Zhang, C. Wang, Y. Feng, and C. Chen, “Cross-layer distillation with semantic calibration,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 7028–7036.
[52]
S. An, Q. Liao, Z. Lu, and J.-H. Xue, “Efficient semantic segmentation via self-attention and self-distillation,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 15256–15266, Sep. 2022.
[53]
D. Zhanget al., “Weakly supervised semantic segmentation via alternate self-dual teaching,” IEEE Trans. Image Process., early access, Dec. 20, 2023. 10.1109/TIP.2023.3343112.
[54]
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. NIPS, 2021, pp. 12077–12090.
[55]
Y. Zhang, J. Zhang, Q. Wang, and Z. Zhong, “DyNet: Dynamic convolution for accelerating convolutional neural networks,” 2020, arXiv:2004.10694.
[56]
A. G. Howardet al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017, arXiv:1704.04861.
[57]
J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” in Proc. ICCV, 2019, pp. 4794–4802.
[58]
T. Huanget al., “DyRep: Bootstrapping training with dynamic reparameterization,” in Proc. CVPR, 2022, pp. 588–597.
[59]
T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” in Proc. NIPS, 2022, pp. 33716–33727.
[60]
K. Pearson, “Mathematical contributions to the theory of evolution—III. Regression, heredity, and panmixia,” Philos. Trans. Roy. Soc. London. Ser. A, Containing Papers Math. Phys. Character, vol. 187, pp. 253–318, Dec. 1896.
[61]
S. Z. Li, Markov Random Field Modeling in Image Analysis. Berlin, Germany: Springer, 2001, pp. 1–40.
[62]
T. Liu, X. Yang, and C. Chen, “Normalized feature distillation for semantic segmentation,” 2022, arXiv:2207.05256.
[63]
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. ECCV, 2012, pp. 746–760.
[64]
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun RGB-D: A RGB-D scene understanding benchmark suite,” in Proc. CVPR, 2015, pp. 567–576.
[65]
S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V. Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmentation network,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 9441–9447.
[66]
Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5108–5115.
[67]
Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, and J. Han, “ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation,” in Proc. CVPR, 2021, pp. 2633–2642.
[68]
Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, Jul. 2019.
[69]
W. Liang, Y. Yang, F. Li, X. Long, and C. Shan, “Mask-guided modality difference reduction network for RGB-T semantic segmentation,” Neurocomputing, vol. 523, pp. 9–17, Feb. 2023.
[70]
S. Zhao, Y. Liu, Q. Jiao, Q. Zhang, and J. Han, “Mitigating modality discrepancies for RGB-T semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., early access, Jan. 6, 2023. 10.1109/TNNLS.2022.3233089.
[71]
S. Yi, M. Chen, X. Liu, J. Li, and L. Chen, “HAFFseg: RGB-thermal semantic segmentation network with hybrid adaptive feature fusion strategy,” Signal Process., Image Commun., vol. 117, Sep. 2023, Art. no.
[72]
S. Zhao and Q. Zhang, “A feature divide-and-conquer network for RGB-T semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2892–2905, Jun. 2023.
[73]
W. Zhou, H. Zhang, W. Yan, and W. Lin, “MMSMCNet: Modal memory sharing and morphological complementary networks for RGB-T urban scene semantic segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7096–7108, Dec. 2023.

Cited By

View all

Index Terms

  1. DGPINet-KD: Deep Guided and Progressive Integration Network With Knowledge Distillation for RGB-D Indoor Scene Analysis
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image IEEE Transactions on Circuits and Systems for Video Technology
            IEEE Transactions on Circuits and Systems for Video Technology  Volume 34, Issue 9
            Sept. 2024
            1180 pages

            Publisher

            IEEE Press

            Publication History

            Published: 27 March 2024

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 01 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media