Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Triple fusion and feature pyramid decoder for RGB-D semantic segmentation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Yang, L., Liang, X., Wang, T., Xing, E.: Real-to-virtual domain unification for end-to-end autonomous driving. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 530–545 (2018)

  2. Xiao, X., Zhao, Y., Zhang, F., Luo, B., Yu, L., Chen, B., Yang, C.: Baseg: boundary aware semantic segmentation for autonomous driving. Neural Netw. 157, 460–470 (2023)

    Article  Google Scholar 

  3. López-Cifuentes, A., Escudero-Vinolo, M., Bescós, J., García-Martín, Á.: Semantic-aware scene recognition. Pattern Recognit. 102, 107256 (2020)

    Article  Google Scholar 

  4. Wei, J., Wu, Z., Wang, L., Bui, T.D., Qu, L., Yap, P.-T., Xia, Y., Li, G., Shen, D.: A cascaded nested network for 3T brain MR image segmentation guided by 7T labeling. Pattern Recognit. 124, 108420 (2022)

    Article  Google Scholar 

  5. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  6. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)

  7. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  8. Xu, X., Li, G., Xie, G., Ren, J., Xie, X., et al.: Weakly supervised deep semantic segmentation using CNN and ELM with semantic candidate regions. Complexity 2019, 9180391 (2019)

    Article  Google Scholar 

  9. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  10. Lin, X., Sánchez-Escobedo, D., Casas, J.R., Pardàs, M.: Depth estimation and semantic segmentation from a single RGB image using a hybrid convolutional neural network. Sensors 19(8), 1795 (2019)

    Article  Google Scholar 

  11. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338 (2010)

    Article  Google Scholar 

  12. Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)

  13. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)

  14. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

  15. Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934 (2017)

  16. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  17. Zhou, H., Qi, L., Huang, H., Yang, X., Wan, Z., Wen, X.: CANet: co-attention network for RGB-D semantic segmentation. Pattern Recognit. 124, 108468 (2022)

    Article  Google Scholar 

  18. Ying, X., Chuah, M.C.: Uctnet: uncertainty-aware cross-modal transformer network for indoor RGB-D semantic segmentation. In: European Conference on Computer Vision, pp. 20–37. Springer (2022)

  19. Yang, E., Zhou, W., Qian, X., Lei, J., Yu, L.: Drnet: dual-stage refinement network with boundary inference for RGB-D semantic segmentation of indoor scenes. Eng. Appl. Artif. Intell. 125, 106729 (2023)

    Article  Google Scholar 

  20. Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: Shapeconv: shape-aware convolutional layer for indoor RGB-D semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7088–7097 (2021)

  21. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  22. Romeo, L., Devanna, R., Marani, R., Matranga, G., Biddoccu, M., Milella, A.: Scale-invariant semantic segmentation of natural RGB-D images combining decision tree and deep learning models. In: Multimodal Sensing and Artificial Intelligence: Technologies and Applications III, vol. 12621, pp. 257–260. SPIE (2023)

  23. Yoon, J., Han, J., Nguyen, T.P.: Logistics box recognition in robotic industrial de-palletising procedure with systematic RGB-D image processing supported by multiple deep learning methods. Eng. Appl. Artif. Intell. 123, 106311 (2023)

    Article  Google Scholar 

  24. Li, Y., Ouyang, S., Zhang, Y.: Combining deep learning and ontology reasoning for remote sensing image semantic segmentation. Knowl.-Based Syst. 243, 108469 (2022)

    Article  Google Scholar 

  25. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

  26. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)

  27. Chaurasia, A., Culurciello, E.: Linknet: exploiting encoder representations for efficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. IEEE (2017)

  28. Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2393–2402 (2018)

  29. He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3562–3572 (2019)

  30. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7519–7528 (2019)

  31. Huang, Z., Wang, C., Wang, X., Liu, W., Wang, J.: Semantic image segmentation by scale-adaptive networks. IEEE Trans. Image Process. 29, 2066–2077 (2019)

    Article  Google Scholar 

  32. Knolle, M., Kaissis, G., Jungmann, F., Ziegelmayer, S., Sasse, D., Makowski, M., Rueckert, D., Braren, R.: Efficient, high-performance semantic segmentation using multi-scale feature extraction. PLoS ONE 16(8), 0255397 (2021)

    Article  Google Scholar 

  33. Li, S., Wan, L., Tang, L., Zhang, Z.: Mfeafn: multi-scale feature enhanced adaptive fusion network for image semantic segmentation. PLoS ONE 17(9), 0274249 (2022)

    Article  Google Scholar 

  34. Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part I 13, pp. 213–228. Springer (2017)

  35. Jiang, J., Zheng, L., Luo, F., Zhang, Z.: Rednet: residual encoder–decoder network for indoor RGB-D semantic segmentation (2018). arXiv preprint arXiv:1806.01054

  36. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

  37. He, Y., Chiu, W.-C., Keuper, M., Fritz, M.: Std2p: RGBD semantic segmentation using spatio-temporal data-driven pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4837–4846 (2017)

  38. Hu, X., Yang, K., Fei, L., Wang, K.: Acnet: attention based network to exploit complementary features for RGBD semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1440–1444. IEEE (2019)

  39. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VII 13, pp. 345–360. Springer (2014)

  40. Chen, L.-Z., Lin, Z., Wang, Z., Yang, Y.-L., Cheng, M.-M.: Spatial information guided convolution for real-time RGBD semantic segmentation. IEEE Trans. Image Process. 30, 2313–2324 (2021)

    Article  Google Scholar 

  41. Nascimento, M.G.d., Fawcett, R., Prisacariu, V.A.: Dsconv: efficient convolution operator. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2019)

  42. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part V 12, pp. 746–760. Springer (2012)

  43. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

  44. Janoch, A., Darrell, T., Abbeel, P., Malik, J.: The berkeley 3d object dataset. Techn. Report No. UCB/EECS-2012-85, University of California at Berkeley (2012)

  45. Xiao, J., Owens, A., Torralba, A.: Sun3d: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1625–1632 (2013)

  46. Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3d graph neural networks for RGBD semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5199–5208 (2017)

  47. Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14, pp. 664–679. Springer (2016)

  48. Lin, D., Chen, G., Cohen-Or, D., Heng, P.-A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1311–1319 (2017)

  49. Zhang, G., Xue, J.-H., Xie, P., Yang, S., Wang, G.: Non-local aggregation for RGB-D semantic segmentation. IEEE Signal Process. Lett. 28, 658–662 (2021)

    Article  Google Scholar 

  50. Yu, L., Gao, Y., Zhou, J., Zhang, J., Wu, Q.: Multi-layer feature aggregation for deep scene parsing models (2020). arXiv preprint arXiv:2011.02572

  51. Bai, L., Yang, J., Tian, C., Sun, Y., Mao, M., Xu, Y., Xu, W.: Dcanet: differential convolution attention network for RGB-D semantic segmentation (2022). arXiv preprint arXiv:2210.06747

  52. Zhu, L., Kang, Z., Zhou, M., Yang, X., Wang, Z., Cao, Z., Ye, C.: Cmanet: cross-modality attention network for indoor-scene semantic segmentation. Sensors 22(21), 8520 (2022)

    Article  Google Scholar 

  53. Tang, X., Li, B., Guo, J., Chen, W., Zhang, D., Huang, F.: A cross-modal feature fusion model based on convnext for RGB-D semantic segmentation. Mathematics 11(8), 1828 (2023)

    Article  Google Scholar 

  54. Zhang, Y., Xiong, C., Liu, J., Ye, X., Sun, G.: Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation. IEEE Sens. J. 23, 23512–23521 (2023)

    Article  Google Scholar 

  55. Ni, J., Zhang, Z., Shen, K., Tang, G., Yang, S.X.: An improved deep network-based RGB-D semantic segmentation method for indoor scenes. Int. J. Mach. Learn. Cybern. 15, 589–604 (2023)

    Article  Google Scholar 

  56. Park, S.-J., Hong, K.-S., Lee, S.: Rdfnet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4980–4989 (2017)

  57. Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150 (2018)

  58. Seichter, D., Fischedick, S.B., Köhler, M., Groß, H.-M.: Efficient multi-task RGB-D scene analysis for indoor environments. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–10. IEEE (2022)

  59. Wu, P., Guo, R., Tong, X., Su, S., Zuo, Z., Sun, B., Wei, J.: Link-RGBD: Cross-guided feature fusion network for RGBD semantic segmentation. IEEE Sens. J. 22(24), 24161–24175 (2022)

    Article  Google Scholar 

Download references

Funding

Funding was provided by National Natural Science Foundation of China (Grant No. 62102003).

Author information

Authors and Affiliations

Authors

Contributions

Zhu was responsible for writing the manuscript text and experimental design, Ge and Xia for revising the manuscript, Tang for data analysis, Lu for experimental design and data collection, and Chen for organizing the experimental data.

Corresponding author

Correspondence to Xu Zhu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, B., Zhu, X., Tang, Z. et al. Triple fusion and feature pyramid decoder for RGB-D semantic segmentation. Multimedia Systems 30, 272 (2024). https://doi.org/10.1007/s00530-024-01459-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01459-w

Keywords

Navigation