research-article

Joint Cross-Modal and Unimodal Features for RGB-D Salient Object Detection

Authors:

Nianchang Huang,

Jungong HanAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 23

Pages 2428 - 2441

https://doi.org/10.1109/TMM.2020.3011327

Published: 01 January 2021 Publication History

Abstract

RGB-D salient object detection is one of the basic tasks in computer vision. Most existing models focus on investigating efficient ways of fusing the complementary information from RGB and depth images for better saliency detection. However, for many real-life cases, where one of the input images has poor visual quality or contains affluent saliency cues, fusing cross-modal features does not help to improve the detection accuracy, when compared to using unimodal features only. In view of this, a novel RGB-D salient object detection model is proposed by simultaneously exploiting the cross-modal features from the RGB-D images and the unimodal features from the input RGB and depth images for saliency detection. To this end, a Multi-branch Feature Fusion Module is presented to effectively capture the cross-level and cross-modal complementary information between RGB-D images, as well as the cross-level unimodal features from the RGB images and the depth images separately. On top of that, a Feature Selection Module is designed to adaptively select those highly discriminative features for the final saliency prediction from the fused cross-modal features and the unimodal features. Extensive evaluations on four benchmark datasets demonstrate that the proposed model outperforms the state-of-the-art approaches by a large margin.

References

[1]

A. Borji, M. Cheng, H. Jiang, and J. Li, “Salient object detection: A survey,”Comput. Vis. Media, vol. 5, pp. 117–150, 2014.

[2]

Z. Ren, S. Gao, L. Chia, and I. W.-H. Tsang, “Region-based saliency detection and its application in object recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 24, no.5, pp. 769–779, May2014.

[3]

S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliency map with convolutional neural network,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 597–606.

[4]

S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele, “Exploiting saliency for object segmentation from image level labels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5038–5047.

[5]

L. Ye, Z. Liu, L. Li, L. Shen, C. Bai, and Y. Wang, “Salient object segmentation via effective integration of saliency and objectness,”IEEE Trans. Multimedia, vol. 19, no. 8, pp. 1742–1756, Aug.2017.

Digital Library

[6]

L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Salient object detection with recurrent fully convolutional networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1734–1746, Jul.2019.

[7]

Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6593–6601, 2017.

[8]

T. Wang, et al. “Detect globally, refine locally: A novel approach to saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3127–3135.

[9]

L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional message passing model for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1741–1750.

[10]

G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2386–2395.

[11]

R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tuned salient region detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1597–1604.

[12]

Y. Liu, J. Han, Q. Zhang, and L. Wang, “Salient object detection via two-stage graphs,”IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 4, pp. 1023–1037, Apr.2019.

[13]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–14.

[14]

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 630–645.

[15]

Q. Xie, O. Remil, Y. Guo, M. Wang, M. Wei, and J. Wang, “Object detection and tracking under occlusion for object-level RGB-D video segmentation,”IEEE Trans. Multimedia, vol. 20, no. 3, pp. 580–592, 2017.

Digital Library

[16]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[17]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1492–1500.

[18]

Y. Tang and X. Wu, “Salient object detection using cascaded convolutional neural networks and adversarial learning,”IEEE Trans. Multimedia, vol. 21, no. 9, pp. 2237–2247, Sep.2019.

[19]

K. Fu, Q. Zhao, and I. Y. Gu, “RefiNet: A deep segmentation assisted refinement network for salient object detection,”IEEE Trans. Multimedia, vol. 21, no. 2, pp. 457–469, Feb.2019.

Digital Library

[20]

D. P. Fanet al., “Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks,”2019, arXiv:1907.06781.

[21]

J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with microsoft kinect sensor: A review,”IEEE Trans. Cybern., vol. 43, pp. 1318–1334, Oct.2013.

[22]

I. Realsense, “Introducing intel realsense lidar camera,” Website, 2020, https://www.intelrealsense.com/

[23]

R. Huang, Y. Xing, and Z. Wang, “RGB-D salient object detection by a CNN with multiple layers fusion,”IEEE Signal Process. Lett., vol. 26, no. 4, pp. 552–556, Apr.2019.

[24]

J. Han, H. Chen, N. Liu, C. Yan, and X. Li, “CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion,”IEEE Trans. Cybern., vol. 48, no. 11, pp. 3171–3183, Nov.2018.

[25]

H. Chen and Y. Li, “Progressively complementarity-aware fusion network for RGB-D salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3051–3060.

[26]

Z. Liu, S. Shi, Q. Duan, W. Zhang, and P. Zhao, “Salient object detection for RGB-D image by single stream recurrent convolution neural network,”Neurocomputing, vol. 363, pp. 46–57, 2019.

Digital Library

[27]

H. Chen and Y. Li, “Three-stream attention-aware network for RGB-D salient object detection,”IEEE Trans. Image Process., vol. 28, no. 6, pp. 2825–2835, Jun.2019.

[28]

J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang, “Contrast prior and fluid pyramid integration for RGB-D salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3927–3936.

[29]

C. Zhu, X. Cai, K. Huang, T. H. Li, and G. Li, “PDNet: Prior-model guided depth-enhanced network for salient object detection,” in Proc. EEE Int. Conf. Multimedia Expo, 2018, pp. 199–204.

[30]

X. Wang, T. Sun, R. Yang, C. Li, B. Luo, and J. Tang, “Quality-aware multimodal saliency detection via deep reinforcement learning,”2018, arXiv:1811.10763.

[31]

N. Wang, and X. Gong, “Adaptive fusion for RGB-D salient object detection,”IEEE Access, vol. 7, pp. 55 277–55 284, 2019.

[32]

L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,’’2017, arXiv:1706.05587.

[33]

L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 801–818.

[34]

P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating multi-level convolutional features for salient object detection,” in Proc. IEEE Int. Conf. Comput. Vision, 2017, pp. 202–211.

[35]

G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 478–487.

[36]

L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3183–3192.

[37]

Y. Liu, J. Han, Q. Zhang, and C. Shan, “Deep salient object detection with contextual information guidance,”IEEE Trans. Image Process., vol. 29, pp. 360–374, 2019.

Digital Library

[38]

J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 1, no. 1, pp. 1–13, 2019.

[39]

R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency based on anisotropic center-surround difference,” in Proc. IEEE Int. Conf. Image Process., 2014, pp. 1115–1119.

[40]

A. Wang and M. Wang, “RGB-D salient object detection via minimum barrier distance transform and saliency fusion,”IEEE Signal Process. Lett., vol. 24, no. 5, pp. 663–667, 2017.

[41]

R. Shigematsu, D. Feng, S. You, and N. Barnes, “Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2749–2757.

[42]

L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang, “RGB-D salient object detection via deep fusion,”IEEE Trans. Image Process., vol. 26, no. 5, pp. 2274–2285, May2017.

Digital Library

[43]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.

[44]

R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1265–1274.

[45]

T. Wanget al., “Detect globally, refine locally: A novel approach to saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3127–3135.

[46]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,”2014, arXiv:1412.7062.

[47]

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2002–2011.

[48]

A. Kroner, M. Senden, K. Driessens, and R. Goebel, “Contextual encoder-decoder network for visual saliency prediction,”Neural Netw., vol. 129, pp. 261–270, 2019.

[49]

V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 807–814.

[50]

C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Proc. Artif. Intell. Statist., 2015, pp. 562–570.

[51]

S. Xie and Z. Tu, “Holistically-nested edge detection,”Int. J. Comput. Vis., vol. 125, no. 1, pp. 3–12, 2017.

[52]

S. Woo, J. Park, J. Lee, and I. So Kweon, “CBAM: Convolutional block attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19.

Digital Library

[53]

H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “RGB-D salient object detection: A benchmark and algorithms,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 92–109.

[54]

Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis for saliency analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 454–461.

[55]

D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4548–4557.

[56]

A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” in Adv. Neural Inf. Process. Syst., 2019, pp. 8024–8035.

[57]

X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256.

[58]

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proc. Int. Conf. Int. Conf. Mach. Learn., 2013, pp. 1130–1139.

[59]

Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks meet squeeze-excitation networks and beyond,” in Proc. IEEE Int. Conf. Comput. Vision Workshops, 2019, pp. 1–10.

Cited By

Guo RYing XQi YQu L(2024)UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency DetectionIEEE Transactions on Multimedia10.1109/TMM.2024.336992226(7622-7635)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3369922
Meena PKumar HYadav S(2024)A Volumetric Saliency Guided Image Summarization for RGB-D Indoor Scene ClassificationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341294934:11_Part_1(10917-10929)Online publication date: 11-Jun-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3412949
Zhang QQin QYang YJiao QHan J(2024)Feature Calibrating and Fusing Network for RGB-D Salient Object DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329658134:3(1493-1507)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TCSVT.2023.3296581
Show More Cited By

Recommendations

RGB-D salient object detection via cross-modal joint feature extraction and low-bound fusion loss
Abstract
RGB-D salient object detection aims at identifying attractive objects in a scene by combining the color image and depth map. However, due to the differences between RGB-D image pairs, it is a key issue to utilize cross-modal data ...
Cross-modal hierarchical interaction network for RGB-D salient object detection
Highlights
- We propose a novel cross-modal hierarchical interaction network for accurate RGB-D salient object detection, which not only excavates multi-modal interaction ...
Abstract
How to effectively exchange and aggregate the information of multiple modalities (e.g. RGB image and depth map) is a big challenge in the RGB-D salient object detection community. To address this problem, in this paper, we propose a ...
Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

Most existing RGB-D salient object detection (SOD) methods extract features of both modalities in parallel or adopt depth features as supplementary information for unidirectional interaction from depth modality to RGB modality in the encoder stage. These ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 23, Issue

2021

1967 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo RYing XQi YQu L(2024)UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency DetectionIEEE Transactions on Multimedia10.1109/TMM.2024.336992226(7622-7635)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3369922
Meena PKumar HYadav S(2024)A Volumetric Saliency Guided Image Summarization for RGB-D Indoor Scene ClassificationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341294934:11_Part_1(10917-10929)Online publication date: 11-Jun-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3412949
Zhang QQin QYang YJiao QHan J(2024)Feature Calibrating and Fusing Network for RGB-D Salient Object DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329658134:3(1493-1507)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TCSVT.2023.3296581
Meng LYuan MShi XLiu QZhange LWu JDai PCheng F(2023)Coordinate Attention Filtering Depth-Feature Guide Cross-Modal Fusion RGB-Depth Salient Object DetectionAdvances in Multimedia10.1155/2023/99219882023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/9921988
Zhang MYao SHu BPiao YJi W(2023)C$^{2}$DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object DetectionIEEE Transactions on Multimedia10.1109/TMM.2022.318785625(5142-5154)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3187856
Zhou WYang ELei JWan JYu L(2023)PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene ParsingIEEE Transactions on Multimedia10.1109/TMM.2022.316185225(3483-3494)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3161852
Qiu CZhang DHu YLi HSun QChen Y(2023)Radio-Assisted Human DetectionIEEE Transactions on Multimedia10.1109/TMM.2022.314912925(2613-2623)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3149129
Zhao SZhang Q(2023)A Feature Divide-and-Conquer Network for RGB-T Semantic SegmentationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322935933:6(2892-2905)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TCSVT.2022.3229359
Pan WSun XQian Y(2023)RGB-D saliency detection via complementary and selective learningApplied Intelligence10.1007/s10489-022-03612-253:7(7957-7969)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1007/s10489-022-03612-2
Jin YYao YWang HFeng Y(2023)Research on Improved Algorithm of Significance Object Detection Based on ATSA ModelAdvances in Brain Inspired Cognitive Systems10.1007/978-981-97-1417-9_15(154-165)Online publication date: 5-Aug-2023
https://dl.acm.org/doi/10.1007/978-981-97-1417-9_15
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents