Abstract
Saliency detection plays an important role in computer vision and scene understanding, which has attracted increasing attention in recent years. Compared to the widely studied image saliency prediction, there are still many problems to be solved in the area of video saliency. Different from images, effectively describing and utilizing the motion information contained in video data is a critical issue. In this paper, we propose a spatial and motion dual-stream framework for video saliency detection. The coarse motion features extracting from optical flow are fine-tuned with higher level semantic spatial features via a residual cross-connection. A hierarchical fusion structure is proposed to maintain contextual information by integrating spatial and motion features in each level. To model the inter-frame correlation in the video, the convolutional gated recurrent unit (convGRU) is used to retain global consistency of the saliency area between neighbor frames. Experimental results on four widely used datasets demonstrate the effectiveness of the proposed method with other state-of-the-art methods. Our source codes can be acquired at https://github.com/banhuML/MFHF.
Similar content being viewed by others
Data Availibility
The datasets generated during and/or analysed during the current study are available in the DHF1K repository, https://github.com/wenguanwang/DHF1K.
References
Wang X, Qi C (2020) Detecting action-relevant regions for action recognition using a three-stage saliency detection technique. Multimed Tools Appl 79(11):7413–7433
Cizmeciler K, Erdem E, Erdem A (2022) Leveraging semantic saliency maps for query-specific video summarization. Multimed Tools Appl 81(12):17457–17482
Ullah J, Khan A, Jaffar MA (2018) Motion cues and saliency based unconstrained video segmentation. Multimed Tools Appl 77(6):7429–7446
Li S, Xu M, Wang Z, Sun X (2016) Optimal bit allocation for ctu level rate control in hevc. IEEE Trans Circ Syst Video Technol 27(11):2409–2424
Xu M, Liu Y, Hu R, He F (2018) Find who to look at: turning from action to saliency. IEEE Trans Image Proc 27(9):4529–4544
Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Proc 26(7):3156–3170
Chen C, Li S, Qin H, Pan Z, Yang G (2018) Bilevel feature learning for video saliency detection. IEEE Trans Multimed 20(12):3324–3336
Li Y, Li S, Chen C, Hao A, Qin H (2019) Accurate and robust video saliency detection via self-paced diffusion. IEEE Trans Multimed 22(5):1153–1167
Chen C, Wang G, Peng C, Zhang X, Qin H (2019) Improved robust video saliency detection based on long-term spatial-temporal information. IEEE Trans Image Proc 29:1090–1100
Zhang P, Liu J, Wang X, Pu T, Fei C, Guo Z (2020) Stereoscopic video saliency detection based on spatiotemporal correlation and depth confidence optimization. Neurocomputing 377:256–268
Wang G, Chen C, Fan D, Hao A, Qin H (2021) Weakly supervised visual-auditory saliency detection with multigranularity perception. arXiv preprint arXiv:2112.13697
Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7274–7283
Chen C, Song J, Peng C, Wang G, Fang Y (2021) A novel video salient object detection method via semisupervised motion quality perception. IEEE Trans Circ Syst Video Technol 32(5):2732–2745
Chen C, Wang H, Fang Y, Peng C (2022) A novel long-term iterative mining scheme for video salient object detection. IEEE Trans Circ Syst Video Technol 32(11):7662–7676
Borji A, Cheng M-M, Jiang H, Li J (2015) Salient object detection: a benchmark. IEEE Trans Image Proc 24(12):5706–5722
Vig E, Dorr M, Cox D (2014) Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2798–2805
Liu N, Han J, Zhang D, Wen S, Liu T (2015) Predicting eye fixations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 362–370
Huang X, Shen C, Boix X, Zhao Q (2015) Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 262–270
Pan J, Sayrol E, Giro-i-Nieto X, McGuinness K, O’Connor NE (2016) Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 598–606
Liu N, Han J, Liu T, Li X (2016) Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE Trans Neur Netw Learn Syst 29(2):392–404
Kruthiventi SS, Ayush K, Babu RV (2017) Deepfix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans Image Proc 26(9):4446–4456
Liu N, Han J (2018) A deep spatial contextual long-term recurrent convolutional network for saliency detection. IEEE Trans Image Proc 27(7):3264–3274
Mathe S, Sminchisescu C (2014) Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(7):1408–1424
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Gibson JJ (1950) The perception of the visual world
Teed Z, Deng J (2020) Raft: recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. Springer, pp 402–419
Cong R, Song W, Lei J, Yue G, Zhao Y, Kwong S (2022) Psnet: parallel symmetric network for video salient object detection. IEEE Trans Emerg Top Comput Intell
Bak C, Kocak A, Erdem E, Erdem A (2017) Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans Multimed 20(7):1688–1698
Jiang L, Xu M, Liu T, Qiao M, Wang Z (2018) Deepvs: a deep learning based video saliency prediction approach. In: Proceedings of the European conference on computer vision (eccv), pp 602–617
Zhang K, Chen Z (2018) Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans Circ Syst Video Technol 29(12):3544–3557
Lai Q, Wang W, Sun H, Shen J (2019) Video saliency prediction using spatiotemporal residual attentive networks. IEEE Trans Image Proc 29:1113–1126
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Srinivasu PN, Bhoi AK, Jhaveri RH, Reddy GT, Bilal M (2021) Probabilistic deep q network for real-time path planning in censorious robotic procedures using force sensors. J Real-Time Image Proc 18(5):1773–1785
Craye C, Filliat D, Goudou J-F (2016) Environment exploration for object-based visual saliency learning. In: 2016 IEEE international conference on robotics and automation (ICRA), pp 2303–2309. IEEE
Le Meur O, Le Callet P, Barba D, Thoreau D (2006) A coherent computational approach to model bottom-up visual attention. IEEE Trans Pattern Ana Mach Intell 28(5):802–817
Zhang L, Tong MH, Marks TK, Shan H, Cottrell GW (2008) Sun: a bayesian framework for saliency using natural statistics. J Vision 8(7):32–32
Gao D, Vasconcelos N (2005) Discriminant saliency for visual recognition from cluttered scenes. In: Adv Neural Inf Proc Syst, pp 481–488
Bruce N, Tsotsos J (2005) Saliency based on information maximization. Adv Neural Inf Proc Syst 18:155–162
Cheng M-M, Mitra NJ, Huang X, Torr PH, Hu S-M (2014) Global contrast based salient region detection. IEEE Trans Pattern Anal Machi Intell 37(3):569–582
Xu M, Ren Y, Wang Z (2015) Learning to predict saliency on face images. In: Proceedings of the IEEE international conference on computer vision, pp 3907–3915
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans Image Processing 27(10):5142–5154
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. Adv Neural Inf Proc Syst 28
Wang W, Shen J (2017) Deep visual attention prediction. IEEE Trans Image Proc 27(5):2368–2378
Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8 IEEE
Itti L, Dhavale N, Pighin F (2003) Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Applications and science of neural networks, fuzzy systems, and evolutionary computation VI, vol 5200, pp 64–78. SPIE
Wang W, Shen J, Xie J, Cheng M-M, Ling H, Borji A (2019) Revisiting video saliency prediction in the deep learning era. IEEE Trans Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/TPAMI.2019.2924417
Zhu S, Chang Q, L, Q (2022) Video saliency aware intelligent hd video compression with the improvement of visual quality and the reduction of coding complexity. Neural Computing and Applications 1–20
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans Image Proc 30:3995–4007
Zhang F, Woodford OJ, Prisacariu VA, Torr PH (2021) Separable flow: Learning motion cost volumes for optical flow estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10807–10817
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766
Mital P, mith TJ, Luke S, Henderson J (2013) Do low-level visual features have a causal influence on gaze during dynamic scene viewing? J Vision 13(9):144–144
Abrams RA (2003) Christ SE (2003) Motion onset captures attention. Psychol Sci 14(5):427–432
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8): 1735–1780
Mital PK, Smith TJ, Hill RL, Henderson JM (2011) Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn Comput 3(1):5–24
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision, pp 2106–2113. IEEE
Borji A, Tavakoli HR, Sihite DN, Itti L (2013) Analysis of scores, datasets, and models in visual saliency prediction. In: Proceedings of the IEEE international conference on computer vision, pp 921–928
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Linardos P, Mohedano E, Nieto JJ, O’Connor NE, Giro-i-Nieto X, McGuinness K (2019) Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint arXiv:1907.01869
Min K, Corso JJ (2019) Tased-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: Proceedings of the IEEE international conference on computer vision, pp 2394–2403
Wang Z, Zhou Z, Lu H, Hu Q, Jiang J (2020) Video saliency prediction via joint discrimination and local consistency. IEEE Transactions on Cybernetics
Wang Z, Zhou Z, Lu H, Jiang J (2020) Global and local sensitivity guided key salient object re-augmentation for video saliency detection. Pattern Recogn 103:107275
Jiang L, Xu M, Zhang S, Sigal L (2020) Deepct: a novel deep complex-valued network with learnable transform for video saliency prediction. Pattern Recogn 102:107234
Zou W, Zhuo S, Tang Y, Tian S, Li X, Xu C (2021) Sta3d: spatiotemporally attentive 3d network for video saliency prediction. Pattern Recognition Letters 147:78–84
Xue H, Sun M, Liang Y (2022) Ecanet: explicit cyclic attention-based network for video saliency prediction. Neurocomput 468:233–244
Chen J, Li Z, Jin Y, Ren D, Ling H (2021) Video saliency prediction via spatio-temporal reasoning. Neurocomput 462:59–68
Acknowledgements
This research was supported by the National Science and Technology Major Project (Grant No. 2020YFA0713504), the National Natural Science Foundation of China (Nos. 62376238, 62372170), and the Scientific Research Foundation of Education Department of Hunan Province of China (Grant Nos. 21A0109)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that there are no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiao, F., Luo, H., Zhang, W. et al. Fusion hierarchy motion feature for video saliency detection. Multimed Tools Appl 83, 32301–32320 (2024). https://doi.org/10.1007/s11042-023-16593-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16593-2