Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

A systematic review of deep learning frameworks for moving object segmentation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Moving Object Segmentation (MOS) in machine learning and computer vision is gaining much interest among researchers in recent times. This field of dynamic scene understanding aims to understand every pixel of a video by segmenting objects in a frame sequence from beginning till end. The goal is to generate temporally consistent and accurate pixel masks for the target object in a video. Our paper follows a systematic way to represent the available state-of-the-art literature in the field of MOS. It includes a detailed analysis of semi-supervised, unsupervised, interactive, and referring MOS. Next, various deep learning-based approaches for MOS are elaborated. In Section 4, the basic architecture of MOS has been explained with a summary of the current benchmark data sets used. Further, a summary of evaluation metrics and qualitative and quantitative results for MOS are discussed. Finally, latest available results to date are provided as well as future research directions for MOS are detailed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available in the DAVIS and YouTube-VOS repository, https://davischallenge.org/ and https://youtube-vos.org/

References

  1. Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications 79(29–30):20483–20518

    Article  Google Scholar 

  2. Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for covid-19 lesion segmentation from ct scans. Signal, Image and Video Processing, pp 1–8

    Google Scholar 

  3. Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432

  4. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision, Springer, pp 850–865

  5. Bhat G, Lawin FJ, Danelljan M, Robinson A, Felsberg M, Gool LV, Timofte R (2020) Learning what to learn for video object segmentation. In: European Conference on Computer Vision, Springer, pp 777–794

  6. Botach A, Zheltonozhskii E, Baskin C (2022) End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4985–4995

  7. Brox T, Malik J (2010) Object segmentation by long term analysis of point trajectories. In: European Conference on Computer Vision, Springer, pp 282–295

  8. Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 221–230

  9. Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis K-K, Van Gool L (2019) The 2019 davis challenge on vos: Unsupervised multi-object segmentation. arXiv preprint arXiv:1905.00737

  10. Chen Y-W, Jin X, Shen X, Yang M-H (2022) Video salient object detection via contrastive features and attention modules. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1320–1329

  11. Cheng HK, Tai Y-W, Tang C-K (2021) Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5559–5568

  12. Cheng HK, Tai Y-W, Tang C-K (2021) Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34

  13. Cheng J, Tsai Y-H, Wang S, Yang M-H (2017) Segflow: Joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp 686–695

  14. Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9384–9393

  15. Cho S, Lee H, Kim M, Jang S, Lee S (2022) Pixel-level bijective matching for video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 129–138

  16. Cucchiara R, Grana C, Piccardi M, Prati A (2003) Detecting moving objects, ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10):1337–1342

    Article  Google Scholar 

  17. Culibrk D, Marques O, Socek D, Kalva H, Furht B (2007) Neural network approach to background modeling for video object segmentation. IEEE Transactions on Neural Networks 18(6):1614–1627

    Article  PubMed  Google Scholar 

  18. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302

    Article  Google Scholar 

  19. Duarte K, Rawat YS, Shah M (2019) Capsulevos: Semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8480–8489

  20. Duke B, Ahmed A, Wolf C, Aarabi P, Taylor GW (2021) Sstvos: Sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5912–5921

  21. Dutt Jain S, Xiong B, Grauman K (2017) Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3664–3673

  22. Fiaz M, Zaheer MZ, Mahmood A, Lee S-I, Jung SK (2021) 4g-vos: Video object segmentation using guided context embedding. Knowl-Based Syst 231:107401

  23. Gao M, Zheng F, Yu JJ, Shan C, Ding G, Han J (2022) Deep learning for video object segmentation: a review. Artif Intell Rev 1–75

  24. Ge W, Lu X, Shen J (2021) Video object segmentation using global and instance embedding learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16836–16845

  25. Han J, Yang L, Zhang D, Chang X, Liang X (2018) Reinforcement cutting-agent learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9080–9089

  26. Heo Y, Koh YJ, Kim C-S (2021) Guided interactive video object segmentation using reliability-based attention maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7322–7330

  27. Hu Y-T, Chen H-S, Hui K, Huang J-B, Schwing AG (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3105–3115

  28. Hu Y-T, Huang J-B, Schwing A (2017) Maskrnn: Instance level video object segmentation. Advances in neural information processing systems 30

  29. Hu L, Zhang P, Zhang B, Pan P, Xu Y, Jin R (2021) Learning position and target consistency for memory-based video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4144–4154

  30. Ji G-P, Fu K, Wu Z, Fan D-P, Shen J, Shao L (2021) Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4922–4933

  31. Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8953–8962

  32. Lamdouar H, Yang C, Xie W, Zisserman A (2020) Betrayed by motion: Camouflaged object discovery via motion segmentation. In: Proceedings of the Asian Conference on Computer Vision

  33. Lan M, Zhang J, He F, Zhang L (2022) Siamese network with interactive transformer for video object segmentation. Proceedings of the AAAI Conference on Artificial Intelligence 36:1228–1236

    Article  Google Scholar 

  34. Lee Y, Seong H, Kim E (2021) Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. arXiv preprint arXiv:2112.12402

  35. Liang Y, Li X, Jafari N, Chen J (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems 33:3430–3441

    Google Scholar 

  36. Li M, Hu L, Xiong Z, Zhang B, Pan P, Liu D (2022) Recurrent dynamic embedding for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1332–1341

  37. Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: 2013 IEEE International Conference on Computer Vision, pp 2192–2199. https://doi.org/10.1109/ICCV.2013.273

  38. Li X, Loy CC (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 90–105

  39. Lin H, Qi X, Jia J (2019) Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3949–3957

  40. Lin Z, Yang T, Li M, Wang Z, Yuan C, Jiang W, Liu W (2022) Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1362–1372

  41. Li S, Seybold B, Vorobyov A, Fathi A, Huang Q, Kuo C-CJ (2018) Instance embedding transfer to unsupervised video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6526–6535

  42. Li S, Seybold B, Vorobyov A, Lei X, Kuo C-CJ (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 207–223

  43. Liu Z, Liu J, Chen W, Wu X, Li Z (2021) Faminet: Learning real-time semisupervised video object segmentation with steepest optimized optical flow. IEEE Trans Instrum Meas 71:1–16

    Google Scholar 

  44. Liu Y, Yu R, Yin F, Zhao X, Zhao W, Xia W, Yang Y (2022) Learning quality-aware dynamic memory for video object segmentation. arXiv preprint arXiv:2207.07922

  45. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440

  46. Luiten J, Zulfikar IE, Leibe B (2020) Unovost: Unsupervised offline video object segmentation and tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2000–2009

  47. Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV (2020) Video object segmentation with episodic graph memory networks. In: European Conference on Computer Vision, Springer, pp 661–679

  48. Maninis K-K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2018) Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(6):1515–1530

    Article  PubMed  Google Scholar 

  49. Mao Y, Wang N, Zhou W, Li H (2021) Joint inductive and transductive learning for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9670–9679

  50. Miao J, Wei Y, Yang Y (2020) Memory aggregation networks for efficient interactive video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10366–10375

  51. Ochs P, Malik J, Brox T (2013) Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(6):1187–1200

    Article  Google Scholar 

  52. Oh SW, Lee J-Y, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7376–7385

  53. Oh SW, Lee J-Y, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9226–9235

  54. Park K, Woo S, Oh SW, Kweon IS, Lee J-Y (2022) Per-clip video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1352–1361

  55. Patil PW, Biradar KM, Dudhane A, Murala S (2020) An end-to-end edge aggregation network for moving object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8149–8158

  56. Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2663–2672

  57. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 724–732. https://doi.org/10.1109/CVPR.2016.85

  58. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724–732

  59. Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Van Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675

  60. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28

  61. Ren S, Liu W, Liu Y, Chen H, Han G, He S (2021) Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15455–15464

  62. Rother C, Kolmogorov V, Blake A (2004) ”grabcut” interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314

  63. Schmidt C, Athar A, Mahadevan S, Leibe B (2022) D2conv3d: Dynamic dilated convolutions for object segmentation in videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1200–1209

  64. Seo S, Lee J-Y, Han B (2020) Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: European Conference on Computer Vision, Springer, pp 208–223

  65. Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: European Conference on Computer Vision, Springer, pp 629–645

  66. Seong H, Oh SW, Lee J-Y, Lee S, Lee S, Kim E (2021) Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 12889–12898

  67. Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28

  68. Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4481–4490

  69. Tsai D, Flagg M, Nakazawa A, Rehg JM (2012) Motion coherent tracking using multi-label mrf optimization. Int J Comput Vis 100(2):190–202

    Article  MathSciNet  Google Scholar 

  70. Ventura C, Bellver M, Girbau A, Salvador A, Marques F, Giro-i-Nieto X (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5277–5286

  71. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L-C (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9481–9490

  72. Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364

  73. Voigtlaender P, Luo L, Yuan C, Jiang Y, Leibe B (2021) Reducing the annotation effort for video object segmentation datasets. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3060–3069

  74. Wang W, Shen J, Porikli F, Yang R (2018) Semi-supervised video object segmentation with super-trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(4):985–998

    Article  CAS  Google Scholar 

  75. Wang H, Jiang X, Ren H, Hu Y, Bai S (2021) Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1296–1305

  76. Wang W, Lu X, Shen J, Crandall DJ, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9236–9245

  77. Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi SC, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3064–3074

  78. Wei L, Lang C, Liang L, Feng S, Wang T, Chen S (2022) Weakly supervised video object segmentation via dual-attention cross-branch fusion. ACM Transactions on Intelligent Systems and Technology (TIST) 13(3):1–20

    Article  Google Scholar 

  79. Wu D, Dong X, Shao L, Shen J (2022) Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4996–5005

  80. Wu J, Jiang Y, Sun P, Yuan Z, Luo P (2022) Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4974–4984

  81. Xie H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1286–1295

  82. Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence 34:12549–12556

    Article  Google Scholar 

  83. Xu X, Wang J, Li X, Lu Y (2021) Reliable propagation-correction modulation for video object segmentation. arXiv preprint arXiv:2112.02853

  84. Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327

  85. Xu K, Yao A (2021) Efficient video object segmentation with compressed video. arXiv preprint arXiv:2107.12192

  86. Xu K, Yao A (2022) Accelerating video object segmentation with compressed video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1342–1351

  87. Yang Z, Wei Y, Yang Y (2021) Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34:2491–2502

    Google Scholar 

  88. Yang L, Fan Y, Xu N (2019) Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5188–5197

  89. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6499–6507

  90. Yin Y, Xu D, Wang X, Zhang L (2021) Agunet: Annotation-guided u-net for fast one-shot video object segmentation. Pattern Recogn 110:107580

  91. Yin Z, Zheng J, Luo W, Qian S, Zhang H, Gao S (2021) Learning to recommend frame for interactive video object segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15445–15454

  92. Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2636–2645

  93. Zhang D, Javed O, Shah M (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 628–635

  94. Zhang L, Lin Z, Zhang J, Lu H, He Y (2019) Fast video object segmentation via dynamic targeting network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5582–5591

  95. Zhou T, Li J, Li X, Shao L (2021) Target-aware object discovery and association for unsupervised video multi-object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6985–6994

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dipika Gupta.

Ethics declarations

Funding and/or Conflicts of interests/Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Manish Kumar and Sachin Chaudhary contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, D., Kumar, M. & Chaudhary, S. A systematic review of deep learning frameworks for moving object segmentation. Multimed Tools Appl 83, 24715–24748 (2024). https://doi.org/10.1007/s11042-023-16417-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16417-3

Keywords

Navigation