Abstract
Moving Object Segmentation (MOS) in machine learning and computer vision is gaining much interest among researchers in recent times. This field of dynamic scene understanding aims to understand every pixel of a video by segmenting objects in a frame sequence from beginning till end. The goal is to generate temporally consistent and accurate pixel masks for the target object in a video. Our paper follows a systematic way to represent the available state-of-the-art literature in the field of MOS. It includes a detailed analysis of semi-supervised, unsupervised, interactive, and referring MOS. Next, various deep learning-based approaches for MOS are elaborated. In Section 4, the basic architecture of MOS has been explained with a summary of the current benchmark data sets used. Further, a summary of evaluation metrics and qualitative and quantitative results for MOS are discussed. Finally, latest available results to date are provided as well as future research directions for MOS are detailed.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available in the DAVIS and YouTube-VOS repository, https://davischallenge.org/ and https://youtube-vos.org/
References
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications 79(29–30):20483–20518
Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for covid-19 lesion segmentation from ct scans. Signal, Image and Video Processing, pp 1–8
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision, Springer, pp 850–865
Bhat G, Lawin FJ, Danelljan M, Robinson A, Felsberg M, Gool LV, Timofte R (2020) Learning what to learn for video object segmentation. In: European Conference on Computer Vision, Springer, pp 777–794
Botach A, Zheltonozhskii E, Baskin C (2022) End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4985–4995
Brox T, Malik J (2010) Object segmentation by long term analysis of point trajectories. In: European Conference on Computer Vision, Springer, pp 282–295
Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 221–230
Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis K-K, Van Gool L (2019) The 2019 davis challenge on vos: Unsupervised multi-object segmentation. arXiv preprint arXiv:1905.00737
Chen Y-W, Jin X, Shen X, Yang M-H (2022) Video salient object detection via contrastive features and attention modules. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1320–1329
Cheng HK, Tai Y-W, Tang C-K (2021) Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5559–5568
Cheng HK, Tai Y-W, Tang C-K (2021) Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34
Cheng J, Tsai Y-H, Wang S, Yang M-H (2017) Segflow: Joint learning for video object segmentation and optical flow. In: Proceedings of the IEEE International Conference on Computer Vision, pp 686–695
Chen X, Li Z, Yuan Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9384–9393
Cho S, Lee H, Kim M, Jang S, Lee S (2022) Pixel-level bijective matching for video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 129–138
Cucchiara R, Grana C, Piccardi M, Prati A (2003) Detecting moving objects, ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10):1337–1342
Culibrk D, Marques O, Socek D, Kalva H, Furht B (2007) Neural network approach to background modeling for video object segmentation. IEEE Transactions on Neural Networks 18(6):1614–1627
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Duarte K, Rawat YS, Shah M (2019) Capsulevos: Semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8480–8489
Duke B, Ahmed A, Wolf C, Aarabi P, Taylor GW (2021) Sstvos: Sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5912–5921
Dutt Jain S, Xiong B, Grauman K (2017) Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3664–3673
Fiaz M, Zaheer MZ, Mahmood A, Lee S-I, Jung SK (2021) 4g-vos: Video object segmentation using guided context embedding. Knowl-Based Syst 231:107401
Gao M, Zheng F, Yu JJ, Shan C, Ding G, Han J (2022) Deep learning for video object segmentation: a review. Artif Intell Rev 1–75
Ge W, Lu X, Shen J (2021) Video object segmentation using global and instance embedding learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16836–16845
Han J, Yang L, Zhang D, Chang X, Liang X (2018) Reinforcement cutting-agent learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9080–9089
Heo Y, Koh YJ, Kim C-S (2021) Guided interactive video object segmentation using reliability-based attention maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7322–7330
Hu Y-T, Chen H-S, Hui K, Huang J-B, Schwing AG (2019) Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3105–3115
Hu Y-T, Huang J-B, Schwing A (2017) Maskrnn: Instance level video object segmentation. Advances in neural information processing systems 30
Hu L, Zhang P, Zhang B, Pan P, Xu Y, Jin R (2021) Learning position and target consistency for memory-based video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4144–4154
Ji G-P, Fu K, Wu Z, Fan D-P, Shen J, Shao L (2021) Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4922–4933
Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8953–8962
Lamdouar H, Yang C, Xie W, Zisserman A (2020) Betrayed by motion: Camouflaged object discovery via motion segmentation. In: Proceedings of the Asian Conference on Computer Vision
Lan M, Zhang J, He F, Zhang L (2022) Siamese network with interactive transformer for video object segmentation. Proceedings of the AAAI Conference on Artificial Intelligence 36:1228–1236
Lee Y, Seong H, Kim E (2021) Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. arXiv preprint arXiv:2112.12402
Liang Y, Li X, Jafari N, Chen J (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems 33:3430–3441
Li M, Hu L, Xiong Z, Zhang B, Pan P, Liu D (2022) Recurrent dynamic embedding for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1332–1341
Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: 2013 IEEE International Conference on Computer Vision, pp 2192–2199. https://doi.org/10.1109/ICCV.2013.273
Li X, Loy CC (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 90–105
Lin H, Qi X, Jia J (2019) Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3949–3957
Lin Z, Yang T, Li M, Wang Z, Yuan C, Jiang W, Liu W (2022) Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1362–1372
Li S, Seybold B, Vorobyov A, Fathi A, Huang Q, Kuo C-CJ (2018) Instance embedding transfer to unsupervised video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6526–6535
Li S, Seybold B, Vorobyov A, Lei X, Kuo C-CJ (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 207–223
Liu Z, Liu J, Chen W, Wu X, Li Z (2021) Faminet: Learning real-time semisupervised video object segmentation with steepest optimized optical flow. IEEE Trans Instrum Meas 71:1–16
Liu Y, Yu R, Yin F, Zhao X, Zhao W, Xia W, Yang Y (2022) Learning quality-aware dynamic memory for video object segmentation. arXiv preprint arXiv:2207.07922
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440
Luiten J, Zulfikar IE, Leibe B (2020) Unovost: Unsupervised offline video object segmentation and tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2000–2009
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV (2020) Video object segmentation with episodic graph memory networks. In: European Conference on Computer Vision, Springer, pp 661–679
Maninis K-K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2018) Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(6):1515–1530
Mao Y, Wang N, Zhou W, Li H (2021) Joint inductive and transductive learning for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9670–9679
Miao J, Wei Y, Yang Y (2020) Memory aggregation networks for efficient interactive video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10366–10375
Ochs P, Malik J, Brox T (2013) Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(6):1187–1200
Oh SW, Lee J-Y, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7376–7385
Oh SW, Lee J-Y, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9226–9235
Park K, Woo S, Oh SW, Kweon IS, Lee J-Y (2022) Per-clip video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1352–1361
Patil PW, Biradar KM, Dudhane A, Murala S (2020) An end-to-end edge aggregation network for moving object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8149–8158
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2663–2672
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 724–732. https://doi.org/10.1109/CVPR.2016.85
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724–732
Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Van Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Ren S, Liu W, Liu Y, Chen H, Han G, He S (2021) Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15455–15464
Rother C, Kolmogorov V, Blake A (2004) ”grabcut” interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314
Schmidt C, Athar A, Mahadevan S, Leibe B (2022) D2conv3d: Dynamic dilated convolutions for object segmentation in videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1200–1209
Seo S, Lee J-Y, Han B (2020) Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: European Conference on Computer Vision, Springer, pp 208–223
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: European Conference on Computer Vision, Springer, pp 629–645
Seong H, Oh SW, Lee J-Y, Lee S, Lee S, Kim E (2021) Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 12889–12898
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28
Tokmakov P, Alahari K, Schmid C (2017) Learning video object segmentation with visual memory. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4481–4490
Tsai D, Flagg M, Nakazawa A, Rehg JM (2012) Motion coherent tracking using multi-label mrf optimization. Int J Comput Vis 100(2):190–202
Ventura C, Bellver M, Girbau A, Salvador A, Marques F, Giro-i-Nieto X (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5277–5286
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L-C (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9481–9490
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364
Voigtlaender P, Luo L, Yuan C, Jiang Y, Leibe B (2021) Reducing the annotation effort for video object segmentation datasets. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3060–3069
Wang W, Shen J, Porikli F, Yang R (2018) Semi-supervised video object segmentation with super-trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(4):985–998
Wang H, Jiang X, Ren H, Hu Y, Bai S (2021) Swiftnet: Real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1296–1305
Wang W, Lu X, Shen J, Crandall DJ, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9236–9245
Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi SC, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3064–3074
Wei L, Lang C, Liang L, Feng S, Wang T, Chen S (2022) Weakly supervised video object segmentation via dual-attention cross-branch fusion. ACM Transactions on Intelligent Systems and Technology (TIST) 13(3):1–20
Wu D, Dong X, Shao L, Shen J (2022) Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4996–5005
Wu J, Jiang Y, Sun P, Yuan Z, Luo P (2022) Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4974–4984
Xie H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1286–1295
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence 34:12549–12556
Xu X, Wang J, Li X, Lu Y (2021) Reliable propagation-correction modulation for video object segmentation. arXiv preprint arXiv:2112.02853
Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327
Xu K, Yao A (2021) Efficient video object segmentation with compressed video. arXiv preprint arXiv:2107.12192
Xu K, Yao A (2022) Accelerating video object segmentation with compressed video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1342–1351
Yang Z, Wei Y, Yang Y (2021) Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34:2491–2502
Yang L, Fan Y, Xu N (2019) Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5188–5197
Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6499–6507
Yin Y, Xu D, Wang X, Zhang L (2021) Agunet: Annotation-guided u-net for fast one-shot video object segmentation. Pattern Recogn 110:107580
Yin Z, Zheng J, Luo W, Qian S, Zhang H, Gao S (2021) Learning to recommend frame for interactive video object segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15445–15454
Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2636–2645
Zhang D, Javed O, Shah M (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 628–635
Zhang L, Lin Z, Zhang J, Lu H, He Y (2019) Fast video object segmentation via dynamic targeting network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5582–5591
Zhou T, Li J, Li X, Shao L (2021) Target-aware object discovery and association for unsupervised video multi-object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6985–6994
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding and/or Conflicts of interests/Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Manish Kumar and Sachin Chaudhary contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gupta, D., Kumar, M. & Chaudhary, S. A systematic review of deep learning frameworks for moving object segmentation. Multimed Tools Appl 83, 24715–24748 (2024). https://doi.org/10.1007/s11042-023-16417-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16417-3