Article

Learning What to Learn for Video Object Segmentation

Authors:

Felix Järemo Lawin,

Martin Danelljan,

Andreas Robinson,

Michael Felsberg,

Radu TimofteAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II

Pages 777 - 794

https://doi.org/10.1007/978-3-030-58536-5_46

Published: 23 August 2020 Publication History

Abstract

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a

2.6 %

relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking.

References

[1]

Behl, H.S., Najafi, M., Arnab, A., Torr, P.H.S.: Meta learning deep visual words for fast video object segmentation. In: NeurIPS 2019 Workshop on Machine Learning for Autonomous Driving (2018)

[2]

Berman, M., Rannen Triki, A., Blaschko, M.B.: The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421 (2018)

[3]

Bertinetto, L., Henriques, J.F., Torr, P., Vedaldi, A.: Meta-learning with differentiable closed-form solvers. In: International Conference on Learning Representations (2019)

[4]

Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6182–6191 (2019)

[5]

Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5320–5329. IEEE (2017)

[6]

Choi, J., Kwon, J., Lee, K.M.: Deep meta learning for real-time target-aware visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 911–920 (2019)

[7]

Cohen, I., Medioni, G.: Detecting and tracking moving objects for video surveillance. In: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), vol. 2, pp. 319–325. IEEE (1999)

[8]

Danelljan, M., Van Gool, L., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)

[9]

Erdélyi, A., Barát, T., Valet, P., Winkler, T., Rinner, B.: Adaptive cartooning for privacy protection in camera networks. In: 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 44–49. IEEE (2014)

[10]

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org (2017)

[11]

He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)

[12]

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)

[13]

Hu, P., Wang, G., Kong, X., Kuen, J., Tan, Y.P.: Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409 (2018)

[14]

Hu Y-T, Huang J-B, and Schwing AG Ferrari V, Hebert M, Sminchisescu C, and Weiss Y VideoMatch: matching based video object segmentation Computer Vision – ECCV 2018 2018 Cham Springer 56-73

Digital Library

[15]

Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[16]

Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014

[17]

Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR (2019)

[18]

Lin, H., Qi, X., Jia, J.: Agss-vos: attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3949–3957 (2019)

[19]

Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755

[20]

Liu, Y., Liu, L., Zhang, H., Rezatofighi, H., Reid, I.: Meta learning with differentiable closed-form solver for fast video object segmentation. arXiv preprint arXiv:1909.13046 (2019)

[21]

Luiten J, Voigtlaender P, and Leibe B Jawahar CV, Li H, Mori G, and Schindler K PReMVOS: proposal-generation, refinement and merging for video object segmentation Computer Vision – ACCV 2018 2019 Cham Springer 565-580

Digital Library

[22]

Maninis KK et al. Video object segmentation without temporal information IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2018 41 6 1515-1530

Digital Library

[23]

Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark (2018). Accessed 04 Sep 2019

[24]

Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385. IEEE (2018)

[25]

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision (2019)

[26]

Park, E., Berg, A.C.: Meta-tracker: fast and robust online adaptation for visual object trackers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 569–585 (2018)

[27]

Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)

[28]

Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663–2672 (2017)

[29]

Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation (2020)

[30]

Ros, G., Ramos, S., Granados, M., Bakhtiary, A., Vazquez, D., Lopez, A.M.: Vision-based offline-online perception paradigm for autonomous driving. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 231–238. IEEE (2015)

[31]

Saleh, K., Hossny, M., Nahavandi, S.: Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network. In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE (2016)

[32]

Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)

[33]

Voigtlaender, P., Leibe, B.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[34]

Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[35]

Vondrick C, Shrivastava A, Fathi A, Guadarrama S, and Murphy K Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Tracking emerges by colorizing videos Computer Vision – ECCV 2018 2018 Cham Springer 402-419

Digital Library

[36]

Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)

[37]

Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: ranking attention network for fast video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3978–3987 (2019)

[38]

Xu N et al. Ferrari V, Hebert M, Sminchisescu C, Weiss Y, et al. YouTube-VOS: sequence-to-sequence video object segmentation Computer Vision – ECCV 2018 2018 Cham Springer 603-619

Digital Library

[39]

Xu, N., et al.: Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)

[40]

Yang L, Wang Y, Xiong X, Yang J, and Katsaggelos AK Efficient video object segmentation via network modulation Algorithms 2018 29 15

Cited By

Li WGuo PZhou XHong LHe YZheng XZhang WZhang W(2024)OneVOS: Unifying Video Object Segmentation with All-in-One Transformer FrameworkComputer Vision – ECCV 202410.1007/978-3-031-73636-0_2(20-40)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73636-0_2
Uziel RDinari OFreifeld OOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)From ViT features to training-free video object segmentation via streaming-data mixture modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666607(10995-11007)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666607
Chen ZYang MZhang S(2023)Complementary Coarse-to-Fine Matching for Video Object SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359649619:6(1-21)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3596496
Show More Cited By

Recommendations

VOSTR: Video Object Segmentation via Transferable Representations
Abstract
In order to learn video object segmentation models, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit ...
Learning to learn better for video object segmentation
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Recently, the joint learning framework (JOINT) integrates matching based transductive reasoning and online inductive learning to achieve accurate and robust semi-supervised video object segmentation (SVOS). However, using the mask embedding as the label ...
Learning static object segmentation from motion segmentation

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II

Aug 2020

839 pages

ISBN:978-3-030-58535-8

DOI:10.1007/978-3-030-58536-5

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li WGuo PZhou XHong LHe YZheng XZhang WZhang W(2024)OneVOS: Unifying Video Object Segmentation with All-in-One Transformer FrameworkComputer Vision – ECCV 202410.1007/978-3-031-73636-0_2(20-40)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73636-0_2
Uziel RDinari OFreifeld OOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)From ViT features to training-free video object segmentation via streaming-data mixture modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666607(10995-11007)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666607
Chen ZYang MZhang S(2023)Complementary Coarse-to-Fine Matching for Video Object SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359649619:6(1-21)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3596496
Jiang KHong LChen ZGuo PTao ZWang YZhang WEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial AttacksProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611827(8598-8607)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611827
Hong LZhang WGao SLu HZhang WEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object SegmentationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611804(7481-7490)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611804
Singh HVerma MCheruku R(2023)DSNet: Efficient Lightweight Model for Video Salient Object Detection for IoT and WoT ApplicationsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587592(1286-1295)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587592
Li GGong SZhong SZhou L(2022)Spatial and Temporal Guidance for Semi-supervised Video Object SegmentationNeural Information Processing10.1007/978-3-031-30111-7_9(97-109)Online publication date: 22-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-30111-7_9
Chen ZHu PZhang LLu HHe YWang SZhang XHu MLi T(2022)Video Object Segmentation via Structural Feature ReconfigurationComputer Vision – ACCV 202210.1007/978-3-031-26293-7_35(588-605)Online publication date: 4-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-26293-7_35
Paul MDanelljan MMayer CVan Gool L(2022)Robust Visual Tracking by SegmentationComputer Vision – ECCV 202210.1007/978-3-031-20047-2_33(571-588)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20047-2_33
Cho SLee HLee MPark CJang SKim MLee S(2022)Tackling Background Distraction in Video Object SegmentationComputer Vision – ECCV 202210.1007/978-3-031-20047-2_26(446-462)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20047-2_26
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents