Abstract
Video Object Segmentation (VOS) is a fundamental task with many real-world computer vision applications and challenging due to available distractors and background clutter. Many existing online learning approaches have limited practical significance because of high computational cost required to fine-tune network parameters. Moreover, matching based and propagation approaches are computationally efficient but may suffer from degraded performance in cluttered backgrounds and object drifts. In order to handle these issues, we propose an offline end-to-end model to learn guided feature transfer for VOS. We introduce guided feature modulation based on target mask to capture the video context information and a generative appearance model is used to provide cues for both the target and the background. Proposed guided feature modulation system learns the target semantic information based on modulation activations. Generative appearance model learns the probability of a pixel to be target or background. In addition, low-resolution features from deeper networks may not capture the global contextual information and may reduce the performance during feature refinement. Therefore, we also propose a guided pooled decoder to learn the global as well as local context information for better feature refinement. Evaluation over two VOS benchmark datasets including DAVIS2016 and DAVIS2017 have shown excellent performance of the proposed framework compared to more than 20 existing state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: CVPR, pp. 5977–5986 (2018)
Caelles, S., et al.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)
Caelles, S., et al.: Fast video object segmentation with spatio-temporal GANs. arXiv preprint arXiv:1903.12161 (2019)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818 (2018)
Chen, Y., et al.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR, pp. 1189–1198 (2018)
Cheng, J., et al.: SegFlow: Joint learning for video object segmentation and optical flow. In: ICCV, pp. 686–695 (2017)
Cheng, J., et al.: Fast and accurate online video object segmentation via tracking parts. In: CVPR, pp. 7415–7424 (2018)
Ci, H., Wang, C., Wang, Y.: Video object segmentation by learning location-sensitive embeddings. In: ECCV, pp. 501–516 (2018)
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Advances in Neural Information Processing Systems, pp. 6594–6604 (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Fiaz, M., Mahmood, A., Baek, K.Y., Farooq, S.S., Jung, S.K.: Improving object tracking by added noise and channel attention. Sensors 20(13), 3780 (2020)
Fiaz, M., Mahmood, A., Javed, S., Jung, S.K.: Handcrafted and deep trackers: Recent visual object tracking approaches and trends. ACM Comput. Surv. (CSUR) 52(2), 1–44 (2019)
Fiaz, M., Mahmood, A., Jung, S.K.: Learning soft mask based feature fusion with channel and spatial attention for robust visual object tracking. Sensors 20(14), 4021 (2020)
Fiaz, M., Mahmood, A., Jung, S.K.: Video object segmentation using guided feature and directional deep appearance learning. In: Proceedings of the 2020 DAVIS Challenge on Video Object Segmentation-CVPR, Workshops, Seattle, WA, USA, vol. 19 (2020)
Fiaz, M., et al.: Adaptive feature selection Siamese networks for visual tracking. In: Ohyama, W., Jung, S.K. (eds.) IW-FCV 2020. CCIS, vol. 1212, pp. 167–179. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4818-5_13
Fiaz, M., Zaheer, M.Z., Mahmood, A., Lee, S.I., Jung, S.K.: 4G-VOS: video object segmentation using guided context embedding. Knowl. Based Syst. 231, 107401 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hu, Y.T., Huang, J.B., Schwing, A.G.: Videomatch: Matching based video object segmentation. In: ECCV, pp. 54–70 (2018)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)
Jampani, V., Gadde, R., Gehler, P.V.: Video propagation networks. In: CVPR, pp. 451–461 (2017)
Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR, pp. 5849–5858 (2017)
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: CVPR, pp. 8953–8962 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, X., C. Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV, pp. 90–105 (2018)
Lin, H., Qi, X., Jia, J.: AGSS-VOS: attention guided single-shot video object segmentation. In: ICCV, pp. 3949–3957 (2019)
Lukežič, A., Matas, J., Kristan, M.: D3s-a discriminative single shot segmentation tracker. arXiv preprint arXiv:1911.08862 (2019)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 3 (2013)
Maninis, K.K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1515–1530 (2018)
Nam, H., Kim, H.: Batch-instance normalization for adaptively style-invariant neural networks. In: Advances in Neural Information Processing System (2018)
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663–2672 (2017)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732 (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Rahman, M.M., Fiaz, M., Jung, S.K.: Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access 8, 100857–100869 (2020)
Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: CVPR, pp. 3126–3135 (2019)
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR, pp. 3899–3908 (2016)
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277–5286 (2019)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. In: The 2017 DAVIS Challenge on VOS-CVPR Workshops, vol. 5 (2017)
Voigtlaender, P., Luiten, J., Leibe, B.: BoLTVOS: box-level tracking for video object segmentation. arXiv preprint arXiv:1904.04552 (2019)
Wang, Q., et al.: Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328–1338 (2019)
Wang, W., Shen, J., Porikli, F., Yang, R.: Semi-supervised video object segmentation with super-trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 985–998 (2018)
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: RANet: ranking attention network for fast video object segmentation. In: ICCV, pp. 3978–3987 (2019)
Oh, S.W., et al.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)
Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Yang, L., et al.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499–6507 (2018)
Yang, Z., et al.: Anchor diffusion for unsupervised video object segmentation. In: ICCV, pp. 931–940 (2019)
Zhou, Q., et al.: Proposal, tracking and segmentation (PTS): a cascaded network for video object segmentation. arXiv preprint arXiv:1907.01203 (2019)
Zhuo, T., Cheng, Z., Kankanhalli, M.: Fast video object segmentation via mask transfer network. arXiv preprint arXiv:1908.10717 (2019)
Acknowledgment
This study was supported by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fiaz, M., Mahmood, A., Shahzad Farooq, S., Ali, K., Shaheryar, M., Jung, S.K. (2022). Video Object Segmentation Based on Guided Feature Transfer Learning. In: Sumi, K., Na, I.S., Kaneko, N. (eds) Frontiers of Computer Vision. IW-FCV 2022. Communications in Computer and Information Science, vol 1578. Springer, Cham. https://doi.org/10.1007/978-3-031-06381-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-06381-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06380-0
Online ISBN: 978-3-031-06381-7
eBook Packages: Computer ScienceComputer Science (R0)