Abstract
Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Fig. 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks. Our project page can be found at https://sayannag.github.io/safari_eccv2024/.
S. Nag—Work done during internship at Adobe Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV, pp. 9157–9166 (2019)
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: CVPR, pp. 4985–4995 (2022)
Chen, D.J., Jia, S., Lo, Y.C., Chen, H.T., Liu, T.L.: See-through-text grouping for referring image segmentation. In: ICCV, pp. 7454–7463 (2019)
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: CVPR, pp. 3150–3158 (2016)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV, pp. 16321–16330 (2021)
Dou, Z.Y., et al.: An empirical study of training end-to-end vision-and-language transformers. In: CVPR, pp. 18166–18176 (2022)
Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR, pp. 15506–15515 (2021)
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR, pp. 5958–5966 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: CVPR, pp. 4424–4433 (2020)
Huang, S., et al.: Referring image segmentation via cross-modal progressive comprehension. In: CVPR, pp. 10488–10497 (2020)
Hui, T., et al.: Linguistic structure guided context modeling for referring image segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 59–75. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_4
Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., Tan, T.: Locate then segment: a strong pipeline for referring image segmentation. In: CVPR, pp. 9858–9867 (2021)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: ICCV, pp. 1780–1790 (2021)
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8
Kim, D., Kim, N., Lan, C., Kwak, S.: Shatter and gather: learning referring image segmentation with text supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15547–15557 (2023)
Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: Restr: convolution-free referring image segmentation using transformers. In: CVPR, pp. 18145–18154 (2022)
Lee, J., Lee, S., Nam, J., Yu, S., Do, J., Taghavi, T.: Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21870–21881 (2023)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR, pp. 10965–10975 (2022)
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. In: NeurIPS (2021)
Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR, pp. 5745–5753 (2018)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV, pp. 1271–1280 (2017)
Liu, F., et al.: Referring image segmentation using text supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22124–22134 (2023)
Liu, J., et al.: Polyformer: referring image segmentation as sequential polygon generation. In: CVPR, pp. 18653–18663 (2023)
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR, pp. 8759–8768 (2018)
Liu, Y., et al.: RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: CVPR, pp. 10437–10446 (2020)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR, pp. 11–20 (2016)
Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 656–672. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_39
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV, pp. 565–571. IEEE (2016)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Qu, M., Wu, Y., Wei, Y., Liu, W., Liang, X., Zhao, Y.: Learning to segment every referring object point by point. In: CVPR, pp. 3021–3030 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021). https://proceedings.mlr.press/v139/radford21a.html
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? NeurIPS 34, 12116–12128 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS 28 (2015)
Shi, H., Li, H., Meng, F., Wu, Q.: Key-word-aware network for referring expression image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 38–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_3
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? Visual prompt engineering for VLMS. arXiv preprint arXiv:2304.06712 (2023)
Strudel, R., Laptev, I., Schmid, C.: Weakly-supervised segmentation of referring expressions. arXiv preprint arXiv:2205.04725 (2022)
Wang, Z., et al.: CRIS: clip-driven referring image segmentation. In: CVPR, pp. 11686–11695 (2022)
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR, pp. 4974–4984 (2022)
Yang, S., Xia, M., Li, G., Zhou, H.Y., Yu, Y.: Bottom-up shift and reasoning for referring image segmentation. In: CVPR, pp. 11266–11275 (2021)
Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: language-aware vision transformer for referring image segmentation. In: CVPR, pp. 18155–18165 (2022)
Yang, Z., et al.: UniTAB: unifying text and box outputs for grounded vision-language modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 521–539. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_30
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR, pp. 10502–10511 (2019)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zhang, H., et al.: Glipv2: unifying localization and vision-language understanding. NeurIPS 35, 36067–36080 (2022)
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: CVPR, pp. 5579–5588 (2021)
Zhu, C., et al.: SeqTR: a simple yet universal network for visual grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 598–615. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_35
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nag, S., Goswami, K., Karanam, S. (2025). SafaRi: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-72784-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72783-2
Online ISBN: 978-3-031-72784-9
eBook Packages: Computer ScienceComputer Science (R0)