Abstract
Hierarchical segmentation entails creating segmentations at varying levels of granularity. We introduce the first hierarchical semantic segmentation dataset with subpart annotations for natural images, which we call SPIN (SubPartImageNet). We also introduce two novel evaluation metrics to evaluate how well algorithms capture spatial and semantic relationships across hierarchical levels. We benchmark modern models across three different tasks and analyze their strengths and weaknesses across objects, parts, and subparts. To facilitate community-wide progress, we publicly release our dataset at https://joshmyersdean.github.io/spin/index.html.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
No code is publicly available for VDT, and ViRReq does not offer complete code. At the time of writing, Semantic-SAM has not released their semantic prediction code.
- 2.
We prompted GPT-4 with “Please list the canonical subparts of a <object>-<part>. Only include subparts that are clearly visible and recognizable to a layperson.”.
- 3.
For a quadruped, for instance, pairs such as \(\{\)(eyes, head), (chest, torso), (torso, quadruped)\(\}\) could be present in \(\mathcal {R}\).
- 4.
For subparts, we refer to the parent object rather than the parent part, as broader context aids in processing finer details [47].
- 5.
Results are shown in the supplementary materials for two prompts asking “Is the category not present” and “Is the [different category] present”.
References
Explore images. https://support.apple.com/guide/iphone/use-voiceover-for-images-and-videos-iph37e6b3844/ios
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Berglund, L., et al.: The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. arXiv preprint arXiv:2309.12288 (2023)
Cai, M., et al.: Making large multimodal models understand arbitrary visual prompts. In: IEEE Conference on Computer Vision and Pattern Recognition (2024)
Chang, A.X., et al.: ShapeNet: an information-rich 3D moel repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1978 (2014)
Deitke, M., et al.: RoboTHOR: an open simulation-to-real embodied AI platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Deng, B., Genova, K., Yazdani, S., Bouaziz, S., Hinton, G., Tagliasacchi, A.: CVXNet: learnable convex decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 31–44 (2020)
Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., Vedantam, S.R.: Hyperbolic image-text representations. In: International Conference on Machine Learning, pp. 7694–7731. PMLR (2023)
Ding, M., et al.: Visual dependency transformers: Dependency tree emerges from reversed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14528–14539 (2023)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Geng, H., et al.: GAPartNet: cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7081–7091 (2023)
de Geus, D., Meletis, P., Lu, C., Wen, X., Dubbelman, G.: Part-aware panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5485–5494 (2021)
Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level human parsing via part grouping network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–785 (2018)
He, J., Chen, J., Lin, M.X., Yu, Q., Yuille, A.L.: Compositor: bottom-up clustering and compositing for robust part and object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11259–11268 (2023)
He, J., et al.: PartImageNet: a large, high-quality dataset of parts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 128–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_8
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hong, Y., Li, Q., Zhu, S.C., Huang, S.: VLGrammar: grounded grammar induction of vision and language. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1665–1674 (2021)
Hong, Y., Yi, L., Tenenbaum, J., Torralba, A., Gan, C.: PTR: a benchmark for part-based conceptual, relational, and physical reasoning. Adv. Neural. Inf. Process. Syst. 34, 17427–17440 (2021)
Jia, M., et al.: Fashionpedia: ontology, segmentation, and an attribute localization dataset. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 316–332. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_19
Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4015–4026 (2023)
Koo, J., Huang, I., Achlioptas, P., Guibas, L.J., Sung, M.: PartGlot: learning shape part segmentation from language reference games. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16505–16514 (2022)
Lai, X., et al.: LISA: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Lee, J., Peng, Y.H., Herskovitz, J., Guo, A.: Image explorer: multi-layered touch exploration to make images accessible. In: Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2021. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3441852.3476548
Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)
Li, L., Zhou, T., Wang, W., Li, J., Yang, Y.: Deep hierarchical semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1246–1257 (2022)
Li, T., Gupta, V., Mehta, M., Srikumar, V.: A logic-driven framework for consistency of neural models. arXiv preprint arXiv:1909.00126 (2019)
Li, X., Xu, S., Yang, Y., Cheng, G., Tong, Y., Tao, D.: Panoptic-partformer: learning a unified model for panoptic part segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 729–747. Springer, Cham (2022)
Li, X., et al.: Panoptic-PartFormer++: a unified and decoupled view for panoptic part segmentation. arXiv preprint arXiv:2301.00954 (2023)
Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)
Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_8
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, J., Min, S., Zettlemoyer, L., Choi, Y., Hajishirzi, H.: Infini-gram: scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377 (2024)
Liu, Q., et al.: CGPart: a part segmentation dataset based on 3D computer graphics models. arXiv preprint arXiv:2103.14098 (2021)
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision, ICCV 2001, vol. 2, pp. 416–423. IEEE (2001)
Michieli, U., Zanuttigh, P.: Edge-aware graph matching network for part-based semantic segmentation. Int. J. Comput. Vision 130(11), 2797–2821 (2022)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995). https://doi.org/10.1145/219717.219748
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 909–918 (2019)
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Myers-Dean, J., Fan, Y., Price, B., Chan, W., Gurari, D.: Interactive segmentation for diverse gesture types without context. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 7198–7208 (2024)
Nair, V., Zhu, H.H., Smith, B.A.: ImageAssist: tools for enhancing touchscreen-based image exploration systems for blind and low vision users. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3544548.3581302
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramanathan, V., et al.: PACO: parts and attributes of common objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7141–7151 (2023)
Rasheed, H., et al.: GLaMM: pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356 (2023)
Raymond, W., Gibbs, J., Matlock, T.: Psycholinguistics and mental representations (2000)
Ren, Z., et al.: PixeLLM: pixel reasoning with large multimodal model (2023)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
Song, X., et al.: ApolloCar3D: a large 3D car instance understanding benchmark for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5452–5462 (2019)
Sun, P., et al.: Going denser with open-vocabulary part segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15407–15419 (2023). https://api.semanticscholar.org/CorpusID:258762519
Tang, C., Xie, L., Zhang, X., Hu, X., Tian, Q.: Visual recognition by request. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15265–15274 (2023)
Tong, K., Wu, Y.: Deep learning-based detection from the perspective of small or tiny objects: a survey. Image Vis. Comput. 123, 104471 (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Tsogkas, S., Kokkinos, I., Papandreou, G., Vedaldi, A.: Deep learning for semantic part segmentation with high-level guidance. arXiv preprint arXiv:1505.02438 (2015)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset (2011)
Wang, J., Yuille, A.L.: Semantic part segmentation using compositional model combining shape and appearance. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 1788–1797 (2015)
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Joint object and part segmentation using deep learned potentials. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1573–1581 (2015)
Wang, W., et al.: CogVLM: visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
Wang, X., Li, S., Kallidromitis, K., Kato, Y., Kozuka, K., Darrell, T.: Hierarchical open-vocabulary universal image segmentation. Adv. Neural Inf. Process. Syst. 36 (2024)
Wei, M., Yue, X., Zhang, W., Kong, S., Liu, X., Pang, J.: OV-PARTS: towards open-vocabulary part segmentation. Adv. Neural Inf. Process. Syst 36 (2024)
Wu, T.H., et al.: See, say, and segment: teaching LMMs to overcome false premises. arXiv preprint arXiv:2312.08366 (2023)
Xiang, F., et al.: SAPIEN: a simulated part-based interactive environment. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: PartNet: a recursive part decomposition network for fine-grained and hierarchical shape segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9491–9500 (2019)
Yuan, Y., et al.: Osprey: pixel understanding with visual instruction tuning (2023)
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 792–800 (2018)
Zheng, S., Yang, F., Kiapour, M.H., Piramuthu, R.: ModaNet: a large-scale street fashion dataset with polygon annotations. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1670–1678 (2018)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Acknowledgments
This work was supported by Adobe Research Gift Funds and utilized the Blanca condo computing resource at the University of Colorado Boulder. Josh Myers-Dean is supported by a NSF GRFP fellowship (#1917573). We thank the crowdworkers for contributing their time for the construction of SPIN and the authors of our benchmarked models for open-sourcing their work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Myers-Dean, J., Reynolds, J., Price, B., Fan, Y., Gurari, D. (2025). SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15082. Springer, Cham. https://doi.org/10.1007/978-3-031-72691-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-72691-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72690-3
Online ISBN: 978-3-031-72691-0
eBook Packages: Computer ScienceComputer Science (R0)