Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.15308 (cs)

[Submitted on 23 Oct 2023 (v1), last revised 10 Jun 2024 (this version, v4)]

Title:SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Authors:Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari

View PDF HTML (experimental)

Abstract:The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2310.15308 [cs.CV]
	(or arXiv:2310.15308v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.15308

Submission history

From: Haoxiang Wang [view email]
[v1] Mon, 23 Oct 2023 19:21:57 UTC (6,394 KB)
[v2] Mon, 20 Nov 2023 00:56:15 UTC (8,063 KB)
[v3] Wed, 22 May 2024 05:53:11 UTC (9,140 KB)
[v4] Mon, 10 Jun 2024 19:19:16 UTC (9,140 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators