Model Breadcrumbs: Scaling Multi-task Model Merging with Sparse Masks

MohammadReza Davari¹³ &
Eugene Belilovsky¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15133))

Included in the following conference series:

European Conference on Computer Vision

260 Accesses

Abstract

The rapid development of AI systems has been greatly influenced by the emergence of foundation models. A common approach for targeted problems involves fine-tuning these pre-trained foundation models for specific target tasks, resulting in a rapid spread of models fine-tuned across a diverse array of tasks. This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks. We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined weight set that guides model adaptation within the weight space of a pre-trained model. These breadcrumbs are constructed by subtracting the weights from a pre-trained model before and after fine-tuning, followed by a sparsification process that eliminates weight outliers and negligible perturbations. Our experiments demonstrate the effectiveness of Model Breadcrumbs to simultaneously improve performance across multiple tasks. This contribution aligns with the evolving paradigm of updatable machine learning, reminiscent of the collaborative principles underlying open-source software development, fostering a community-driven effort to reliably update machine learning models. Our method is shown to be more efficient and unlike previous proposals does not require hyperparameter tuning for each new task added. Through extensive experimentation involving various models, tasks, and modalities we establish that integrating Model Breadcrumbs offers a simple, efficient, and highly effective approach for constructing multi-task models and facilitating updates to foundation models.$^1$ (1) The code to reproduce our results is publicly available at: https://github.com/rezazzr/breadcrumbs

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Relationship Between Disentanglement and Multi-task Learning

HydaLearn

Article 04 July 2022

A brief review on multi-task learning

Article 08 August 2018

References

Ainsworth, S.K., Hayase, J., Srinivasa, S.: Git re-basin: merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836 (2022)
Asadi, N., Davari, M., Mudur, S., Aljundi, R., Belilovsky, E.: Prototype-sample relation distillation: towards replay-free continual learning. In: International Conference on Machine Learning, pp. 1093–1106. PMLR (2023)
Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105, 1865–1883 (2017)
Article Google Scholar
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: International Conference on Machine Learning, pp. 1931–1942. PMLR (2021)
Google Scholar
Choshen, L., Venezian, E., Slonim, N., Katz, Y.: Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044 (2022)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Cossu, A., Tuytelaars, T., Carta, A., Passaro, L., Lomonaco, V., Bacciu, D.: Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357 (2022)
Davari, M., Asadi, N., Mudur, S., Aljundi, R., Belilovsky, E.: Probing representation forgetting in supervised and unsupervised continual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16712–16721 (2022)
Google Scholar
Davari, M., Belilovsky, E.: Probing representation forgetting in continual learning. In: NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (2021)
Google Scholar
Davari, M., Belilovsky, E.: Model breadcrumbs: scalable upcycling of finetuned foundation models via sparse task vectors merging. In: ICML 2024 Workshop on Foundation Models in the Wild (2024)
Google Scholar
Davari, M., Kosseim, L., Bui, T.: TIMBERT: toponym identifier for the medical domain based on BERT. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 662–668 (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dolan, B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005) (2005)
Google Scholar
Don-Yehiya, S., Venezian, E., Raffel, C., Slonim, N., Katz, Y., Choshen, L.: Cold fusion: Collaborative descent for distributed multitask finetuning. arXiv preprint arXiv:2212.01378 (2022)
Fabbri, A.R., Li, I., She, T., Li, S., Radev, D.R.: Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749 (2019)
Farahnak, F., Mohammadi, E., Davari, M., Kosseim, L.: Semantic similarity matching using contextualized representations. In: Canadian Conference on AI (2021)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. (2019)
Google Scholar
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: the German traffic sign detection benchmark. In: International Joint Conference on Neural Networks (2013)
Google Scholar
Ilharco, G., et al.: Editing models with task arithmetic. arXiv preprint arXiv:2212.04089 (2022)
Ilharco, G., et al.: Patching open-vocabulary models by interpolating weights. Adv. Neural. Inf. Process. Syst. 35, 29262–29277 (2022)
Google Scholar
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
Khot, T., Clark, P., Guerquin, M., Jansen, P., Sabharwal, A.: QASC: a dataset for question answering via sentence composition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8082–8090 (2020)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Google Scholar
Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: Race: large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 (2017)
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist2 (2010)
Legate, G., Bernier, N., Caccia, L., Oyallon, E., Belilovsky, E.: Guiding the last layer in federated learning with pre-trained models. arXiv preprint arXiv:2306.03937 (2023)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Lin, B.Y., et al.: Commongen: a constrained text generation challenge for generative commonsense reasoning. arXiv preprint arXiv:1911.03705 (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)
Google Scholar
Matena, M.S., Raffel, C.A.: Merging models with fisher-weighted averaging. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 17703–17716. Curran Associates, Inc. (2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/70c26937fbf3d4600b69a129031b66ec-Paper-Conference.pdf
Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47(1), 90–100 (2003)
Article MathSciNet Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. Adv. Neural Inf. Process. Syst. (NIPS) (2011)
Google Scholar
Neyshabur, B., Sedghi, H., Zhang, C.: What is being transferred in transfer learning? Adv. Neural. Inf. Process. Syst. 33, 512–523 (2020)
Google Scholar
Nguyen, J., Malik, K., Sanjabi, M., Rabbat, M.: Where to begin? Exploring the impact of pre-training and initialization in federated learning. arXiv preprint arXiv:2206.15387 (2022)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning. OpenAI blog (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Ramé, A., Ahuja, K., Zhang, J., Cord, M., Bottou, L., Lopez-Paz, D.: Model ratatouille: recycling diverse models for out-of-distribution generalization. arXiv preprint arXiv:2212.10445 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15), e2016239118 (2021)
Article Google Scholar
Rothchild, D., Tamkin, A., Yu, J., Misra, U., Gonzalez, J.: C5T5: controllable generation of organic molecules with transformers. arXiv preprint arXiv:2108.10307 (2021)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Stoica, G., Bolya, D., Bjorner, J., Hearn, T., Hoffman, J.: Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053 (2023)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Linzen, T., Chrupała, G., Alishahi, A. (eds.) Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium (Nov 2018). https://doi.org/10.18653/v1/W18-5446, https://aclanthology.org/W18-5446
Warstadt, A., Singh, A., Bowman, S.R.: Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019)
Article Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Google Scholar
Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: International Conference on Machine Learning, pp. 23965–23998. PMLR (2022)
Google Scholar
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971 (2022)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492 (2010)
Google Scholar
Yadav, P., Tam, D., Choshen, L., Raffel, C., Bansal, M.: Ties-merging: resolving interference when merging models. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Yang, Z., Maricar, Y., Davari, M., Grenon-Godbout, N., Rabbany, R.: Toxbuster: In-game chat toxicity buster with BERT. arXiv preprint arXiv:2305.12542 (2023)
Yin, P., Neubig, G., Yih, W.t., Riedel, S.: TaBERT: pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020)

Download references

Acknowledgements

We acknowledge funding from the NSERC Discovery Grant RGPIN-2021-04104 and FRQNT New Scholar. This research was enabled in part by compute resources provided by Digital Research Alliance of Canada (the Alliance) and Calcul Québec.

Author information

Authors and Affiliations

Concordia University and Mila – Quebec AI Institute, Montreal, Canada
MohammadReza Davari & Eugene Belilovsky

Authors

MohammadReza Davari
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Belilovsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to MohammadReza Davari .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 315 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Davari, M., Belilovsky, E. (2025). Model Breadcrumbs: Scaling Multi-task Model Merging with Sparse Masks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-73226-3_16
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73225-6
Online ISBN: 978-3-031-73226-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Model Breadcrumbs: Scaling Multi-task Model Merging with Sparse Masks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On the Relationship Between Disentanglement and Multi-task Learning

HydaLearn

A brief review on multi-task learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 315 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Model Breadcrumbs: Scaling Multi-task Model Merging with Sparse Masks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On the Relationship Between Disentanglement and Multi-task Learning

HydaLearn

A brief review on multi-task learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 315 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation