Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3661167.3661215acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
short-paper
Open access

Automated categorization of pre-trained models in software engineering: A case study with a Hugging Face dataset

Published: 18 June 2024 Publication History

Abstract

Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large machine learning (ML) models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories.
This paper introduces an approach to bridge the gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying ons model names.

References

[1]
Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2023. HFCommunity: A Tool to Analyze the Hugging Face Hub Community. In Procs. of SANER 2023. 728–732. https://doi.org/10.1109/SANER56733.2023.00080 ISSN: 2640-7574.
[2]
Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1 (1994), 12–19. Publisher: Wiley Online Library.
[3]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. 2023. Analyzing the Evolution and Maintenance of ML Models on Hugging Face. https://doi.org/10.48550/arXiv.2311.13380 arXiv:2311.13380 [cs].
[4]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology 2, 3 (April 2011), 1–27. https://doi.org/10.1145/1961189.1961199
[5]
Juri Di Rocco, Davide Di Ruscio, Claudio Di Sipio, Phuong T. Nguyen, and Riccardo Rubei. 2021. Development of recommendation systems for software engineering: the CROSSMINER experience. Empirical Software Engineering 26, 4 (July 2021), 69. https://doi.org/10.1007/s10664-021-09963-7
[6]
Juri Di Rocco, Davide Di Ruscio, Claudio Di Sipio, Phuong T. Nguyen, and Riccardo Rubei. 2023. HybridRec: A recommender system for tagging GitHub repositories. Applied Intelligence 53, 8 (April 2023), 9708–9730. https://doi.org/10.1007/s10489-022-03864-y
[7]
Claudio Di Sipio, Riccardo Rubei, Juri Di Rocco, Davide Di Ruscio, and Phuong T. Nguyen. 2024. Replication Package: Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset. https://github.com/MDEGroup/EASE2024-HF-ReplicationPackage
[8]
Claudio Di Sipio, Riccardo Rubei, Davide Di Ruscio, and Phuong T. Nguyen. 2020. A Multinomial Naïve Bayesian (MNB) Network to Automatically Recommend Topics for GitHub Repositories. In Procs. of the Evaluation and Assessment in Software Engineering. ACM, Trondheim Norway, 71–80. https://doi.org/10.1145/3383219.3383227
[9]
Malinda Dilhara, Ameya Ketkar, and Danny Dig. 2021. Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. ACM Trans. Softw. Eng. Methodol. 30, 4, Article 55 (jul 2021), 42 pages. https://doi.org/10.1145/3453478
[10]
Zishuo Ding, Heng Li, Weiyi Shang, and Tse-Hsun Peter Chen. 2022. Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks. Empirical Software Engineering 27, 3 (March 2022), 63. https://doi.org/10.1007/s10664-022-10118-5
[11]
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. arxiv:2304.07590 [cs.SE]
[12]
Lina Gong, Jingxuan Zhang, Mingqiang Wei, Haoxiang Zhang, and Zhiqiu Huang. 2023. What Is the Intended Usage Context of This Model? An Exploratory Study of Pre-Trained Models on Various Model Repositories. ACM Trans. on Software Engineering and Methodology 32, 3 (May 2023), 69:1–69:57. https://doi.org/10.1145/3569934
[13]
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, 2021. Pre-trained models: Past, present and future. AI Open 2 (2021), 225–250. https://doi.org/10.1016/j.aiopen.2021.08.002
[14]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, 2023. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arxiv:2308.00352 [cs.AI]
[15]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, 2023. Large Language Models for Software Engineering: A Systematic Literature Review. https://doi.org/10.48550/arXiv.2308.10620 arXiv:2308.10620 [cs].
[16]
Maliheh Izadi, Mahtab Nejati, and Abbas Heydarnoori. 2023. Semantically-enhanced topic recommendation systems for software projects. Empirical Software Engineering 28, 2 (Feb. 2023), 50. https://doi.org/10.1007/s10664-022-10272-w
[17]
Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R. Schorlemmer, Rohan Sethi, 2023. An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry. In Procs. of ICSE 2023. IEEE Press, Melbourne, Victoria, Australia, 2463–2475. https://doi.org/10.1109/ICSE48619.2023.00206
[18]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:1907.11692 [cs.CL]
[19]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, 2019. Model Cards for Model Reporting. In Procs. of the Conf. on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT⁎’19). ACM, 220–229. https://doi.org/10.1145/3287560.3287596
[20]
Diego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz, Chengjun Guo, Wenxin Jiang, 2022. Discrepancies among pre-trained deep neural networks: a new threat to model zoo reliability. In Procs. of ESEC/FSE 2022. ACM, 1605–1609. https://doi.org/10.1145/3540250.3560881
[21]
Gonzalo Navarro. 2001. A guided tour to approximate string matching. Comput. Surveys 33, 1 (2001), 31–88. https://doi.org/10.1145/375360.375365
[22]
Payam Refaeilzadeh, Lei Tang, and Huan Liu. 2009. Cross-Validation. Springer US, Boston, MA, 532–538. https://doi.org/10.1007/978-0-387-39940-9_565
[23]
Jason D M Rennie, Lawrence Shih, Jaime Teevan, and David R Karger. 2003. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. (2003).
[24]
Martin P. Robillard, Walid Maalej, Robert J. Walker, and Thomas Zimmermann (Eds.). 2014. Recommendation Systems in Software Engineering. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45135-5
[25]
Cezar Sas, Andrea Capiluppi, Claudio Di Sipio, Juri Di Rocco, and Davide Di Ruscio. 2023. GitRanking: A ranking of GitHub topics for software classification using active sampling. Software: Practice and Experience 53, 10 (Oct. 2023), 1982–2006. https://doi.org/10.1002/spe.3238
[26]
Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, 2022. Using pre-trained models to boost code review automation. In Procs. of the 44th Int. Conf. on Software Engineering(ICSE ’22). ACM, 2291–2302. https://doi.org/10.1145/3510003.3510621
[27]
Ratnadira Widyasari, Zhipeng Zhao, Thanh Le Cong, Hong Jin Kang, and David Lo. 2023. Topic Recommendation for GitHub Repositories: How Far Can Extreme Multi-Label Learning Go?. In 2023 IEEE Int. Conf. on Software Analysis, Evolution and Reengineering (SANER). IEEE, Taipa, Macao, 167–178. https://doi.org/10.1109/SANER56733.2023.00025
[28]
Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K. Lahiri. 2022. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). In Procs. of ISSTA 2022. ACM, 77–88. https://doi.org/10.1145/3533767.3534396
[29]
Yuqi Zhou, Jiawei Wu, and Yanchun Sun. 2021. GHTRec: A Personalized Service to Recommend GitHub Trending Repositories for Developers. In IEEE Int. Conf. on Web Services. IEEE, 314–323. https://doi.org/10.1109/ICWS53863.2021.00049

Cited By

View all
  • (2024)Automatic Categorization of GitHub Actions with Transformers and Few-shot LearningProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3690752(468-474)Online publication date: 24-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering
June 2024
728 pages
ISBN:9798400717017
DOI:10.1145/3661167
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Check for updates

Author Tags

  1. Hugging Face
  2. Model classification
  3. Pre-trained models

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

  • FRINGE PNRR Project
  • EMELIOT PRIN project
  • TRex-SE PRIN Project

Conference

EASE 2024

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)38
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic Categorization of GitHub Actions with Transformers and Few-shot LearningProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3690752(468-474)Online publication date: 24-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media