Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3589335.3651582acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
short-paper
Open access

Who is Creating Malware Repositories on GitHub and Why?

Published: 13 May 2024 Publication History

Abstract

Recent studies have found thousands of malware source code repositories on GitHub. For the first time, we propose to understand the origins and motivations behind the creation of such malware repositories. For that, we collect and profile the authors of malware repositories using a three-fold systematic approach. First, we identify 14K users in GitHub who have authored at least one malware repository. Second, we leverage a pretrained large language model (LLM) to estimate the likelihood of malicious intent of these authors. This innovative approach led us to categorize 3339 as Malicious, 3354 as Likely Malicious, and 7574 as Benign authors. Further, to validate the accuracy and reliability of our classification, we conduct a manual review of 200 randomly selected authors. Third, our analysis provides insights into the authors' profiles and motivations. We find that Malicious authors often have sparse profiles and focus on creating and spreading malware, while Benign authors typically have complete profiles with a focus on cybersecurity research and education. Likely Malicious authors show varying levels of engagement and ambiguous intentions. We see our study as a key step towards understanding the ecosystem of malware authorship on GitHub.

Supplemental Material

MP4 File
Supplemental video

References

[1]
Parikshit Bansal and Amit Sharma. 2023. Large language models as annotators: Enhancing generalization of nlp models at minimal cost. arXiv preprint (2023).
[2]
Alejandro Calleja, Juan Tapiador, and Juan Cabalero. 2018. The malsource dataset: Quantifying complexity and code reuse in malware development. IEEE Trans. on Info. Forensics and Security, Vol. 14, 12 (2018), 3175--3190.
[3]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056 (2023).
[4]
Risul Islam, Md Omar Faruk Rokon, Ahmad Darki, and Michalis Faloutsos. 2021. Hackerscope: The dynamics of a massive hacker online ecosystem. SNAM (2021).
[5]
Kazuaki Kashihara, Kuntal Kumar Pal, Chitta Baral, and Robert P Trevino. 2023. Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums. arXiv preprint arXiv:2303.05400 (2023).
[6]
Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. 2022. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329 (2022).
[7]
Yinheng Li. 2023. Apractical SURVEY ON ZERO-SHOT PROMPT DESIGN FOR IN-CONTEXT LEARNING. (2023).
[8]
Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding together at scale: GitHub as a collaborative social network. In ICWSM.
[9]
Md Rayhanul Masud and Michalis Faloutsos. 2024. Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub. arxiv: 2403.04419 [cs.SE]
[10]
Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. 2023. Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv preprint (2023).
[11]
Behnaz Moradi-Jamei, Brandon L Kramer, J Bayoán Santiago Calderón, and Gizem Korkmaz. 2021. Community formation and detection on GitHub collaboration networks. In IEEE/ACM ASONAM. 244--251.
[12]
OpenAI. 2023. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf.
[13]
Md Omar Faruk Rokon, Risul Islam, Ahmad Darki, Evangelos E. Papalexakis, and Michalis Faloutsos. 2020. SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub. In 23rd RAID. USENIX, 149--163.
[14]
Md Omar Faruk Rokon, Risul Islam, Md Rayhanul Masud, and Michalis Faloutsos. 2022. PIMan: A Comprehensive Approach for Establishing Plausible Influence among Software Repositories. In 2022 IEEE/ACM ASONAM. IEEE.
[15]
Md Omar Faruk Rokon, Pei Yan, Risul Islam, and Michalis Faloutsos. 2021. Repo2vec: A comprehensive embedding approach for determining repository similarity. In ICSME.
[16]
Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth? arXiv preprint arXiv:2103.08493 (2021).
[17]
Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint (2023).
[18]
Yu Wu, Jessica Kropczynski, Patrick C Shih, and John M Carroll. 2014. Exploring the ecosystem of software developers on GitHub and other platforms. In CSCW.
[19]
Xiaoya Xia, Zhenjie Weng, Wei Wang, and Shengyu Zhao. 2022. Exploring activity and contributors on GitHub: Who, what, when, and where. In APSEC. IEEE.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '24: Companion Proceedings of the ACM Web Conference 2024
May 2024
1928 pages
ISBN:9798400701726
DOI:10.1145/3589335
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Check for updates

Badges

Author Tags

  1. classification
  2. github
  3. hacker
  4. llm
  5. malware
  6. repository
  7. user

Qualifiers

  • Short-paper

Funding Sources

Conference

WWW '24
Sponsor:
WWW '24: The ACM Web Conference 2024
May 13 - 17, 2024
Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 484
    Total Downloads
  • Downloads (Last 12 months)484
  • Downloads (Last 6 weeks)40
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media