Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3691620.3695493acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

Published: 27 October 2024 Publication History

Abstract

PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.

References

[1]
Bertus. 2020. Oss gadget: collection of tools for analyzing open source packages. https://github.com/microsoft/OSSGadget.
[2]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics, 5, 135--146.
[3]
Jie Cai, Bin Li, Jiale Zhang, and Xiaobing Sun. 2024. Ponzi scheme detection in smart contract via transaction semantic representation learning. IEEE Trans. Reliab., 73, 2, 1117--1131.
[4]
Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. BGNN4VD: constructing bidirectional graph neural-network for vulnerability detection. Inf. Softw. Technol., 136, 106576.
[5]
Sicong Cao, Xiaobing Sun, Lili Bo, Rongxin Wu, Bin Li, and Chuanqi Tao. 2022. MVD: memory-related vulnerability detection based on flow-sensitive graph neural networks. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1456--1468.
[6]
Sicong Cao, Xiaobing Sun, Xiaoxue Wu, David Lo, Lili Bo, Bin Li, and Wei Liu. 2024. Coca: improving and explaining graph neural network-based vulnerability detection systems. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 155:1--155:13.
[7]
Sicong Cao et al. 2023. Improving java deserialization gadget chain mining via overriding-guided object generation. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 397--409.
[8]
Sicong Cao et al. 2023. Oddfuzz: discovering java deserialization vulnerabilities via structure-aware directed greybox fuzzing. In Proceedings of the 44th IEEE Symposium on Security and Privacy (SP). IEEE, 2726--2743.
[9]
coetaur0. 2022. Python3 control flow graph generator. Retrieved August 8, 2022 from https://github.com/coetaur0/staticfg.
[10]
Alibaba company. 2024. Pypi mirror of alibaba company. Retrieved May 20, 2024 from https://mirrors.aliyun.com/pypi/simple/.
[11]
Huawei company. 2024. Pypi mirror of huawei company. Retrieved May 20, 2024 from https://mirrors.huaweicloud.com/repository/pypi/simple/.
[12]
Snykio company. 2024. Open source vulnerability database. Retrieved May 20, 2024 from https://security.snyk.io/.
[13]
Tencent company. 2024. Pypi mirror of tencent company. Retrieved May 20, 2024 from https://mirrors.cloud.tencent.com/pypi/simple.
[14]
VirusTOTAL company. 2024. Analyse suspicious files, domains, ips and urls to detect malware and other breaches, automatically share them with the security community. Retrieved May 20, 2024 from https://www.virustotal.com/gui/home/upload.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio, (Eds.) Association for Computational Linguistics, 4171--4186.
[16]
douban. 2024. Pypi mirror of douban company. Retrieved May 20, 2024 from http://pypi.doubanio.com/simple/.
[17]
Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. 2021. Towards measuring supply chain attacks on package managers for interpreted languages. In 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21--25, 2021. The Internet Society. https://www.ndss-symposium.org/ndss-paper/towards-measuring-supply-chain-attacks-on-package-managers-for-interpreted-languages/.
[18]
Yong Fang, Mingyu Xie, and Cheng Huang. 2021. PBDT: python backdoor detection model based on combined features. Secur. Commun. Networks, 2021, 9923234:1--9923234:13.
[19]
Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid, and Max Schaefer. 2021. Anomalicious: automated detection of anomalous and potentially malicious commits on github. In 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25--28, 2021. IEEE, 258--267.
[20]
Yacong Gu, Lingyun Ying, Yingyuan Pu, Xiao Hu, Huajun Chai, Ruimin Wang, Xing Gao, and Haixin Duan. 2023. Investigating package related security threats in software registries. In 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21--25, 2023. IEEE, 1578--1595.
[21]
Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, and Yang Liu. 2023. An empirical study of malicious code in pypi ecosystem. CoRR, abs/2309.11021. arXiv: 2309.11021.
[22]
Gao K, He H, Xie B, and Zhou MH. 2024. Survey on open source software supply chains. Journal of Software (in Chinese), 35, 581--603.
[23]
Genpei Liang, Xiangyu Zhou, Qingyu Wang, Yutong Du, and Cheng Huang. 2021. Malicious packages lurking in user-friendly python package index. In 20th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2021, Shenyang, China, October 20--22, 2021. IEEE, 606--613.
[24]
Wentao Liang, Xiang Ling, Jingzheng Wu, Tianyue Luo, and Yanjun Wu. 2023. A needle is an outlier in a haystack: hunting malicious pypi packages with code clustering. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11--15, 2023. IEEE, 307--318.
[25]
lwn. 2024. A backdoor in xz. Retrieved May 20, 2024 from https://lwn.net/Articles/967194/.
[26]
2024. Maven index. https://maven.apache.org/.
[27]
Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond typosquatting: an in-depth look at package confusion. In 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9--11, 2023. Joseph A. Calandrino and Carmela Troncoso, (Eds.) USENIX Association, 3439--3456. https://www.usenix.org/conference/usenixsecurity23/presentation/neupane.
[28]
2024. New malicious pypi packages caught using covert side-loading tactics. https://thehackernews.com/2024/02/new-malicious-pypi-packages-caught.html.
[29]
2024. Npm index. https://www.npmjs.com/.
[30]
2024. Nuget index. https://www.nuget.org/.
[31]
Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber's knife collection: A review of open source software supply chain attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment - 17th International Conference, DIMVA 2020, Lisbon, Portugal, June 24--26, 2020, Proceedings (Lecture Notes in Computer Science). Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves, (Eds.) Vol. 12223. Springer, 23--43.
[32]
Adam Paszke et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 8024--8035.
[33]
2024. Pypi index. https://pypi.org/.
[34]
2024. Pypi simple. https://pypi.org/simple/.
[35]
Qi'anxin. 2021. 2021 china software supply chain security analysis report. Retrieved June 2, 2021 from https://www.qianxin.com/threat/reportdetail?report_id=132.
[36]
Qi'anxin. 2022. 2022china software supply chain security analysis report. Retrieved July 26, 2022 from https://www.qianxin.com/threat/reportdetail?report_id=161.
[37]
Qi'anxin. 2023. 2023 china software supply chain security analysis report. Retrieved July 24, 2023 from https://www.qianxin.com/threat/reportdetail?report%5C_id=297.
[38]
Qi'anxin. 2024. Pypi massive forged packet name attack. Retrieved March 29, 2024 from https://mp.weixin.qq.com/s/VIThE0I5BkQBW6hIOubnkQ.
[39]
Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: practical call graph generation in python. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22--30 May 2021. IEEE, 1646--1657.
[40]
Adriana Sejfia and Max Schäfer. 2022. Practical automated detection of malicious npm packages. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25--27, 2022. ACM, 1681--1692.
[41]
2024. Tiobe index. https://www.tiobe.com/tiobe-index/.
[42]
Tsinghua university. 2024. Pypi mirror of tsinghua university. Retrieved May 20, 2024 from https://pypi.tuna.tsinghua.edu.cn/simple/.
[43]
D.-L. Vu. 2020. A fork of bandit tool with patterns to identifying malicious python code. https://github.com/lyvd/bandit4mal.
[44]
Duc Ly Vu, Fabio Massacci, Ivan Pashchenko, Henrik Plate, and Antonino Sabetta. 2021. Lastpymile: identifying the discrepancy between sources and packages. In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23--28, 2021. Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta, (Eds.) ACM, 780--792.
[45]
Thomas Wolf et al. 2019. Huggingface's transformers: state-of-the-art natural language processing. arXiv preprint arXiv: 1910.03771.
[46]
Elizabeth Wyss, Alexander Wittman, Drew Davidson, and Lorenzo De Carli. 2022. Wolf at the door: preventing install-time attacks in npm with latch. In ASIA CCS '22: ACM Asia Conference on Computer and Communications Security, Nagasaki, Japan, 30 May 2022 -- 3 June 2022. Yuji Suga, Kouichi Sakurai, Xuhua Ding, and Kazue Sako, (Eds.) ACM, 1139--1153.
[47]
Jiale Zhang, Chengcheng Zhu, Chunpeng Ge, Chuan Ma, Yanchao Zhao, Xiaobing Sun, and Bing Chen. 2024. Badcleaner: defending backdoor attacks in federated learning via attention-based multi-teacher distillation. IEEE Trans. Dependable Secur. Comput., 21, 5, 4559--4573.
[48]
Jiale Zhang, Chengcheng Zhu, Xiaobing Sun, Chunpeng Ge, Bing Chen, Willy Susilo, and Shui Yu. 2024. Flpurifier: backdoor defense in federated learning via decoupled contrastive training. IEEE Trans. Inf. Forensics Secur., 19, 4752--4766.
[49]
Jiale Zhang, Chengcheng Zhu, Di Wu, Xiaobing Sun, Jianming Yong, and Guodong Long. 2021. Badfss: backdoor attacks on federated self-supervised learning. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI).
[50]
Junan Zhang, Kaifeng Huang, Bihuan Chen, Chong Wang, Zhenhao Tian, and Xin Peng. 2023. Malicious package detection in NPM and pypi using a single model of malicious behavior sequence. CoRR, abs/2309.02637. arXiv: 2309.02637.
[51]
Yiming Zhang, Yujie Fan, Shifu Hou, Yanfang Ye, Xusheng Xiao, Pan Li, Chuan Shi, Liang Zhao, and Shouhuai Xu. 2020. Cyber-guided deep neural network for malicious repository detection in github. In 2020 IEEE International Conference on Knowledge Graph, ICKG 2020, Online, August 9--11, 2020. Enhong Chen and Grigoris Antoniou, (Eds.) IEEE, 458--465.
[52]
Anmin Zhou, Tianyi Huang, Cheng Huang, Dunhan Li, and Chuangchuang Song. 2022. Pycomm: malicious commands detection model for python scripts. J. Intell. Fuzzy Syst., 42, 3, 2261--2273.

Index Terms

  1. 1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
      October 2024
      2587 pages
      ISBN:9798400712487
      DOI:10.1145/3691620
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2024

      Check for updates

      Author Tags

      1. open-source software
      2. malicious packages
      3. PyPI
      4. BERT

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ASE '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 82 of 337 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 82
        Total Downloads
      • Downloads (Last 12 months)82
      • Downloads (Last 6 weeks)82
      Reflects downloads up to 26 Nov 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media