research-article

1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

Authors:

Kaifeng HuangAuthors Info & Claims

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Pages 1159 - 1170

https://doi.org/10.1145/3691620.3695493

Published: 27 October 2024 Publication History

Abstract

PyPI, the official package registry for Python, has seen a surge in the number of malicious package uploads in recent years. Prior studies have demonstrated the effectiveness of learning-based solutions in malicious package detection. However, manually-crafted expert rules are expensive and struggle to keep pace with the rapidly evolving malicious behaviors, while deep features automatically extracted from code are still inaccurate in certain cases. To mitigate these issues, in this paper, we propose Ea4mp, a novel approach which integrates deep code behaviors with metadata features to detect malicious PyPI packages. Specifically, Ea4mp extracts code behavior sequences from all script files and fine-tunes a BERT model to learn deep semantic features of malicious code. In addition, we realize the value of metadata information and construct an ensemble classifier to combine the strengths of deep code behavior features and metadata features for more effective detection. We evaluated Ea4mp against three state-of-the-art baselines on a newly constructed dataset. The experimental results show that Ea4mp improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. With Ea4mp, we successfully identified 119 previously unknown malicious packages from a pool of 46,573 newly-uploaded packages over a three-week period, and 82 out of them have been removed by the PyPI official.

References

[1]

Bertus. 2020. Oss gadget: collection of tools for analyzing open source packages. https://github.com/microsoft/OSSGadget.

[2]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics, 5, 135--146.

[3]

Jie Cai, Bin Li, Jiale Zhang, and Xiaobing Sun. 2024. Ponzi scheme detection in smart contract via transaction semantic representation learning. IEEE Trans. Reliab., 73, 2, 1117--1131.

[4]

Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. BGNN4VD: constructing bidirectional graph neural-network for vulnerability detection. Inf. Softw. Technol., 136, 106576.

Digital Library

[5]

Sicong Cao, Xiaobing Sun, Lili Bo, Rongxin Wu, Bin Li, and Chuanqi Tao. 2022. MVD: memory-related vulnerability detection based on flow-sensitive graph neural networks. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1456--1468.

Digital Library

[6]

Sicong Cao, Xiaobing Sun, Xiaoxue Wu, David Lo, Lili Bo, Bin Li, and Wei Liu. 2024. Coca: improving and explaining graph neural network-based vulnerability detection systems. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 155:1--155:13.

Digital Library

[7]

Sicong Cao et al. 2023. Improving java deserialization gadget chain mining via overriding-guided object generation. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 397--409.

[8]

Sicong Cao et al. 2023. Oddfuzz: discovering java deserialization vulnerabilities via structure-aware directed greybox fuzzing. In Proceedings of the 44th IEEE Symposium on Security and Privacy (SP). IEEE, 2726--2743.

[9]

coetaur0. 2022. Python3 control flow graph generator. Retrieved August 8, 2022 from https://github.com/coetaur0/staticfg.

[10]

Alibaba company. 2024. Pypi mirror of alibaba company. Retrieved May 20, 2024 from https://mirrors.aliyun.com/pypi/simple/.

[11]

Huawei company. 2024. Pypi mirror of huawei company. Retrieved May 20, 2024 from https://mirrors.huaweicloud.com/repository/pypi/simple/.

[12]

Snykio company. 2024. Open source vulnerability database. Retrieved May 20, 2024 from https://security.snyk.io/.

[13]

Tencent company. 2024. Pypi mirror of tencent company. Retrieved May 20, 2024 from https://mirrors.cloud.tencent.com/pypi/simple.

[14]

VirusTOTAL company. 2024. Analyse suspicious files, domains, ips and urls to detect malware and other breaches, automatically share them with the security community. Retrieved May 20, 2024 from https://www.virustotal.com/gui/home/upload.

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio, (Eds.) Association for Computational Linguistics, 4171--4186.

[16]

douban. 2024. Pypi mirror of douban company. Retrieved May 20, 2024 from http://pypi.doubanio.com/simple/.

[17]

Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. 2021. Towards measuring supply chain attacks on package managers for interpreted languages. In 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21--25, 2021. The Internet Society. https://www.ndss-symposium.org/ndss-paper/towards-measuring-supply-chain-attacks-on-package-managers-for-interpreted-languages/.

[18]

Yong Fang, Mingyu Xie, and Cheng Huang. 2021. PBDT: python backdoor detection model based on combined features. Secur. Commun. Networks, 2021, 9923234:1--9923234:13.

Digital Library

[19]

Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid, and Max Schaefer. 2021. Anomalicious: automated detection of anomalous and potentially malicious commits on github. In 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25--28, 2021. IEEE, 258--267.

Digital Library

[20]

Yacong Gu, Lingyun Ying, Yingyuan Pu, Xiao Hu, Huajun Chai, Ruimin Wang, Xing Gao, and Haixin Duan. 2023. Investigating package related security threats in software registries. In 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21--25, 2023. IEEE, 1578--1595.

[21]

Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, and Yang Liu. 2023. An empirical study of malicious code in pypi ecosystem. CoRR, abs/2309.11021. arXiv: 2309.11021.

[22]

Gao K, He H, Xie B, and Zhou MH. 2024. Survey on open source software supply chains. Journal of Software (in Chinese), 35, 581--603.

[23]

Genpei Liang, Xiangyu Zhou, Qingyu Wang, Yutong Du, and Cheng Huang. 2021. Malicious packages lurking in user-friendly python package index. In 20th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2021, Shenyang, China, October 20--22, 2021. IEEE, 606--613.

[24]

Wentao Liang, Xiang Ling, Jingzheng Wu, Tianyue Luo, and Yanjun Wu. 2023. A needle is an outlier in a haystack: hunting malicious pypi packages with code clustering. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11--15, 2023. IEEE, 307--318.

Digital Library

[25]

lwn. 2024. A backdoor in xz. Retrieved May 20, 2024 from https://lwn.net/Articles/967194/.

[26]

2024. Maven index. https://maven.apache.org/.

[27]

Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond typosquatting: an in-depth look at package confusion. In 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9--11, 2023. Joseph A. Calandrino and Carmela Troncoso, (Eds.) USENIX Association, 3439--3456. https://www.usenix.org/conference/usenixsecurity23/presentation/neupane.

[28]

2024. New malicious pypi packages caught using covert side-loading tactics. https://thehackernews.com/2024/02/new-malicious-pypi-packages-caught.html.

[29]

2024. Npm index. https://www.npmjs.com/.

[30]

2024. Nuget index. https://www.nuget.org/.

[31]

Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber's knife collection: A review of open source software supply chain attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment - 17th International Conference, DIMVA 2020, Lisbon, Portugal, June 24--26, 2020, Proceedings (Lecture Notes in Computer Science). Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves, (Eds.) Vol. 12223. Springer, 23--43.

[32]

Adam Paszke et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 8024--8035.

[33]

2024. Pypi index. https://pypi.org/.

[34]

2024. Pypi simple. https://pypi.org/simple/.

[35]

Qi'anxin. 2021. 2021 china software supply chain security analysis report. Retrieved June 2, 2021 from https://www.qianxin.com/threat/reportdetail?report_id=132.

[36]

Qi'anxin. 2022. 2022china software supply chain security analysis report. Retrieved July 26, 2022 from https://www.qianxin.com/threat/reportdetail?report_id=161.

[37]

Qi'anxin. 2023. 2023 china software supply chain security analysis report. Retrieved July 24, 2023 from https://www.qianxin.com/threat/reportdetail?report%5C_id=297.

[38]

Qi'anxin. 2024. Pypi massive forged packet name attack. Retrieved March 29, 2024 from https://mp.weixin.qq.com/s/VIThE0I5BkQBW6hIOubnkQ.

[39]

Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: practical call graph generation in python. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22--30 May 2021. IEEE, 1646--1657.

Digital Library

[40]

Adriana Sejfia and Max Schäfer. 2022. Practical automated detection of malicious npm packages. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25--27, 2022. ACM, 1681--1692.

Digital Library

[41]

2024. Tiobe index. https://www.tiobe.com/tiobe-index/.

[42]

Tsinghua university. 2024. Pypi mirror of tsinghua university. Retrieved May 20, 2024 from https://pypi.tuna.tsinghua.edu.cn/simple/.

[43]

D.-L. Vu. 2020. A fork of bandit tool with patterns to identifying malicious python code. https://github.com/lyvd/bandit4mal.

[44]

Duc Ly Vu, Fabio Massacci, Ivan Pashchenko, Henrik Plate, and Antonino Sabetta. 2021. Lastpymile: identifying the discrepancy between sources and packages. In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23--28, 2021. Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta, (Eds.) ACM, 780--792.

Digital Library

[45]

Thomas Wolf et al. 2019. Huggingface's transformers: state-of-the-art natural language processing. arXiv preprint arXiv: 1910.03771.

[46]

Elizabeth Wyss, Alexander Wittman, Drew Davidson, and Lorenzo De Carli. 2022. Wolf at the door: preventing install-time attacks in npm with latch. In ASIA CCS '22: ACM Asia Conference on Computer and Communications Security, Nagasaki, Japan, 30 May 2022 -- 3 June 2022. Yuji Suga, Kouichi Sakurai, Xuhua Ding, and Kazue Sako, (Eds.) ACM, 1139--1153.

Digital Library

[47]

Jiale Zhang, Chengcheng Zhu, Chunpeng Ge, Chuan Ma, Yanchao Zhao, Xiaobing Sun, and Bing Chen. 2024. Badcleaner: defending backdoor attacks in federated learning via attention-based multi-teacher distillation. IEEE Trans. Dependable Secur. Comput., 21, 5, 4559--4573.

Digital Library

[48]

Jiale Zhang, Chengcheng Zhu, Xiaobing Sun, Chunpeng Ge, Bing Chen, Willy Susilo, and Shui Yu. 2024. Flpurifier: backdoor defense in federated learning via decoupled contrastive training. IEEE Trans. Inf. Forensics Secur., 19, 4752--4766.

Digital Library

[49]

Jiale Zhang, Chengcheng Zhu, Di Wu, Xiaobing Sun, Jianming Yong, and Guodong Long. 2021. Badfss: backdoor attacks on federated self-supervised learning. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI).

[50]

Junan Zhang, Kaifeng Huang, Bihuan Chen, Chong Wang, Zhenhao Tian, and Xin Peng. 2023. Malicious package detection in NPM and pypi using a single model of malicious behavior sequence. CoRR, abs/2309.02637. arXiv: 2309.02637.

[51]

Yiming Zhang, Yujie Fan, Shifu Hou, Yanfang Ye, Xusheng Xiao, Pan Li, Chuan Shi, Liang Zhao, and Shouhuai Xu. 2020. Cyber-guided deep neural network for malicious repository detection in github. In 2020 IEEE International Conference on Knowledge Graph, ICKG 2020, Online, August 9--11, 2020. Enhong Chen and Grigoris Antoniou, (Eds.) IEEE, 458--465.

[52]

Anmin Zhou, Tianyi Huang, Cheng Huang, Dunhan Li, and Chuangchuang Song. 2022. Pycomm: malicious commands detection model for python scripts. J. Intell. Fuzzy Syst., 42, 3, 2261--2273.

Digital Library

Index Terms

1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

Index terms have been assigned to the content through auto-classification.

Recommendations

Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence
Open-source software (OSS) supply chain enlarges the attack surface of a software system, which makes package registries attractive targets for attacks. Recently, multiple package registries have received intensified attacks with malicious packages. Of ...
Malicious Package Detection using Metadata Information
WWW '24: Proceedings of the ACM Web Conference 2024

Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a ...
A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering
ASE '23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering

As the most popular Python software repository, PyPI has become an indispensable part of the Python ecosystem. Regrettably, the open nature of PyPI exposes end-users to substantial security risks stemming from malicious packages. Consequently, the timely ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

October 2024

2587 pages

ISBN:9798400712487

DOI:10.1145/3691620

General Chair:
Vladimir Filkov,
Program Co-chairs:
Baishakhi Ray
Columbia University, USA; AWS AI Lab
,
Minghui Zhou
Peking University, China

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

ASE '24

Sponsor:

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

CA, Sacramento, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
189
Total Downloads

Downloads (Last 12 months)189
Downloads (Last 6 weeks)52

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten