research-article

Open access

VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation

Authors:

Haipeng CaiAuthors Info & Claims

FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

Pages 567 - 571

https://doi.org/10.1145/3663529.3663800

Published: 10 July 2024 Publication History

Abstract

We present VinJ, an efficient automated tool for large-scale diverse vulnerability data generation. VinJ automatically generates vulnerability data by injecting vulnerabilities into given programs, based on knowledge learned from existing vulnerability data. VinJ is able to generate diverse vulnerability data covering 18 CWEs with 69% success rate and generate 686k vulnerability samples in 74 hours (i.e., 0.4 seconds per sample), indicating it is efficient. The generated data is able to improve existing DL-based vulnerability detection, localization, and repair models significantly. The demo video of VinJ can be found at https://youtu.be/-oKoUqBbxD4. The tool website can be found at https://github.com/NewGillig/VInj. We also release the generated large-scale vulnerability dataset, which can be found at https://zenodo.org/records/10574446.

References

[1]

Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to fix bugs automatically. Proceedings of the ACM on Programming Languages, 3, OOPSLA (2019), 1–27. https://doi.org/10.1145/3360585

Digital Library

[2]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). 30–39. https://doi.org/10.1145/3475960.3475985

Digital Library

[3]

Yingzhou Bi, Jiangtao Huang, Penghui Liu, and Lianmei Wang. 2023. Benchmarking Software Vulnerability Detection Techniques: A Survey. arXiv preprint arXiv:2303.16362, https://doi.org/10.48550/arXiv.2303.16362

[4]

Haipeng Cai, Yu Nong, Yuzhe Ou, and Feng Chen. 2023. Generating vulnerable code via learning-based program transformations. In AI Embedded Assurance for Cyber Systems. Springer, 123–138. https://doi.org/10.1007/978-3-031-42637-7_7

[5]

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering, 48, 9 (2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402

[6]

Michael L Collard, Michael John Decker, and Jonathan I Maletic. 2013. srcML: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In 2013 IEEE International Conference on Software Maintenance. 516–519. https://doi.org/10.1109/ICSM.2013.85

Digital Library

[7]

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering (ASE). 313–324. https://doi.org/10.1145/2642937.2642982

Digital Library

[8]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR). 508–512. https://doi.org/10.1145/3379597.3387501

Digital Library

[9]

Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 1981. The measurement of interrater agreement. Statistical methods for rates and proportions, 2, 212-236 (1981), 22–23. https://doi.org/10.1002/0471445428.ch18

[10]

Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 608–620. https://doi.org/10.1145/3524842.3528452

Digital Library

[11]

Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 935–947. https://doi.org/10.1145/3540250.3549098

Digital Library

[12]

Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. 2023. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content. In 2024 IEEE Symposium on Security and Privacy (SP). 61–61. https://doi.org/10.1109/SP54263.2024.00061

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, https://doi.org/0.48550/arXiv.1412.6980

[14]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22, 3 (2012), 276–282. https://doi.org/10.23641/asha.12978197

[15]

Yu Nong and Haipeng Cai. 2020. A preliminary study on open-source memory vulnerability detectors. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). 557–561. https://doi.org/10.1109/SANER48275.2020.9054851

[16]

Yu Nong, Haipeng Cai, Pengfei Ye, Li Li, and Feng Chen. 2021. Evaluating and comparing memory error vulnerability detectors. Information and Software Technology, 137 (2021), 106614. https://doi.org/10.1016/j.infsof.2021.106614

[17]

Yu Nong, Richard Fang, Guangbei Yi, Kunsong Zhao, Xiapu Luo, Feng Chen, and Haipeng Cai. 2024. VGX: Large-scale sample generation for boosting learning-based software vulnerability analyses. In IEEE/ACM International Conference on Software Engineering (ICSE). Article 149, 13 pages. https://doi.org/10.1145/3597503.3639116

Digital Library

[18]

Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2022. Generating realistic vulnerabilities via neural code editing: an empirical study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1097–1109. https://doi.org/10.1145/3540250.3549128

Digital Library

[19]

Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2023. VulGen: Realistic vulnerability generation via pattern mining and deep learning. In IEEE/ACM International Conference on Software Engineering (ICSE). 2527–2539. https://doi.org/10.1109/ICSE48619.2023.00211

Digital Library

[20]

Yu Nong, Rainy Sharma, Abdelwahab Hamou-Lhadj, Xiapu Luo, and Haipeng Cai. 2022. Open science in software engineering: A study on deep learning-based vulnerability detection. IEEE Transactions on Software Engineering, 49 (2022), 1983–2005. https://doi.org/10.1109/TSE.2022.3207149

Digital Library

[21]

National Institute of Standards and Technology (NIST). 2022. CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html

[22]

Jibesh Patra and Michael Pradel. 2021. Semantic bug seeding: a learning-based approach for creating realistic bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 906–918. https://doi.org/10.1145/3468264.3468623

Digital Library

[23]

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. 2021. Patchdb: A large-scale security patch dataset. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 149–160. https://doi.org/10.1109/DSN48987.2021.00030

[24]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685

[25]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, and Morgan Funtowicz. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, https://doi.org/10.48550/arXiv.1910.03771

[26]

Ziyu Yao, Frank F. Xu, Pengcheng Yin, Huan Sun, and Graham Neubig. 2021. Learning Structural Edits via Incremental Tree Transformations. In International Conference on Learning Representations. 126–133. https://doi.org/10.48550/arXiv.2101.12087

[27]

Shasha Zhang. 2021. A Framework of Vulnerable Code Dataset Generation by Open-Source Injection. In 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 1099–1103. https://doi.org/10.1109/ICAICA52286.2021.9497888

[28]

Zenong Zhang, Zach Patterson, Michael Hicks, and Shiyi Wei. 2022. FIXREVERTER: A Realistic Bug Injection Methodology for Benchmarking Fuzz Testing. In 31st USENIX Security Symposium (USENIX Security 22). 3699–3715. isbn:978-1-939133-31-1

[29]

Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120. https://doi.org/10.1109/ICSE-SEIP52600.2021.00020

Digital Library

[30]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in Neural Information Processing Systems (NeurIPS), 32 (2019), 10197–10207. https://doi.org/10.48550/arXiv.1909.03496

Index Terms

VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation
1. Security and privacy
  1. Software and application security
    1. Software security engineering

Recommendations

VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Accompanying the successes of learning-based defensive software vulnerability analyses is the lack of large and quality sets of labeled vulnerable program samples, which impedes further advancement of those defenses. Existing automated sample generation ...
History and Future of Automated Vulnerability Analysis
SACMAT '19: Proceedings of the 24th ACM Symposium on Access Control Models and Technologies

The software upon which our modern society operates is riddled with security vulnerabilities. These vulnerabilities allow hackers access to our sensitive data and make our system insecure. To identify vulnerabilities in software, human experts, or ...
Measuring and ranking attacks based on vulnerability analysis

As the number of software vulnerabilities increases, the research on software vulnerabilities becomes a focusing point in information security. A vulnerability could be exploited to attack the information asset with the weakness related to the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

July 2024

715 pages

ISBN:9798400706585

DOI:10.1145/3663529

General Chair:
Marcelo d'Amorim
North Carolina State University, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Army Research Office

Conference

FSE '24

Sponsor:

SIGSOFT

FSE '24: 32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Porto de Galinhas, Brazil

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
101
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)58

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents