Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3663529.3663800acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation

Published: 10 July 2024 Publication History

Abstract

We present VinJ, an efficient automated tool for large-scale diverse vulnerability data generation. VinJ automatically generates vulnerability data by injecting vulnerabilities into given programs, based on knowledge learned from existing vulnerability data. VinJ is able to generate diverse vulnerability data covering 18 CWEs with 69% success rate and generate 686k vulnerability samples in 74 hours (i.e., 0.4 seconds per sample), indicating it is efficient. The generated data is able to improve existing DL-based vulnerability detection, localization, and repair models significantly. The demo video of VinJ can be found at https://youtu.be/-oKoUqBbxD4. The tool website can be found at https://github.com/NewGillig/VInj. We also release the generated large-scale vulnerability dataset, which can be found at https://zenodo.org/records/10574446.

References

[1]
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to fix bugs automatically. Proceedings of the ACM on Programming Languages, 3, OOPSLA (2019), 1–27. https://doi.org/10.1145/3360585
[2]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). 30–39. https://doi.org/10.1145/3475960.3475985
[3]
Yingzhou Bi, Jiangtao Huang, Penghui Liu, and Lianmei Wang. 2023. Benchmarking Software Vulnerability Detection Techniques: A Survey. arXiv preprint arXiv:2303.16362, https://doi.org/10.48550/arXiv.2303.16362
[4]
Haipeng Cai, Yu Nong, Yuzhe Ou, and Feng Chen. 2023. Generating vulnerable code via learning-based program transformations. In AI Embedded Assurance for Cyber Systems. Springer, 123–138. https://doi.org/10.1007/978-3-031-42637-7_7
[5]
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering, 48, 9 (2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402
[6]
Michael L Collard, Michael John Decker, and Jonathan I Maletic. 2013. srcML: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In 2013 IEEE International Conference on Software Maintenance. 516–519. https://doi.org/10.1109/ICSM.2013.85
[7]
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering (ASE). 313–324. https://doi.org/10.1145/2642937.2642982
[8]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR). 508–512. https://doi.org/10.1145/3379597.3387501
[9]
Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 1981. The measurement of interrater agreement. Statistical methods for rates and proportions, 2, 212-236 (1981), 22–23. https://doi.org/10.1002/0471445428.ch18
[10]
Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 608–620. https://doi.org/10.1145/3524842.3528452
[11]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 935–947. https://doi.org/10.1145/3540250.3549098
[12]
Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. 2023. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content. In 2024 IEEE Symposium on Security and Privacy (SP). 61–61. https://doi.org/10.1109/SP54263.2024.00061
[13]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, https://doi.org/0.48550/arXiv.1412.6980
[14]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica, 22, 3 (2012), 276–282. https://doi.org/10.23641/asha.12978197
[15]
Yu Nong and Haipeng Cai. 2020. A preliminary study on open-source memory vulnerability detectors. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). 557–561. https://doi.org/10.1109/SANER48275.2020.9054851
[16]
Yu Nong, Haipeng Cai, Pengfei Ye, Li Li, and Feng Chen. 2021. Evaluating and comparing memory error vulnerability detectors. Information and Software Technology, 137 (2021), 106614. https://doi.org/10.1016/j.infsof.2021.106614
[17]
Yu Nong, Richard Fang, Guangbei Yi, Kunsong Zhao, Xiapu Luo, Feng Chen, and Haipeng Cai. 2024. VGX: Large-scale sample generation for boosting learning-based software vulnerability analyses. In IEEE/ACM International Conference on Software Engineering (ICSE). Article 149, 13 pages. https://doi.org/10.1145/3597503.3639116
[18]
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2022. Generating realistic vulnerabilities via neural code editing: an empirical study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1097–1109. https://doi.org/10.1145/3540250.3549128
[19]
Yu Nong, Yuzhe Ou, Michael Pradel, Feng Chen, and Haipeng Cai. 2023. VulGen: Realistic vulnerability generation via pattern mining and deep learning. In IEEE/ACM International Conference on Software Engineering (ICSE). 2527–2539. https://doi.org/10.1109/ICSE48619.2023.00211
[20]
Yu Nong, Rainy Sharma, Abdelwahab Hamou-Lhadj, Xiapu Luo, and Haipeng Cai. 2022. Open science in software engineering: A study on deep learning-based vulnerability detection. IEEE Transactions on Software Engineering, 49 (2022), 1983–2005. https://doi.org/10.1109/TSE.2022.3207149
[21]
National Institute of Standards and Technology (NIST). 2022. CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html
[22]
Jibesh Patra and Michael Pradel. 2021. Semantic bug seeding: a learning-based approach for creating realistic bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 906–918. https://doi.org/10.1145/3468264.3468623
[23]
Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. 2021. Patchdb: A large-scale security patch dataset. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 149–160. https://doi.org/10.1109/DSN48987.2021.00030
[24]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
[25]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, and Morgan Funtowicz. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, https://doi.org/10.48550/arXiv.1910.03771
[26]
Ziyu Yao, Frank F. Xu, Pengcheng Yin, Huan Sun, and Graham Neubig. 2021. Learning Structural Edits via Incremental Tree Transformations. In International Conference on Learning Representations. 126–133. https://doi.org/10.48550/arXiv.2101.12087
[27]
Shasha Zhang. 2021. A Framework of Vulnerable Code Dataset Generation by Open-Source Injection. In 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). 1099–1103. https://doi.org/10.1109/ICAICA52286.2021.9497888
[28]
Zenong Zhang, Zach Patterson, Michael Hicks, and Shiyi Wei. 2022. FIXREVERTER: A Realistic Bug Injection Methodology for Benchmarking Fuzz Testing. In 31st USENIX Security Symposium (USENIX Security 22). 3699–3715. isbn:978-1-939133-31-1
[29]
Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A dataset built for ai-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120. https://doi.org/10.1109/ICSE-SEIP52600.2021.00020
[30]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in Neural Information Processing Systems (NeurIPS), 32 (2019), 10197–10207. https://doi.org/10.48550/arXiv.1909.03496

Index Terms

  1. VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering
    July 2024
    715 pages
    ISBN:9798400706585
    DOI:10.1145/3663529
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Vulnerability analysis
    2. data augmentation
    3. deep learning

    Qualifiers

    • Research-article

    Funding Sources

    • Army Research Office

    Conference

    FSE '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 101
      Total Downloads
    • Downloads (Last 12 months)101
    • Downloads (Last 6 weeks)58
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media