short-paper

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

Authors:

Jiahao Fan,

Yi Li,

Shaohua Wang,

Tien N. NguyenAuthors Info & Claims

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

Pages 508 - 512

https://doi.org/10.1145/3379597.3387501

Published: 18 September 2020 Publication History

Get Access

Abstract

We collected a large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul. We crawled the public Common Vulnerabilities and Exposures (CVE) database and CVE-related source code repositories. Specifically, we collected the descriptive information of the vulnerabilities from the CVE database, e.g., CVE IDs, CVE severity scores, and CVE summaries. With the CVE information and its related published Github code repository links, we downloaded all of the code repositories and extracted vulnerability related code changes. In total, Big-Vul contains 3,754 code vulnerabilities spanning 91 different vulnerability types. All these code vulnerabilities are extracted from 348 Github projects. All information is stored in the CSV format. We linked the code changes with the CVE descriptive information. Thus, our Big-Vul can be used for various research topics, e.g., detecting and fixing vulnerabilities, analyzing the vulnerability related code changes. Big-Vul is publicly available on Github.

References

[1]

CVE Details. 2020. CVE Details Website. http://https://www.cvedetails.com/.

Google Scholar

[2]

Antonios Gkortzis, Dimitris Mitropoulos, and Diomidis Spinellis. 2018. VulinOSS: a dataset of security vulnerabilities in open-source systems. In Proceedings of the 15th International Conference on Mining Software Repositories. 18--21.

Digital Library

Google Scholar

[3]

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018).

Google Scholar

[4]

Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2018. Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 449--460.

Crossref

Google Scholar

[5]

Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 383--387.

Digital Library

Google Scholar

[6]

This Project. [n.d.]. Our C/C++dataset. https://github.com/ZeoVan/MSR_20_Code_Vulnerability_CSV_Dataset.

Google Scholar

[7]

Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 757--762.

Crossref

Google Scholar

[8]

Antonino Sabetta and Michele Bezzi. 2018. A practical approach to the automatic classification of security-relevant commits. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 579--582.

Crossref

Google Scholar

[9]

Zack Whittaker. 2020. Microsoft and NSA say a security bug affects millions of Windows 10 computers. https://techcrunch.com/2020/01/14/microsoft-critical-certificates-bug/.

Google Scholar

[10]

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590--604.

Digital Library

Google Scholar

[11]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems. 10197--10207.

Google Scholar

[12]

Yaqin Zhou and Asankhaya Sharma. 2017. Automated identification of security issues from commit messages and bug reports. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 914--919.

Digital Library

Google Scholar

Cited By

View all

Grahn DChen LZhang J(2024)Vul-Mixer: Efficient and Effective Machine Learning–Assisted Software Vulnerability DetectionElectronics10.3390/electronics1313253813:13(2538)Online publication date: 28-Jun-2024
https://doi.org/10.3390/electronics13132538
Kishiyama BLee YYang J(2024)Improving VulRepair’s Perfect Prediction by Leveraging the LION OptimizerApplied Sciences10.3390/app1413575014:13(5750)Online publication date: 1-Jul-2024
https://doi.org/10.3390/app14135750
Negri-Ribalta CGeraud-Stewart RSergeeva ALenzini G(2024)A systematic literature review on the impact of AI models on the security of code generationFrontiers in Big Data10.3389/fdata.2024.13867207Online publication date: 13-May-2024
https://doi.org/10.3389/fdata.2024.1386720
Show More Cited By

Index Terms

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries
1. Security and privacy

Recommendations

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive ...
Common Vulnerability Scoring System

Vendors have historically used proprietary methods for scoring software vulnerabilities, usually without detailing their criteria or processes. The Common Vulnerability Scoring System (CVSS) is a public initiative designed to address this issue by ...
SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques
MSR4P&S 2022: Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security

Automated source code generation is currently a popular machine-learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

June 2020

675 pages

ISBN:9781450375177

DOI:10.1145/3379597

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

MSR '20

Sponsor:

SIGSOFT

MSR '20: 17th International Conference on Mining Software Repositories

June 29 - 30, 2020

Seoul, Republic of Korea

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

149
Total Citations
View Citations
3,245
Total Downloads

Downloads (Last 12 months)1,435
Downloads (Last 6 weeks)139

Reflects downloads up to 20 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Grahn DChen LZhang J(2024)Vul-Mixer: Efficient and Effective Machine Learning–Assisted Software Vulnerability DetectionElectronics10.3390/electronics1313253813:13(2538)Online publication date: 28-Jun-2024
https://doi.org/10.3390/electronics13132538
Kishiyama BLee YYang J(2024)Improving VulRepair’s Perfect Prediction by Leveraging the LION OptimizerApplied Sciences10.3390/app1413575014:13(5750)Online publication date: 1-Jul-2024
https://doi.org/10.3390/app14135750
Negri-Ribalta CGeraud-Stewart RSergeeva ALenzini G(2024)A systematic literature review on the impact of AI models on the security of code generationFrontiers in Big Data10.3389/fdata.2024.13867207Online publication date: 13-May-2024
https://doi.org/10.3389/fdata.2024.1386720
SUMOTO KKANAKOGI KWASHIZAKI HTSUDA NYOSHIOKA NFUKAZAWA YKANUKA H(2024)Automated Labeling of Entities in CVE Vulnerability Descriptions with Natural Language ProcessingIEICE Transactions on Information and Systems10.1587/transinf.2023DAP0013E107.D:5(674-682)Online publication date: 1-May-2024
https://doi.org/10.1587/transinf.2023DAP0013
Ren JZhang JLi JYang S(2024)Slice-level vulnerability detection model based on graph neural networkProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673287(52-57)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3673277.3673287
Weng CQin YLin BLiu PChen L(2024)MatsVD: Boosting Statement-Level Vulnerability Detection via Dependency-Based AttentionProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674807(115-124)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674807
Liu PLin BQin YWeng CChen L(2024)T-RAP: A Template-guided Retrieval-Augmented Vulnerability Patch Generation ApproachProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3672506(105-114)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3672506
Nong YYang HChen FCai Hd'Amorim M(2024)VinJ: An Automated Tool for Large-Scale Software Vulnerability Data GenerationCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663800(567-571)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663800
Le TBabar MThai T(2024)Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPTProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661281(679-685)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661281
Mastropaolo ANardone VBavota GDi Penta M(2024)How the Training Procedure Impacts the Performance of Deep Learning-based Vulnerability PatchingProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661200(150-159)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661200
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations

Common Vulnerability Scoring System

SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques