Computer Science > Cryptography and Security

arXiv:2304.00409 (cs)

[Submitted on 1 Apr 2023 (v1), last revised 9 Aug 2023 (this version, v2)]

Title:DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Authors:Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, David Wagner

View PDF

Abstract:We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined.
Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects.
We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

Comments:	Published at RAID 2023
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2304.00409 [cs.CR]
	(or arXiv:2304.00409v2 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2304.00409

Submission history

From: Yizheng Chen [view email]
[v1] Sat, 1 Apr 2023 23:29:14 UTC (249 KB)
[v2] Wed, 9 Aug 2023 01:21:50 UTC (306 KB)

Computer Science > Cryptography and Security

Title:DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators