research-article

CVE-assisted large-scale security bug report dataset construction method

Authors:

Dejun MuAuthors Info & Claims

Volume 160, Issue C

https://doi.org/10.1016/j.jss.2019.110456

Published: 01 February 2020 Publication History

Highlights

•

An automatic approach for large-scale dataset construction is provided by using only a small group of initial labeled ground-truth samples.

•

A large-scale dataset from the “OpenStack” project for security bug report prediction is constructed.

•

The quality of the “Chromium” dataset is improved by identifying and removing noises generated during the labeling process.

•

Scripts of our dataset construction approach as well as the two constructed large-scale datasets are shared.

Abstract

Identifying SBRs (security bug reports) is crucial for eliminating security issues during software development. Machine learning are promising ways for SBR prediction. However, the effectiveness of the state-of-the-art machine learning models depend on high-quality datasets, while gathering large-scale datasets are expensive and tedious. To solve this issue, we propose an automated data labeling approach based on iterative voting classification. It starts with a small group of ground-truth traing samples, which can be labeled with the help of authoritative vulnerability records hosted in CVE (Common Vulnerabilities and Exposures). The accuracy of the prediction model is improved with an iterative voting strategy. By using this approach, we label over 80k bug reports from OpenStack and 40k bug reports from Chromium. The correctness of these labels are then manually reviewed by three experienced security testing members. Finally, we construct a large-scale SBR dataset with 191 SBRs and 88,472 NSBRs (non-security bug reports) from OpenStack; and improve the quality of existing SBR dataset Chromium by identifying 64 new SBRs from previously labeled NSBRs and filtering out 173 noise bug reports from this dataset. These share datasets as well as the proposed dataset construction method help to promote research progress in SBR prediction research domain.

References

[1]

M.T. Agrawal Amritanshu, Is “better data” better than “better data miners”?, Proceedings of the 40th International Conference on Software Engineering, IEEE, 2018, pp. 1050–1061.

[2]

N. Akhtar, A. Mian, Threat of adversarial attacks on deep learning in computer vision: a survey, IEEE Access 6 (2018) 14410–14430.

[3]

M.E.D.e.a. Anisetti Claudio Agostino Ardagna, A security benchmark for openstack, Proceedings of the 10th International Conference on Cloud Computing, Honolulu, CA, USA, 25–30 June 2017, IEEE, 2017, pp. 294–301,.

[4]

L. Bao, X. Xia, D. Lo, G.C. Murphy, A large scale study of long-time contributor prediction for github projects, IEEE Trans. Softw. Eng. (2019).

[5]

L. Bao, Z. Xing, X. Xia, D. Lo, A.E. Hassan, Inference of development activities from interaction with uninstrumented applications, Emp. Softw. Eng. (1) (2017) 1–39.

[6]

S.A. Baset, C. Tang, B.-C. Tak, L. Wang, Dissecting open source cloud evolution: An openstack case study, Proceeding of the 5th USENIX Workshop on Hot Topics in Cloud Computing, SAN JOSE, CA, 25–26 June 2013, 2013.

[7]

D. Behl, S. Handa, A. Arora, A bug mining tool to identify and analyze security bugs using naive bayes and tf-idf, Proceedings of International Conference on Optimization, Reliabilty, and Information Technology, Faridabad, India, 6–8 Feb. 2014, IEEE, 2014,.

[8]

C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27.

Digital Library

[9]

Y.Y. Chang, P. Zavarsky, R. Ruhl, D. Lindskog, Trend analysis of the CVE for software vulnerability management, Proceedings of the Third International Conference on Privacy, Security, Risk and Trust and Social Computing, Boston, MA, USA, 9–11 October 2011, IEEE, 2011, pp. 1290–1293,.

[10]

Q. Chen, L. Bao, L. Li, X. Xia, L. Cai, Categorizing and predicting invalid vulnerabilities on common vulnerabilities and exposures, 2018 25th Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2018, pp. 345–354.

[11]

X. Chen, Y. Zhao, Q. Wang, Z. Yuan, Multi: multi-objective effort-aware just-in-time software defect prediction, Inf. Softw. Technol. 93 (2018) 1–13.

Digital Library

[12]

Cl and ci. https://www.surveysystem.com/sscalc.htm.

[13]

Cve detail, 2019. https://www.cvedetails.com/.

[14]

Cve website, 2019. https://cve.mitre.org/.

[15]

Cve website. http://cve.mitre.org/.

[16]

J. Deshmukh, S. Podder, S. Sengupta, N. Dubash, et al., Towards accurate duplicate bug retrieval using deep learning techniques, 2017 IEEE International conference on software maintenance and evolution (ICSME), IEEE, 2017, pp. 115–124.

[17]

Dr wikipedia. https://en.wikipedia.org/wiki/Dimensionality_reduction.

[18]

R.E. Fan, K.W. Chang, C.J. Hsieh, Wang, Liblinear: a library for large linear classification, J. Mach. Learn. Res. 9 (9) (2008) 1871–1874.

[19]

Y. Fan, X. Xia, D. Lo, A.E. Hassan, Chaff from the wheat: characterizing and determining valid bug reports, IEEE Trans. Softw. Eng. (2018) 1–30.

[20]

Fleiss, J. L., Measuring nominal scale agreement among many raters., Psychol. Bull. 76 (5) (1971) 378–382.

[21]

W. Garcia, T. Benson, A first look at bugs in openstack, Proceedings of the 2016 ACM Workshop on Cloud-Assisted Networking, Irvine, California, USA, 12-12 December 2016, ACM, 2016, pp. 67–72,.

Digital Library

[22]

M. Gegick, P. Rotella, T. Xie, Identifying security bug reports via text mining: An industrial case study, Proceedings of the 7th International Working Conference on Mining Software Repositories, Cape Town, South Africa, 2–3 May 2010, IEEE, 2010, pp. 11–20,.

[23]

L. Glanz, S. Schmidt, S. Wollny, B. Hermann, A vulnerability’s lifetime: enhancing version information in CVE databases, Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business, Graz, Austria, 21–23 October 2015, 2015, pp. 28:1–28:4,.

Digital Library

[24]

K. Goseva-popstojanova, J. Tyo, B. Sizemore, Security Vulnerability Profiles of NASA Mission Software : Empirical Analysis of Security Related Bug Reports, Proceedings of the 28th International Symposium on Software Reliability Engineering, Toulouse, France, 23–26 October 2017, IEEE, 2018, pp. 1–11,.

[25]

H.S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-Anake, T. Do, J. Adityatama, K.J. Eliazar, A. Laksono, J.F. Lukman, V. Martin, What bugs live in the cloud? a study of 3000+ issues in cloud systems, Proceedings of the ACM Symposium on Cloud Computing Pages 1–14, Seattle, WA, USA, 03–05 November 2014, ACM, 2014,.

Digital Library

[26]

T. Jia, Y. Li, X. Yuan, H. Tang, Z. Wu, Characterizing and predicting bug assignment in openstack, Proceeding of the Second International Conference on Trustworthy Systems and Their Applications, Hualien, Taiwan, 8–9 July 2015, IEEE, 2015, pp. 16–23,.

Digital Library

[27]

Jira, 2018. https://www.atlassian.com/software/jira.

[28]

T. Junk, Confidence level computation for combining searches with small statistics, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 434 (2–3) (1999) 435–443.

[29]

P. Kamongi, K. Kavi, VULCAN: Vulnerability Assessment Framework for Cloud Computing, Proceedings of the 7th International Conference on Software Security and Reliability, Gaithersburg, MD, USA, 18–20 June 2013, IEEE, 2013,.

Digital Library

[30]

OpenStack bug reports in LaunchPad, 2018. https://bugs.launchpad.net/openstack/.

[31]

G. Li, S.K.S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, S.W. Keckler, Understanding error propagation in deep learning neural network (dnn) accelerators and applications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2017, p. 8.

[32]

M. Lindorfer, M. Neugschwandtner, C. Platzer, Marvin: Efficient and comprehensive mobile app classification through static and dynamic analysis, IEEE Computer Software and Applications Conference., IEEE, 2015.

[33]

Linearsvc introduction in scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.

[34]

M.M. Lopez, J.K. Kalita, Deep learning applied to nlp, CoRR abs/1703.03091 (2017).

[35]

R.A. Martin, Integrating your information security vulnerability management capabilities through industry standards (cve&oval), SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme-System Security and Assurance (Cat. No. 03CH37483), 2, IEEE, 2003, pp. 1528–1533.

[36]

Mining Software Repository Challenge, 2011. http://2011.msrconf.org/msr-challenge.html.

[37]

P. Musavi, B. Adams, F. Khomh, Experience report: An empirical study of api failures in openstack cloud environments, Proceedings of the 27th International Symposium on Software Reliability Engineering, Ottawa, ON, Canada, 23–27 October 2016, IEEE, 2016, pp. 424–434,.

[38]

A.Y. Ng, Feature selection, l1 vs. l2 regularization, and rotational invariance, Proceedings of the twenty-first international conference on Machine learning, 2004.

[39]

M. Ohira, Y. Kashiwa, Y. Yamatani, H. Yoshiyuki, Y. Maeda, N. Limsettho, K. Fujino, H. Hata, A. Ihara, K. Matsumoto, A dataset of high impact bugs: Manually-classified issue reports, Proceedings of the 12th Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015, 2015,.

[40]

Openstack user survey report, 2018. https://www.openstack.org/user-survey/2018-user-survey-report/.

[41]

O. Chaparro, J. Lu, F. Zampetti, L. Moreno, Detecting missing information in bug descriptions, Joint Meeting on Foundations of Software Engineering, 2017, pp. 396–407.

[42]

S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (10) (2009) 1345–1359.

Digital Library

[43]

F. Peters, T. Tun, Y. Yu, B. Nuseibeh, Text filtering and ranking for security bug report prediction, IEEE Trans. Softw. Eng. 45 (6) (2019) 615–631.

[44]

V. Pham, T. Dang, Cvexplorer: multidimensional visualization for common vulnerabilities and exposures, Proceedings of the International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018, IEEE, 2018, pp. 1296–1301,.

[45]

B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, W. Zou, adiff: Cross-version binary code similarity detection with dnn, Proceeding of the International Conference on Automated Software Engineering, Corum, Montpellier, France, 2018, pp. 3–7.

[46]

X. Ren, Z. Xing, X. Xia, D. Lo, X. Wang, J. Grundy, Neural network-based detection of self-admitted technical debt: from performance to explainability, ACM Trans. Softw. Eng. Methodol. 28 (3) (2019) 15.

[47]

J. Romano, Appropriate statistics for ordinal level data: Should we really be using t-test and cohensd for evaluating group differences on the nsse and other surveys, Annual meeting of the Florida Association of Institutional Research., 2006.

[48]

Scap website. https://scap.nist.gov.

[49]

Scikit-learn. https://scikit-learn.org/.

[50]

Selectfrommodel. https://scikit-learn.org/stable/modules/classes.html.

[51]

M. Shepperd, Q. Song, Z. Sun, C. Mair, Data quality: some comments on the nasa software defect datasets, IEEE Trans. Softw. Eng. 39 (9) (2013) 1208–1215.

Digital Library

[52]

R. Shu, T. Xia, L. Williams, T. Menzies, Better Security Bug Report Classification via Hyperparameter Optimization, International Conference on Automated Software Engineering, 2019, IEEE/ACM, 2019, pp. 1–12.

[53]

L.A.W. Thomas Zimmermann Nachiappan Nagappan, Searching for a needle in a haystack: Predicting security vulnerabilities for windows vista, Proceedings of the 3rd International Conference on Software Testing, Verification and Validation, Paris, France, April 7–9, 2010, IEEE, 2010, pp. 421–428,.

Digital Library

[54]

J.P. Tyo, Empirical Analysis and Automated Classification of Security Bug Reports, West Virginia University, 2018.

[55]

Openstack vulnerability management process, 2019. https://security.openstack.org/vmt-process.html.

[56]

Z. Wan, X. Xia, A.E. Hassan, D. Lo, J. Yin, X. Yang, Perceptions, expectations, and challenges in defect prediction, IEEE Trans. Softw. Eng. (2018).

[57]

D. Wijayasekara, M.W.J. Manic, Mining bug databases for unidentified software vulnerabilities, Proceedings of the 5th International Conference on Human System Interactions, erth, WA, Australia, 6–8 June 2012, 2012, pp. 89–96,.

Digital Library

[58]

Y. Xiang, H. Li, S. Wang, C.P. Chen, W. Xu, Debugging openstack problems using a state graph approach, Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, Hong Kong, Hong Kong, 04–05 August 2016, ACM, 2016, p. 13,.

Digital Library

[59]

X.-L. Yang, D. Lo, X. Xia, Q. Huang, J.-L. Sun, High-impact bug report identification with imbalanced learning strategies, J. Comput. Sci. Technol. 32 (1) (2017) 181–198.

[60]

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J., 2016. LSUN : Construction of a large-scale image dataset using deep learning with humans in the loop.

[61]

T. Yu, W. Wei, H. Xue, J. Hayes, Conpredictor: concurrency defect prediction in real-world applications, IEEE Trans. Softw. Eng. 45 (6) (2019) 558–575.

[62]

S. Zaman, B. Adams, A.E. Hassan, Security Versus Performance Bugs: A Case Study On Firefox, Proceedings of the Working Conference on Mining Software Repositories, IEEE, 2011, pp. 93–102.

[63]

W. Zheng, C. Feng, T. Yu, X. Yang, X. Wu, Towards understanding bugs in an open source cloud management stack: an empirical study of openstack software bugs, J. Syst. Softw. 151 (2019) 210–223.

Cited By

Xu YLi Y(2024)A New Method of Security Bug Reports AnalysisIT Professional10.1109/MITP.2023.329852026:2(49-56)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1109/MITP.2023.3298520
Esposito MFalessi D(2024)VALIDATEInformation and Software Technology10.1016/j.infsof.2024.107448170:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.infsof.2024.107448
Wang RXu SJi XTian YGong LWang K(2024)An extensive study of the effects of different deep learning models on code vulnerability detection in Python codeAutomated Software Engineering10.1007/s10515-024-00413-431:1Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1007/s10515-024-00413-4
Show More Cited By

Index Terms

CVE-assisted large-scale security bug report dataset construction method

Index terms have been assigned to the content through auto-classification.

Recommendations

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries
MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories

We collected a large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul. We crawled the public Common Vulnerabilities and Exposures (CVE) database and CVE-related source code repositories. Specifically, we collected the ...
An Automated and Flexible Multilingual Bug-Fix Dataset Construction System
ASE '23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering

Developing effective data-driven automated bug-fixing approaches is heavily relying on large bug-fix datasets. However, the granularity of current repository-mined bug-fixing datasets is usually at the function level, without meta-information such as the ...
A New Large-Scale Video Dataset of the Eyelid Opening Degree for Deep Regression-Based PERCLOS Estimation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops
Abstract
In this study, we focus on PERcent time of slow eyelid CLOSures (PERCLOS), a measure of drowsiness based on physiological indicators. Our main contribution is to design and construct a large-scale dataset for training regression-based models to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Systems and Software

Journal of Systems and Software Volume 160, Issue C

Feb 2020

77 pages

ISSN:0164-1212

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 February 2020

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu YLi Y(2024)A New Method of Security Bug Reports AnalysisIT Professional10.1109/MITP.2023.329852026:2(49-56)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1109/MITP.2023.3298520
Esposito MFalessi D(2024)VALIDATEInformation and Software Technology10.1016/j.infsof.2024.107448170:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.infsof.2024.107448
Wang RXu SJi XTian YGong LWang K(2024)An extensive study of the effects of different deep learning models on code vulnerability detection in Python codeAutomated Software Engineering10.1007/s10515-024-00413-431:1Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1007/s10515-024-00413-4
Kekül HErgen BArslan H(2024)Estimating vulnerability metrics with word embedding and multiclass classification methodsInternational Journal of Information Security10.1007/s10207-023-00734-723:1(247-270)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10207-023-00734-7
Hazra RDwivedi AMukherjee A(2022)Is This Bug Severe? A Text-Cum-Graph Based Model for Bug Severity PredictionMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26422-1_15(236-252)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-26422-1_15

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents