research-article

Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities

Authors:

Zoltán Ságodi,

Bence Bogenfürst,

Péter Hegedűs,

Rudolf FerencAuthors Info & Claims

EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering

Pages 252 - 261

https://doi.org/10.1145/3661167.3661207

Published: 18 June 2024 Publication History

Abstract

Discovering and mitigating software vulnerabilities is a challenging task. These vulnerabilities are often caused by simple, otherwise (and in other contexts) harmless code snippets (e.g., unchecked path traversal). Large Language Models (LLMs) promise to revolutionize not just human-machine interactions but various software engineering tasks as well, including the automatic repair of vulnerabilities. However, currently, it is hard to assess the performance, robustness, and reliability of these models as most of their evaluation has been done on small, synthetic examples. In our work, we systematically evaluate the automatic vulnerability fixing capabilities of GPT-4, a popular LLM, using a database of real-world Java vulnerabilities, Vul4J. We expect the model to provide fixes for vulnerable methods, which we evaluate manually and based on unit test results included in the Vul4J database. GPT-4 provided perfect fixes consistently for at least 12 out of the total 46 examined vulnerabilities, which could be applied as is. In an additional 5 cases, the provided textual instructions would help to fix the vulnerabilities in a practical scenario (despite the provided code being incorrect). Our findings, similar to others, also show that prompting has a significant effect.

References

[1]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’21). ACM, 10. https://doi.org/10.1145/3475960.3475985

Digital Library

[2]

Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E. Díaz Ferreyra. 2022. Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared towards the Study of Program Repair Techniques. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 464–468. https://doi.org/10.1145/3524842.3528482

Digital Library

[3]

Yiannis Charalambous, Norbert Tihanyi, Youcheng Sun, Mohamed Amine Ferrag, and Lucas Cordeiro. 2023. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification. https://doi.org/10.48550/arXiv.2305.14752

[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 2021. Evaluating large language models trained on code.(2021). arXiv preprint arXiv:2107.03374 (2021).

[5]

Kunming Cheng, Qiang Guo, Yongbin He, Yanqiu Lu, Ruijie Xie, Cheng Li, and Haiyang Wu. 2023. Artificial intelligence in sports medicine: could GPT-4 make human doctors obsolete?Annals of Biomedical Engineering (2023), 1–5.

[6]

CVE 2023. Common Vulnerabilities and Exposures. https://cve.mitre.org/. Accessed: 2023-10-13.

[7]

CWE 2023. Common Weaknesses Enumeration. https://cwe.mitre.org/. Accessed: 2023-10-13.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[9]

Jens Dietrich, Shawn Rasheed, and Alexander Jordan. 2023. On the Security Blind Spots of Software Composition Analysis. arXiv preprint arXiv:2306.05534 (2023).

[10]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128

Digital Library

[11]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139

[12]

Aayush Garg. 2023. Guiding Quality Assurance Through Context Aware Learning.

[13]

Pe’ter Gyimesi, Be’la Vancsics, Andrea Stocco, Davood Mazinanian, A’rpa’d Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: a Benchmark of JavaScript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101. https://doi.org/10.1109/ICST.2019.00019

[14]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440. https://doi.org/10.1145/2610384.2628055

Digital Library

[15]

Márk Lajkó, Viktor Csuvik, and László Vidács. 2022. Towards JavaScript Program Repair with Generative Pre-Trained Transformer (GPT-2). In Proceedings of the Third International Workshop on Automated Program Repair (Pittsburgh, Pennsylvania) (APR ’22). Association for Computing Machinery, New York, NY, USA, 61–68. https://doi.org/10.1145/3524459.3527350

Digital Library

[16]

Márk Lajkó, Dániel Horváth, Viktor Csuvik, and László Vidács. 2022. Fine-Tuning GPT-2 to Patch Programs, Is It Worth It?. In Computational Science and Its Applications – ICCSA 2022 Workshops, Osvaldo Gervasi, Beniamino Murgante, Sanjay Misra, Ana Maria A. C. Rocha, and Chiara Garau (Eds.). Springer International Publishing, Cham, 79–91.

[17]

Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering 41, 12 (2015), 1236–1256. https://doi.org/10.1109/TSE.2015.2454513

Digital Library

[18]

Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra. 2021. Automatic Program Repair. IEEE Software 38, 4 (2021), 22–27. https://doi.org/10.1109/MS.2021.3072577

Digital Library

[19]

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie. 2023. Large Language Models Understand and Can be Enhanced by Emotional Stimuli. arxiv:2307.11760 [cs.CL]

[20]

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (Vancouver, BC, Canada) (SPLASH Companion 2017). Association for Computing Machinery, New York, NY, USA, 55–56. https://doi.org/10.1145/3135932.3135941

Digital Library

[21]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.

Digital Library

[22]

Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2450–2462. https://doi.org/10.1109/ICSE48619.2023.00205

Digital Library

[23]

Incheon Paik and Jun-Wei Wang. 2021. Improving Text-to-Code Generation with Features of Code Graph on GPT-2. Electronics 10 (11 2021), 2706. https://doi.org/10.3390/electronics10212706

[24]

Aurora Papotti, Ranindya Paramitha, and Fabio Massacci. 2022. On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools. arXiv preprint arXiv:2209.07211 (2022).

[25]

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339–2356. https://doi.org/10.1109/SP46215.2023.10179324

[26]

Eduard Pinconschi, Quang-Cuong Bui, Rui Abreu, Pedro Adão, and Riccardo Scandariato. 2022. Maestro: A platform for benchmarking automatic program repair tools on software vulnerabilities. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 789–792.

Digital Library

[27]

Julian Aron Prenner, Hlib Babii, and Romain Robbes. 2022. Can OpenAI’s Codex Fix Bugs?: An evaluation on QuixBugs. In 2022 IEEE/ACM International Workshop on Automated Program Repair (APR). 69–75. https://doi.org/10.1145/3524459.3527351

Digital Library

[28]

Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire Robertson, and Jay J Van Bavel. 2023. GPT is an effective tool for multilingual psychological text analysis. (2023).

[29]

Seemanta Saha 2019. Harnessing evolution for multi-hunk program repair. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 13–24.

Digital Library

[30]

D. Sobania, M. Briesch, C. Hanna, and J. Petke. 2023. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE Computer Society, Los Alamitos, CA, USA, 23–30. https://doi.org/10.1109/APR59189.2023.00012

[31]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29.

Digital Library

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

Digital Library

[33]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.

[34]

Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api.semanticscholar.org/CorpusID:258967736

Digital Library

[35]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.

Digital Library

[36]

Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 959–971. https://doi.org/10.1145/3540250.3549101

Digital Library

[37]

Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational Automated Program Repair. arxiv:2301.13246 [cs.SE]

[38]

Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin. 2022. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690.

Digital Library

Index Terms

Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

How Effective Are Neural Networks for Fixing Security Vulnerabilities
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Security vulnerability repair is a difficult task that is in dire need of automation. Two groups of techniques have shown promise: (1) large code language models (LLMs) that have been pre-trained on source code for tasks such as code completion, and (...
VulRep: vulnerability repair based on inducing commits and fixing commits
Abstract
With the rapid development of the information age, software vulnerabilities have threatened the safety of communication and mobile network, and research on vulnerability repair is urgent. Different from the existing machine learning-based ...
APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities
Abstract
Security vulnerability fixes could be a promising research avenue for Automated Program Repair (APR) techniques. In recent years, APR tools have been thoroughly developed for fixing generic bugs. However, the area is still relatively unexplored ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering

June 2024

728 pages

ISBN:9798400717017

DOI:10.1145/3661167

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Hungarian National Research, Development And Innovation Office

Conference

EASE 2024

EASE 2024: 28th International Conference on Evaluation and Assessment in Software Engineering

June 18 - 21, 2024

Salerno, Italy

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
76
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)23

Reflects downloads up to 20 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents