Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3661167.3661207acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities

Published: 18 June 2024 Publication History

Abstract

Discovering and mitigating software vulnerabilities is a challenging task. These vulnerabilities are often caused by simple, otherwise (and in other contexts) harmless code snippets (e.g., unchecked path traversal). Large Language Models (LLMs) promise to revolutionize not just human-machine interactions but various software engineering tasks as well, including the automatic repair of vulnerabilities. However, currently, it is hard to assess the performance, robustness, and reliability of these models as most of their evaluation has been done on small, synthetic examples. In our work, we systematically evaluate the automatic vulnerability fixing capabilities of GPT-4, a popular LLM, using a database of real-world Java vulnerabilities, Vul4J. We expect the model to provide fixes for vulnerable methods, which we evaluate manually and based on unit test results included in the Vul4J database. GPT-4 provided perfect fixes consistently for at least 12 out of the total 46 examined vulnerabilities, which could be applied as is. In an additional 5 cases, the provided textual instructions would help to fix the vulnerabilities in a practical scenario (despite the provided code being incorrect). Our findings, similar to others, also show that prompting has a significant effect.

References

[1]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’21). ACM, 10. https://doi.org/10.1145/3475960.3475985
[2]
Quang-Cuong Bui, Riccardo Scandariato, and Nicolás E. Díaz Ferreyra. 2022. Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared towards the Study of Program Repair Techniques. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 464–468. https://doi.org/10.1145/3524842.3528482
[3]
Yiannis Charalambous, Norbert Tihanyi, Youcheng Sun, Mohamed Amine Ferrag, and Lucas Cordeiro. 2023. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification. https://doi.org/10.48550/arXiv.2305.14752
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, 2021. Evaluating large language models trained on code.(2021). arXiv preprint arXiv:2107.03374 (2021).
[5]
Kunming Cheng, Qiang Guo, Yongbin He, Yanqiu Lu, Ruijie Xie, Cheng Li, and Haiyang Wu. 2023. Artificial intelligence in sports medicine: could GPT-4 make human doctors obsolete?Annals of Biomedical Engineering (2023), 1–5.
[6]
CVE 2023. Common Vulnerabilities and Exposures. https://cve.mitre.org/. Accessed: 2023-10-13.
[7]
CWE 2023. Common Weaknesses Enumeration. https://cwe.mitre.org/. Accessed: 2023-10-13.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[9]
Jens Dietrich, Shawn Rasheed, and Alexander Jordan. 2023. On the Security Blind Spots of Software Composition Analysis. arXiv preprint arXiv:2306.05534 (2023).
[10]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
[11]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
[12]
Aayush Garg. 2023. Guiding Quality Assurance Through Context Aware Learning.
[13]
Pe’ter Gyimesi, Be’la Vancsics, Andrea Stocco, Davood Mazinanian, A’rpa’d Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: a Benchmark of JavaScript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101. https://doi.org/10.1109/ICST.2019.00019
[14]
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440. https://doi.org/10.1145/2610384.2628055
[15]
Márk Lajkó, Viktor Csuvik, and László Vidács. 2022. Towards JavaScript Program Repair with Generative Pre-Trained Transformer (GPT-2). In Proceedings of the Third International Workshop on Automated Program Repair (Pittsburgh, Pennsylvania) (APR ’22). Association for Computing Machinery, New York, NY, USA, 61–68. https://doi.org/10.1145/3524459.3527350
[16]
Márk Lajkó, Dániel Horváth, Viktor Csuvik, and László Vidács. 2022. Fine-Tuning GPT-2 to Patch Programs, Is It Worth It?. In Computational Science and Its Applications – ICCSA 2022 Workshops, Osvaldo Gervasi, Beniamino Murgante, Sanjay Misra, Ana Maria A. C. Rocha, and Chiara Garau (Eds.). Springer International Publishing, Cham, 79–91.
[17]
Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering 41, 12 (2015), 1236–1256. https://doi.org/10.1109/TSE.2015.2454513
[18]
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra. 2021. Automatic Program Repair. IEEE Software 38, 4 (2021), 22–27. https://doi.org/10.1109/MS.2021.3072577
[19]
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie. 2023. Large Language Models Understand and Can be Enhanced by Emotional Stimuli. arxiv:2307.11760 [cs.CL]
[20]
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A Multi-Lingual Program Repair Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (Vancouver, BC, Canada) (SPLASH Companion 2017). Association for Computing Machinery, New York, NY, USA, 55–56. https://doi.org/10.1145/3135932.3135941
[21]
Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.
[22]
Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2450–2462. https://doi.org/10.1109/ICSE48619.2023.00205
[23]
Incheon Paik and Jun-Wei Wang. 2021. Improving Text-to-Code Generation with Features of Code Graph on GPT-2. Electronics 10 (11 2021), 2706. https://doi.org/10.3390/electronics10212706
[24]
Aurora Papotti, Ranindya Paramitha, and Fabio Massacci. 2022. On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools. arXiv preprint arXiv:2209.07211 (2022).
[25]
Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339–2356. https://doi.org/10.1109/SP46215.2023.10179324
[26]
Eduard Pinconschi, Quang-Cuong Bui, Rui Abreu, Pedro Adão, and Riccardo Scandariato. 2022. Maestro: A platform for benchmarking automatic program repair tools on software vulnerabilities. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 789–792.
[27]
Julian Aron Prenner, Hlib Babii, and Romain Robbes. 2022. Can OpenAI’s Codex Fix Bugs?: An evaluation on QuixBugs. In 2022 IEEE/ACM International Workshop on Automated Program Repair (APR). 69–75. https://doi.org/10.1145/3524459.3527351
[28]
Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire Robertson, and Jay J Van Bavel. 2023. GPT is an effective tool for multilingual psychological text analysis. (2023).
[29]
Seemanta Saha 2019. Harnessing evolution for multi-hunk program repair. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 13–24.
[30]
D. Sobania, M. Briesch, C. Hanna, and J. Petke. 2023. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE Computer Society, Los Alamitos, CA, USA, 23–30. https://doi.org/10.1109/APR59189.2023.00012
[31]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
[33]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
[34]
Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api.semanticscholar.org/CorpusID:258967736
[35]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
[36]
Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 959–971. https://doi.org/10.1145/3540250.3549101
[37]
Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational Automated Program Repair. arxiv:2301.13246 [cs.SE]
[38]
Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin. 2022. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690.

Index Terms

  1. Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering
    June 2024
    728 pages
    ISBN:9798400717017
    DOI:10.1145/3661167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Automated program repair
    2. GPT
    3. Machine learning
    4. Vulnerability fixing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Hungarian National Research, Development And Innovation Office

    Conference

    EASE 2024

    Acceptance Rates

    Overall Acceptance Rate 71 of 232 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 76
      Total Downloads
    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 20 Sep 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media