Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3639478.3639792acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
short-paper
Open access

Beyond Accuracy and Robustness Metrics for Large Language Models for Code

Published: 23 May 2024 Publication History

Abstract

In recent years, Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion, summarization, review, tracing, translation, test case generation, clone detection, and bug fixing. Notably, GitHub Copilot [31] and Google's CodeBot [21] exemplify how LLMc contributes to substantial time and effort savings in software development. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, underscoring the necessity for a comprehensive evaluation approach. To address these issues, this dissertation proposes the development of a benchmarking infrastructure, named HolBench, aimed at overcoming gaps in evaluating LLMc quality. The goal is to standardize testing scenarios, facilitate meaningful comparisons across LLMc, and provide multi-metric measurements beyond a sole focus on accuracy. This approach aims to decrease the costs associated with advancing LLMc research, enhancing their reliability for adoption in academia and industry.

References

[1]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, et al. 2021. Program Synthesis with Large Language Models. arXiv:cs.PL/2108.07732
[2]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, et al. 2022. MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation. http://arxiv.org/abs/2208.08227 arXiv:2208.08227 [cs].
[3]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Generation Probabilities are Not Enough: Improving Error Highlighting for AI Code Suggestions. (2021). Publisher: arXiv Version Number: 2.
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating Large Language Models Trained on Code. http://arxiv.org/abs/2107.03374 arXiv:2107.03374 [cs].
[5]
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, et al. 2021. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transactions on Software Engineering 47, 9 (2021), 1943--1959.
[6]
Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, et al. 2022. An Empirical Study on the Usage of Transformer Models for Code Completion. IEEE Transactions on Software Engineering 48, 12 (2022), 4818--4837.
[7]
A. Connor, A. Harris, N. Cooper, and D. Poshyvanyk. 2022. Can We Automatically Fix Bugs by Learning Edit Operations?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA, 782--792.
[8]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, et al. 2021. Measuring Coding Challenge Competence With APPS. CoRR abs/2105.09938 (2021). arXiv:2105.09938 https://arxiv.org/abs/2105.09938
[9]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, et al. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. http://arxiv.org/abs/2308.10620 arXiv:2308.10620 [cs].
[10]
Alexander LeClair, Aakash Bansal, and Collin McMillan. 2021. Ensemble Models for Neural Source Code Summarization of Subroutines. http://arxiv.org/abs/2107.11423 arXiv:2107.11423 [cs].
[11]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al. 2022. Holistic Evaluation of Language Models. http://arxiv.org/abs/2211.09110 arXiv:2211.09110 [cs].
[12]
Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, et al. 2023. Improving ChatGPT Prompt for Code Generation. arXiv:cs.SE/2305.08360
[13]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:cs.SE/2305.01210
[14]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, et al. [n. d.]. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs] http://arxiv.org/abs/2102.04664
[15]
Kevin Moran, David N. Palacio, Carlos Bernal-Cardenas, Daniel McCrystal, Denys Poshyvanyk, et al. 2020. Improving the Effectiveness of Traceability Link Recovery using Hierarchical Bayesian Networks. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 873--885.
[16]
Kevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, Michele Tufano, et al. 2022. An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 514--525.
[17]
Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-Based Statistical Language Model for Code. In ICSE'15. IEEE Press, 858--868.
[18]
OpenAI. 2023. GPT-4 Technical Report. arXiv:cs.CL/2303.08774
[19]
Veselin Raychev, Martin T. Vechev, and Eran Yahav. 2014. Code completion with statistical language models. PLDI (2014).
[20]
Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk. 2023. Benchmarking Causal Study to Interpret Large Language Models for Source Code. http://arxiv.org/abs/2308.12415 arXiv:2308.12415 [cs].
[21]
Doug Rosenberg, Barry Boehm, Matt Stephens, Charles Suscheck, Shobha Rani Dhalipathi, et al. 2020. CodeBots: From Domain Model to Executable Architecture. Parallel Agile-faster delivery, fewer defects, lower cost (2020), 27--51.
[22]
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes Via Neural Machine Translation. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 25--36.
[23]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2018. Deep Learning Similarities from Different Representations of Source Code. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR). 542--553.
[24]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano di Penta, Martin White, et al. 2018. An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 832--837.
[25]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2019. Learning How to Mutate Source Code from Bug-Fixes. ICSME 2019 (2019), 301--312.
[26]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2019. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology 28, 4 (2019), 1--29.
[27]
Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, et al. 2022. Using Pre-Trained Models to Boost Code Review Automation. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 2291--2302.
[28]
Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 163--174.
[29]
Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, et al. 2022. ReCode: Robustness Evaluation of Code Generation Models. http://arxiv.org/abs/2212.10264 arXiv:2212.10264 [cs].
[30]
Cody Watson, Michele Tufano, Kevin Moran, Gabriele Bavota, and Denys Poshyvanyk. 2020. On Learning Meaningful Assert Statements for Unit Test Cases. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 1398--1409.
[31]
Michel Wermelinger. 2023. Using GitHub Copilot to solve simple programming problems. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 172--178.
[32]
Martin White, Michele Tufano, Matias Martínez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 479--490.
[33]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87--98.
[34]
Martin White, Christopher Vendome, Mario Linares-Vasquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. 334--345.
[35]
Robert White and Jens Krinke. 2020. ReAssert: Deep Learning for Assert Generation. http://arxiv.org/abs/2011.09784 arXiv:2011.09784 [cs].
[36]
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent J. Hellendoorn. 2022. A Systematic Evaluation of Large Language Models of Code. http://arxiv.org/abs/2202.13169 arXiv:2202.13169 [cs].
[37]
Wojciech Zaremba, Greg Brockman, and OpenAI. 2021. OpenAI Codex. https://openai.com/blog/openai-codex/.
[38]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. [n. d.]. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. ([n. d.]).

Index Terms

  1. Beyond Accuracy and Robustness Metrics for Large Language Models for Code

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE-Companion '24: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings
    April 2024
    531 pages
    ISBN:9798400705021
    DOI:10.1145/3639478
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    In-Cooperation

    • Faculty of Engineering of University of Porto

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 May 2024

    Check for updates

    Author Tags

    1. deep learning
    2. code generation
    3. interpretability
    4. transformers

    Qualifiers

    • Short-paper

    Conference

    ICSE-Companion '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 276 of 1,856 submissions, 15%

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 157
      Total Downloads
    • Downloads (Last 12 months)157
    • Downloads (Last 6 weeks)40
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media