short-paper

Open access

Beyond Accuracy and Robustness Metrics for Large Language Models for Code

Author:

Daniel Rodriguez-CardenasAuthors Info & Claims

ICSE-Companion '24: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

Pages 159 - 161

https://doi.org/10.1145/3639478.3639792

Published: 23 May 2024 Publication History

Abstract

In recent years, Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion, summarization, review, tracing, translation, test case generation, clone detection, and bug fixing. Notably, GitHub Copilot [31] and Google's CodeBot [21] exemplify how LLMc contributes to substantial time and effort savings in software development. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, underscoring the necessity for a comprehensive evaluation approach. To address these issues, this dissertation proposes the development of a benchmarking infrastructure, named HolBench, aimed at overcoming gaps in evaluating LLMc quality. The goal is to standardize testing scenarios, facilitate meaningful comparisons across LLMc, and provide multi-metric measurements beyond a sole focus on accuracy. This approach aims to decrease the costs associated with advancing LLMc research, enhancing their reliability for adoption in academia and industry.

References

[1]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, et al. 2021. Program Synthesis with Large Language Models. arXiv:cs.PL/2108.07732

[2]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, et al. 2022. MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation. http://arxiv.org/abs/2208.08227 arXiv:2208.08227 [cs].

[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Generation Probabilities are Not Enough: Improving Error Highlighting for AI Code Suggestions. (2021). Publisher: arXiv Version Number: 2.

[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating Large Language Models Trained on Code. http://arxiv.org/abs/2107.03374 arXiv:2107.03374 [cs].

[5]

Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, et al. 2021. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transactions on Software Engineering 47, 9 (2021), 1943--1959.

[6]

Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, et al. 2022. An Empirical Study on the Usage of Transformer Models for Code Completion. IEEE Transactions on Software Engineering 48, 12 (2022), 4818--4837.

[7]

A. Connor, A. Harris, N. Cooper, and D. Poshyvanyk. 2022. Can We Automatically Fix Bugs by Learning Edit Operations?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA, 782--792.

[8]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, et al. 2021. Measuring Coding Challenge Competence With APPS. CoRR abs/2105.09938 (2021). arXiv:2105.09938 https://arxiv.org/abs/2105.09938

[9]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, et al. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. http://arxiv.org/abs/2308.10620 arXiv:2308.10620 [cs].

[10]

Alexander LeClair, Aakash Bansal, and Collin McMillan. 2021. Ensemble Models for Neural Source Code Summarization of Subroutines. http://arxiv.org/abs/2107.11423 arXiv:2107.11423 [cs].

[11]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al. 2022. Holistic Evaluation of Language Models. http://arxiv.org/abs/2211.09110 arXiv:2211.09110 [cs].

[12]

Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, et al. 2023. Improving ChatGPT Prompt for Code Generation. arXiv:cs.SE/2305.08360

[13]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:cs.SE/2305.01210

[14]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, et al. [n. d.]. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs] http://arxiv.org/abs/2102.04664

[15]

Kevin Moran, David N. Palacio, Carlos Bernal-Cardenas, Daniel McCrystal, Denys Poshyvanyk, et al. 2020. Improving the Effectiveness of Traceability Link Recovery using Hierarchical Bayesian Networks. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 873--885.

Digital Library

[16]

Kevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, Michele Tufano, et al. 2022. An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 514--525.

[17]

Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-Based Statistical Language Model for Code. In ICSE'15. IEEE Press, 858--868.

[18]

OpenAI. 2023. GPT-4 Technical Report. arXiv:cs.CL/2303.08774

[19]

Veselin Raychev, Martin T. Vechev, and Eran Yahav. 2014. Code completion with statistical language models. PLDI (2014).

[20]

Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, and Denys Poshyvanyk. 2023. Benchmarking Causal Study to Interpret Large Language Models for Source Code. http://arxiv.org/abs/2308.12415 arXiv:2308.12415 [cs].

[21]

Doug Rosenberg, Barry Boehm, Matt Stephens, Charles Suscheck, Shobha Rani Dhalipathi, et al. 2020. CodeBots: From Domain Model to Executable Architecture. Parallel Agile-faster delivery, fewer defects, lower cost (2020), 27--51.

[22]

Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes Via Neural Machine Translation. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 25--36.

Digital Library

[23]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2018. Deep Learning Similarities from Different Representations of Source Code. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR). 542--553.

Digital Library

[24]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano di Penta, Martin White, et al. 2018. An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 832--837.

Digital Library

[25]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2019. Learning How to Mutate Source Code from Bug-Fixes. ICSME 2019 (2019), 301--312.

[26]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, et al. 2019. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology 28, 4 (2019), 1--29.

Digital Library

[27]

Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, et al. 2022. Using Pre-Trained Models to Boost Code Review Automation. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 2291--2302.

Digital Library

[28]

Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 163--174.

Digital Library

[29]

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, et al. 2022. ReCode: Robustness Evaluation of Code Generation Models. http://arxiv.org/abs/2212.10264 arXiv:2212.10264 [cs].

[30]

Cody Watson, Michele Tufano, Kevin Moran, Gabriele Bavota, and Denys Poshyvanyk. 2020. On Learning Meaningful Assert Statements for Unit Test Cases. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 1398--1409.

[31]

Michel Wermelinger. 2023. Using GitHub Copilot to solve simple programming problems. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 172--178.

Digital Library

[32]

Martin White, Michele Tufano, Matias Martínez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 479--490.

[33]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87--98.

Digital Library

[34]

Martin White, Christopher Vendome, Mario Linares-Vasquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. 334--345.

[35]

Robert White and Jens Krinke. 2020. ReAssert: Deep Learning for Assert Generation. http://arxiv.org/abs/2011.09784 arXiv:2011.09784 [cs].

[36]

Frank F. Xu, Uri Alon, Graham Neubig, and Vincent J. Hellendoorn. 2022. A Systematic Evaluation of Large Language Models of Code. http://arxiv.org/abs/2202.13169 arXiv:2202.13169 [cs].

[37]

Wojciech Zaremba, Greg Brockman, and OpenAI. 2021. OpenAI Codex. https://openai.com/blog/openai-codex/.

[38]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. [n. d.]. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. ([n. d.]).

Index Terms

Beyond Accuracy and Robustness Metrics for Large Language Models for Code
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Evaluating Large Language Models in Class-Level Code Generation
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful ...
Beyond Accuracy: Evaluating Source Code Capabilities in Large Language Models for Software Engineering
ICSE-Companion '24: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

This dissertation aims to introduce interpretability techniques to comprehensively evaluate the performance of Large Language Models (LLMs) in software engineering tasks, beyond canonical metrics. In software engineering, Deep Learning techniques are ...
Are NLP Metrics Suitable for Evaluating Generated Code?
Product-Focused Software Process Improvement
Abstract
Code generation is a technique that generates program source code without human intervention. There has been much research on automated methods for writing code, such as code generation. However, many techniques are still in their infancy and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-Companion '24: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

April 2024

531 pages

ISBN:9798400705021

DOI:10.1145/3639478

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2024

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

ICSE-Companion '24

Sponsor:

SIGSOFT

ICSE-Companion '24: 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
157
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)40

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents