research-article

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Authors:

Liang Feng ZhangAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 50, Issue 6

Pages 1548 - 1584

https://doi.org/10.1109/TSE.2024.3392499

Published: 23 April 2024 Publication History

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using <italic>ChatGPT</italic>, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by <italic>ChatGPT</italic>, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate <italic>ChatGPT</italic>'s ability to engage in multi-round fixing process (i.e., <italic>ChatGPT</italic>'s dialog ability, chatting between users and <italic>ChatGPT</italic> for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of <italic>ChatGPT</italic> in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) <italic>ChatGPT</italic> is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with <inline-formula><tex-math notation="LaTeX">$48.14\%$</tex-math><alternatives><mml:math><mml:mn>48.14</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="tang-ieq1-3392499.gif"/></alternatives></inline-formula> advantage in <italic>Accepted</italic> rate on judgment platform, but <italic>ChatGPT</italic>'s ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with <italic>ChatGPT </italic> generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by <italic>ChatGPT </italic> has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than <inline-formula><tex-math notation="LaTeX">$89\%$</tex-math><alternatives><mml:math><mml:mn>89</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="tang-ieq2-3392499.gif"/></alternatives></inline-formula> of vulnerabilities successfully addressed; and (4) code generation may be affected by <italic>ChatGPT</italic>'s non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the <italic>ChatGPT</italic>-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.

Cited By

View all

Gu XChen MLin YHu YZhang HWan CWei ZXu YWang J(2024)On the Effectiveness of Large Language Models in Domain-Specific Code GenerationACM Transactions on Software Engineering and Methodology10.1145/3697012Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697012
Billah MRoy PCodabux ZRoy B(2024)Are Large Language Models a Threat to Programming Platforms? An Exploratory StudyProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686689(292-301)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686689

Recommendations

Assessing the Readability of ChatGPT Code Snippet Recommendations: A Comparative Study
SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software Engineering

Developers often rely on code search engines to find high-quality and reusable code snippets online, such as those available on Stack Overflow. Recently, ChatGPT, a language model trained for dialog tasks, has been gaining attention as a promising ...
Assessing the quality of GitHub copilot’s code generation
PROMISE 2022: Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering

The introduction of GitHub’s new code generation tool, GitHub Copilot, seems to be the first well-established instance of an AI pair-programmer. GitHub Copilot has access to a large number of open-source projects, enabling it to utilize more extensive ...
A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT
ASE '23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering

Code generation aims to generate source code implementing human requirements illustrated with natural language specifications. With the rapid development of intelligent software engineering, automated code generation has become a hot research topic in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 50, Issue 6

June 2024

352 pages

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 23 April 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gu XChen MLin YHu YZhang HWan CWei ZXu YWang J(2024)On the Effectiveness of Large Language Models in Domain-Specific Code GenerationACM Transactions on Software Engineering and Methodology10.1145/3697012Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697012
Billah MRoy PCodabux ZRoy B(2024)Are Large Language Models a Threat to Programming Platforms? An Exploratory StudyProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686689(292-301)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686689

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations

Assessing the Readability of ChatGPT Code Snippet Recommendations: A Comparative Study

Assessing the quality of GitHub copilot’s code generation

A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations