1 Introduction
Program generation (e.g., code repair and code translation) involves producing new source code sequences to facilitate software development and maintenance. These activities require software engineers to manually analyze, understand, and even execute programs to ensure the quality of newly developed code. Thus, automated learning based program generation has been proposed to automatically generate new code based on context. Recently, language models have shown promise in semantic modeling and natural language understanding [
7,
75]. Due to similarities between text and source code [
12], language models (e.g., LSTM and Transformer) have gained immense research interest for automated code generation and understanding [
37,
72]. Specifically, pre-trained language models for code (e.g., CodeT5 [
81] and CodeBERT [
19]), typically pre-trained on massive unlabeled code corpora using self-supervised objectives, can learn generic source code representations. These can then be transferred to diverse downstream tasks, decreasing pre-training costs and improving performance on code understanding and generation [
78]. Consequently, pre-trained language models have become the state of the art for many code intelligence tasks, such as code review [
28,
38], code summarization [
76], code completion [
39,
76], and program repair [
87].
Despite the effectiveness and advanced capabilities of language models for code demonstrated in prior work, these powerful program generation models are not yet reliable enough for practical application [
63,
84]. For example, prior studies [
62,
83,
84] have shown that language models are extremely vulnerable to adversarial attacks, where attackers can induce incorrect outputs through minor perturbations. Additional analysis and demonstration of model robustness are clearly needed. Furthermore, impressive results in lab-only settings may not translate to real-world efficacy [
63]. Next, we discuss three significant issues and implications that must be addressed before widespread adoption of program generation models.
Insufficient Interpretability Analysis. One severe drawback of prior research is a lack of explainability analysis on language models for code. In contrast to linear models, language models have complex architectures (e.g., standard CodeT5 with 12 layers, 220M parameters), making it challenging to explain specific predictions [
14,
67,
68,
69]. In other words, the recommendations made by language models are opaque to practitioners and researchers, who are unlikely to accept code recommendations without evidence, especially for non-trivial cases [
33,
48,
60]. For example, although prior studies have proven that language models can automatically update code to fix bugs, which input characteristics contribute to the prediction is largely overlooked [
84,
87]. This gap in interpretability analysis significantly hinders the real-world adoption of program generation models.
Experimental Bias . While prior research has demonstrated the effectiveness of language models on code-based tasks, their performance evaluations can be susceptible to experimental bias [
84]. A common issue is dataset bias [
9,
65,
86], where most code snippets are collected from open source projects in a compromised manner that potentially introduces noise. For example, Zhang et al. [
86] trained Transformer-based methods on a noisy dataset containing both bot-generated and trivial code commit messages, achieving a 42.4% BLEU-4 score. However, removing the noisy data resulted in a sharp performance drop to 26.2% BLEU-4. Therefore, the actual capabilities of language models trained on such biased datasets are difficult to accurately evaluate. They likely face severe performance degradation when applied in practice versus lab settings.
Poor Practicability . Substantial research indicates promise for using language models in automated program generation; however, most evaluation results demonstrate that these techniques are not yet practical for real-world application. For example, Tufano et al. [
72] proposed a Transformer-based automated code change approach, achieving just 21.16% accuracy. With an accuracy of 21% (i.e., successfully predicted code transformations), there are too many false negatives, which means that practitioners can never know whether the code update fits or not. Further experiments showed that increasing the beam search size from 1 to 10 brings a significant performance improvement (i.e., 21.16% to 36.02%). However, picking the best sequence among 10 candidates remains challenging for practitioners.
To the best of our knowledge, no systematic research study has investigated the reliability and explainability of language models for program generation. In response to the observations and concerns raised previously, we conduct the first comprehensive empirical evaluation of popular pre-trained language models on automated program generation tasks. Specifically, we adopt eight mainstream pre-trained models: T5 [
61], CodeT5 [
81], CoTexT [
58], CodeTrans [
17], CodeGPT [
44], CodeBERT [
19], CodeT5+ [
79], and CodeReviewer [
38]. We evaluated performance on four popular program generation tasks—code repair, code review, code translation, and text-to-code generation—using five state-of-the-art benchmark datasets (i.e.,
Tufano et al. [72],
Bugs2Fix [73],
CodeReview [38],
CodeTrans-Dataset [44], and
CONCODE [31]). Our results confirm the superior accuracy of studied models reported in prior work, with some achieving nearly 70% on certain datasets. However, we discovered that performance is skewed by inappropriate experimental design, yielding unreliable and unrealistic evaluations. Specifically, duplication in benchmark datasets skews results. Some have explicit duplicates between training and testing sets, directly inflating performance. In addition, even where testing sets lack identical training examples, they may contain duplicated testing cases. This overlapping test data further distorts evaluation. For code-to-code tasks, a concerning observation is that a substantial percentage of generated code exactly matches inputs, instead of generating refined or updated code sequences. Moreover, explanation results suggest that program generation models can recognize code grammar and structural information. However, they present poor robustness even to minor changes in input sequences. In contrast to previous research, our findings indicate that automated program generation remains imperfect, with ample room for improvement through future work.
Contribution . The main contributions of this article are summarized as follows:
—
To the best of our knowledge, we conduct the first comprehensive benchmark study of pre-trained language models for program generation, evaluating reliability and explainability.
—
Our analysis reveals significant experimental biases in prior work, including dataset duplication and overlapping inputs that inflate performance claims. Explanation analysis demonstrates that models overlook critical tokens and lack robustness, highlighting key challenges for practical deployment.
—
Results provide insights to guide future research toward more rigorous and reliable language models for neural program generation.
Open Science . To support the open science initiative, we publish the studied dataset and a replication package, which are publicly available on GitHub.
1 Article Organization . The rest of the article is structured as follows. Section
2 presents the background. Section
3 details our experimental design. Section
4 provides our experimental results and analysis. Section
5 discusses the study’s implications and potential threats to its validity. Related work is briefly introduced in Section
6, followed by a conclusion in Section
7.
3 Experimental Design
In this section, we describe the setup and methodologies of our empirical study, focusing on the reliability and explainability of language model based program generation.
3.1 Language Models
Recently, pre-trained language models for automated code understanding and generation tasks have been studied extensively in academia and industry. These existing approaches can be categorized into three types: encoder-based models such as CodeBERT [
19], decoder-based models such as CodeGPT [
44], and encoder-decoder-based models like CodeT5 [
81]. Although prior works [
8,
38,
84] have proven that encoder-based models and decoder-based models are not good at generation tasks, we include them in our study for completeness. While there can be many potential state-of-the-art models to be studied, we chose eight representative pre-trained language models (i.e., T5 [
61], CoTexT [
58], CodeTrans [
17], CodeBERT [
19], CodeGPT [
44], CodeT5 [
81], CodeReviewer [
38], and CodeT5+ [
79]) that are publicly available for evaluation.
T5 [
61]. Raffel et al. [
61] proposed the T5 (Text-To-Text Transfer Transformer) architecture, pre-trained on a large natural language corpus. T5 has demonstrated state-of-the-art performance when fine-tuned for many NLP tasks. As T5 is only pre-trained on natural language data, we utilize it as a baseline to compare against pre-trained models for programming language.
CoTexT [
58]. CoTexT utilizes the same architecture as T5 and is pre-trained on the CodeSearchNet Corpus [
30] and Google BigQuery [
1]. CoTexT has achieved state-of-the-art results on code generation benchmarks [
58].
CodeTrans [
17]. CodeTrans is a large pre-trained Transformer-based encoder-decoder inspired by T5. Since the official online repository provides many versions, we use the version with the most downloads. The CodeTrans model we used in this study is trained on 9,714 Java open source projects from GitHub.
CodeBERT [
19]. CodeBERT is an encoder-based model pre-trained on natural language and programming language corpora using RoBERTa architecture. CodeBERT is pre-trained on CodeSearchNet Corpus [
30]. It has demonstrated strong performance on code search and summarization [
84].
CodeGPT [
81]. CodeGPT is a decoder-only transformer model pre-trained on multiple programming languages. CodeGPT is designed for code generation tasks like method name prediction, code completion, and code translation [
44].
CodeT5 [
81]. CodeT5 is a state-of-the-art unified pre-trained encoder-decoder programming language model. Inspired by T5, Wang et al. [
81] pre-trained the T5 architecture on eight programming languages together with their comments collected from open source repositories. CodeT5 has demonstrated promising performance on code-related generation tasks [
23,
28,
36].
CodeReviewer [
38]. CodeReviewer is built on CodeT5 and is pre-trained on a large dataset of code updates and corresponding comments in code review scenarios. CodeReviewer has proven prominent performance on code refinement by recent works [
38].
CodeT5+ [
79]. CodeT5+ is an enhanced version of CodeT5, introducing architectural enhancements and advanced pre-training techniques like span denoising and contrastive learning. These innovations enable CodeT5+ models to achieve state-of-the-art performance across diverse code intelligence tasks, even zero-shot text-to-code generation [
79].
3.2 Downstream Tasks and Corresponding Datasets
Program generation tasks involve producing or generating sequences of tokens in a programming language. To better understand why language models perform program generation effectively, we extensively evaluate four different scenarios, in five datasets, for automated program generation as shown in Table
1. The details of each scenario are described next.
Code Review . The objective of code review is to automatically implement code changes by developers during pull requests, including bug fixing, refactoring, and optimization. Code review is a critical part of software development and serves as a common program generation task, extensively examined in prior research [
38,
70,
72]. In our empirical study, we assess two different datasets related to code review tasks:
—
Tufano et al.: Tufano et al. [
72] collected code review data from three large Gerrit [
2] code review repositories (i.e.,
Android,
Google, and
Ovirt). This collection forms a code-to-code corpus, where the source code prior to the pull request is transformed into the target code, reflecting the changes post-review. The dataset is segregated according to method length into 10,783
Small instances (with fewer than 50 tokens) and 10,991
Medium instances (with 50 to 100 tokens). Consequently, six subsets are obtained:
Android_S,
Android_M,
Google_S,
Google_M,
Ovirt_S, and
Ovirt_M.
—
CodeReview: Li et al. [
38] collected pull request data from the top popular open source projects on GitHub. This expansive dataset mines the projects in nine programming language data, resulting in a collection of approximately 1.3 million pairs of pre- and post-review code. The dataset forms a “code+text to code” corpus, where the input is the initial code along with associated commits, and the output is the revised code following the pull request.
Code Repair . Code repair aims to fix bugs in the code automatically. This is one of the most common program generation tasks and has been widely investigated by prior work [
44,
73]:
—
Bugs2Fix: Tufano et al. [
73] systematically extracted bug-fixing commit data from a large number of GitHub repositories, obtaining method-level pairs of buggy and fixed Java code snippets. This dataset presents a code-to-code transformation task, where the source is the buggy code and the target is the fixed code. The dataset is split into two subsets based on the code length,
B2F_S for small fixes (
\(\le \!\! 50\) tokens) and
B2F_M for medium fixes (
\(\gt \!\! 50\) and
\(\le \!\! 100\) tokens). Respectively, these subsets encompass 58,350 small and 65,455 medium bug-fix instances.
Code Translation [
50]. This task aims to translate code from one programming language to another while preserving its functionality:
—
CodeTrans-Dataset: CodeTrans-Dataset [
44], part of the CodeXGLUE benchmark [
44], provides paired examples for code translation between Java and C#, culminating in a total of 11,800 functionally equivalent method pairs. This dataset enables the evaluation of translation from Java to C# and vice versa, resulting in two distinct datasets:
Java2C# and
C#2Java.
Code Generation . Code generation refers to the creation of executable code based on natural language descriptions:
—
CONCODE: The
CONCODE [
31] dataset is widely used in code generation research and collects examples from approximately 33,000 Java projects on GitHub. It encompasses 100,000 examples for training along with 4,000 examples for validation and testing. Each example is composed of a triple: a natural language description, code environments, and code snippets. This dataset exemplifies the “text-to-code” challenge, emphasizing the need for algorithms capable of understanding textual descriptions and transforming them into syntactically and semantically correct code.
In summary, our empirical study spans multiple task scenarios including code review, code repair, code translation, and code generation. We have obtained 12 distinct subsets for this purpose: Android_S, Android_M, Google_S, Google_M, Ovirt_S, Ovirt_M, CodeReview, B2F_S, B2F_M, Java2C#, C#2Java, and CONCODE. These datasets facilitate the exploration of three input-output paradigms: code-to-code, code+text-to-code, and text-to-code, each pivotal to the domain of program generation.
3.3 Evaluation Metrics
To evaluate model performance, we primarily use accuracy (i.e., exact match rate or perfect prediction rate), which is the commonly used metric for program generation tasks [
38,
70,
72,
73]. Specifically, accuracy is calculated by dividing the number of perfectly predicted code snippets by the total snippets in the test set. Additionally, we incorporate the BLEU-4 score from the NLP domain [
56] to evaluate the similarity between the generated and target code. BLEU helps evaluate partial matches and overall fluency, complementing the exact match rate. To further analyze potential data duplication within datasets, we introduce a modified BLEU-4 metric to quantify dataset-level similarity. This enables identifying and measuring repetitive patterns that could skew model training and evaluation. For explanation results, we analyze and compare the average feature importance vectors
\(\bar{r}\). This provides insight into how models utilize input features for prediction.
3.4 Implementation
We build language models on top of two Python libraries: PyTorch [
57] and Transformers [
82]. The studied models from Section
3.1 are obtained via the Transformers API. To enable fair comparison, we use the 12-layer base version for all models, consistent with common practice in relevant literature [
38,
84]. In this study, encoder-based models (CodeBERT) are appended with a transformer decoder to generate code regressively. Decoder (CodeGPT) and encoder-decoder-based models (CodeT5) directly generate code aggressively, as adopted by prior studies [
44,
84]. For each dataset, we use identical training, validation, and test splits as well as fine-tuning approaches described in CodeXGLUE [
44] and original papers. The
Bugs2Fix,
CodeTrans-Dataset, and
CONCODE datasets were collected by CodeXGLUE, a benchmark for program understanding and generation. The
Tufano et al. and
CodeReview datasets are in original format. Regarding our explorations into explainable AI, we adopt the implementations provided by the Captum [
34] and Ecco [
4] packages, both of which are designed for Python. Experiments are run on a machine with an AMD Ryzen 9 5950X 16-core @ 3.4-GHz CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.
3.5 Research Questions
We structure our study around three key research questions to comprehensively evaluate language models for program generation:
RQ1: How Do Language Models Perform on Program Generation Tasks? Numerous pre-trained language models have been proposed, with some demonstrating promising capabilities on certain tasks. However, a systematic and extensive exploration of performance across diverse datasets and models is lacking. We undertake this broad investigation, evaluating the latest state-of-the-art models on tasks spanning code repair, review, translation, and generation. This research question aims to undertake a comprehensive evaluation of language models, examining their ability to not only replicate previous success but also generalize across distinct program generation tasks and datasets.
RQ2: How Reliable Are Automated Program Generation Approaches? This research question critically analyzes the evaluation approaches used to assess automated program generation models, to identify potential experimental flaws or biases that could undermine performance evaluation. Specifically, we investigate the representativeness and diversity of training and testing datasets. Performance heavily relies on dataset quality (i.e., “garbage in, garbage out”) [
63]. Widespread duplication or lack of diversity could skew results. In this research question, we aim to uncover potential limitations that lead to overestimating or underestimating realistic capabilities. Addressing these concerns is vital for establishing robust and reliable benchmarking practices that will contribute to accurate characterization and continued improvement of automated program generation approaches.
RQ3: Can We Explain Why Automated Program Generation Approaches Can (or Fail to) Generate Code Sequences Reliably? Only analyzing the generated sequences still does not determine why language models perform effectively or ineffectively. The main reason is that what these pre-trained language models are based on to predict new code sequences is largely unknown. Therefore, we employ explainable AI approaches to understand what tokens contribute to the generated code sequences. We expect our exploratory experiments on explainable automated program generation can put forward practical insights for future research. We employ a state-of-the-art model-agnostic explainable approach for interpreting language models. We then utilize the explanation results to understand why program generation models output new code sequences effectively or ineffectively.
4 Result
To answer the aforementioned research questions, we perform an extensive analysis of the generated code sequences and their explanation results, respectively. Next, we present the results with respect to our three research questions.
4.1 RQ1: How Do Language Models Perform on Program Generation Tasks?
Approach. To better understand how well the pre-trained language models perform in automated program generation, we extensively study and compare their performance on the studied datasets. We follow the same training/validation/testing data splits and fine-tuning procedure for each dataset introduced in Section
3.2. We employed a beam search setting of 1 for our evaluations. Then, we calculated the accuracy (i.e., the exact match rate) for each model. Table
2 presents a detailed summary of the accuracy achieved by eight different language models across these datasets.
Result. We observe from Table
2 several key findings on language model performance on program generation tasks. First, we find that language models exclusively pre-trained on natural language corpora, such as T5, demonstrate limited effectiveness on programming tasks compared to models pre-trained on code data. For example, T5 achieves just 6.42% and 2.75% accuracy on
Android_S and
Android_M, respectively, highlighting the importance of domain-specific pre-training in improving performance on automated program generation. Additionally, we observe that model architecture plays a significant role in performance. Decoder-only (CodeGPT) and encoder-only (CodeBERT) models exhibit inferior results across multiple datasets compared to encoder-decoder architectures like CodeT5. Furthermore, we observe that models such as CodeT5, CodeReviewer, and CodeT5+ consistently outperform others across various benchmarks. For example, on the code+comment-to-code
CodeReview dataset, CodeReviewer obtains state-of-the-art performance with 30.43% accuracy. Similarly, it leads text-to-code performance on the
CONCODE dataset with 22.65% accuracy. In other code-to-code scenarios, like the Tufano et al. benchmarks, CodeT5+ performs better on multiple datasets (e.g., achieving 11.36% accuracy on
Android_M). These findings are in alignment with previous studies, confirming the superiority of these models in program generation tasks [
28,
38,
79].
However, a critical observation is that the overall accuracy levels across all models are relatively low, especially in more complex tasks, which raises significant concerns about their reliability in real-world settings. When comparing dataset sizes, it is evident that all models perform better on small-sized datasets than on medium-sized ones. For example, CodeT5+’s accuracy fluctuates markedly from 18.44% on the small B2F_S dataset down to just 7.84% on the medium B2F_M dataset. Additionally, the performance gap across different datasets is striking, as seen with CodeT5+, which achieves a high of 70.6% on C#2Java but decreases to a mere 7.27% on Google_M. Such fluctuations and inconsistency in results pose substantial challenges in understanding why models perform so well or poorly on specific datasets. Understanding the reasons is crucial for future research, particularly for improving the reliability and practical application of automated program generation.
4.2 RQ2: How Reliable Are Automated Program Generation Approaches?
Reliability and trustworthiness are critically important in automated program generation, especially within the evolving landscape of software engineering [
43]. Our preliminary investigation (RQ1) revealed a large disparity in the performance of language models: some models demonstrated exceptionally high performance on certain datasets yet exhibited markedly lower effectiveness on others. This considerable fluctuation raises concerns about the potential overestimation or underestimation of their capabilities due to experimental biases. Considering the potential for both overconfidence and undue skepticism resulting from these inaccuracies, it is important to measure these tools’ effectiveness accurately, ensuring their proper application and deployment in software engineering practices. In response to these issues, we undertake an in-depth analysis to uncover potential sources of unrealistic performance evaluation. Our analysis is structured along three different aspects:
—
Data duplication between training and testing sets: We investigate how similarities or duplications within training and testing datasets might inflate the perceived performance of language models. Such overlaps may create an illusion of high accuracy, masking the true capabilities of these models in novel or diverse scenarios.
—
Data duplication across testing sets: We investigate the presence of duplicate examples within testing datasets. Duplication within these sets can lead to a misleading evaluation.
—
Output-input similarity analysis: Finally, we examine the correlation between the outputs generated by the models and their inputs. In automated program generation, the expectation is for models to update or refine input code creatively and accurately. However, when outputs are the same as inputs, it raises questions about the true generative capacity of the models.
In our analysis, we utilize the BLEU score, specifically the
BLEU-4 variant, to systematically evaluate the similarity within the studied datasets.
BLEU-4, which assesses the co-occurrence of 4-gram sequences, is widely used in prior research [
29,
70,
84]. A
BLEU-4 score of 0 corresponds to no similarity, indicating completely unique content, whereas a score of 1 reflects total duplication or exact replication. By applying this metric, we can discern the extent to which our datasets contain unique or duplicative examples, thereby providing an empirical basis for evaluating the potential impact of data duplication on evaluation performance.
4.2.1 Data Duplication between Training and Testing Sets.
In automated program generation, the robustness of language model evaluations is crucial. A key threat to this robustness is ‘data snooping,’ a pitfall where models, due to improper data handling, gain inadvertent access to testing information during training [
63]. Such exposure can lead to exaggerated performance metrics, as models may simply recall information rather than apply learned patterns to new data. To prevent this and ensure genuine model generalization, it is essential to assess the overlap between training and testing datasets.
Approach. To determine the similarity score for a test instance
t, we compare its source sequence against each instance in the training set
\(T = \lbrace t_1, t_2, \ldots , t_n\rbrace\). The similarity score,
\(S(t)\), is the maximum BLEU-4 score obtained from these comparisons:
Additionally, we keep track of the index
i that yields this maximum score, which can be defined as follows:
This iterative process is performed for each test instance t, allowing us to quantify the extent of data overlap and identify the potential duplication within the datasets.
Result. Figure
2 and Table
3 present a detailed analysis of the data similarity between training and testing datasets and its influence on model performance. Figure
2 presents the distribution of test data similarity to training data across various datasets and corresponding recalculations of model performance for each similarity range. Furthermore, we have recorded the average similarity score of the output sequences for each test instance with their most similar training instances, as shown in Figure
2.
Our analysis uncovers a wide variation in the similarity scores across an array of datasets, with the notable exception of the CodeReview dataset, which stands out due to its lack of highly similar instances. However, datasets such as Ovirt_M show that a majority, exceeding 60%, of test data source sequences have a similarity score above 0.8, highlighting a considerable overlap with the training data. In particular, from Tufano et al., comprising Android_S, Android_M, Google_S, Google_M, Ovirt_S, and Ovirt_M, along with the CodeTrans-Dataset datasets Java2C# and C#2Java, we find a significant presence of test samples that are duplicates of the training set (i.e, similarity score = 1). For example, in all Tufano et al. datasets, more than 20% of the test samples are identical to those in the training set. These findings suggest the presence of data duplication within the datasets, raising important questions about the possibility of data snooping that could distort the evaluation of model performance.
In Figure
2, we observe an increase in the similarity of target sequences corresponding to an increase in source sequence similarity, up until the source are completely identical (
\(S(t)=1\)). Taking the
Android_S dataset as an example, the average similarity for target sequences climbs from 0.1 to approximately 0.6 as the similarity of the source sequences increases. This indicates that target sequences are generally more aligned when the sources are more similar. However, when the source sequences are identical (
\(S(t)=1\)), the similarity between target sequences notably drops. For example, in
Android_S, the average similarity on target sequences drops from 0.6 to 0.4. This observation may indicate a research oversight during dataset preparation, where instances with identical “source + target” pairs are usually removed, but those with identical sources or targets, when considered separately, remain unfiltered.
Figure
2 illustrates that as the similarity between test and training instances increases, so does the performance of language models. For instance, CodeT5+ shows an accuracy of around 10% for test instances with a similarity score below 0.2, which increases to nearly 25% for instances with a similarity score above 0.8 in the
B2F_S dataset. As shown in Table
2, the average performance of CodeT5+ on the
B2F_S dataset is 18.44%. This suggests that models may be leveraging memorized patterns from the training data rather than demonstrating true generalization capabilities. Consequently, the ability of language models to handle unseen or novel test instances remains a considerable challenge for future research. A notable drop in model performance is also observed when the source sequences are exactly the same, a decrease that corresponds with the fall in similarity among output sequences. This could be due to the common practice during dataset preparation where instances of identical “source + target” pairs are removed, possibly leading to an oversight of singular duplicates in sources or targets.
Table
3 further describes the impact on the performance of models such as CodeReviewer and CodeT5+ when instances with high similarity scores are excluded from the test sets. Whereas the
CodeReview dataset shows no change due to a lack of closely matched samples, other datasets usually display a marked decrease in accuracy once we remove test cases with high similarity scores. In the
Android_S dataset, for example, the accuracy of
CodeT5+ decreases from 15.2% to 13.09%. In some cases, the drop is even more apparent;
CodeT5+ falls from 70.6% to 53.09% on the
C#2Java dataset. This highlights the critical role that a varied and comprehensive test set plays in ensuring the accuracy of a model’s performance evaluation.
4.2.2 Data Duplication across Testing Sets.
Building upon the test-training analysis, this section examines intra-set duplications within the test datasets. The presence of duplicated instances in test sets can lead to unreliable evaluation results. If such duplications go unaddressed, the model’s efficacy may end up being evaluated on a narrow subset of repeated test instances rather than a diverse range of samples. To address this concern, we investigate the prevalence of data duplication within the testing sets based on our studied datasets.
Approach. For each test instance
t, we investigate potential duplication within the test set by comparing its source sequence with that of every other test instance. The maximum BLEU-4 score from these comparisons is recorded, along with the index of the test instance that yields this maximum score:
Here, \(D(t)\) is the similarity score for the test instance t within testing data, and \(j_{\text{max}}(t)\) is the index of another test instance \(t^{\prime }\) that is most similar to t. This process is iteratively performed for each test instance in the dataset, allowing us to identify exact and near-duplicates within the test set.
Result. Figure
3 shows the distribution of similarity within the test data of our studied datasets. We observe that except for the
B2F_S and
B2F_M datasets, which exhibit no duplicate instances, a considerable number of datasets include duplicates. For example,
Android_S records 11% of its test instances as duplicates based on source sequences, and similarly,
Ovirt_M possesses more than 10% duplication. In line with earlier observations from Figure
2, the average similarity scores of the target sequences for these duplicate instances are substantially lower compared to the dataset at large. This indicates that while testing language model performance on program generation, identical source sequences are often provided, but the models are expected to provide different outputs. Figure
4 provides an example from the
Android_S dataset where two test instances with the same source code require different correct outputs from the model. This practice leads to an inherently unfair evaluation scenario, where the same test instance is associated with different performance expectations. Figure
3 indicates that performance on these duplicated test instances can vary greatly from the average, potentially giving an inaccurate representation of model performance. Our findings emphasize the need to carefully construct test sets, avoiding the pitfalls of duplications that can compromise reliable model performance evaluation. It is crucial to construct detailed and impartial evaluation approaches that truly measure the language models’ ability to generate code across a wide array of test scenarios.
4.2.3 Output-Input Similarity Analysis.
The goal of automated program generation is to refine or create new code sequences that accurately address specified functionality changes or requirements. In most prior works, model predictions are compared against target outputs to calculate accuracy and other performance metrics. We extend this evaluation to compare the generated outputs with the original inputs, assessing whether the models are merely mirroring the inputs or actually generating updated code sequences.
Approach. Different from our earlier focus on similarity scores within and across test sets, here we focus specifically on the percentage of identical sequences in the model outputs as compared to their source code sequences. For this comparison, we utilize a direct string comparison method, discounting special tokens (e.g., “<pad>”, “<s>”, and “</s>”), which are typically used for formatting and do not contribute to the functional content of the code. Through this process, for each model’s predictions, we calculate the proportion of outputs that are exactly the same as the source sequences.
Result. Table
4 presents the rates of duplication between model outputs and inputs across several language models on the studied datasets. The table reveals a considerable variance in the rates across different datasets and models. Notably, only the
CodeReview,
Java2C#, and
C#2Java datasets display minimal duplication between source and target sequences, with rates as low as 1% to 2%, suggesting that these datasets effectively evaluate the models’ capacity for generating new code. When comparing model outputs with source code sequences, language models present high rates of output-input duplication. For example, the rate of T5 on
Android_S and
Android_M reaches 78% and 80%, respectively. Such a high duplication rate suggests that a substantial portion of the output code is replicated from the inputs, calling into question the models’ generative capabilities. Although models like CodeReviewer and CodeT5+ show superior performance, as indicated in RQ1, they still exhibit a significant degree of output duplication with their inputs (e.g., 35% for CodeT5+ on the
B2F_S dataset and 32% on
Google_M). The analysis of the
CodeReview dataset, even within the context of code+comment-to-code tasks, reveals a significant level of duplication between the source code sequences and the models’ outputs. Note that the
CONCODE dataset, which focuses on text-to-code tasks, demonstrates zero duplication, indicating the models’ potential to generate entirely new code from textual prompts. In the context of language models applied to code translation tasks within the
Java2C# and
C#2Java datasets, we observe a commendably low duplication rate. The observed variability in duplication rates across datasets and tasks underscores the imperative for nuanced and robust evaluation metrics that can accurately reflect the true generative capabilities of language models.
4.3 RQ3: Can We Explain Why Automated Program Generation Approaches Can (or Fail to) Generate Code Sequences Reliably?
Most pre-trained language models operate as black-box systems, obscuring their internal decision-making processes. As highlighted in RQ1, the accuracy of these models with beam search set at 1, is not particularly high, underscoring the uncertainty in their output reliability. Furthermore, the results from RQ2 suggest that the evaluation of these models may be compromised by certain impractical experimental settings. Consequently, it is necessary to employ explainable AI approaches to demystify the internal workings of these models. In this section, we employ gradient-based SHAP to examine and understand the decision-making processes of automated program generation models. This approach aims to uncover the factors behind a model’s ability or failure to generate accurate and reliable code sequences. To ensure a focused and relevant analysis, we limit our examination to the models demonstrating superior performance in previous sections (i.e., CodeT5, CodeReviewer, and CodeT5+). This focused examination is designed to gain insights into what contributes to their effective and reliable performance in program generation, with the aim of guiding future developments in this field.
4.3.1 Understanding Token Importance.
In this section, we delve into the learning patterns of language models concerning program generation by examining the average feature importance of different types of tokens. These tokens represent the fundamental building blocks of programming languages and are crucial for understanding the model’s focus during the program generation process. We analyze five types of tokens, including the following:
—
Identifiers are unique names for variables, classes, methods, and so on.
—
Keywords refer to programming language-specific words with pre-determined standard meanings.
—
Operators are special symbols that represent some operation on data.
—
Separators are special symbols used to indicate the group of code.
—
Literals refer to constant values.
Result. Figure
5 presents the average feature importance of token types across various datasets when analyzed through the CodeT5, CodeReviewer, and CodeT5+ models. The bar chart provides a comparative visualization of how each model weighs the significance of identifiers, keywords, operators, separators, and literals during the program generation process. It is apparent across the models: identifiers and keywords tend to be assigned higher importance scores, showing their critical role in understanding the syntactic and semantic structure of the code. This suggests that models are possibly prioritizing the recognition of variable names, function calls, and control structures, which are key to the functionality of the code. Operators and separators, while varied across models and datasets, generally exhibit moderate importance. This reflects a slight comprehension by the models of the operational logic and structural delineation within the code, which, although less emphasized than identifiers and keywords, are still recognized as essential components of program logic.
4.3.2 Model Performance Under Input Token Reduction.
In this section, we investigate the robustness of program generation models when faced with reduced input tokens. Drawing from the explanation results, we investigate how the selective deletion of tokens classified as less important impacts the models’ ability (i.e., lowest feature importance) to generate accurate code sequences.
Impacts of Token Reduction Size. In this study, we examine the impact of selectively removing tokens from the input sequences based on their feature importance scores. For each test case, tokens are ranked by their importance, and we methodically remove a specified number of the least important tokens, with placeholders inserted to preserve the format of the code. This process generates new “source-target” pairs, which are then fed back into the models. The experiment is conducted with incremental token removals, specifically at counts of 1, 3, 5, 10, and 15, to understand how the absence of certain tokens affects the models’ code generation capabilities. In our analysis, we denote the original “source-target” testing instances with a 0 to differentiate them from the modified instances.
Result. The results presented in Figure
6, which focus on the
Android_S dataset, provide a clear illustration of the performance degradation associated with the incremental removal of the least important tokens. For example, in the CodeT5+ model, eliminating just one token leads to a significant drop in performance, from 15.2% to 9.19%. As the number of tokens removed increases to 5, the performance further declines to a mere 3%. Beyond the removal of 10 tokens, the model’s performance drops to zero, indicating a complete inability to generate the correct code sequence. This pattern is not unique to CodeT5+; similar trends are observed across all three studied models. The significant decrease in performance, even with the removal of tokens previously considered less important, suggests a complex interdependence among the various input features. These findings raise important questions about the robustness of program generation models, particularly their sensitivity to changes in input and their ability to adapt and maintain accuracy under modified conditions.
Comparative Analysis of Token Removal Strategies. To evaluate the reliability of the explainable AI approach, we compare strategies to remove tokens from the input sequences, specifically targeting the lowest and highest importance tokens identified by our model, and also using a random token removal for comparison. This comparison is designed to investigate how each removal strategy affects the model’s performance. By comparing the results of targeted versus random token removal, we aim to examine the accuracy of the explainable AI approach’s ability to identify important features for program generation. In each of these three strategies, we consistently remove a set number of five tokens for a standardized comparison.
Result. As shown in Figure
7, the removal of tokens demonstrates a clear impact on performance across different datasets. Specifically, removing tokens identified with the lowest importance typically leads to a smaller decline in performance compared to either random removal or the removal of the most important tokens. For example, within the
Android_S dataset, the accuracy of the CodeT5+ model falls to 3% after removing the least important tokens but drops almost to zero with random or most important token removal. Similarly, within the
B2F_S and
CONCODE dataset, the accuracy of the CodeT5+ model on lowest and random removal is much higher compared to removing the most important tokens. Thus, these findings suggest that explainable AI approaches can help us determine the significance of different input tokens for program generation. Moreover, the consistent decrease in performance across various datasets, following the removal of important tokens, presents a lack of robustness in these language models.
5 Discussion
In this study, we have identified several interesting findings of the reliability and explainability of automated program generation approaches. We now discuss the main implications and limitations of our study.
5.1 Implications
Reliability . High reliability is essential for the real-world usage of language model-based automated program generation systems. As highlighted by prior works [
62,
83], state-of-the-art language models are vulnerable to adversarial attacks and can be fooled into recommending wrong code. This finding indicates that the existing language models still suffer from reliability issues. However, most prior works pay more attention to improving the accuracy of automated program generation systems, but they neglect to evaluate whether the proposed methodologies are sufficiently reliable.
Our study first replicates the good performance presented in previous research. Our results further indicate that data duplication commonly exists in existing state-of-the-art program generation datasets (i.e., duplication between training and testing sets, as well as duplication within testing sets), and the issues provide unreliable or unrealistic performance evaluation in prior research. Further, our results show that pre-trained language models frequently generate outputs replicated from the inputs, outputting numerous unchanged code sequences. These findings provide evidence that language model based program generation approaches suffer from serious reliability threats since the performance is overestimated or underestimated by unreliable experimental analysis. This not only provides unreliable evaluation performance of deep learning systems but also raises concerns regarding their deployment in real-world applications. Consequently, there is a need for research on enhancing the quality and reliability of automated program generation research, which can further benefit the deployment of deep learning systems in the real world. First, more research efforts should focus on the dataset quality. Except for the data duplication issues [
5], Shi et al. [
64] and Sun et al. [
65] have demonstrated that data noise was prevalent in widely used benchmark datasets in code summarization and code search. Therefore, for future research, it is important to construct a rigorous evaluation methodology supported by reliable and standardized benchmarks. In addition, instead of only relying on accuracy, more comprehensive evaluation metrics should be employed to provide a reliable evaluation for language models.
Explainability . Pre-trained models are black-box models. Thus, prior research usually disregards understanding why the language model based program generation approaches make a specific prediction. Our study employs a model-agnostic explainable AI approach to explain language models. Furthermore, our results reveal several insightful findings that can inspire future research. First, we observe that language model based program generation models pay much more attention to keywords and identifiers of programming language compared with operators and separators. This indicates that language models are capable of recognizing code grammar and structural information. However, our findings reveal a significant decline in the performance of language models with the removal of even a few tokens, including those considered least important, highlighting a lack of robustness in these models. These observations from explanation results help us better understand inference behaviors and learning abilities of program generation behaviors. Consequently, our study also proves that explainable AI approaches are an effective and promising approach to analyzing or improving the reliability of language model based automated program generation systems. More research can be initiated on this aspect.
5.2 Threats to Validity
The primary threat to internal validity mainly lies in the model architecture and hyper-parameter setting. We use eight program generation models, which are based on the same model settings in the original papers. It is expected that hyper-parameter tuning would bring performance improvements. Instead, the goal of our work is not to find the best setting, but to fairly investigate the reliability and explainability of program generation models. The external threats to validity mainly lie in the studied datasets and the generalizability of the results. In this study, we used five different types of program generation datasets. We considered four types of code generation tasks (i.e., code review, code repair, code translation, and code generation), different token sizes (i.e., small and medium), different programming languages (e.g., Java, Python, C#), and three types of inputs (i.g., only code, code+comments, only text). For reproducibility purposes, we provided a replication package to facilitate future work to replicate our approach on more repositories and tasks.