4 Major Findings
In this section, we present our results and discuss the main findings regarding the research questions.
RQ1. What is the number of comments generated based on templates in the method comment dataset and the inline comment dataset?
Previous study [
61] found that the duplication in the dataset of code-comment pairs will directly effect the results of the comment generation models (i.e., it produces a better result than the real value). For example, if some samples in the training set and the testset are exactly the same, the model will perform better on these samples. However, the model is actually not as good as it performs. It is likely that the model overfits these samples and has poor generalization ability. In order to analyze how much influence the template has on the training comment generation model later, we firstly propose this RQ. Specifically, we analyze the use of templates in real comments.
We notice that some comments in the dataset are almost the same. These comments differ only in a few words. For example, comments “Find the _Fields constant that matches name, or null if its not found.” and “Find the _Fields constant that matches fieldId, or null if its not found.”. We have marked the different words in italic style. After investigation, we find that most of these similar comments were generated by IDE or some other code language conversion tools with pre-defined comment templates. They add a few specific words to these templates to generate comments. The comments generated from predefined comment templates will be referred to as “template comment”. “template comment” are almost the same when they share the same template.
We try to find out the “template comment” on the comment datasets. We use the AEL algorithm introduced in Section
3 to find out “template comment”. There are three people working on utilizing AEL algorithm. Two people check the results of the AEL algorithm and another one makes the final decision if there is a disagreement between the first two people. It takes about 3 days to execute the AEL process. Table
4 summarizes the results of the AEL algorithm on method comment dataset and inline comment dataset. We can see that in the method comment dataset, the proportion of the “template comment” is very high. The proportion of “template comment” in the method comment dataset is 10 times more than that of the inline comment dataset. Some comments are exactly the same, and we call them “duplicate comment”. “duplicate comment” is also “template comment”, which can be categorized by the AEL algorithm as well. Therefore, the number in the Table
4 includes both “template comment” and “duplicate comment”.
In the 975,765 method comments, the AEL algorithm divides a total of 457,672 clusters and there are 140 clusters with more than 100 items. While in the 973,525 inline comments, the AEL algorithm gets a total of 433,169 clusters and there are 168 clusters with more than 100 items. We manually check a few clusters with more than 100 items to study the cause of the template. Tables
5 and
6 are the summaries of templates detected by AEL in method and inline comment dataset. Table
7 is the summary of the noisy data in both datasets. The summaries include the explanation of comment clusters, example of templates, and the corresponding number.
We can observe from Table
5 that the first reason for the comment template is: “Comments are generated by the predefined comment template in the IDE comment plugin”. That is, the comments are generated by the template predefined in the IDE comment plugin. For example, the comment “Returns true if field
\(\lt *\gt\) is set (has been
\(\lt *\gt\) a value) and false otherwise” describes that the return value is determined by the field
\(\lt *\gt\), and the field
\(\lt *\gt\) can be replace by a variable when generate the real comment. Another reason for causing the comment template is: “Comments generated when java methods are automatically generated”. This template shows that the comments are generated along with the automatically generated source code. A motivate example is the comments of the method getter() and setter(), “Auto generated getter method @return
\(\lt *\gt\)” and “Auto generated setter method @param param
\(\lt *\gt\)”, as shown in Table
5. These two comments are generated along with the automatically generated methods getter() and setter().
Table
6 shows the template comments detected by AEL in inline comment dataset. We summary five of the most common template types. We can see that most of the “template comments” are the “duplicate comments”. There is no variable tokens in the template comments. For example, there are 1,108 identical comments “check for required fields check for sub-struct validity” in the first template “Comments generated by the open source tool Thrift”. Therefore, the most recurrent comments and less recurrent comments for each template are the same. We also observe from Table
6 that most of the template comments in inline comment dataset are from the programming frameworks, such as Android (i.e., template 2), AWS SDK (i.e., Amazon Web Services,
4 template 3), WSDL (i.e., Web Services Description Language,
5 template 4). In addition, there are a large number of template comments appear in the same project, such as the template 5 in Table
6.
Table
7 shows the noisy comments detected by AEL in method and inline comment dataset. Noisy data could be some meaningless symbols or commented out code. The most common noisy comment in the method comment dataset is the “inheritance document”. These comments use the marks “{@inheritDoc}” or “@inheritDoc” as placeholder in the source code, but they are meaningless for explaining the source code, so we classify them as the noisy comments. We can also observe from Table
7 that another two common templates “Symbolic noise comments” and “commented out code” can be found both in the method and inline comment dataset. Obversely, these two comments do nothing to explain the source code.
Summary: we can see by using the AEL algorithm, we find a large number of “template comment” in these comment datasets. Most of the “template comments” in the method comment dataset are generated by the predefined comment template in the IDE comment plugin or generated along with the automatically generated source code. Most of the “template comments” in the inline comment dataset are the “duplicate comments”. Besides, the number of “template comment” for method comments is more than that of inline comments. This is because current IDEs mainly provide the function of generating method comments based on templates, but rarely provide the function of generating inline comments based on templates. Moreover, the noisy comments “Symbolic noise comments” and “commented out code” can be found in both method and inline comment datasets.
RQ2. Are there different writing styles for method comment and inline comment?
In this RQ, we want to explore the writing style difference between method comment and inline comment. The difference in writing styles might explain how the same model behaves differently on different comment generation tasks in the next RQ.
Word usage. We focus on the overall situation of the words used in the two types of comments. After a series of preprocessing of the comment dataset, we analyze the word dictionary composed of the method and inline comments. The preprocessing is as follows. First, we take the first sentence of the comment as the subject of the study. According to the statistical results, the number of words in the first sentence of method and inline comments are 14.71 and 9.84, respectively. We then conduct CamelCase splitting, snake_case splitting, and implement the lemmatization for each word in the comment. Then, we turn every word into a lower case. The first sentence in a method comment is usually considered a summary sentence [
51], so we take the first sentence of the method comment for study. Similarly, we also take the first sentence of the inline comment for study. Although inline comments in many cases have only one sentence. In addition, we think it is necessary to split the words in comments with CamelCase and snake_case. It is because some variables or API names from the corresponding code may be mixed into the comments [
55]. Word lemmatization is to restore a word to its original form according to its POS. For example, the verb past tense “broken” will be restored to “break”, and the adjective comparative “bigger” will be restored to “big”.
After applying the preprocessing, we count the frequency of the words used in the two types of comments to form the corresponding dictionary. The details of the dictionary are shown in Table
8, and we find that the words used in method comments are more concentrated and the words used in inline comments are more dispersed. The size of the dictionary of method comment is smaller than that of the inline comment. The dictionary of method comment contains 57,553 words, while the dictionary of inline comment contains 87,665 words. We sort the words in the dictionary according to their frequency. In method comments, only the first 54 words are needed when the cumulative frequency of words reaches 50%. The case for inline comments is 84 words. Similarly, when the cumulative frequency of words reaches 60%, 70%, 80%, 90%, 95%, 96%, 97%, 98%, and 99%, the number of words needed in method comment dictionary is obviously less than that in inline comment dictionary, as shown in Table
8. In particular, if the cumulative frequency of words in inline comments reaches 90%, the word number is 1,431, but 1,431 words can cover 92.75% of method comments. It shows that developers use more abundant words when writing inline comments. This may be that inline comments need to describe source code in more specific situations. Inline comments, therefore, require more extensive expression. While the method comments may describe the features of methods at a higher level when comparing with the inline comments, which may lead the words used in the method comments are more concentrated considering only taking the first sentence of method comment into consideration.
Tokens in comments. We would like to discuss which kind of comments is more likely to reference the token in the code. Intuitively, we think that the inline comments are easier to reference tokens in the code because inline comments directly explain the next few code lines. The method comments generally explains the function of the entire method. We study this issue by counting the proportion of the comments that mention the code token to all the comments.
As shown in Table
9, we explore the proportion of comments mentioning a certain type of token in the code among all comments. The types of tokens include variable, API, basic data type, and reference data type. Basic data types refer to the 8 basic types provided by the Java language, such as int, float, string, and so on. The reference data type refers to the user-defined data type, usually a Java class, such as Student, Employee, and so on. The inner API refers to the API that belongs to the same project, and the outer API refers to the external API coming from other projects. To distinguish between inner and outer API, we employ JavaParser
6 to identify the API of current project and generate a API list for current project. JavaParser converts Java code into a corresponding Abstract Syntax Tree, and we identify the inner API of current project by traversing the Abstract Syntax Tree. Then, if we find a API in the API list of current project, it will be identified as an inner API, otherwise, it is an outer API.
The data in Table
9 refers to the proportion of comments that mention the corresponding token type among all comments. For example, there are 10 method comment samples, and 3 of them mention the API name in the corresponding code. In this case, the proportion of comments mentioning API is 30%
We are surprised to find that no matter what type of token it is, the statistical results of method comments are higher than those of inline comments. This is contrary to the intuitive conclusion we mentioned earlier. In addition, we can see that in the two types of comments, the proportion of comments mentioning API is similar. The number of comments mentioning outer API is significantly higher than that mentioning inner API in both types of comments. It is because the outer API may be more difficult to understand since its source code is more difficult to access. The number of outer API is more than that of inner API. For other types of tokens, the proportion of method comments is significantly higher than that of inline comments.
Tokens in commented codes. Based on the analysis of tokens in comments, it can be found that there are some special tokens existing in comments. In order to analyze where and how the comment capture these special tokens from source code, we also count the proportion of these special tokens that appear in comments in source code, which is shown in Table
10. It can be found that method codes have more information about variable, API, basic data type and reference data type. Besides, as it is known that inline codes are located in method codes, so the method codes are the context of inline codes. We count the proportion of these special tokens in inline codes with the context, that is, we count the proportion of these special tokens in source code in method codes including commented inline codes. The proportion of these four kinds of special tokens increases when adding the context. Therefore, the context includes more special token information.
POS. We are also interested in the POS of words and phrases in the two types of comments. We use the well-known natural language processing tool NLTK
7 to parse the comment sentences and got the POS of each word in the sentence. We count various POS structures in method comments and inline comments, and select the most important and useful 10 POS structures, which is shown in Tables
11 and
12.
Tables
11 and
12 are the statistics of POS of words and the statistics and examples of POS of phrases in the two comment datasets. Here “prep or conj” is the abbreviation of “preposition or conjunction”. From Table
11, we can see that the top10 most frequently used POS of the two types of comments are almost the same. In terms of phrases, the top3 POS of the two-word phrases are the same. The top10 list is also the same, except that the order is slightly different. We can learn that the two types of comments are similar in the use of POS. There is no phenomenon that method comments tend to use a certain POS of words, while inline comments tend to use phrases with specific POS combinations. We can also see from the percentages in Table
12 that inline comments are indeed more diverse than method comments. This is consistent with the conclusion we reach when we discuss word usage earlier. To make the phrase more detailed, we also present a representative example for each kind of phrase in Table
12.
As it is shown in Table
8, the dictionary size of inline comment is 52.3% larger than that of method comment. However, from Table
12, we find that the differences of two-word phrases are not obvious. Therefore, we try to analyze the sentence diversity of inline comments. We try to do a cluster experiment based on sentence similarity to reduce the diversity of inline comments.
Summary: It can be found that there exist many differences between method comments and inline comments in writing style. Compared with method comments, inline comments have a more diverse dictionary which includes more tokens. This can guide us to adjust the size of inline comment dictionary when designing an inline comment generation model. Also, method comments utilize more certain types of tokens such as variable, API, basic data type, and reference data type. This can guide us to utilize these kinds of tokens when designing method comment generation model. Besides, the POS distribution between method comments and inline comments are similar. Therefore, we can provide a POS mapping table to catch the comment generation rules when designing a comment generation model. In order to prove our findings, we also do a questionnaire survey to investigate the habits of real developers writing method comments and inline comments, as it is shown in Section
5. The results also show that developers have different writing styles when writing method comments and inline comments.
RQ3. Can method comment generation models be applied in generating inline comment and why?
From RQ1 and RQ2, we know the characteristics of method comments and inline comments. The difference between them is also explored. For example, method comments include more templates and have a concentrated dictionary. These characteristics may be the influence factors of the neural machine translation (i.e., NMT) model. We want to analyze if the difference between method and inline model has influence on comment generation model.
We use the three existing NMT models for the comparison experiments. These models are Seq2Seq, DeepCom, and Code2Seq. The experimental results are shown in Table
13 (the second and fourth columns). We can see that the BLEU-4 of generating method comment is better than that of generating inline comment for all models, and the performance of the models on the method comments outperforms the ones on the inline comments by 10%. Seq2Seq has the best performance in inline comment generation. Besides, Code2Seq performs OK in method comment generation, but not good in inline comments (17–18 points lower than the other two models). This may refer to the feature extraction approach of Code2Seq. As for Code2Seq, its main idea is to select several AST paths from the AST, but as for inline AST, this approach may cause many LCA nodes to be selected multiple times. These LCA nodes are only introduced to complete the structure of inline AST, and have nothing to do with the inline comments and they have no semantics information for generating inline comments. So, compared with other traversal approaches, this approach may generate more noise. Therefore, it does not show a good result.
Because the models used for method and inline comment generation are the same, it seems that the features in the comments affect these performance of the generation models. We further explore the possible reasons.
Template comment. Note that we have detected many template comments in the method comment dataset. These comments are highly consistent and may affect the effectiveness of the models. As we describe in RQ1, when multiple identical comments exist in both the training set and testset, models can perform better on the repetitive comments. The reason is that, in this case, the training set and testset overlap. It always makes the models perform better than their real performance. To eliminate the influence that may be caused by template comments, we reconstruct a method comment dataset of the same size. When we randomly select samples, we avoid samples that are detected as template comment by the AEL method. We test the performance of the models on the new method comment dataset. The experimental results are shown in Table
13 (the third column). By comparing the results, we can find that the performances of all models become worse by almost 4% after removing the template comment samples, but it is still better than generating inline comments. It shows that the existence of template comments does make the models perform better.
OOV issues. The
Out Of Vocabulary (
OOV) issue refers to out-of-vocabulary words or unregistered words appearing in the sample sequence. It is a normal issue in NLP models. In order to solve this problem, we use CamelCase splitting and snake_case splitting to reduce token granularity for these three models. Because the input of the model is code related sequence, we also count the distribution of code tokens (like Table
8), which is shown in Table
14. It can be found that the word sizes of method codes and inline codes are similar (212,578 and 214,696). In order to analyze the influence of OOV tokens. We change the size of vocabulary in method comment generation models and see how it can affect the performance of generation model, which is shown in Table
15. It can be shown that if the vocabulary size decreases from 50,000 tokens to 30,000 tokens. The performance of different comment generation models also decreases by 3.44%–8.76%. The result shows that if there is less OOV tokens, that is, a larger vocabulary size, the models will have a better performance. Therefore, it is an effective way to improve the performance of comment generation models by mitigating the OOV issue.
Tokens in comments. As it is shown above, we can see that after removing the template comments, the quality of the method comments generated by the models is still better than that of the inline comments. It seems that there are other factors that make it difficult for the models to generate inline comments. In RQ2, when we study which kind of comments are more likely to use the token in the code, we find that the statistical results of method comments are quite different from those of inline comments. This may be one of the reasons why the quality of the generated method comments is better.
We continue to use the testset without template comments to study the “tokens” factor. We divide the testset into two parts based on whether the tokens in the code are mentioned in the original comments. The quality of the generated comments of the two parts is compared. We can observe from Table
16 that whether the API and reference data type are mentioned in the comments or not could have an impact on the quality of the generated method comments. This finding is not similar with previous study [
27]. When considering other cases, we are surprised to find out that whether tokens of the code appears in the inline comments or not, the quality of generating comments is similar. In particular, when variable and reference data type tokens exist in inline comment tokens, the comment generation performance has a slight improvement.
We further analyze this phenomenon. According to our assumption, if a token appears in both code and comment, it is easier for models to generate this token when using code as a feature to generate comments. However, the experimental result is quite the contrary. Two hypotheses probably could explain this result. On the one hand, the popular models we use to generate comments are encoder-decoder structures. If a token appears in both code and comment, it could have two different coding vectors in the encoder’s vector space and decoder’s vector space. Unless we have some restrictions to align the two different vectors, it is the same as two different tokens in the code and the comment. On the other hand, it is a very complex process for model generating comments, even if tokens appear in both code and comment could promote the generating quality, we could not ensure the quality of the generated whole sentence.
We also notice that the models are more effective generating method comments when no mentioning API or reference data types and we have done further research. The results are shown in Table
17. We find that comments do not mention API or reference data types have a shorter average length and fewer AST nodes. It seems that these samples are uncomplicated relatively and models could learn and memorize them more easily.
Word usage. In RQ2, we have also learned that these two different types of comments also differ in terms of the word diversity. The vocabulary used in method comments is more concentrated while the vocabulary used in inline comments is more diverse. Specifically, as it is shown in Table
8, the dictionary size of method comment is 34.35% less than that of inline comment. Besides, only 1,128 words can cover 90% of method comment tokens, while 1,431 words cover 90% of inline comment tokens. It is probably one of the most important factors why generating method comments have higher quality than generating inline comments. On the one hand, with a more diverse vocabulary (i.e., inline comment dictionary), the model has more choice in word selection when generating comments at every step. Even if we randomly pick up a word, there is a higher probability for generating sentences to have higher generating quality with relatively concentrated vocabulary (i.e., method comment dictionary). In other words, the probability of picking the correct word becomes higher. On the other hand, models should learn patterns from the dataset and it is relatively easier for concentrated vocabulary to have more patterns. In order to prove the influence of word diversity, we try to reduce the diversity of inline comments. Specifically, we do a cluster experiment of inline comment dataset according to the word usage similarity which is shown in Table
18. We classify the dataset into 20 clusters. We extract the cluster with the largest amount of data and utilize Seq2seq model to the performance of this cluster. The BLEU-4 result is 31.93. It shows that the cluster dataset performs 6.61% better than the performance of the whole dataset (31.93 and 29.95).
Summary: We find that the existence of template comments is one of the main reasons that the quality of generated method comments is better than that of generated inline comments. After eliminating this factor, the method comments generated by the model are still better than the generated inline comments. We also find that the distribution of vocabulary is another main reason, which is similar with NMT task. As it is known that as for the fixed vocabulary size, larger word size means more unknown words, which decreases the accuracy of the translation (i.e., comment generation) [
60]. We are surprised to find out that there is no correlation between the quality of generated comments and whether or not the same tokens appear in both code and comment.
From the findings based on three RQs, we find that there are many differences in the characteristics of method comments and inline comments. These findings give us some implications and motivate us that we need to propose a more specialized approach for method comment or inline comment generation to improve the performance. For instance, there are more template comments in method comments. So we need to remove template comments when training the method comment generation model to reduce the influence of method comments. Firstly, the dictionary in inline comment is more diverse than that in the method comment. Therefore, it is reasonable to adjust the size of dictionary. Specifically, we can increase the inline comment dictionary size or reduce method comment dictionary size to improve comment generation. Besides, from Table
2, we find that the length of method codes is longer than that of inline codes. Therefore, method codes have more complete semantics and syntax information. We can consider adding context information of inline codes to make inline semantics and syntax information complete. But directly adding all the context will make the code sequence so long that includes redundant information. From Table
10, we find that when adding the context of inline codes, there are more information about variable, API, basic data type, and reference data type. From Table
16, we find that comments that mention variable and reference data type have better generation performance. Therefore, it is possible to improve the performance of inline comment generation model by extracting variable and reference data type tokens from the context as the input. From Table
17, we find that comments that mention variable and reference data type have better generation performance. Therefore, it is possible to improve the performance of inline comment generation model by extracting variable and reference data type tokens from the context as one of the model inputs. Another factor these findings can guide us is that the method to adjust hyperparameters of NLP models. As it is shown in Tables
1 and
17, the code length, comment length, and AST nodes number can guide us to adjust the max input size of the comment generation model. The size of the dictionary can also guide us to design the embedding size of the model.