Code comments are one of the important documents to help developers review and comprehend source code. In recent studies, researchers have proposed many deep learning models to generate the method header comments (i.e., method comment), which have achieved encouraging results. The comments in the method, which is called inline comment, are also important for program comprehension. Unfortunately, they have not received enough attention in automatic generation when comparing with the method comments. In this paper, we compare and analyze the similarities and differences between the method comments and the inline comments. By applying the existing models of generating method comments to the inline comment generation, we find that these existing models perform worse on the task of inline comment generation. We then further explore the possible reasons and obtain a number of new observations. For example, we find that there are a lot of templates (i.e., comments with the same or similar structures) in the method comment dataset, which makes the models perform better. Some terms were thought to be important (e.g., API calls) in the comment generation by previous study does not significantly affect the quality of the generated comments, which seems counter-intuitive. Our findings may give some implications for building the approaches of method comment or inline comment generation in the future.

1 Introduction

Code comments in programs are very important because they record the thoughts and intentions of the developers [1, 2, 3, 4]. They play a vital role in program comprehension, software maintenance, and other software related work [5, 6, 7, 8]. However, writing comments in the source code is time-consuming and tedious [9, 10]. Previous research shows that developers often neglect to write comments due to the tight development schedule [11, 12, 14]. To automatically complete code documentation and improve programming efficiency, researchers have proposed many approaches to automatically generate comments [16, 19, 20, 21, 22, 23, 24].

Existing comment generation approaches are mainly for method comments, which refer to the comment locating before a method [13, 25] to provide a summary description of the entire method [15, 27]. Another kind of comments are inside the method, which are called inline comments [14, 26] (also called block comment). They usually explain the next few lines of code, such as the implementation of a more specific functionality, and form a complementary relationship with method comments. As shown in Figure 1, Mark 1 shows a method comment, and Mark 2 is an inline comment.

Fig. 1.

Since the majority of existing studies focus on the automatic generation of the method comment [16, 19, 20, 21, 22, 23, 24, 28] and they rarely pay attention to inline comments, we are curious about the proportion of method comments and inline comments used by developers. We collect method comments and inline comments from a large number of open source projects to see how prevalent are them in the program documentation. The distribution of dataset is shown in Table 1. The dataset including 998 projects shows that the number of method comments and inline comments are 975,765 and 973,525, respectively. These comments are distributed in 167,466 classes and 143,763 classes respectively. Each class has 5.83 method comments and 6.77 inline comments in average. It indicates that developers use the inline comment as many as method comment. Besides, the average code length of method and inline are 94.86 tokens and 85.48 tokens respectively, which indicates that method code length is usually longer than inline code length.

Table 1.

Type	Number	Class	Comments contained in each class	Average code length
Method comment	975,765	167,466	5.83	94.86
Inline comment	973,525	143,763	6.77	85.48

Table 1. Dataset Distribution for Method and Inline Comments

In this article, we first build a method comment dataset and an inline comment dataset, and then conduct a comparative study of the method comments and inline comments. In detail, we compare the template usage, word usage, the relationship between comments and code, language style in method comments and inline comments, respectively. We further apply existing comment generation models to both method comment and inline comment and analyze their performance.

To have a meaningful comparative study, we explore the following Research Questions (RQs):

–

RQ1. What is the number of comments generated based on templates in the method comment dataset and the inline comment dataset, respectively? Template-based comment generation technology is widely used in code and comment generation [23, 29, 30]. In order to find out the distribution of the template-based comments in different kinds of comments, we employ an automatic approach to find the templates.

–

RQ2. Are there different writing styles for method comment and inline comment? In this study, we investigate the writing styles including word usage in comment, tokens in comments (i.e., token specifically refers to API, variable, basic data type, reference data type in the code), and part of speech (hereinafter referred to as POS).

–

RQ3. Can method comment generation models be well applied to generating inline comment and why? We apply method comment generation models to generating inline comment, and use the comment generation evaluation criteria to assess the performance of the models.

After studying the three RQs, we come to some conclusions.

–

There are obviously more method comments generated based on templates than inline comments.

–

In terms of writing styles, the words used in method comments are more concentrated and the words used in inline comments are more diverse.

–

We also find that method comments tend to mention tokens in the code more than inline comments.

–

The existing comment generation model performs better on method comments than on inline comments. The wording style of the comments is a reason. But there is no obvious evidence to support that comments that mention tokens in the code are easier to generate. At the same time, the existence of template comments will make method comments perform better.

To facilitate research and application, our source code¹ and datasets² are released, including: the experimental scripts, a method comment dataset, and an inline comment dataset. We describe the basic requirements and steps for running the proposed method. To the best of our knowledge, the inline comment dataset is the first pure dataset that only includes inline comments, which can be used in the inline comment generation task for researchers. We also release the method comment dataset, and we remove a template comments from the dataset and keep only one comment per template. Then, the dataset is better at reflecting the true performance of a model if using our released dataset for training and testing the model.

The rest of the article is organized as follows. First, we introduce related work in Section 2, including empirical research on code comments and code comment generation. Then we introduce several methodologies used in this article in Section 3, including data collection and analysis, detecting the comments automatically generated based on templates and comment generation. The major findings will be arranged in Section 4. Section 5 gives a discussion and Section 6 is the threats to validity. Finally, the conclusion and future work will be given in Section 7.

2 Related Work

2.1 Empirical Study on Code Comment

In recent years, more and more researchers have carried out empirical research on code comment. Abdulkadir Şeker et al. [38] were curious about if comments are as crucial as code contributions on open-source software platforms. They proposed novel developer metrics to do empirical research and concluded that writing comments to describe any feature of the code is as valuable as code. Vishal Misra et al. [39] were interested in the correlation between code comments and issues in GitHub. They first classified the comments into two categories, Relevant or Auxiliary. They then performed various experiments to explore the correlation between code comments and issues on 625 Python repositories from GitHub. They pointed out that there was some relation between code comments and issues that the higher the relevant comment percentage, the fewer days it took to solve the issues. The novelty of this research is to explore the relationship between comments and issues, which can guide developers to write more standardized comments. In addition, Chen et al. [62] classified code comment into six categories, include what, why, how-to-use, how-it-is-done, property, and others. They conducted an experiment to investigate the performance of different state-of-the-art code summarization approaches on the categories, and found that the performance of different code summarization approaches varies substantially across the categories. In the procedure of classifying comments, three programmers manually labeled the data which consists of 20,000 code-comment pairs. They proved that with a simple and basic classifier, we can promote the performance of code summarization. With a view to what types of comment researchers focus on when assessing comment quality, Pooja Rani et al. [63] presented a systematic literature review of the last decade of research in software engineering and investigated the comment types researchers target. The scope of comments under assessment includes class, API, method(function), package, license, or inline comments. They observed that 50% of the studies analyze all types of code comments and the rest focus on studying a specific type of comments, indicating research interest in leveraging a particular type of comment for specific development tasks.

Fengcai Wen et al. [40] launched a large-scale empirical study on code-comment inconsistencies. They analyzed different types of commits to find out which commit types were more likely to trigger comment updates. They believed that their findings could be used to guide the development of tools for fixing code-comment inconsistencies. Sean Stapleton et al. [41] noticed the deficiencies of the metrics currently used to evaluate the quality of model-generated comments, such as BLEU and ROUGE. They did a human study in which students and professional developers were asked to do a series of tasks around code comments. Some of the participants were faced with human-written comments, while others were faced with model-generated comments. They found that participants were able to complete the task significantly better with the help of human-written comments. Although the participants did not perceive the difference in quality between human-written and model-generated comments. Moreover, whether the model-generated comments could help developers complete tasks better was not related to evaluation metrics. They believed that new evaluation metrics are needed to measure the quality of the model-generated comments. Gros et al. [42] analyzed the differences between code-comment translation and nature language translation based on data feature and evaluation metrics. They found that the outputs of code-comment translation are more repetitive. Besides, they found that nature language translation has more dependency between input and output. They also analyzed the effectiveness of different version of BLEU score, which showed that different BLEU versions would cause huge differences in the BLEU value, which may have a great impact on the experimental results. Therefore, it was necessary to establish an identical comment evaluation standard. Moreover, in order to ensure the readability and naturalness of code comments as a natural language, some studies will also use manual evaluation as qualitative evaluation. For example, Wang et al. [13] proposed a python comment generation approach based on reinforcement learning. It not only utilized BLEU to evaluate the approach but also adopted human evaluation. They invited some people to evaluate the generated comment from naturalness and informativeness. Shi et al. [43] also proposed a human evaluation based on naturalness, informativeness, and similarity. In summary, the manual evaluation of code comments is mainly based on whether the comments conform to natural language grammar (i.e., naturalness) and whether they can accurately reflect the function of the code (i.e., informativeness).

Another empirical study mainly focuses on the comment density. Oman et al. [31] and Barranco et al. [32] assessed the proportion of code comments in a software system to evaluate the comment quality. However, this evaluation metric is crude, as some redundant comments (e.g., copyright comments) were also considered. Arafat et al. [33, 34] conducted an empirical study and the results showed that the open source projects were consistently well documented with an average comment density of 18.67%. In another two studies, Siy et al. [35] found a consistent comment density of around 50%, while Elish et al. [36] found an average comment density of 15.2% with a standard deviation of 12.2% in 100 Java open source classes. However, the small size of these two studies make it hard to compare these studies with our work. Jiang et al. [37] study the evolution of code comments in the PostgreSQL project via utilizing the data recovered from CVS. Their study reveals that the percentage of functions with header and non-header comments remains consistent throughout the development history.

These empirical research can help developer understand the characteristic of comments from different perspectives. Although there are many empirical studies on code comments, it seems that few researchers regard inline comment as the main research object. In this article, we focus on the similarities and differences between method comments and inline comments, and explore some practical issues about inline comments, such as comment generation.

2.2 Code Comment Generation

A variety of methods for automatic code comment generation have been proposed [16, 19, 20, 21, 22, 23, 24, 44, 47, 48, 57, 58, 59]. These methods aimed to generate brief natural language summaries for source code. It is a critical task in software engineering and programmers can benefit a lot from it whenever they are reading or writing codes. According to different objects to be commented, code comment generation can be divided into three types: class comment generation [44], method comment generation [19], and inline comment generation [1]. Since a single class often covers a lot of content, it is difficult to generate comments describing all the functions of the class at once. Therefore, there is a limited approach based on generating code comments directly at the class level. The most representative research came from Moreno L et al. [48]. They presented a technique to automatically generate readable comments for Java classes, and they determined the class and method stereotypes and uses them, in conjunction with heuristics, to select the information to be included in the generated comment.

Currently, most of the approaches focused on method comment generation and inline comment generation. As for method comment generation, formerly the main approaches of method comment generation were manually template-based. For example, Giriprasad Sridhara et al. [44] utilized the Software Word Usage Model (SWUM) and predefined some heuristic rules to identify keywords from code text and generated the templated comments for Java methods. This kind of approaches could generated well-formed comments and sometimes could accurately summary the code functions. However, creating a such model needed quite a few manpower to design the rules and templates, which was the main influence on the performance of the model. After that, some research proposed to mine external sources libraries (e.g., technical Q&A websites, code corpus, bug tracking systems, mailing lists) to generate method comment generation. Stack Overflow and GitHub are the main mining sources for these research [16]. For instance, Vassallo et al. [45] proposed an approach of mining large-scale Q&A data from the technical Q&A website StackOverflow to automatically generate method comments. Specifically, they mined discussions on StackOverflow based on heuristics with the aim of identifying method descriptions. Recently, some learning-based techniques of natural language processing were used to generate method comment. This kind of approaches took the code and the comment as two different types of language and translated one into another [49]. Besides, because a method can be parsed to a intermediate representation like AST, some research also utilized AST as another kind of input. DeepCom (Xing Hu et al.) [19] was one of them, exploiting the structural property of source code by means of ASTs and using SBT method to traverse ASTs to generate input sequences. Moreover, Code2Seq (Uri Alon et al.) [20] picked K pairs of leave nodes randomly in AST, which could form K paths of the tree, and then used them to represent the source code. Yusuke Shido et al. [21] developed multi-way Tree-LSTM, using the LSTM-based model to encode the nodes of ASTs from bottom up. The approaches above extracted features information either from text or AST. Hybrid-Deepcom (Xing Hu et al.) [22] and ast-attendgru (LeClair A et al.) [23] were models fusing both semantic and structural features by taking both code text and AST as inputs. Experiment results showed that AST could well represent the structural property and improve the quality of producing comment, but it was language-specific with a fixed size dictionary. To solve the problem, Moore J et al. [24] proposed a CNN-based model, splitting all the tokens into characters and some frequent subtokens. In this case, all codes were treated as character sequences and the size of the dictionary was limited and numerically small.

Similar approaches were also utilized in inline comment generation. As for template-based comment generation approach, Sridhara G et al. [47] presented an approach for identifying code fragments of statement sequences, conditionals, and loops that can be abstracted as a high level action, then they automatically synthesized a natural language description for these fragments based on the predefined templates. Some inline comment generation research also proposed to mine external sources libraries. Wong et al. [1] proposed to mine code-descriptions from a large programming Q&A site, and then leveraged these mappings to generate comments automatically for similar code segments matched in open-source projects. Based on this research, Wong et al. also used code clone detection technology to search for reusable code comments from open source software code libraries [2]. This approach could only generate usable code comments for 85 code fragments in 21 large open source projects. Therefore, the approaches based on mining the external resource library had a large room for improvement in the success rate of generating inline comments. Learning-based techniques were also utilized in inline comment generation. Some approaches treated the code text as a sequence while some treated the Abstract Syntax Tree (AST) as a sequence. Srinivasan Iyer et al. [16] presented CODE-NN, an LSTM-based neural network with attention, whose input was code text sequence and the output was comment tokens sequence. Learning-based methods do not require templates and rules anymore and they can learn the patterns by themselves. CODE-NN splits the code text into several tokens and treats these tokens as a sequence. The model can extract semantic information from the names of tokens but the structural information of codes is not used. Huang et al. [17] proposed to utilize heuristic rules and learning-based approach to collect inline code-comment pairs and constructed a reinforcement learning-based approach to generate inline comments. They utilized code snippets and AST sequences which were attained with a statement-based traversal way. The result outperformed the baselines and state-of-the-art in comment generation.

In this article, we will use several classic method comment generation and inline comment generation approaches (i.e., Seq2Seq, DeepCom, Code2Seq) to generate the method and inline comments, and then make a comparative study between the generated method comments and inline comments.

3 Methodology

3.1 Overview

The process of our research can be divided into four steps, which are shown in Figure 2. Firstly, we collect the method code-comment data and inline code-comment data. Then we analyze the feature differences between the two kinds of comments. We counted the distribution of these two types of comments on word usage, tokens in code and commnets and POS, and so no. After that, we utilize AEL algorithm [53] to identify and extract the template comment from method comment and inline comment. The AEL algorithm includes anonymize, tokenize, categorize and reconcile. At last, we utilize some comment generation models (Seq2seq, code2seq and Deepcom) to evaluate the performance of generation of these two kinds of comments.

Fig. 2.

3.2 Data Collection and Analysis

In order to investigate the comments in the source code, we collect a dataset from GitHub. According to the score provided by GitHub, we download the top 1,000 Java projects. All these projects have comments. There are 998 projects after we filter out the ones with non-English comments. The dataset contains 1,949,290 comments from the 998 projects: 975,765 method comments and 973,525 inline comments.

For method comment, its coverage is the whole method, which can be easily determined [46]. For inline comments, we employ our previous inline comment scope detection method [50] to identify their covered code snippets. The detection method utilizes features of code snippets and comments to detect the scope of the inline comment in the Java program. The accuracy of the detection method is about 81.45%. The inline comment scope detection method is described as follows:

–

Features Extraction: To automatically identify the comment scope, the detection method extracts features both from the code line and the comment. The features are divided into three dimensions including code features, comment features, and code comment relationship features. Code features determine the comment scope from the code line types, the nested level of code lines, method calls and variables usage, and so on, while comment features capture comment scope from the words choice in a comment, i.e., counting the verbs and nouns in the comment. Meanwhile, the detection method also extracts the features from the correlation relationship between the comment and the code line to determine the scope of a comment, such as the textual and semantic similarity between the comment and the code line. The more introduction regarding features extraction can be found in [50].

–

Comment Scope Detection: In order to classify statements into two categories: within and outside the scope of the inline comments, a comment scope detection model by utilizing the supervised machine learning algorithms is built. In our previous study, we manually validate the scope of the inline comments to collect an inline comment dataset. Specifically, for the code lines in the comment-code pair, the first out-of-scope code line is regarded as the demarcation point of the scope of the comment. The more explicit example is in Figure 3. As we can see, line 5 is classified as the first out-of-scope statement, so the scope of the comment is line 2 to line 3. Then, the code lines in the scope of the comments are labeled as “1”, and the ones out-of-scope are labeled as “0”.

Fig. 3.

After well training, we apply the inline comment scope detection method to identify the scope of each inline comment, we can collect an inline comment dataset from the selected projects. Table 2 shows the numbers, distributions, and sizes of method and inline comments collected from the 998 projects. Here we use Prop(%) to represent the average proportion of the methods with the method comment in the three kinds of methods (i.e., public method, protected method, and private method). For example, if the proportions of the methods with method comment in three projects are 10%, 20%, and 15% respectively, then the Prop(%) here is 15% (i.e., the average result). It is worth noting that inline comments do not have an obvious “total amount” like methods. So we use the proportion of the number of code lines covered by the inline comments on all code lines to calculate the proportion of inline comments in each project. Project(%) refers to the proportion of projects that contain comments to all projects. # Sent refers to the average number of sentences of comments. # Word refers to the average number of words per sentence in comment.

Table 2.

Type	Number	Prop(%)	Project(%)	# Sent	# Word
Method	975,765	15.98	93.09	2.72	15.07
Method (public)	823,185	15.51	88.94	2.20	15.17
Method (protected)	58,640	27.69	56.06	2.18	14.31
Method (private)	93,940	15.86	73.42	1.93	14.59
Inline	973,525	10.90	98.70	1.13	9.16

Table 2. Statistical Results for the Comments

We can observe that 15.98% of methods are commented. We further study the method comments according to the visibility of methods, i.e., public, protected, and private. It shows that the number of protected methods with method comment is the least, but it has the highest proportion of comments. The Prop(%) of public methods is closed to the Prop(%) of private methods.

There are 2.72 and 1.13 sentences on average in method comments and inline comments, respectively. The method and inline comment contain an average of 15.07, and 9.16 words per sentence, respectively. It indicates that developers use shorter sentences in inline comments. It should be noted that not all the projects contain these two types of comments. Among 998 projects, 6.91% of the projects have no method comments, and only 1.30% of the projects have no inline comments.

3.3 Detecting the Comments Automatically Generated Based on Templates

In practice, most auto-generated comments in open source software are generated by IDEs with predefined templates [52]. Considering the template definition can be very flexible, these auto-generated comments are of different documentation style and cannot be filtered using simple rules, e.g., filtering the comments which contain keyword “auto-generated”. In order to find out the comments generated using the same template, we utilize the abstraction technique to recognize and recover the internal structure of each comment. Using the recovered structure, comments can be easily categorized. The recovery of text comment structure is similar to the recovery of log file structure. We apply the AEL approach [53], which was used to abstract execution logs, to detect the comments automatically generated based on templates. In [53], the precision and recall of this approach were not less than 84.2% and 82.4% respectively. Our task is similar with the task in [53]. Figure 4 is a flow chart of AEL. There are four steps: Anonymize, Tokenize, Categorize, and Reconcile.

Fig. 4.

(1)

Anonymize: In this step, AEL uses heuristics to recognize dynamic tokens in comments. The heuristic rules are defined based on domain knowledge. The following are two heuristics to recognize dynamic parts in comment: 1. Phrases like “@author value”; 2. Phrases like “Date: value”. If the AEL recognizes the dynamic tokens, it will replace them with a generic token (we use \(\lt *\gt\) in this article).

(2)

Tokenize: The tokenize step clusters the comments in a coarse-grained level. After the anonymize step, a comment consists of two parts: word part and generic token part. AEL uses the number of words and the number of generic tokens to do the clustering. Comment messages with the same number of words and the same number of generic tokens are divided into the same cluster.

(3)

Categorize: Based on the clustering results from the tokenize step, the categorize step further clusters the comments at a fine-grained level. In each cluster, AEL first selects a comment message to form a sub-cluster and extract its template. Then, it compares the template with the other comments in the cluster. If a comment conforms to the template, it will be added to the sub-cluster. If not all the comments are added to existing sub-clusters, AEL will continue this process in the comments which are not added to any sub-cluster, by randomly selecting a comment in the cluster to form a new sub-cluster and compare its template with other comments. After the categorize step, all the comments in the cluster are divided into sub-clusters and each sub-cluster has a template.

(4)

Reconcile: The incomplete definition of heuristic rules in the anonymize step, results in some similar comment messages that are assigned to different fine-grained clusters in the categorize step. The reconcile step is proposed to deal with this problem. For each coarse-grained cluster, AEL re-examines all the existing templates. Two sub-clusters are merged if the similarity between their representative templates is larger than a user-defined threshold (50% in the experiments). The reconcile step reduces the number of clusters, making the result more reasonable.

By applying AEL, we distinguish many kinds of template-based generated comments and some similar noise comments. One of the noise comments is commented code. The details of the results will be shown in RQ1. We eliminate these automatically generated comments and noise comments before we do other analysis.

3.4 Comment Generation Models

To evaluate whether the existing method comment generation models can be applied to generate inline comments. We select some representative models to evaluate the efficiency of comment generation in our experiments, and they are:

Seq2Seq [18]: this is a very famous model in the field of natural language processing (i.e., NLP). This model was originally proposed to realize automatic translation, that is, translating from a kind of language (e.g., English) to another kind of language (e.g., German). In our task, we treat code as a kind of language and comment as another kind of language to apply this model. Then it can achieve comment generation. Seq2Seq can be utilized in both method comment generation and inline comment generation. It consists of an encoder and a decoder. In this experiment, we treat the code text as a feature to input the encoder and then use the decoder to translate the code text tokens to a comment.

DeepCom [19]: this method is often used as the baseline in the code comment generation task. It treats the ASTs as sequences by taking SBT and then uses seq2seq models to translate every sequence to a brief description.

Code2Seq³ [20]: this method first takes AST to leave nodes as terminals and non-leaf nodes as nonterminals. After that we consider it extracts all pairwise paths between terminals and represent them as sequences of terminal and nonterminal nodes. At last, the approach randomly selects K paths from sequences, and uses the decoder to translate every K paths to a brief description.

The characteristic differences of the three models above are shown in Table 3. These tools are open source. Because they exploit the characteristics of source code or source code syntax structure, and establish associations with natural language, these tools can be applied not only in the field of code comment generation in local development with IDE, but also in the field of commit message generation in code review. The input of Seq2Seq is the code text of the code snippet and the others are the AST. The AST of a code snippet is of tree-structure when the code snippet is a complete unit such as function and class [54]. However, the code snippet of an inline comment is not a complete unit and it can have several sub-tree structures. To have the same process, we first obtain all the sub-trees in the AST of the function including the code snippet, and then find out the LCA (Least Common Ancestors) of these sub-trees. Eventually, we use the LCA to connect all the sub-trees and this is the tree-structure feature of the code snippet.

Table 3.

	Method Comment Generation Model	Inline Comment Generation Model	AST Input	Source Code Input
Seq2Seq	✓	✓	✗	✓
DeepCom	✓	✗	✓	✗
Code2Seq	✓	✗	✓	✗

Table 3. The Differences among Comment Generation Models

Considering training efficiency for these three models, we do not use the entire dataset (which the number of method comments and inline comments are 975,765 and 973,525, respectively) to conduct experiments. Instead, we randomly select the data sets of the same order of magnitude as those models used in the original studies [18, 19, 20]. Then, the size of the training set and testset of method comment are 483,410 and 48,374, respectively. The size of the training set and testset of the inline comment are 439,068 and 43,969, respectively.

We use BLEU [56], a widely used metric for machine translation problem, to measure the quality of the comments generated by the models. BLEU (Bilingual Evaluation Understudy) measures the similarity between a generated comment and an original comment. The higher the BLEU score is, the more similar the generated comment is to the original comment, and the better the model performs. BLEU uses n-gram for matching and calculates the ratio of N groups of word similarity between generated comments and original comments. The formula of BLEU is as following:

\begin{equation} BLEU = BP \cdot exp\left(\sum _{n = 1}^{m} w_{n}logp_{n}\right) , \end{equation}

(1)

where \(p_{n}\) is the ratio of the subsequence of the generated comment with length n to the reference comment. As the value of n increases(the value of n is 4 in this study), the BLEU score decreases exponentially. BP is the length penalty factor, and its formula is as follows:

\begin{equation} BP = {\left\lbrace \begin{array}{ll}1,& if\quad c \gt r\\ e^{(1-r/c)},& if\quad c \le r \\ \end{array}\right.} , \end{equation}

(2)

where, c represents the length of the generated comment, and r represents the length of the reference comment.

4 Major Findings

In this section, we present our results and discuss the main findings regarding the research questions.

RQ1. What is the number of comments generated based on templates in the method comment dataset and the inline comment dataset?

Previous study [61] found that the duplication in the dataset of code-comment pairs will directly effect the results of the comment generation models (i.e., it produces a better result than the real value). For example, if some samples in the training set and the testset are exactly the same, the model will perform better on these samples. However, the model is actually not as good as it performs. It is likely that the model overfits these samples and has poor generalization ability. In order to analyze how much influence the template has on the training comment generation model later, we firstly propose this RQ. Specifically, we analyze the use of templates in real comments.

We notice that some comments in the dataset are almost the same. These comments differ only in a few words. For example, comments “Find the _Fields constant that matches name, or null if its not found.” and “Find the _Fields constant that matches fieldId, or null if its not found.”. We have marked the different words in italic style. After investigation, we find that most of these similar comments were generated by IDE or some other code language conversion tools with pre-defined comment templates. They add a few specific words to these templates to generate comments. The comments generated from predefined comment templates will be referred to as “template comment”. “template comment” are almost the same when they share the same template.

We try to find out the “template comment” on the comment datasets. We use the AEL algorithm introduced in Section 3 to find out “template comment”. There are three people working on utilizing AEL algorithm. Two people check the results of the AEL algorithm and another one makes the final decision if there is a disagreement between the first two people. It takes about 3 days to execute the AEL process. Table 4 summarizes the results of the AEL algorithm on method comment dataset and inline comment dataset. We can see that in the method comment dataset, the proportion of the “template comment” is very high. The proportion of “template comment” in the method comment dataset is 10 times more than that of the inline comment dataset. Some comments are exactly the same, and we call them “duplicate comment”. “duplicate comment” is also “template comment”, which can be categorized by the AEL algorithm as well. Therefore, the number in the Table 4 includes both “template comment” and “duplicate comment”.

Table 4.

	Method Comment	Inline Comment
Total	975,765	973,525
Template Comment	131,470	10,217
Proportion	13.47%	1.05%

Table 4. Template Comments in Two Comment Datasets

In the 975,765 method comments, the AEL algorithm divides a total of 457,672 clusters and there are 140 clusters with more than 100 items. While in the 973,525 inline comments, the AEL algorithm gets a total of 433,169 clusters and there are 168 clusters with more than 100 items. We manually check a few clusters with more than 100 items to study the cause of the template. Tables 5 and 6 are the summaries of templates detected by AEL in method and inline comment dataset. Table 7 is the summary of the noisy data in both datasets. The summaries include the explanation of comment clusters, example of templates, and the corresponding number.

Table 5.

No.	Explanation of Comment Cluster	Example of Template	Most Recurrent Comment	Less Recurrent Comment	Number
1	Comments are generated by the predefined comment template in the IDE comment plugin	“Returns true if field \(\lt \gt\) is set (has been \(\lt \gt\) a value) and false otherwise”	“Returns true if field corresponding to is set (has been assigned a value) and false otherwise”	“Returns true if field locations is set (has been assigned a value) and false otherwise”	2,659
		“Find the _Fields constant that matches \(\lt *\gt\) or null if its not found.”	“Find the _Fields constant that matches fieldId or null if its not found.”	“Find the _Fields constant that matches name or null if its not found.”	2,152
		“Util method to write an attribute \(\lt *\gt\) the ns prefix”	“Util method to write an attribute without the ns prefix”	“Util method to write an attribute with the ns prefix”	2,001
		“@return \(\lt \gt\) the \(\lt \gt\)”	“@return Returns the id”	“@return IntegrationType corresponding to the value”	1,456
		“@param \(\lt \gt\) The \(\lt \gt\) to \(\lt *\gt\)”	“@param id The id to set”	“@param Balance The Balance”	1,257
2	Comments generated when java methods are automatically generated	“Auto generated getter method @return \(\lt *\gt\)”	“Auto generated getter method @return java.lang.String”	“Auto generated getter method @return com.amazon.s3.GetObjectResult”	1,361
		“Auto generated setter method @param param \(\lt *\gt\)”	“Auto generated setter method @param param RequestId”	“Auto generated setter method @param param Topic”	1,361
		“auto generated Axis2 call back method for \(\lt *\gt\) method”	“auto generated Axis2 call back method for putObject method”	“auto generated Axis2 call back method for createBucket method”	111
		“Auto generated add method for the array for convenience @param param \(\lt *\gt\)”	“Auto generated add method for the array for convenience @param param com.amazon.s3.MetadataEntry”	“Auto generated add method for the array for convenience @param param com.amazon.ec2.VpcType”	102

Table 5. Template Comments Detected by AEL in Method Comment Dataset

Table 6.

No.	Explanation of Comment Cluster	Example of Template	Most Recurrent Comment	Less Recurrent Comment	Number
1	Comments generated by the open source tool Thrift	“check for required fields check for sub-struct validity”	“check for required fields check for sub-struct validity”	“check for required fields check for sub-struct validity”	1,108
2	Generated by the Android framework	“Inflate the menu; this adds items to the action bar if it is present.”	“Inflate the menu; this adds items to the action bar if it is present.”	“Inflate the menu; this adds items to the action bar if it is present.”	1,090
3	Comment of AWS SDK	“Bail out if this isn’t the right error code that this marshaller understands”	“Bail out if this isn’t the right error code that this marshaller understands”	“Bail out if this isn’t the right error code that this marshaller understands”	548
4	Comments generated when generating Java code from WSDL using Apache Axis2	“We can safely assume an element has only one type associated with it”	“We can safely assume an element has only one type associated with it”	“We can safely assume an element has only one type associated with it”	392
5	A large number of other template comments appear in the same project	“Since the test is generated for protocol version \(\lt \gt\) which is earlier than latest change in the message (version \(\lt \gt\) only the bytes after frame length fields are compared)”	“Since the test is generated for protocol version (1.0) which is earlier than latest change in the message (version (1.2) only the bytes after frame length fields are compared)”	“Since the test is generated for protocol version (1.0) which is earlier than latest change in the message (version (1.4) only the bytes after frame length fields are compared)”	240

Table 6. Template Comments Detected by AEL in Inline Comment Dataset

Table 7.

No.	Comment type	Explanation of Comment Cluster	Example of Template	Number
1	Method	Indicates inheritance document	“{@inheritDoc}” “@inheritDoc”	22,989
		Commented out code	“private \(\lt name\gt \lt *\gt \lt /name\gt\) (long n) this.n = n; ”	7,771
		Symbolic noise comments	“——————————————————”	4,783
		URL information	“https: \(\lt *\gt\)”	317
2	Inline	Commented out code	“if \(\lt \gt\) {”“System.out.println(\(\lt \gt\));”	4,029
2	Inline	Symbolic noise comments	“========================”	2,137

Table 7. Noisy Data Detected by AEL in Method and Inline Comment Dataset

We can observe from Table 5 that the first reason for the comment template is: “Comments are generated by the predefined comment template in the IDE comment plugin”. That is, the comments are generated by the template predefined in the IDE comment plugin. For example, the comment “Returns true if field \(\lt *\gt\) is set (has been \(\lt *\gt\) a value) and false otherwise” describes that the return value is determined by the field \(\lt *\gt\), and the field \(\lt *\gt\) can be replace by a variable when generate the real comment. Another reason for causing the comment template is: “Comments generated when java methods are automatically generated”. This template shows that the comments are generated along with the automatically generated source code. A motivate example is the comments of the method getter() and setter(), “Auto generated getter method @return \(\lt *\gt\)” and “Auto generated setter method @param param \(\lt *\gt\)”, as shown in Table 5. These two comments are generated along with the automatically generated methods getter() and setter().

Table 6 shows the template comments detected by AEL in inline comment dataset. We summary five of the most common template types. We can see that most of the “template comments” are the “duplicate comments”. There is no variable tokens in the template comments. For example, there are 1,108 identical comments “check for required fields check for sub-struct validity” in the first template “Comments generated by the open source tool Thrift”. Therefore, the most recurrent comments and less recurrent comments for each template are the same. We also observe from Table 6 that most of the template comments in inline comment dataset are from the programming frameworks, such as Android (i.e., template 2), AWS SDK (i.e., Amazon Web Services,⁴ template 3), WSDL (i.e., Web Services Description Language,⁵ template 4). In addition, there are a large number of template comments appear in the same project, such as the template 5 in Table 6.

Table 7 shows the noisy comments detected by AEL in method and inline comment dataset. Noisy data could be some meaningless symbols or commented out code. The most common noisy comment in the method comment dataset is the “inheritance document”. These comments use the marks “{@inheritDoc}” or “@inheritDoc” as placeholder in the source code, but they are meaningless for explaining the source code, so we classify them as the noisy comments. We can also observe from Table 7 that another two common templates “Symbolic noise comments” and “commented out code” can be found both in the method and inline comment dataset. Obversely, these two comments do nothing to explain the source code.

Summary: we can see by using the AEL algorithm, we find a large number of “template comment” in these comment datasets. Most of the “template comments” in the method comment dataset are generated by the predefined comment template in the IDE comment plugin or generated along with the automatically generated source code. Most of the “template comments” in the inline comment dataset are the “duplicate comments”. Besides, the number of “template comment” for method comments is more than that of inline comments. This is because current IDEs mainly provide the function of generating method comments based on templates, but rarely provide the function of generating inline comments based on templates. Moreover, the noisy comments “Symbolic noise comments” and “commented out code” can be found in both method and inline comment datasets.

RQ2. Are there different writing styles for method comment and inline comment?

In this RQ, we want to explore the writing style difference between method comment and inline comment. The difference in writing styles might explain how the same model behaves differently on different comment generation tasks in the next RQ.

Word usage. We focus on the overall situation of the words used in the two types of comments. After a series of preprocessing of the comment dataset, we analyze the word dictionary composed of the method and inline comments. The preprocessing is as follows. First, we take the first sentence of the comment as the subject of the study. According to the statistical results, the number of words in the first sentence of method and inline comments are 14.71 and 9.84, respectively. We then conduct CamelCase splitting, snake_case splitting, and implement the lemmatization for each word in the comment. Then, we turn every word into a lower case. The first sentence in a method comment is usually considered a summary sentence [51], so we take the first sentence of the method comment for study. Similarly, we also take the first sentence of the inline comment for study. Although inline comments in many cases have only one sentence. In addition, we think it is necessary to split the words in comments with CamelCase and snake_case. It is because some variables or API names from the corresponding code may be mixed into the comments [55]. Word lemmatization is to restore a word to its original form according to its POS. For example, the verb past tense “broken” will be restored to “break”, and the adjective comparative “bigger” will be restored to “big”.

After applying the preprocessing, we count the frequency of the words used in the two types of comments to form the corresponding dictionary. The details of the dictionary are shown in Table 8, and we find that the words used in method comments are more concentrated and the words used in inline comments are more dispersed. The size of the dictionary of method comment is smaller than that of the inline comment. The dictionary of method comment contains 57,553 words, while the dictionary of inline comment contains 87,665 words. We sort the words in the dictionary according to their frequency. In method comments, only the first 54 words are needed when the cumulative frequency of words reaches 50%. The case for inline comments is 84 words. Similarly, when the cumulative frequency of words reaches 60%, 70%, 80%, 90%, 95%, 96%, 97%, 98%, and 99%, the number of words needed in method comment dictionary is obviously less than that in inline comment dictionary, as shown in Table 8. In particular, if the cumulative frequency of words in inline comments reaches 90%, the word number is 1,431, but 1,431 words can cover 92.75% of method comments. It shows that developers use more abundant words when writing inline comments. This may be that inline comments need to describe source code in more specific situations. Inline comments, therefore, require more extensive expression. While the method comments may describe the features of methods at a higher level when comparing with the inline comments, which may lead the words used in the method comments are more concentrated considering only taking the first sentence of method comment into consideration.

Table 8.

	Size	99%	98%	97%	96%	95%	90%	80%	70%	60%	50%
Method	57,553	9,194	4,719	3,236	2,489	2,029	1,128	457	221	107	54
Inline	87,665	22,263	9,535	5,672	3,952	3,020	1,431	583	297	157	84

Table 8. Word usage in Comments

Tokens in comments. We would like to discuss which kind of comments is more likely to reference the token in the code. Intuitively, we think that the inline comments are easier to reference tokens in the code because inline comments directly explain the next few code lines. The method comments generally explains the function of the entire method. We study this issue by counting the proportion of the comments that mention the code token to all the comments.

As shown in Table 9, we explore the proportion of comments mentioning a certain type of token in the code among all comments. The types of tokens include variable, API, basic data type, and reference data type. Basic data types refer to the 8 basic types provided by the Java language, such as int, float, string, and so on. The reference data type refers to the user-defined data type, usually a Java class, such as Student, Employee, and so on. The inner API refers to the API that belongs to the same project, and the outer API refers to the external API coming from other projects. To distinguish between inner and outer API, we employ JavaParser⁶ to identify the API of current project and generate a API list for current project. JavaParser converts Java code into a corresponding Abstract Syntax Tree, and we identify the inner API of current project by traversing the Abstract Syntax Tree. Then, if we find a API in the API list of current project, it will be identified as an inner API, otherwise, it is an outer API.

Table 9.

	Variable	API	API(inner)	API(outer)	Basic data type	Reference data type
Method	58.58%	26.71%	2.33%	24.38%	9.28%	30.46%
Inline	26.37%	24.21%	0.25%	23.96%	1.51%	5.96%

Table 9. The Proportion of Comments Mentioning a Certain Type of Token in the Code

The data in Table 9 refers to the proportion of comments that mention the corresponding token type among all comments. For example, there are 10 method comment samples, and 3 of them mention the API name in the corresponding code. In this case, the proportion of comments mentioning API is 30%

We are surprised to find that no matter what type of token it is, the statistical results of method comments are higher than those of inline comments. This is contrary to the intuitive conclusion we mentioned earlier. In addition, we can see that in the two types of comments, the proportion of comments mentioning API is similar. The number of comments mentioning outer API is significantly higher than that mentioning inner API in both types of comments. It is because the outer API may be more difficult to understand since its source code is more difficult to access. The number of outer API is more than that of inner API. For other types of tokens, the proportion of method comments is significantly higher than that of inline comments.

Tokens in commented codes. Based on the analysis of tokens in comments, it can be found that there are some special tokens existing in comments. In order to analyze where and how the comment capture these special tokens from source code, we also count the proportion of these special tokens that appear in comments in source code, which is shown in Table 10. It can be found that method codes have more information about variable, API, basic data type and reference data type. Besides, as it is known that inline codes are located in method codes, so the method codes are the context of inline codes. We count the proportion of these special tokens in inline codes with the context, that is, we count the proportion of these special tokens in source code in method codes including commented inline codes. The proportion of these four kinds of special tokens increases when adding the context. Therefore, the context includes more special token information.

Table 10.

	Variable	API	Basic data type	Reference data type
Method	6.09%	0.78%	1.72%	11.08%
Inline	4.18%	0.64%	0.29%	1.52%
Inline with context	14.95%	1.28%	0.65%	3.42%

Table 10. The Proportion of Codes Mentioning a Certain Type of Token in the Comment

POS. We are also interested in the POS of words and phrases in the two types of comments. We use the well-known natural language processing tool NLTK⁷ to parse the comment sentences and got the POS of each word in the sentence. We count various POS structures in method comments and inline comments, and select the most important and useful 10 POS structures, which is shown in Tables 11 and 12.

Table 11.

	Method		Inline
Top	POS	Ratio	POS	Ratio
1	noun	37.70%	noun	31.44%
2	verb	15.50%	verb	16.05%
3	determiner	15.08%	prep or conj	12.60%
4	prep or conj	13.92%	determiner	9.72%
5	adjective	8.04%	adjective	8.79%
6	adverb	2.00%	adverb	5.14%
7	pronoun	0.77%	pronoun	2.56%
8	modal auxiliary	0.65%	modal auxiliary	1.51%
9	genitive marker	0.25%	particle	0.30%
10	particle	0.12%	genitive marker	0.25%

Table 11. Top10 most Frequently Occurring POS

Table 12.

	Method			Inline
Top	POS	Ratio	Example	POS	Ratio	Example
1	noun+noun	12.08%	“Constructs exception with the specified detail message”	noun+noun	8.00%	“Assume a drive letter for a mount point”
2	determiner+noun	10.50%	“The IV is produced by adding the initial IV to the counter”	determiner+noun	6.92%	“(sum >>> Byte.SIZE) is the carry for addition”
3	noun+“prep or conj”	9.63%	“@param path for the exception”	noun+“prep or conj”	6.71%	“There is no real data in inBuffer”
4	“prep or conj”+determiner	7.04%	“Configure whether the stream should drop the cache”	adjective+noun	5.78%	“iterate over old configuration”
5	adjective+noun	6.15%	“Release a ByteBuffer which was created by the enhanced ByteBuffer read function”	“prep or conj”+determiner	4.14%	“This operation does not change the current offset of the file”
6	verb+determiner	5.93%	“Get a ByteBuffer containing file data”	verb+determiner	3.98%	“Have some decrypted data unread, need to reset”
7	noun+verb	4.37%	“Constructor deprecated by ContentSummary.Builder”	noun+verb	3.53%	“’The fallback behavior accomplishes the rename by a full copy”
8	determiner+adjective	3.37%	“Returns the names of the fields from the summary header”	“prep or conj”+noun	3.36%	“Else try POSIX style rename on Windows only”
9	“prep or conj”+noun	3.23%	“Reads up to buf.remaining() bytes into buf”	verb+“prep or conj”	2.98%	“Then read in the class names and add them to our tables”
10	verb+noun	3.02%	“Read data from underlying stream”	“prep or conj”+verb	2.74%	“Special case: must come before writing out the declaredClass”

Table 12. Top10 most Frequently Occurring POS of Two-word Phrases

Tables 11 and 12 are the statistics of POS of words and the statistics and examples of POS of phrases in the two comment datasets. Here “prep or conj” is the abbreviation of “preposition or conjunction”. From Table 11, we can see that the top10 most frequently used POS of the two types of comments are almost the same. In terms of phrases, the top3 POS of the two-word phrases are the same. The top10 list is also the same, except that the order is slightly different. We can learn that the two types of comments are similar in the use of POS. There is no phenomenon that method comments tend to use a certain POS of words, while inline comments tend to use phrases with specific POS combinations. We can also see from the percentages in Table 12 that inline comments are indeed more diverse than method comments. This is consistent with the conclusion we reach when we discuss word usage earlier. To make the phrase more detailed, we also present a representative example for each kind of phrase in Table 12.

As it is shown in Table 8, the dictionary size of inline comment is 52.3% larger than that of method comment. However, from Table 12, we find that the differences of two-word phrases are not obvious. Therefore, we try to analyze the sentence diversity of inline comments. We try to do a cluster experiment based on sentence similarity to reduce the diversity of inline comments.

Summary: It can be found that there exist many differences between method comments and inline comments in writing style. Compared with method comments, inline comments have a more diverse dictionary which includes more tokens. This can guide us to adjust the size of inline comment dictionary when designing an inline comment generation model. Also, method comments utilize more certain types of tokens such as variable, API, basic data type, and reference data type. This can guide us to utilize these kinds of tokens when designing method comment generation model. Besides, the POS distribution between method comments and inline comments are similar. Therefore, we can provide a POS mapping table to catch the comment generation rules when designing a comment generation model. In order to prove our findings, we also do a questionnaire survey to investigate the habits of real developers writing method comments and inline comments, as it is shown in Section 5. The results also show that developers have different writing styles when writing method comments and inline comments.

RQ3. Can method comment generation models be applied in generating inline comment and why?

From RQ1 and RQ2, we know the characteristics of method comments and inline comments. The difference between them is also explored. For example, method comments include more templates and have a concentrated dictionary. These characteristics may be the influence factors of the neural machine translation (i.e., NMT) model. We want to analyze if the difference between method and inline model has influence on comment generation model.

We use the three existing NMT models for the comparison experiments. These models are Seq2Seq, DeepCom, and Code2Seq. The experimental results are shown in Table 13 (the second and fourth columns). We can see that the BLEU-4 of generating method comment is better than that of generating inline comment for all models, and the performance of the models on the method comments outperforms the ones on the inline comments by 10%. Seq2Seq has the best performance in inline comment generation. Besides, Code2Seq performs OK in method comment generation, but not good in inline comments (17–18 points lower than the other two models). This may refer to the feature extraction approach of Code2Seq. As for Code2Seq, its main idea is to select several AST paths from the AST, but as for inline AST, this approach may cause many LCA nodes to be selected multiple times. These LCA nodes are only introduced to complete the structure of inline AST, and have nothing to do with the inline comments and they have no semantics information for generating inline comments. So, compared with other traversal approaches, this approach may generate more noise. Therefore, it does not show a good result.

Table 13.

	Method comment	Method comment without template comment	Inline comment
Seq2Seq	38.98	34.13	29.95
DeepCom	39.13	33.96	29.07
Code2Seq	24.52	17.84	12.52

Table 13. The BLEU-4 Results of Applying Generation Models to Method Comment and Inline Comment

Because the models used for method and inline comment generation are the same, it seems that the features in the comments affect these performance of the generation models. We further explore the possible reasons.

Template comment. Note that we have detected many template comments in the method comment dataset. These comments are highly consistent and may affect the effectiveness of the models. As we describe in RQ1, when multiple identical comments exist in both the training set and testset, models can perform better on the repetitive comments. The reason is that, in this case, the training set and testset overlap. It always makes the models perform better than their real performance. To eliminate the influence that may be caused by template comments, we reconstruct a method comment dataset of the same size. When we randomly select samples, we avoid samples that are detected as template comment by the AEL method. We test the performance of the models on the new method comment dataset. The experimental results are shown in Table 13 (the third column). By comparing the results, we can find that the performances of all models become worse by almost 4% after removing the template comment samples, but it is still better than generating inline comments. It shows that the existence of template comments does make the models perform better.

OOV issues. The Out Of Vocabulary (OOV) issue refers to out-of-vocabulary words or unregistered words appearing in the sample sequence. It is a normal issue in NLP models. In order to solve this problem, we use CamelCase splitting and snake_case splitting to reduce token granularity for these three models. Because the input of the model is code related sequence, we also count the distribution of code tokens (like Table 8), which is shown in Table 14. It can be found that the word sizes of method codes and inline codes are similar (212,578 and 214,696). In order to analyze the influence of OOV tokens. We change the size of vocabulary in method comment generation models and see how it can affect the performance of generation model, which is shown in Table 15. It can be shown that if the vocabulary size decreases from 50,000 tokens to 30,000 tokens. The performance of different comment generation models also decreases by 3.44%–8.76%. The result shows that if there is less OOV tokens, that is, a larger vocabulary size, the models will have a better performance. Therefore, it is an effective way to improve the performance of comment generation models by mitigating the OOV issue.

Table 14.

	Size	99%	98%	97%	96%	95%	90%	80%	70%	60%	50%
Method	212,578	15,292	6,908	4,447	3,253	2,534	1,050	330	137	59	27
Inline	214,696	19,175	8,110	4,992	3,555	2,731	1,104	335	131	52	22

Table 14. Word usage in Codes

Table 15.

Vocabulary size	Seq2Seq	Deepcom	Code2Seq
50,000 tokens	38.98	39.13	24.52
30,000 tokens	32.22	30.37	21.08

Table 15. The Influence of OOV Tokens

Tokens in comments. As it is shown above, we can see that after removing the template comments, the quality of the method comments generated by the models is still better than that of the inline comments. It seems that there are other factors that make it difficult for the models to generate inline comments. In RQ2, when we study which kind of comments are more likely to use the token in the code, we find that the statistical results of method comments are quite different from those of inline comments. This may be one of the reasons why the quality of the generated method comments is better.

We continue to use the testset without template comments to study the “tokens” factor. We divide the testset into two parts based on whether the tokens in the code are mentioned in the original comments. The quality of the generated comments of the two parts is compared. We can observe from Table 16 that whether the API and reference data type are mentioned in the comments or not could have an impact on the quality of the generated method comments. This finding is not similar with previous study [27]. When considering other cases, we are surprised to find out that whether tokens of the code appears in the inline comments or not, the quality of generating comments is similar. In particular, when variable and reference data type tokens exist in inline comment tokens, the comment generation performance has a slight improvement.

Table 16.

	Method				Inline
Comment type	Variable	API	Basic data type	Reference data type	Variable	API	Basic data type	Reference data type
Mentioned	33.99	30.45	33.42	32.80	29.57	29.04	28.75	29.09
Not mentioned	33.94	35.45	34.08	34.82	28.84	29.04	29.04	29.04

Table 16. The Effect of the Comment Mentioning Special Tokens

We further analyze this phenomenon. According to our assumption, if a token appears in both code and comment, it is easier for models to generate this token when using code as a feature to generate comments. However, the experimental result is quite the contrary. Two hypotheses probably could explain this result. On the one hand, the popular models we use to generate comments are encoder-decoder structures. If a token appears in both code and comment, it could have two different coding vectors in the encoder’s vector space and decoder’s vector space. Unless we have some restrictions to align the two different vectors, it is the same as two different tokens in the code and the comment. On the other hand, it is a very complex process for model generating comments, even if tokens appear in both code and comment could promote the generating quality, we could not ensure the quality of the generated whole sentence.

We also notice that the models are more effective generating method comments when no mentioning API or reference data types and we have done further research. The results are shown in Table 17. We find that comments do not mention API or reference data types have a shorter average length and fewer AST nodes. It seems that these samples are uncomplicated relatively and models could learn and memorize them more easily.

Table 17.

	Mentioned API	Not men- tioned API	Mentioned reference data type	Not mentioned reference data type
Original comment length	13.99	12.41	14.05	12.03
Num of AST nodes	53.42	20.28	36.52	25.42

Table 17. The Average Original Comment Length and Average Number of AST Nodes in Different Cases

Word usage. In RQ2, we have also learned that these two different types of comments also differ in terms of the word diversity. The vocabulary used in method comments is more concentrated while the vocabulary used in inline comments is more diverse. Specifically, as it is shown in Table 8, the dictionary size of method comment is 34.35% less than that of inline comment. Besides, only 1,128 words can cover 90% of method comment tokens, while 1,431 words cover 90% of inline comment tokens. It is probably one of the most important factors why generating method comments have higher quality than generating inline comments. On the one hand, with a more diverse vocabulary (i.e., inline comment dictionary), the model has more choice in word selection when generating comments at every step. Even if we randomly pick up a word, there is a higher probability for generating sentences to have higher generating quality with relatively concentrated vocabulary (i.e., method comment dictionary). In other words, the probability of picking the correct word becomes higher. On the other hand, models should learn patterns from the dataset and it is relatively easier for concentrated vocabulary to have more patterns. In order to prove the influence of word diversity, we try to reduce the diversity of inline comments. Specifically, we do a cluster experiment of inline comment dataset according to the word usage similarity which is shown in Table 18. We classify the dataset into 20 clusters. We extract the cluster with the largest amount of data and utilize Seq2seq model to the performance of this cluster. The BLEU-4 result is 31.93. It shows that the cluster dataset performs 6.61% better than the performance of the whole dataset (31.93 and 29.95).

Table 18.

Cluster No.	1	2	3	4	5	6	7	8	9	10
Data size	259,328	64,874	55,856	45,026	44,570	41,670	41,456	41,238	40,559	38,196
Cluster No.	11	12	13	14	15	16	17	18	19	20
Data size	37,839	34,633	33,420	26,255	25,739	23,947	21,126	21,102	19,396	17,074

Table 18. The Cluster Result of Inline Comments

Summary: We find that the existence of template comments is one of the main reasons that the quality of generated method comments is better than that of generated inline comments. After eliminating this factor, the method comments generated by the model are still better than the generated inline comments. We also find that the distribution of vocabulary is another main reason, which is similar with NMT task. As it is known that as for the fixed vocabulary size, larger word size means more unknown words, which decreases the accuracy of the translation (i.e., comment generation) [60]. We are surprised to find out that there is no correlation between the quality of generated comments and whether or not the same tokens appear in both code and comment.

From the findings based on three RQs, we find that there are many differences in the characteristics of method comments and inline comments. These findings give us some implications and motivate us that we need to propose a more specialized approach for method comment or inline comment generation to improve the performance. For instance, there are more template comments in method comments. So we need to remove template comments when training the method comment generation model to reduce the influence of method comments. Firstly, the dictionary in inline comment is more diverse than that in the method comment. Therefore, it is reasonable to adjust the size of dictionary. Specifically, we can increase the inline comment dictionary size or reduce method comment dictionary size to improve comment generation. Besides, from Table 2, we find that the length of method codes is longer than that of inline codes. Therefore, method codes have more complete semantics and syntax information. We can consider adding context information of inline codes to make inline semantics and syntax information complete. But directly adding all the context will make the code sequence so long that includes redundant information. From Table 10, we find that when adding the context of inline codes, there are more information about variable, API, basic data type, and reference data type. From Table 16, we find that comments that mention variable and reference data type have better generation performance. Therefore, it is possible to improve the performance of inline comment generation model by extracting variable and reference data type tokens from the context as the input. From Table 17, we find that comments that mention variable and reference data type have better generation performance. Therefore, it is possible to improve the performance of inline comment generation model by extracting variable and reference data type tokens from the context as one of the model inputs. Another factor these findings can guide us is that the method to adjust hyperparameters of NLP models. As it is shown in Tables 1 and 17, the code length, comment length, and AST nodes number can guide us to adjust the max input size of the comment generation model. The size of the dictionary can also guide us to design the embedding size of the model.

5 Discussion

To further investigate the importance and differences between the two types of comments, a questionnaire survey⁸ is conducted to understand people’s concerns in the process of writing code comments. We invite programmers, researchers, and students from universities and industry to participate in our research through an online questionnaire. Over 40% of the volunteers have more than three years of programming experience and the rest have at least one year programming experience. The questionnaire consists of 15 questions, 5 of which are relevant to the participants’ background and are used to assess their experience in writing code comments. The other 10 questions are used to investigate the importance and understanding of method comments and inline comments. Finally, 102 questionnaires are collected using the Questionnaire Star online survey system, of which 27.5% are from industry, including development engineers, algorithm engineers, test engineers, and so on, and others are from universities.

The results show that in terms of writing comments, 38% of the participants have the habit of writing method comments frequently, while the proportion of participants who write inline comments frequently reaches 59% (frequently indicates that over 60% of methods will write comments). In addition, 75% and 77% of the participants consider themselves to read method comments and inline comments frequently, respectively. Two scale questions are designed to measure how helpful two kinds of comments are perceived to be in understanding other people’s code, by asking participants to rate the importance of method comments and inline comments to program interpretation, on a scale of 0–10. The results show that the average score for method comments is 8.46 and for inline comments 8.47, indicating that inline comments are as important as method comments and should be given sufficient attention.

Furthermore, we query the feedback from participants through the question “What is your customary comment length when writing comments?”. For method comments, almost half of the participants are not too concerned about the length of the comments, with 38% of the participants habitually using a length of between 10 and 20 words. For inline comments, over 42% of the participants prefer to use a comment length of 10 words or less. This finding could correspond to the previous statistics for the dataset, suggesting that people tend to use a higher number of words when writing method comments, leading to the difference in length between the two types of comments.

Finally, we investigate the participants’ focus on the code tokens during the commenting process, by asking them which tokens in the code are involved in method and line comments. Besides the previously mentioned variable, API, basic data type, and reference data type, we have added some other options, such as exception related keyword, control logic related keyword, and so on. The detailed results are shown in Table 19, where we calculate that the standard deviation of the proportion of method comments involving various types of code token is 0.2551, while the standard deviation of inline comments is 0.1869. It shows that the distribution of code token types in method comments is more concentrated and mainly focuses on method parameter, method return value and words in variable, with the proportion of method return value being 27.45% higher than that of inline comments. Additionally, the distribution of tokens in the inline comments is more balanced, with a relatively smaller difference in the proportion of various keywords. This finding suggests that, while the proportion of code tokens referenced in method comments is higher, the variety of tokens involved in inline comments is likely to be richer.

Table 19.

	Method parameter	Method return value	Words in variable	API from third library	API from this project	API from this class	Basic data type	Reference data type	Exception related keyword	Visibility related keyword	Control logic related keyword	Thread related keyword	Test related keyword	Other
Method	88.24%	80.39%	57.84%	27.45%	30.39%	20.59%	29.41%	41.18%	20.59%	11.76%	18.63%	13.73%	12.75%	6.86%
Inline	66.67%	52.94%	59.80%	32.35%	33.33%	21.57%	19.61%	30.39%	29.41%	7.84%	22.55%	13.73%	11.76%	8.82%

Table 19. The Proportion of Participants Mentioning a Certain Type of Token when Writing Comments

6 Threats to Validity

In this section, we focus on the threats that could affect the results of our study.

Threats to internal validity relate to the scale of the data set using for the empirical study of the code comment. Since we need to study the general features of the method comments and inline comments in the source code, we need to collect a large number of code comment instances. Therefore, we collected 998 projects from GitHub, which contain 975,765 method comments and 973,525 inline comments, respectively. In the future, we need to get more code comment instances to extend our data set.

Threats to external validity relate to the generalizability of our results. We collect a number of comment-code snippet pairs for a comparison experiment (i.e., method comment generation vs. inline comment generation) in RQ3. All of the code snippets are written in Java language. When migrating the comparison experiment to the datasets written by other programming languages, such as C, C++, and Python, some particular code syntax (e.g., pointer operation in C++) should be carefully handled when extracting the syntax features from the abstract syntax tree. In the future, further investigation by analyzing even more projects written by other programming languages is needed to mitigate this threat.

Threats to construct validity refers to the suitability of our evaluation measure. We use a conventional measure to evaluate the effectiveness of the models when generating method comments and inline comments in RQ3. Because the issue of comment generation can be modeled as a natural language generation problem, we introduce the BLEU score to evaluate the performance of the comment generation models. BLEU score can evaluate the effectiveness of the comment generation models. Thus, we believe there is little threat to the suitability of our evaluation measure.

7 Conclusion and Future Work

In this article, we first compare the occurrence of method comments and inline comments in the project and think it is necessary to conduct some investigation and analysis on inline comments. Then we explore the similarities and differences between method comments and inline comments in many aspects, including the number of template comments, writing styles, tokens in comments, and so on. Finally, we compare the performance of the existing comment generation models on the method comment dataset and the inline comment dataset. We find that the effect of the models on inline comment is worse than that on method comment. Through further research, we believe that the existence of template comments and writing styles are the reasons why the generation of method comments is easier to achieve good results. Whether the original comment mentioned the token in the code does not have much impact on the result. There may be other reasons for the poor effect of generating inline comments. We will continue to explore more possible factors in future work.

Footnotes

https://github.com/dolpvv57/comparative_study_on_comment.

https://github.com/dolpvv57/comparative_study_on_comment/tree/master.

https://github.com/tech-srl/code2seq.

⁴

https://aws.amazon.com/ar/sdk-for-java/?nc1=h_ls.

⁵

https://www.service-architecture.com/articles/web-services/web_services_description_language_wsdl.html.

⁶

http://javaparser.org/.

⁷

http://www.nltk.org.

⁸

https://drive.google.com/file/d/1f9NaRB3eehm5DI66DrFU4CWJ_ha70sPa/view?usp=sharing.

References

[1]

E. Wong, J. Yang, and L. Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 562–567.

Abstract

1 Introduction

2 Related Work

2.1 Empirical Study on Code Comment

2.2 Code Comment Generation

3 Methodology

3.1 Overview

3.2 Data Collection and Analysis

3.3 Detecting the Comments Automatically Generated Based on Templates

3.4 Comment Generation Models

4 Major Findings

5 Discussion

6 Threats to Validity

7 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

Deep code comment generation

MESIA: Understanding and Leveraging Supplementary Nature of Method-level Comments for Automatic Comment Generation

A Survey on Research of Code Comment

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations