Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (https://xcodemind.github.io). At last, we also point out several challenging and promising directions for future research.

1 Introduction

Software is eating the world [15]. With the advancement of Artificial Intelligence (AI), it is time to expand that maxim: software ate the world, and AI is eating the software. As the software is primarily composed of code, we define the emerging concept of code intelligence as the application of AI techniques to extract knowledge from large-scale code repositories, with the aim of developing intelligent tools to improve the quality and productivity of computer programming [163]. This concept is fueled by the ever-expanding reservoir of source code, often referred to as “Big Code” [7], which is harvested from platforms such as GitHub [1] and StackOverflow [2]. In this paper, our research scope is confined to code intelligence, with a particular focus on the application of deep learning techniques.

Achieving code intelligence necessitates a collaborative synergy in research across the domains of software engineering, machine learning, Natural Language Processing (NLP), programming language, and security. From our investigation, precise and reliable code representation learning (or code embedding), which aims to efficiently and effectively encode the semantics of source code into distributed vector representations, is the foundation for code intelligence. Such embedding vectors are then used in various downstream tasks, such as code completion [121, 153, 203, 229], code search [83, 111, 240], code summarization [11, 108, 112, 243, 293], type inference [8, 104, 193, 260], and program synthesis [17, 181, 183, 204] and so on. In terms of code representation learning, significant progress has been made by utilizing deep learning and NLP techniques to encode code.

Analogous to word2vec [170] in NLP, Alon et al. [14] proposed code2vec, a distributed representation of code, based on a collection of paths extracted from the Abstract Syntax Tree (AST) of code. Recently, a multitude of neural networks tailored for specific tasks have been proposed and trained using supervised methods. As pre-trained language models (e.g., BERT [64] and GPT-3 [28]) have been widely applied to NLP, many pre-trained language models for code have been proposed [72, 89, 119] to better represent the semantics of code. More recently, the emergence of Large Language Models (LLMs), exemplified by ChatGPT, has illuminated the pathway for further advancement of pre-trained language models, with a notable trend of increasing model sizes. This trend has extended to the domain of code intelligence, resulting in the development of various LLMs tailored for code, including but not limited to CodeT5 [253], StarCoder [134], and Code Llama [204]. In this paper, we examine code intelligence through the lenses of code representation learning, deep learning methods, and their applications.

Related Surveys and Differences. Within our literature review, we identified several surveys related to ours. Notably, Allamanis et al. [7] conducted an exhaustive examination of machine learning approaches for modeling the naturalness of programming language. They primarily emphasize machine learning algorithms, with a specific focus on probabilistic models, as opposed to those based on deep learning. Recently, Watson et al. [256], Wang et al. [248] and Yang et al. [276] conducted a thorough review of the literature on applications of deep learning in software engineering research. They investigated mostly software engineering and AI conferences and journals, focusing on various software engineering tasks (not limited to the code) that are based on deep learning. [63] is a report that summarizes the current status of research on the subject of the intersection between deep learning and software engineering, as well as suggests several future directions. In [163], the authors introduced CodeXGLUE, a benchmark dataset for code representation and generation. They also presented benchmark results, notably leveraging pre-trained language models like CodeBERT.

Table 1 summarizes the differences between our paper when compared with several related surveys in code intelligence. In contrast to [7] that focuses on traditional machine learning approaches, this paper places greater emphasis on leveraging deep learning techniques for code intelligence. In contrast to [256], [248], [276], and [63] that cover various tasks in broad software engineering, our study narrows its focus to tasks associated with source code, examining them specifically from the perspective of deep learning. In addition, we survey papers from various fields including software engineering, programming languages, machine learning, NLP, and security. Furthermore, existing surveys do not provide comprehensive benchmark evaluation results, nor do they develop an open-source toolkit to facilitate further research. This paper addresses this gap by presenting an open-source toolkit, referred to as NaturalCC(standards for Natural Code Comprehension)[239]. The toolkit is designed to streamline the prototyping of code intelligence models and to serve as a benchmarking platform for evaluating various state-of-the-art models. In complement to CodeXGLUE [163], our focus lies in the building of infrastructures that support diverse model implementations and provide users with the ability to conduct rapid prototyping. Compared to CodeXGLUE, our toolkit contains a more extensive array of tools designed for the entire pipeline involved in constructing code intelligence models, offering heightened flexibility.

Table 1.

Paper	Artifact	Technique	Survey	Benchmark	Toolkit
Allamanis et al. [7]	Software	Machine Learning	✓	×	×
Watson et al. [256]	Software	Deep Learning	✓	×	×
Wang et al. [248]
Yang et al. [276]
Devanbu et al. [63]
Lu et al. [163]	Software	Deep Learning	×	✓	✓
Ours	Code	Deep Learning	✓	✓	✓

Table 1. Comparison of our Work with Previous Related Surveys

Our Contributions. This paper is targeted at researchers and practitioners intrigued by the convergence of code intelligence and deep learning, with a specific emphasis on intelligent software engineering, NLP, and programming languages. In this paper, we begin by providing a thorough review of existing research on deep learning for code intelligence. Subsequently, we advance our contribution by developing an open-source toolkit, referred to as NaturalCC, that incorporates state-of-the-art models across various downstream tasks. Employing NaturalCC, we conduct a comprehensive performance benchmark of each model across five downstream tasks, including code summarization, code search, code completion, and type inference. The major contributions of this paper are summarized as follows.

–

We conduct a comprehensive review of deep learning for code intelligence. Specifically, we have collected 276 papers from various top-tier venues and arXiv, covering multiple domains including software engineering, artificial intelligence, NLP, programming languages, and security.

–

We benchmark the performance of 18 leading models across five different tasks (i.e., code summarization, code search, code completion, program synthesis, and type inference). All the resources, datasets and source code are publicly available.¹

–

We introduce NaturalCC, an open-source toolkit featuring integrated state-of-the-art baselines across various tasks, aimed at streamlining research in code intelligence. Researchers in software engineering, NLP, and related domains can leverage this toolkit for rapid prototyping.

2 Survey Methodology

2.1 A Unified View from Code Representation Learning

We propose to summarize existing deep-learning-based approaches to code intelligence from the lens of code representation learning in this paper. As shown in Figure 1, for code representation learning, researchers first extract features that potentially describe the semantics of code, and then design various neural networks to encode them into distributed vectors. Code representation learning can be viewed as the foundation for different downstream applications. Based on the characteristics of each application, the downstream applications can be divided into three groups: (1) Classification-based. In these tasks (e.g., code classification, vulnerability detection, and type inference), a classifier layer (e.g., softmax) is used to map the code embeddings to labels/classes. (2) Similarity-based. In these tasks (e.g., code search and code clone detection), Siamese neural network structure [51] is often adopted, where dual encoders are used to encode the source code and natural-language query into embedding vectors. Based on the two embeddings of code and query, a constraint (such as a triplet loss function) is always used to regularize the similarity between them. Note that, in several approaches to code search and code clone detection, the two embeddings of code and query are also concatenated, and the task is reformulated as a classification task to determine whether the code and query are related [72]. (3) Generation-based. In these tasks (e.g., code completion, code summarization, program translation, program synthesis, and program repair), the objective is to generate source code, natural-language descriptions, or programs in another programming language from a given code snippet. These tasks usually follow the encoder-decoder paradigm, where an encoder network is used to represent the semantics of code, and a decoder network (e.g., RNN) is designed to generate sequences, e.g., natural-language descriptions or source code. Additionally, we categorize the learning paradigms into four groups: supervised learning, unsupervised learning, self-supervised learning, and reinforcement learning.

Fig. 1.

2.2 Paper Selection

Deep learning for code intelligence has been studied in many related research communities. In this paper, we review high-quality papers selected from top-tier conferences and journals, ranging from software engineering, programming languages, NLP, and AI, to security. Overall, we have identified 32 publication venues (see the Supplementary Materials). We first manually check the publication list of the venues and obtain an initial collection of papers. Particularly, we systematically query the aforementioned venue names within the DBLP database² and examine the associated proceedings. Subsequently, two authors, both possessing over five years of expertise in deep learning for code intelligence, collaboratively undertake the task of manually refining the results. This involves meticulous scrutiny of titles and a brief review of abstracts to identify and filter out papers that are potentially relevant to code intelligence. For those large conferences (e.g., AAAI and IJCAI) that accept thousands of papers per year, we first filter out those papers whose titles contain the keywords of “code” or “program”, and then manually check them.

Based on this initial collection of papers, we start to augment it through keyword searching. We systematically search DBLP and Google Scholar using the following keywords: “code representation”, “program comprehension”, “code embedding”, “code classification”, “vulnerability detection”, “bug finding”, “code completion”, “type inference”, “code search/retrieval”, “code clone detection”, “code summarization”, “program translation”, “program synthesis”, and “program repair”, with a combination of “deep”, “learning”, “neural”, and “network”.

It is worth noting that, in addition to accepted papers from the aforementioned venues, we also consider some recent publications from the pre-print archive, as they reflect the most current research outputs. We choose publications from arXiv based on two criteria: paper quality, and author reputation. The quality of a pre-printed paper can be assessed based on the number of citations it has garnered in recent months. The reputations of authors can be indicated by their Google Scholar citations. If a paper satisfies either of these selection criteria, we include it for consideration. Having obtained this collection of papers, we then filter out the irrelevant papers by manual checking. Finally, we obtained a collection of 276 papers. To ensure transparency and accessibility, a comprehensive table of the surveyed papers and the source of papers is maintained online.³ Note that this survey is systematically conducted up to January 2023, as per the submission date. It is evident that the field of code intelligence has witnessed a significant shift towards LLMs since January 2023. Although we have included several papers on LLMs in code intelligence, research in this area is active and rapidly evolving. A comprehensive survey on LLMs for code intelligence is available in our online resource.

3 Literature Review

3.1 Taxonomy

Figure 2 illustrates the taxonomy of current studies on deep learning for code intelligence surveyed in this paper, categorized into three distinct aspects: code features, deep learning techniques, and applications. (1) Code Features. Code representation forms the cornerstone of deep-learning-based code intelligence. We classify current approaches based on the features of input code they utilize, including code tokens, Intermediate Representations (IRs), Application Programming Interfaces (APIs), Abstract Syntax Trees (ASTs), and code graphs (e.g., control-flow and data-flow graphs). (2) Within the domain of deep learning, we begin by exploring a range of neural network architectures, i.e., RNNs, CNNs, Transformers, and GNNs. Subsequently, we examine various learning paradigms utilized for modeling source code, i.e., supervised learning, unsupervised learning, self-supervised learning, and reinforcement learning. (3) We investigate multiple downstream applications that are based on code representation and deep learning techniques, including code classification, vulnerability detection and bug finding, type inference, code search, code clone detection, code completion, code summarization, program translation, program synthesis, and program repair.

Fig. 2.

3.2 Code Features

To represent source code effectively, the initial step is determining what aspects to capture. Various studies have proposed extracting code features from multiple angles, such as code tokens, IRs, ASTs, and different forms of code graphs. Figure 3 provides a comprehensive illustration of a C code snippet alongside its associated code tokens, IR, AST, control-flow graph, data-flow graph, code property graph, and IR-based flow graphs.

Fig. 3.

3.2.1 Code Tokens.

Code tokens, shaping the textual appearance of source code, are composed of function name, keywords, and various variable identifiers. These tokens are simple yet effective in representing the semantics of programs. Just as sentences in natural language can be tokenized, source code can also undergo tokenization at various levels of granularity, such as character-level, word-level, and sub-word level. Cummins et al. [59] introduced a character-level LSTM network for program synthesis, which circumvents the issue of out-of-vocabulary due to the finite set of characters used in programming. However, tokenizing at the character level can obscure the original word meanings and inflate the code sequence length, potentially complicating comprehension of the program’s overall semantics.

More coarsely, numerous word-level approaches have been developed to tokenize source code by utilizing separators. For instance, White et al. [264] and Iyer et al. [112] proposed tokenizing programs based on whitespace, employing RNNs to summarize and complete code. Additionally, Allamanis et al. [11] devised a CNN with an attention mechanism to capture the hierarchical code structure effectively by tokenizing via camel case subtokens, aiming to predict function names.

Out-of-Vocabulary (OOV) Issue. As the variables and function names are always defined by developers without constraints, the vocabulary size escalates dramatically with increasing training data, leading to a pronounced out-of-vocabulary issue, more severe than in NLP. To address this, Cvitkovic et al. [60] introduced a graph-structured cache, adding nodes for new words encountered and connecting them based on code occurrences. Additionally, Chirkova and Troshin [53] proposed a simple yet effective solution for mitigating the OOV issue through identifier anonymization, yielding promising performance enhancements.

Another effective approach involves tokenizing the source code at a sub-word level, such as through techniques like Byte Pair Encoding (BPE), aimed at constructing a set of sub-words for representing the entire code corpus. In Figure 3(b), source tokens obtained via word tokenization and BPE tokenization are illustrated. For the input variable number, word tokenization maintains the original word as a rare token, while BPE tokenization splits it into two common sub-words: num and ber. Pre-trained language models for source code, such as CuBERT [119] and CodeBERT [72], commonly employ BPE to reduce vocabulary size. Karampatsis et al. [120] conducted an empirical study on word segmentation granularity, demonstrating that BPE tokenization can significantly reduce vocabulary size.

3.2.2 Application Programming Interfaces (API).

Several methods have been proposed for analyzing API sequences in programs. One line of research focuses on mining API usage patterns from large code corpora to illustrate API usage. For instance, Moreno et al. [172] introduced Muse, a method for mining and ranking code examples to demonstrate API usage. Another line of work is API recommendation, which aims to suggest or generate API sequences for users. Jiang et al. [115] proposed a method for discovering relevant tutorial fragments for APIs by calculating correlation scores based on PageRank and topic relevance. Gu et al. [84] introduced DeepAPI, a language model employing sequence-to-sequence learning for generating API sequences from natural language descriptions. In contrast to DeepAPI, Nguyen et al. [180] proposed API2Vec to capture contextual information of API elements within sequences, along with a tool named API2API for migrating APIs across different programming languages, such as Java to C#. Ling et al. [147] introduced a method that integrates API call interactions and project structure into a single graph, leveraging it to design a graph-based collaborative filtering system for API usage recommendations. Bui et al. [31] proposed a cross-language API mapping approach from Java to C# using transfer learning across multiple domains. Hu et al. [108] suggested incorporating API information to enhance code summarization. For improving semantics representation in natural-language queries and API sequences, Wei et al. [261] proposed a contrastive learning approach for API recommendation, while Hadi et al. [96] investigated pre-trained models’ effectiveness in generating API sequences from natural-language queries.

3.2.3 Abstract Syntax Tree (AST).

AST is a tree-structured, intermediate representation of code, delineating the syntactic structure of a program. As illustrated in Figure 3(d), leaf nodes (e.g., number, Fib) typically correspond to variables and method names, while non-leaf nodes (e.g., FuncName, SwitchStmt) denote code structure elements like function definitions and branching. This representation facilitates the capture of both lexical information (e.g., variable number) and syntactic structure of code. Various open-source tools, including ANTLR⁴ parser, tree-sitter⁵ parser, and LLVM Clang,⁶ can be utilized to extract ASTs. To represent the ASTs, Mou et al. [173] proposed a tree structure-based CNN, and verified it in a code classification task. Liu et al. [155] introduced an enhanced LSTM to address long-distance dependencies within AST nodes, applied in code completion, code classification, and code summarization. To better process an AST, Zhang et al. [294] partitioned ASTs into sentence-based subtrees and represented them using a two-way loop network. Recently, Kim et al. [121] proposed leveraging relative position embeddings in Transformers for code completion tasks. Additionally, Niu et al. [182] integrated AST information into a pre-trained source code model.

Another line of work [12, 14, 107] involves indirectly representing ASTs through traversal or path sampling. Hu et al. [107] proposed traversing ASTs to transform them into linear sequences of nodes, employing RNNs to represent the sequences for code summarization tasks. Alon et al. [14] conducted path sampling on ASTs and utilized word2vec to represent program semantics. Similarly, Alon et al. [12] applied a comparable approach to code summarization. Additionally, Alon et al. [13] introduced a structured code language model for code completion by sampling paths from incomplete ASTs. In program synthesis, an AST is utilized to guide the synthesis of programs. Yin and Neubig [284] introduced an encoder-decoder framework for code generation. Here, the encoder processes natural language, the decoder generates an AST, and subsequently, the AST is transformed into source code. Additionally, Chen et al. [47] proposed a Tree2Tree model for program translation. This model employs a TreeLSTM to represent the source program and another TreeLSTM to produce the target program in a different programming language.

3.2.4 Intermediate Representation (IR).

The IR is a well-formed structure that is independent of programming languages and machine architectures. It is used by compilers to accurately represent the source code during the translation process from the source code to low-level machine code. The IR can express the operations of the target machine. Leveraging IRs in enhancing code embeddings has been proposed [143], offering the advantage of a constrained vocabulary to mitigate the Out-of-Vocabulary (OOV) issue. In this study, we utilize LLVM-IR, integral to the LLVM infrastructure [123], as depicted in Figure 3(c). To represent IRs, Ben-Nun et al. [22] introduced inst2vec, which initially compiles a program using LLVM Clang to obtain the LLVM-IR, followed by the adoption of skip-gram to represent the instructions. VenkataKeerthy et al. [238] proposed IR2Vec, conceptualizing the intermediate code representation as triples within a knowledge graph, and subsequently exploring multiple knowledge graph representation methods. Cummins et al. [58] presented ProGraML, a novel graph-based code representation built upon IR. This code graph offers fresh avenues for representing source code semantics at a granular level using machine learning techniques such as Graph Neural Networks (GNNs), facilitating complex downstream tasks like program optimization and analysis. Peng et al. [190] suggested representing augmented IR of source code via pre-training and contrastive learning methods, guided by compiler optimization. Notably, Gui et al. [86] investigated a new problem of matching binary code and source code across different programming languages by transforming both into LLVM-IRs.

3.2.5 Code Graphs.

Numerous approaches have been proposed to convert programs into graphs for enhanced representation of their rich structural information, such as Control-Flow Graph (CFG), Data-Flow Graph (DFG), and Code Property Graph (CPG) [58, 274]. The CFG, as depicted in Figure 3(e), illustrates program computation and control flow, where nodes represent basic blocks and edges denote control flow transitions. Meanwhile, the DFG, shown in Figure 3(f), visualizes data relationships among functions, with nodes featuring input and output data ports, and edges linking these ports. To represent multiple structural information of code using a joint data structure, Yamaguchi et al. [274] proposed an innovative CPG to merge the structural information of code, including AST, CFG and Program Dependence Graph (PDG), into a single graph, as shown in Figure 3(g). In practice, we can build CFGs and DFGs using LLVM Clang, and build CPGs using Plume.⁷ Recently, Cummins et al. [58] built a unified graph, termed ProGraML, which includes the CFG, DFG and call-graph, as shown in Figure 3(h).

To represent these code graphs, Allamanis et al. [10] incorporated the data flow into the ASTs and formed a code graph. Subsequently, a Gated Graph Neural Network (GGNN) [137] was developed to learn the data dependencies within this structure. Allamanis and Brockschmidt [9] incorporated contextual information of variables into data flow for automated pasting in programming. Brockschmidt et al. [26] extended incomplete code into a graph and proposed a GNN for code completion. Sui et al. [220] made the code representation more accurate by using the value-flow graph of a program. Furthermore, Shi et al. [216] converted code graphs (e.g., CFG and DFG) into sequences through traversal for code search, while Chen et al. [48] introduced a general method to transform a code graph into a sequence of tokens and pointers.

3.2.6 Other Features of Code.

In addition to the aforementioned features of code that have already been widely explored, several kinds of features are also utilized in specific scenarios. For instance, Henkel et al. [105] introduced a novel feature for code representation learning based on abstractions of traces collected from the symbolic execution of a program, while Hoang et al. [106] proposed employing deep learning to learn distributed representations of code changes/edits for generating software patches. Regarding code changes, various related works have also been proposed for representation or prediction. For example, Tufano et al. [234] suggested automating code editing through sequence-to-sequence-based neural machine translation, while Brody et al. [27] proposed representing code edits first, followed by iteratively generating tree edits over the AST.

3.2.7 Hybrid Representation.

To harness multiple code features, various approaches to hybrid source code representation have been developed. For example, Gu et al. [83] employed three separate RNNs to encode function names, code tokens, and API sequences, respectively, as evaluated in the code search task. Similarly, White et al. [263] utilized two distinct RNNs to represent code tokens and AST node sequences for code cloning detection. Additionally, Zhao and Huang [298] proposed incorporating code flow graphs into a semantic matrix to represent source code, along with a neural network model to assess the functional similarity between code snippet representations. Similarly, Wan et al. [240],243] developed a hybrid network incorporating an LSTM for code tokens, a GGNN for code CFG, and a TreeLSTM for code AST, for tasks such as code summarization and code search. Additionally, Chakraborty and Ray [44] proposed utilizing three modalities of information –edit location, edit code context, and commit messages–to represent the context of programming and generate code patches automatically.

3.3 Deep Learning Techniques

We investigate the types of neural networks and classify the learning paradigms into four groups: supervised learning, unsupervised learning, self-supervised learning, and reinforcement learning.

3.3.1 Neural Networks.

It is natural to model source code as sequential text and apply NLP techniques directly to represent it. Specifically, RNNs [12, 83, 95, 104, 107, 152, 166, 203, 229, 263, 293] and CNNs [11, 225] are commonly employed to capture the sequential structure of source code. To capture syntax structure, particularly the AST of source code, several tree-structured neural networks [47, 173, 243] have been designed. Moreover, to represent semantic structures (e.g., CFG and DFG) of source code, GNNs [8, 10, 26, 158, 254, 260, 302] have been introduced. Recently, Transformer architecture has also been utilized for source code representation [121, 227]. Chirkova and Troshin [52] conducted a comprehensive empirical study on how well Transformers can leverage syntactic information in source code for various tasks. The preliminaries about the mentioned neural networks can be found in Supplementary Materials.

3.3.2 Supervised Learning.

Supervised learning aims to learn a function that maps an input to an output based on a set of input-output pairs as training data, constituting a widely used paradigm in deep learning. Our investigation reveals that current deep learning approaches for code intelligence predominantly rely on supervised learning. Specifically, for tasks such as code classification [30, 173], vulnerability detection and bug finding [50, 142, 145, 302], code completion [13, 132, 152, 203, 227, 229], type inference [8, 104, 166, 260], code search [83, 97, 240], code clone detection [259, 263, 268, 294, 298], code summarization [11, 12, 107, 112, 243], program translation [47, 85], program synthesis [34, 69, 151, 225, 300], and program repair [43, 66, 95, 138, 235, 237, 305], paired input-output data is initially collected. Supervised learning for each task is guided by a specific loss function. However, a limitation of this approach is its dependency on a substantial amount of well-labeled input-output pairs, which can be costly to collect in certain scenarios.

3.3.3 Unsupervised Learning.

As opposed to supervised learning, unsupervised learning aims to discern patterns from a dataset without labels. One notable work is TransCoder [205], which trains a fully unsupervised neural source-to-source translator via unsupervised machine translation. This learning paradigm poses challenges for code intelligence, warranting further research efforts.

3.3.4 Self-Supervised Learning.

Self-supervised learning can be thought of as a blend of supervised learning and unsupervised learning. Unlike supervised learning, which relies on labeled data for training, self-supervised learning derives supervisory signals directly from the data itself, often leveraging its inherent structure. A common approach in self-supervised learning involves predicting unobserved or masked segments of input based on the segments that can be observed. As a representative technique of self-supervised learning, language model pre-training has garnered significant attention in the domain of source code [72, 89, 119]. Kanade et al. [119] introduced the training of CuBERT on a Python code corpus, validating its efficacy across various downstream tasks, including variable misuse, operator classification, and function-document matching. CodeBERT [72] is another pre-trained model designed for handling both source code and natural-language descriptions. It employs Masked Language Modeling (MLM) and has demonstrated promising performance in tasks such as code search and code completion. Building upon CodeBERT, GraphCodeBERT [89], SPT-Code [182], and TreeBERT [117] have been proposed to incorporate structural information from source code. Lachaux et al. [122] introduced a pre-training objective based on deobfuscation as an alternative criterion. Inspired by BART [128], a pre-trained model tailored for natural language understanding and generation, Ahmad et al. [3] developed PLBART for tasks related to code generation and understanding. Zhang et al. [292] trained CoditT5 on vast datasets comprising source code and natural-language comments, enabling tasks such as comment updating, bug fixing, and automated code review. Wang et al. [251] and Guo et al. [88] proposed contrastive learning methods to enhance the representation of source code semantics by integrating source code and natural language modalities. Mastropaolo et al. [167] and Wang et al. [255] explored building pre-trained models based on the T5 (Text-To-Text Transfer Transformer) architecture, which has attained state-of-the-art results in NLP tasks. Bui et al. [32] introduced InferCode, a self-supervised learning approach for predicting subtrees from AST contexts. Jain et al. [114] proposed a contrastive learning approach for task-agnostic code representation, leveraging program transformations in compilers. ERNIE-Code [41] is a unified pre-trained model spanning 116 natural languages and 6 programming languages, aiming to bridge the gap between multilingual natural and programming languages. Additionally, Wang et al. [246] explored fine-tuning pre-trained code models through curriculum learning to better adapt to downstream tasks, addressing dataset distribution discrepancies.

Instead of improving the capability of code embedding, Wan et al. [242] explored the explainability of pre-trained models for code intelligence, i.e., what kind of information these models capture, through structural analysis. Zhang et al. [297] and Shi et al. [214] proposed compressing pre-trained code models to improve efficiency in practical applications. Zhou et al. [301] conducted an empirical study to evaluate the generalizability of CodeBERT across different datasets and downstream tasks.

Large Language Models (LLMs) of Code. The aforementioned pre-trained code models have shown promising capabilities in understanding the semantics of source code. More recently, fueled by the remarkable performance of LLMs in NLP, such as the success of ChatGPT, a diverse range of LLMs has been tailored for code intelligence tasks, particularly code generation. Nijkamp et al. [181] introduced a novel code generation task that enables users to progressively express their intentions through multi-turn interactions, and further trained a family of LLMs with up to 16.1 billion parameters, called CodeGen, for this task. PaLM [54] is a general-purpose LLM developed by Google, which is pre-trained on a dataset comprising both text and code corpora, boasting a vast parameter size of up to 540 billion. Derived from PaLM, PaLM-Coder is a model specifically fine-tuned for code-related tasks, such as code generation and program translation. InCoder [74] is a LLM developed by Meta, which employs a causal masking objective for the purpose of infilling code blocks based on arbitrary left and right contexts. PanGu-Coder [55], from Huawei, is an LLM designed for code generation. It follows a two-stage training approach: first, pre-training with Causal Language Modeling (CLM) on raw code corpora, then combining CLM and MLM objectives to focus on generating code from text. CodeGeeX [299] is a multilingual model with 13 billion parameters for code generation, pre-trained on 850 billion tokens of 23 programming languages. StarCoder [134] is an LLM for code, up to 15.5 billion parameters, which is trained on an extensive dataset consisting of 1 trillion tokens sourced from a vast collection of permissively licensed GitHub repositories. Code Llama [204] is a family of LLMs for code, released by Meta, built upon the foundation of Llama 2. These models are distinguished by their advanced infilling capabilities, extensive long-context fine-tuning, and precise instruction fine-tuning. CodeT5+ [253], released by Salesforce, represents a novel family of LLMs explicitly tailored for a broad spectrum of tasks related to both code comprehension and code generation. This model introduces innovative pre-training objectives, including text-code contrastive learning, matching, and CLM tasks on text-code data. phi-1 [87] is a comparatively smaller LLM for code, consisting of 1.3 billion parameters, achieved through data set refinement, while maintaining competitive performance in code generation.

Different from traditional pre-trained code models that are designed for specific tasks, the LLMs for code are distinguished by their strong capabilities in zero-shot learning. To unleash the zero-shot capabilities of LLMs, many techniques such as prompt tuning, in-context learning, chain-of-thought, and instruction tuning, have been developed. Recently, numerous studies have explored the potential of LLMs in tasks such as code generation [150], code summarization [79], and code repair [270], all achieved through the design of textual prompts. As a specific prompting, in-context learning seeks to bolster the capabilities of LLMs by furnishing them with contextual information or illustrative examples. Li et al. [130] explored in-context learning for better code generation based on LLMs. The chain-of-thought is designed to ensure the outputs of LLMs follow a logical chain. Li et al. [129] explored chain-of-thought for better code generation based on LLMs. The instruction tuning is initially designed to enhance the generalization capabilities of LLMs across different tasks. WizardCoder [164] is crafted to augment the capabilities of StarCoder by creating sophisticated code instruction data via the code-specific Evol-Instruct approach.

3.3.5 Reinforcement Learning.

Reinforcement learning aims to learn an agent through interacting with the environment without input-output pairs. This kind of learning paradigm has been used in various domains such as code summarization [243], code search [279], program repair [93], and program synthesis [300].

3.4 Classification-based Applications

Classification-based applications, such as code classification, vulnerability detection, and type inference, seek to train a classifier with the objective of mapping the source code to specific labels or classes, such as vulnerability status or variable types.

3.4.1 Code Classification.

Classifying source code into different classes (e.g., different functionalities and programming languages), is important for many tasks such as code categorization, programming language identification, code prediction, and vulnerability detection. Various studies have been conducted to classify code snippets into categories based on their functionalities. To represent programs in the form of ASTs, Mou et al. [173] developed a Tree-Based Convolutional Neural Network (TBCNN), which was then verified on code classification. In the wider realm of software categorization, LeClair et al. [125] devised a series of adaptations, incorporating techniques such as word embedding and neural architectures, to tailor NLP methods for text classification specifically to the domain of source code. Bui et al. [30] presented a bilateral neural network for the cross-language algorithm classification task, where each sub-network is used to encode the semantics of code in a specific language, and an additional classification module is designed to model the connection of those bilateral programs.

3.4.2 Vulnerability Detection and Bug Finding.

Detecting vulnerabilities or bugs in programs is essential for assuring the quality of software, as well as saving much effort and time during software development. Although many tools have been developed for vulnerability detection, e.g., Clang Static Analyzer,⁸ Coverity,⁹ Fortify,¹⁰ Flawfinder,¹¹ Infer,¹² and SVF [221], most of them are based on static analysis. Recently, an increasing number of works have utilized deep learning to uncover vulnerabilities. An early endeavor by Wang et al. [249] applied deep belief networks to predict software defects, leveraging semantic features learned from programs based on ASTs. Dam et al. [61] proposed an LSTM-based method to exploit both the syntactic and semantic aspects of source code, and apply the embeddings for both within-project and cross-project vulnerability detection. VulDeePecker [145], \(\mu\)VulDeePecker [307] and SySeVR [144] are a series of works that preserve the semantics of program by extracting API function calls and program slices for vulnerability detection. Le et al. [124] introduced a maximal divergence sequential auto-encoder network for identifying vulnerabilities in binary files. The model is designed to maximize the divergence between embeddings of vulnerable and invulnerable code. Zhou et al. [302] proposed Devign for vulnerability detection, which first represents a program by fusing its AST, CFG and DFG into a unified CPG, and subsequently designs a GNN to model the CPG of code. Similarly, Wang et al. [247] and Cao et al. [36] proposed a flow-sensitive framework for vulnerability detection, utilizing a GNN to represent the control, data, and call dependencies within a program. Cheng et al. [50] introduced DeepWukong, a GNN-based model for vulnerability detection of C/C++ programs, in which the flow information of programs is preserved. Liu et al. [159] introduced a GNN model with expert knowledge for detecting vulnerabilities in smart contracts, which incorporates the flow information of programs. Drawing inspiration from image processing, Wu et al. [269] proposed a method to enhance the scalability of vulnerability detection by transforming code into an image with semantics preserved, and implementing a CNN to capture them effectively.

Recently, several works have attempted to explain the results of deep learning models for vulnerability detection. Li et al. [140] introduced a GNN model for vulnerability detection that allows for interpretability, by providing users with parts of the Program Dependency Graph (PDG) that may contain the vulnerability. Additionally, Zou et al. [308] proposed an interpretable deep-learning-based model based on heuristic searching for vulnerability detection.

In contrast to vulnerability detection which only classifies a program as vulnerable or non-vulnerable, another line of work is bug finding, which aims to pinpoint the buggy location. DeepBugs [194] is an approach for name-based bug detection, which trains a classifier to distinguish buggy or non-buggy code, based on deep learning. To enhance bug detection accuracy, Li et al. [142] proposed a fusion method leveraging both PDG and DFG for enhanced representation. The attention mechanism assigns higher weights to buggy paths to identify potential vulnerabilities. Gupta et al. [94] introduced a tree-structured CNN to pinpoint vulnerabilities or faults in erroneous programs concerning failed tests. Furthermore, Li et al. [139] framed fault localization as an image recognition problem and offered a deep learning approach integrating code coverage, data dependencies between statements, and code representations.

3.4.3 Type Inference.

Programming languages with dynamic typing, such as Python and JavaScript, allow for rapid prototyping for developers and can save the time of software development dramatically. However, without the type information, unexpected run-time errors are prone to occur, which may introduce bugs and produce low-quality code. Current works on type inference, aimed at automatically inferring variable types, predominantly categorize into two types: static-analysis-based and learning-based approaches. Traditional static-analysis approaches [101, 207] are often imprecise since the behavior of programs is always over-approximated. Moreover, these approaches typically analyze the dependencies of entire programs, resulting in comparatively lower efficiency.

Recently, numerous deep learning techniques have been introduced for type inference. To the best of our knowledge, Hellendoorn et al. [104] was the first to employ deep learning for type inference. They proposed a neural network based on sequence-to-sequence architecture, named DeepTyper, which uses GRUs to represent the program context and predict the type annotations for TypeScript. Subsequently, Malik et al. [166] proposed NL2Type to predict type annotations by leveraging the natural-language information of programs. Based on NL2Type, Pradel et al. [193] further proposed TypeWriter, which utilizes both the natural-language information and programming context (e.g., arguments usage within a function). Wei et al. [260] introduced LambdaNet, a method for type inference using GNNs. LambdaNet begins by encoding code into a type dependency graph, preserving typed variables and their logical constraints. Subsequently, a GNN is employed to propagate and aggregate features across associated type variables, ultimately predicting type annotations. Pandi et al. [187] introduced OptTyper, which extracts relevant logical constraints and formulates type inference as an optimization problem. Allamanis et al. [8] proposed Typilus for Python type inference, expanding ASTs into a graph structure and predicting type annotations using GNNs over this graph. To address the challenge of large-scale type vocabulary, Mir et al. [171] introduced Type4Py, a similarity-based deep learning model with type clusters, enabling the inference of rare types and user-defined classes. Recently, Huang et al. [110] reformulated the type inference task as a cloze-style fill-in-the-blank problem, followed by training a CodeBERT model through prompt tuning.

3.5 Similarity-based Applications

Similarity-based applications, such as code search and code clone detection, aim to assess the likeness between a query (in either natural language or programming language) and a candidate code snippet. It is important to note that several approaches propose to reframe these tasks as a classification problem, where both the code and query are concatenated, and the goal is to determine their relatedness [72]. In this paper, we differentiate between similarity-based and classification-based applications by the objects they address, namely, the query and candidate code snippet. Specifically, similarity-based applications center on tasks involving two objects.

3.5.1 Code Search.

Code search aims to retrieve a code snippet by a natural-language query (nl-to-code) or code query (code-to-code). The nl-to-code search refers to searching code fragments that have similar semantics to the natural-language query from a codebase. As the first solution for code search using deep learning, Gu et al. [83] proposed DeepCS, which simultaneously learns the source code representation (e.g., function name, parameters and API usage) and the natural-language query in a shared feature vector space, with triplet criterion as the objective function. On the basis of DeepCS, Wan et al. [240] and Deng et al. [62] included more structural information of source code, including the ASTs and CFGs, under a multi-modal neural network equipped with an attention mechanism for better explainability. Ling et al. [149] first converted code fragments and natural-language descriptions into two different graphs, and presented a matching technique for better source code and natural-language description matching. Furthermore, Shi et al. [216] suggested an improved code search method by converting code graphs (e.g., CFGs and PDGs) into sequences through traversing. Haldar et al. [97] proposed a multi-perspective matching method to calculate the similarities among source code and natural-language query from multiple perspectives. Cambronero et al. [35] empirically evaluated the architectures and training techniques when applying deep learning to code search. Bui et al. [33] and Li et al. [135] leveraged contrastive learning with semantics-preserving code transformations for better code representation in code search.

Similar but different to the DeepCS framework, several more works have been proposed as complements for code search. Yao et al. [279] proposed using reinforcement learning to first generate the summary of code snippet and then use the summary for better code search. Sun et al. [222] suggested parsing source code to machine instructions, then mapping them into natural-language descriptions based on several predefined rules, followed by an LSTM-based code search model like DeepCS. Zhu et al. [304] considered the overlapped substrings between natural-language query and source code, and developed a neural network component to represent the overlap matrix for code search.

Recently, Chai et al. [42] suggested a transfer learning method for domain-specific code search, with the aim of transferring knowledge from Python to SQL. Wan et al. [241] examined the robustness of different neural code search models, and showed that some of them are vulnerable to data-poisoning-based backdoor attacks. Gu et al. [82] proposed to optimize code search by deep hashing techniques.

In contrast to nl-to-code search, the input of code-to-code search is source code, rather than natural-language description. The objective of the code-to-code search is to find code snippets that are semantically related to an input code from a codebase. The core technique of code-to-code search is to measure the similarity index between two code snippets, which is identical to the process of identifying code clones. More related work will be investigated in the code clone detection section.

3.5.2 Code Clone Detection.

Numerous software engineering activities, including code reuse, vulnerability detection, and code search, rely on detecting similar code snippets (or code clones). There are basically four main types of code clones: Type-1 code clones are ones that are identical except for spaces, blanks, and comments. Type-2 code clones denote identical code snippets except for the variable, type, literal, and function names. Type-3 code clones denote two code snippets that are almost identical except for a few statements that have been added or removed. Type-4 code clones denote heterogeneous code snippets with similar functionality but differing code structures or syntax. To handle different types of code clones, various works have been proposed.

Recently, several deep-learning-based approaches have been designed for the semantics representation of a pair of code snippets for the task of clone detection. The core of these approaches lies in representing the source code as distributed vectors, in which the semantics are preserved. As an example, White et al. [263] proposed DLC, which comprehends the semantics of source code by considering its lexical and syntactic information, and then designs RNNs for representation. To improve the representation of the syntactic structure of code, Wei and Li [259] applied TreeLSTM to incorporate AST information of source code. Zhao and Huang [298] proposed encoding the CFG and DFG of code into a semantic matrix, and introduced a deep learning model to match similar code representations. Zhang et al. [294] and Büch and Andrzejak [29] designed approaches to better represent the ASTs of the program, and applied them for code clone detection task. Furthermore, Wang et al. [250], Nair et al. [176] and Mehrotra et al. [168] proposed to convert source code into graphs (e.g., CFG), represent the code graphs via GNN, and then measure the similarities between them. Instead of using GNN, Wu et al. [268] and Hu et al. [109] introduced a centrality analysis approach on the flow graph (e.g., CFG) of code for clone detection, inspired by social network analysis. Wu et al. [266] considered the nodes of an AST as distinct states and constructed a model based on a Markov chain to convert the tree structure into Markov state transitions. Then, for code clone detection, a classifier model is trained on the state transitions. Tufano et al. [236] empirically evaluated the effectiveness of learning representation from diverse perspectives for code clone detection, including identifiers, ASTs, CFGs, and bytecode. Recently, Ding et al. [67] and Tao et al. [231] utilized program transformation techniques to augment the training data, and then applied pre-training and contrastive learning techniques for clone detection. Gui et al. [86] studied a new problem of cross-language binary-source code matching by transforming both source and binary into LLVM-IRs.

3.6 Generation-based Applications

Generation-based applications, including code completion, code summarization, program translation, program synthesis, and program repair, are designed to produce source code, natural-language descriptions, or programs in an alternative programming language, in response to specific requirements presented in either natural language or (partial) code.

3.6.1 Code Completion.

Code completion is a core feature of most modern IDEs. It offers the developers a list of possible code hints based on available information. Raychev et al. [203] made the first attempt to combine the program analysis with neural language models for better code completion. It first extracts the abstract histories of programs through program analysis, and then learns the probabilities of histories via an RNN-based neural language model. Similarly, various works [132, 152, 229] resort to inferring the next code token over the partial AST, by first traversing the AST in a depth-first order, and then introducing an RNN-based neural language model. To better represent the structure of code, Kim et al. [121] suggested predicting the missing partial code by feeding the ASTs to Transformers. Alon et al. [13] presented a structural model for code completion, which represents code by sampling paths from an incomplete AST. Furthermore, Wang and Li [254] suggested a GNN-based approach for code completion, which parses the flattened sequence of an AST into a graph, and represents it using a GGNN [137]. Guo et al. [90] modeled the problem of code completion as filling in a hole, and developed a Transformer model guided by the grammar file of a specified programming language. Brockschmidt et al. [26] expanded incomplete code into a graph representation, and then proposed a GNN for code completion. Svyatkovskiy et al. [227] proposed IntelliCode Compose, a pre-trained language model of code based on GPT-2, providing instant code completion across different programming languages. Liu et al. [153],154] proposed a multi-task learning framework that unifies the code completion and type inference tasks into one overall framework. Lu et al. [162] suggested a retrieval-augmented code completion method that retrieves similar code snippets from a code corpus and then uses them as external context. Since instant code completion is desired, several studies aim to improve the efficiency and flexibility of code completion. Svyatkovskiy et al. [228] suggested improving the efficiency of neural network models for code completion by reshaping the problem from generation to ranking the candidates from static analysis. Additionally, Shrivastava et al. [218] proposed a code completion approach that supports fast adaption to an unseen file based on meta-learning.

3.6.2 Code Summarization.

Inspired by the text generation work in NLP, many approaches have been put forward to systematically generate a description or function name to summarize the semantics of source code. To the best of our knowledge, Allamanis et al. [11] was the first to use deep learning for code summarization. They designed a CNN to represent the code and applied a hybrid breath-first search and beam search to predict the tokens of the function name. Concurrently, Iyer et al. [112] proposed an LSTM-based sequence-to-sequence network with an attention mechanism for generating descriptions for source code. The sequence-to-sequence network inspired a line of works for code summarization, distinguished in code representation learning. To represent the AST information, Hu et al. [107], Alon et al. [12], and LeClair et al. [127] proposed to linearize the ASTs via traversing or path sampling, and used RNNs to represent the sequential AST traversals/paths for code summarization. Likewise, Fernandes et al. [73], LeClair et al. [126] and Jin et al. [118] investigated representing the structure of source code via a GNN, and verified it in code summarization. Guo et al. [91] designed the triplet position to model hierarchies in the syntax structure of source code for better code summarization. Recently, several works [4, 80, 230, 265] proposed to improve code summarization by designing enhanced Transformers to better capture the structural information of code (i.e., ASTs). Wan et al. [243], Shi et al. [213], Yang et al. [277], Gao and Lyu [76], and Wang et al. [252] proposed a hybrid representation approach by combining the embeddings of sequential code tokens and structured ASTs, and feeding them into a decoder network to generate summaries. As a complement, Haque et al. [99] and Bansal et al. [18] advanced the performance of code summarization by integrating the context of summarized code, which contains important hints for comprehending subroutines of code. Shahbazi et al. [211] leveraged the API documentation as a knowledge resource for better code summarization. Instead of generating a sequence of summary tokens at once, Ciurumelea et al. [56] resorted to suggesting code comment completions based on neural language modeling. Lin et al. [146] proposed to improve the code summarization by splitting the AST under the guidance of CFG, which can decrease the AST size and make model training easier.

Another line of work aims to utilize code search to enhance the quality of code summaries generated by deep learning models. For example, Zhang et al. [293], Wei et al. [258], Liu et al. [158] and Li et al. [131] suggested augmenting the provided code snippet by searching similar source code snippets together with their comments, for better code summarization. Instead of acquiring the retrieved samples in advance, Zhu et al. [306] suggested a simple retrieval-based method for the task of code summarization, which estimates a probability distribution for generating each token given the current translation context.

Apart from the above approaches, several works [108, 257, 271, 275, 281] are also worthy to be mentioned. Hu et al. [108] transferred the code API information as additional knowledge to the code summarization task. Xie et al. [271] investigated the task of project-specific code summarization with limited historical code summaries using meta-transfer learning. Wei et al. [257] and Yang et al. [275] framed the code generation task as a dual of code summarization, integrating dual learning to enhance summary generation. Similarly, Ye et al. [281] employed dual learning to leverage code generation for code search and code summarization. Mu et al. [174] introduced a multi-pass deliberation framework for code summarization, inspired by human cognitive processes. Xie et al. [272] proposed a multi-task learning framework by leveraging method name suggestion as an auxiliary task to improve code summarization. Haque et al. [98] emphasized that predicting the action word (always the first word) is an important intermediate problem in order to generate improved code summaries. Recently, the consistency between source code and comments has also attracted much attention, which is critical to ensure the quality of software. Liu et al. [156], Panthaplackel et al. [188], and Nguyen et al. [179] trained a deep-learning-based classifier to determine whether or not the function body and function name are consistent. Panthaplackel et al. [189] and Liu et al. [160] proposed automatically updating an existing comment when the related code is modified, as revealed in the commit histories. Gao et al. [77] proposed to automate the removal of obsolete TODO comments by representing the semantic features of TODO comments, code changes, and commit messages using neural networks. Li et al. [133] proposed to generate review comments automatically based on pre-trained code models.

3.6.3 Program Translation.

Translating programs from a deprecated programming language to a modern one is important for software maintenance. Many neural machine translation-based methods have been proposed for program translation. In order to utilize the AST structure of code, Chen et al. [47] proposed Tree2Tree, a neural network with structural information preserved. It first converts ASTs into binary trees following the left-child right-sibling rule, and then feeds them into an encoder-decoder model equipped with TreeLSTM. Gu et al. [85] presented DeepAM, which can extract API mappings among programming languages without the need of bilingual projects. Recently, Rozière et al. [205] proposed TransCoder, a neural program translator based on unsupervised machine translation. Furthermore, Rozière et al. [206] leveraged the automated unit tests to filter out invalid translations for unsupervised program translation.

3.6.4 Program Synthesis.

Program synthesis is a task for generating source code using high-level specifications (e.g., program descriptions or input-output samples). Given the natural-language inputs, current approaches resort to generating programs through machine translation. For semantic parsing, Dong and Lapata [69] proposed an attention-based encoder-decoder model, which first encodes input natural language into a vector representation using an RNN, and then incorporates another tree-based RNN to generate programs. Liu et al. [151] proposed latent attention for the If-Then program synthesis, which can effectively learn the importance of words in natural-language descriptions. Beltagy and Quirk [21] modeled the generation of If-Then programs from natural-language descriptions as a structure prediction problem, and investigated both neural network and logistic regression models for this problem.

Unlike synthesizing simple If-Then programs, Yin and Neubig [284] proposed a syntax-preserving model for general-purpose programming languages, which generates Python code from pseudo code, powered by a grammar model that explicitly captures the compilation rules. Maddison and Tarlow [165] proposed a probabilistic model based on probabilistic context-free grammars (PCFGs) for capturing the structure of code for code generation. Ling et al. [148] collected two datasets (i.e., Hearthstone and Magic the Gathering) for code generation in trading card games, and proposed a probabilistic neural network with multiple predictors. On the basis of [148], Rabinovich et al. [198] proposed to incorporate the structural constraints on outputs into a decoder network for executable code generation. Similarly, Sun et al. [225] and Sun et al. [226] designed a tree-based CNN and Transformer, respectively, for code generation and semantic parsing tasks based on the sequence-to-sequence framework. Hayati et al. [103] suggested using a neural code generation model to retrieve action subtrees at test time.

Instead of synthesizing programs from natural-language descriptions, several works resort to generating programs from the (pseudo) program in another format or language. Iyer et al. [113] proposed to synthesize the AST derivation of source code given descriptions as well as the programmatic contexts. The above approaches are driven by well-labeled training examples, while Nan et al. [177] proposed a novel approach to program synthesis without using any training example, inspired by how humans learn to program.

Recently, various pre-trained code models also achieved significant progress in code generation. CodeGPT [163] is a Transformer-based model that is trained using corpus for program synthesis, following the same architecture of GPT-2. CodeT5 [255] is a pre-trained code model in eight programming languages based on T5 [199], which incorporates an identifier-aware objective during its pre-training phase. Xu et al. [273] endeavored to integrate external knowledge into the pre-training phase to enhance code generation from natural-language input. Codex [46] is a GPT model trained on a code corpus sourced from GitHub. This model has played a pivotal role as the underpinning framework for Copilot.¹³ Li et al. [136] introduced AlphaCode, a code generation system designed to produce distinctive solutions for complex problems that demand profound cognitive engagement. Poesia et al. [191] introduced a constrained semantic decoding mechanism into a pre-trained model, to constrain outputs of the model in a set of valid programs. More recently, the code generation has been dominated by the LLMs, including CodeGen [181], CodeT5+ [253], InCoder [74], GPT-3.5 [184], StarCoder [134], Code Llama [204], and WizardCoder [164]. These LLMs have achieved significant progress in generating independent functional code as demonstrated in the HumanEval benchmark [46]. Despite this, generating higher-level (i.e., class-level and repository-level) code that requires contextual information remains challenging. In response, various benchmarks, such as ClassEval [70] and CoderEval [285], have been introduced to address this need.

Programming by example is another flourishing direction for program synthesis. Shu and Zhang [219] proposed a Neural Programming By Example (NPBE) model, which learns to solve string manipulation problems through inducting from input-output strings. Balog et al. [17] proposed DeepCoder, which trains a model to predict possible functions useful in the program space, so as to guide the conventional search-based synthesizer. Devlin et al. [65] proposed RobustFill, which is an end-to-end neural network for synthesizing programs from input-output examples. Nye et al. [183] developed a neuro-symbolic program synthesis system called SketchAdapt, which can build programs from input-output samples and code descriptions by intermediate sketch. Bavishi et al. [20] proposed a program candidate generator, backed by GNNs, for program synthesis in large real-world API.

It is worth mentioning that there are many works on generating code from natural language for specific domain-specific programming languages, e.g., Bash and SQL. WikiSQL [300], Spider [288], SparC [289], and CoSQL [287] are four datasets with human annotations for the task of text-to-SQL. Based on these datasets, many works [286, 287, 289] have been proposed. For example, Seq2SQL [300] is a neural machine translation model to generate SQL queries from natural-language descriptions with reinforcement learning. Cai et al. [34] further proposed an encoder-decoder framework to translate natural language into SQL queries, which integrates the grammar structure of SQL for better generation. Yu et al. [286] proposed a neural network SyntaxSQLNet, with syntax tree preserved, for the task of text-to-SQL translation across different domains, which takes the syntax tree of SQL into account during generation.

3.6.5 Program Repair.

Automatically localizing and repairing bugs in programs can save much manual effort in software development. One line of work is to learn the patterns of how programmers edit the source code, which can be used to check syntax errors while compiling. Bhatia and Singh [24] and Santos et al. [208] proposed RNN-based language models for correcting syntax errors in programs. DeepFix [95] and SequenceR [49] are two sequence-to-sequence models for syntax error correction, by translating the erroneous programs into fixed ones. Furthermore, Gupta et al. [93] improved program repair by reinforcement learning. Vasic et al. [237] proposed multi-headed pointer networks (one head each for localization and repair) for jointly localizing and repairing misused variables in code. Dinella et al. [66] presented Hoppity to jointly detect and fix bugs based on neural Turing machine [81], where a GNN-based memory unit is designed for buggy program representation, and an LSTM-based central controller is designed to predict the operations of bug fixing, e.g., patch generation and type prediction. Tarlow et al. [232] proposed Graph2Diff, which designs a GNN for representing the graph structure of programs, and a pointer network to localize the initial AST to be edited. Mesbah et al. [169] and Chakraborty et al. [43] proposed to model the modifications of ASTs, and designed a neural machine translation model to generate correct patches. Zhu et al. [305] presented a syntax-directed decoder network with placeholder generation for program repair, which aims to generate program modifications rather than the target code. Yasunaga and Liang [280] proposed DrRepair, which first builds a program-feedback graph to align the corresponding symbols and diagnostic feedback, and then designs a GNN to generate repaired code. Li et al. [141] introduced a novel deep learning-based method for fixing general bugs, which combines spectrum-based fault localization with deep learning and flow analysis.

Benefiting from the pre-training techniques in NLP, TFix [23] and VulRepair [75] directly posed program repair as a text-to-text problem and utilized a model named T5 [199]. Specifically, it digests the error message and directly outputs the correct code. Jiang et al. [116] proposed CURE for program repair, which is composed of a pre-trained language model, a code-aware search method, and a sub-word tokenization technique.

Another line of work is focusing on repairing programs by generating patches. Tufano et al. [235] carried out an empirical study to evaluate the viability of applying machine translation to generate patches for program repair in real-world scenarios. Different from [235] which targets at function-level small code snippets, Hata et al. [102] trained a neural machine translation model, targeting at statements, by learning from the corresponding pre- and post-correction code in previous commits. Harer et al. [100] proposed to generate the input buggy code via generative adversarial networks so that the correction model can be trained without labeled pairs. Gupta et al. [92] embedded execution traces in order to predict a sequence of edits for repairing Karel programs. Li et al. [138] treated the program repair as code transformation and introduced two neural networks, a tree-based RNN for learning the context of a bug patch, and another one designed to learn the code transformation of fixing bugs. White et al. [262] introduced a novel approach for selecting and transforming program repair patches using deep-learning-based code similarities. Empirically, Tian et al. [233] studied the practicality of patch generation through representation learning of code changes.

4 Benchmark

Even though significant progress has been made in code intelligence with deep learning, two limitations remain obstacles to the development of this field. (1) Lack of standardized implementation for reproducing the results. It has become a common issue that deep-learning-based models are difficult to reproduce due to the sensitivity to data and hyperparameter tuning. From our investigation, most of them are implemented independently using different toolkits (i.e., PyTorch and TensorFlow). There is a need for a unified framework that enables developers to easily evaluate their models by utilizing some shared components. Actually, in the artificial intelligence area (e.g., NLP and computer vision), many toolkits such as Fairseq [186], AllenNLP [78], Detectron2 [267] have been developed, which significantly advance the progress of their corresponding research areas. (2) Lack of benchmarks for fair comparisons. Currently, many approaches have been proposed and each of them claims that the proposed approach has outperformed other ones. To identify where the performance improvements come from, it is essential to create a benchmark for fair comparisons.

Based on these motivations, we propose NaturalCC(standards for Natural Code Comprehension), a thorough platform for evaluating source code models using deep learning techniques. Under this platform, we also benchmark five specific application tasks, including code summarization, code search, code completion, program synthesis, and type inference. The implementation and usage of NaturalCCwill be introduced in Section 5.

4.1 Code Summarization

4.1.1 Approaches.

Currently, most deep-learning-based code summarization methods use the encoder-decoder architecture. An encoder network is used to convert the input source code into an embedding vector, and the decoder network is used to generate output summaries from the encoded vector. In this paper, we benchmark the following representative methods for code summarization, including three different encoder models (i.e., LSTM, TreeLSTM, and Transformer), an encoder-decoder model (i.e., CodeT5), as well as a large-scale decoder model (i.e., GPT-3.5-turbo).

–

Seq2Seq+Attn [112, 243] is a vanilla model following sequence-to-sequence architecture with attention mechanism. It is a famous method for neural machine translation. Unlike works that only represent the source code as token embedding [112], we represent the source code via an LSTM network and generate the summary via another LSTM network.

–

Tree2Seq+Attn [243] also follows the structure of Seq2Seq. The difference is that it uses TreeLSTM as the encoder network for syntax-aware modeling of code. Moreover, an attention module is also designed to attend to different nodes of the syntax tree of code.

–

Transformer [4] is currently considered the leading approach for code summarization, which has also achieved significant improvement in neural machine translation. In Transformer, a relative position embedding, rather than absolute position embedding, is introduced for modeling the positions of code tokens.

–

PLBART [3] is built on the top of BART [128], which is originally designed for text understanding and generation. PLBART can be seen as a specific BART model pre-trained on code corpus.

–

CodeT5 [255] is an encoder-decoder model that builds upon the T5 architecture [199]. Pre-trained on the CodeSearchNet dataset [111], it supports six programming languages and can perform both code understanding and generation tasks.

–

GPT-3.5-turbo [184] is a large-scale decoder-only model based on Transformer architecture. It is pre-trained on a diverse array of data, encompassing both natural language and code, and can learn a specific task given an instruction and several demonstration examples.

4.1.2 Results.

We evaluate the performance of each model on the Python-Doc [19, 243] dataset using the BLEU, METEOR, and ROUGE metrics as in [243]. GPT-3.5-turbo is invoked online via the OpenAI API, and we present five code-summary pairs before the test item for demonstration. The time cost is reported for generating each summary item rather than for each batch. The overall performance is summarized in Table 2. This table shows that GPT-3.5-Turbo, which utilizes the Transformer architecture and pre-training techniques, achieves the highest performance in BLEU and METEOR scores. However, it is out of our expectation that GPT-3.5-turbo exhibits a lower ROUGE-L score. To understand the underlying causes, we manually examined the output, finding that it tends to produce fluent yet longer sentences compared to the reference solution, incorporating additional words to preserve fluency, thereby diminishing the longest common subsequence length shared with the reference solution, which is an important factor in evaluating the ROUGE-L score. This tendency is likely due to their few-shot in-context learning approach [28], which does not adequately teach the model to mimic the stylistic nuances in ground truth outputs (i.e., generating concise summaries). It is interesting to see that the simple Seq2Seq+Attn outperforms the Tree2Seq+Attn that considers the AST of code. For Transformer, we find that the relative position embedding can indeed represent the relative relationships among code tokens.

Table 2.

	BLEU	METEOR	ROUGE-L	Time Cost
Seq2Seq+Attn	25.57	14.40	39.41	0.09s/Batch
Tree2Seq+Attn	23.35	12.59	36.49	0.48s/Batch
Transformer	30.64	17.65	44.59	0.26s/Batch
PLBART	32.71	18.13	46.05	0.26s/Batch
CodeT5	33.25	20.35	50.06	0.82s/Batch
GPT-3.5-turbo	40.52	29.66	42.88	1.23s/Item

Table 2. Performance of our Model and Baseline Methods for Code Summarization over Python-Doc Dataset

4.2 Code Search

4.2.1 Approaches.

CodeSearchNet Challenge [111] is an open challenge designed to assess the current state of code search. In [111], the authors have benchmarked four code search methods. The fundamental idea of [111] is to learn a joint embedding of code and natural-language query in a shared vector space. That is, two encoders are used for representing the source code and query, respectively. A loss function is then designed to maximize the weighted sum for paired embeddings of source code and natural-language query. Based on different encoder networks, we have implemented the following four variant models.

–

Neural Bag of Words (NBOW) [111] is a naive approach by representing the input sequences by a bag of words. For a given code snippet or some specified query written in natural language, it represents tokens into a collection of word embeddings before feeding them into a max pooling layer for creating a sentence-level representation.

–

Bidirectional RNN models (biRNN) [111] proposes to represent the semantics of source code and query via RNN models. In particular, we adopt the two-layer bidirectional LSTM network.

–

1D Convolutional Neural Network (1D-CNN) [111] employs convolutional neural layers for code and query representation, and builds a residual connection at each layer.

–

Self-Attention (SelfAtt) [111] adopts self-attention layers to capture the semantic information of sequential source code and query.

4.2.2 Implementation Details.

We employ word-level BPE to tokenize both code snippets and natural-language descriptions in the considered methods. Subsequently, a shared vocabulary of size 50,000 is constructed based on the sorted token frequency. All models undergo training on a singular Nvidia RTX V100 GPU, utilizing a learning rate of \(5 \times 10^{-4}\). The gradient norm is maintained at 1.0, and a batch size of 1,000 is specified to expedite training. The optimization process for all models is executed using the Adam optimizer.

4.2.3 Results.

We evaluate the performance of each model on the CodeSearchNet corpus using the MRR metric, as described in [111]. The overall performance of each model is summarized in Table 3. As shown in the table, it is clear that the NBOW model with the simplest architecture achieves a comparable performance, at the lowest cost. Moreover, we can also observe that the performance of biRNN is poor, in both effectiveness and efficiency. The recurrent characteristic of RNN makes it time-consuming. The SelfAttn model obtains the best results, which may be attributed to its use of the self-attention mechanism.

Table 3.

	Go	Java	JavaScript	PHP	Python	Ruby	Time Cost
NBOW	66.59	59.92	47.15	54.75	63.33	42.86	0.16s/Batch
1D-CNN	70.87	60.49	38.81	61.92	67.29	36.53	0.30s/Batch
biRNN	65.80	48.60	23.23	51.36	48.28	19.35	0.74s/Batch
SelfAtt	78.45	66.55	50.38	65.78	79.09	47.96	0.25s/Batch

Table 3. MRR of our Model and Baseline Methods for Code Search over CodeSearchNet Dataset

4.3 Code Completion

4.3.1 Approaches.

The code completion task aims to generate the completion text based on the given partial code. In this paper, we investigate three representative approaches.

–

LSTM [121] denotes the model that represents the partial code by LSTM, and then predicts the missing token via a softmax layer.

–

GPT-2 [121] is a pre-trained language model based on Transformer. It refers to the Transformer model that is trained by iteratively predicting the next code token.

–

TravTrans [121] is designed to preserve the syntax structure of source code while predicting the missing token. It first linearizes the code ASTs into a sequence of tokens using depth-first traversing, and afterward feeds the traversal into Transformer for representation. It also uses a softmax layer to predict the missing token.

4.3.2 Implementation Details.

For acquiring high-quality code tokens, we perform preprocessing on the code snippets by parsing them into ASTs and extracting their leaf nodes as code tokens. We establish a unified vocabulary comprising 50,000 tokens, organized based on token frequency. All models undergo training utilizing four Nvidia RTX V100 GPUs, employing a learning rate of \(1 \times 10^{-3}\), and a batch size of 32. The optimization of all models is executed using the Adam optimizer.

4.3.3 Results.

We evaluate each model on the Py150 [202] dataset using the MRR metric as used in [121]. We divide the prediction tokens into five categories, namely attributes, numeric constants, identifier names, function parameters, and all tokens. We summarize the performance of each model in Table 4. From this table, when comparing GPT-2 with LSTM, we can observe that the Transformer architecture outperforms other models in representing the semantics of code, thus, resulting in better performance for code completion. Furthermore, when comparing TravTrans with GPT-2, we can see that the TravTrans that incorporates the syntax structure information achieves better performance, showing that the syntax information is useful for code completion.

Table 4.

	Attribute	Number	Identifier	Parameter	All Tokens	Time Cost
LSTM	51.67	47.45	46.52	66.06	73.73	0.31s/Batch
GPT-2	70.37	62.20	63.84	73.54	82.17	0.43s/Batch
TravTrans	72.08	68.55	76.33	71.08	83.17	0.43s/Batch

Table 4. MRR of our Model and Baseline Methods for Code Completion over Py150 Dataset

4.4 Program Synthesis

4.4.1 Approaches.

The task of program synthesis is focused on generating explicit code based on high-level specifications, such as program descriptions or input-output samples. This task is inherently more challenging than code completion, as it necessitates an understanding of human intentions and generating complete code. With the development of large language models, the field of program synthesis has recently garnered widespread interest. In this paper, we investigate three representative approaches.

–

StarCoder [204] is a decoder-only LLM pre-trained on 1 trillion tokens from 80+ programming languages from GitHub. It can be used for various code tasks including code completion, editing via instructions, and explaining code in natural language. We choose the 15 billion parameter version fine-tuned using an EVOL-Instruct objective on 20K instruction-following dataset named Code Alpaca [164].

–

GPT-3.5-Turbo [184] and GPT-4 [185] are two large-scale decoder-only language models pre-trained on a diverse dataset including code and natural language. Compared with GPT-3.5-Turbo, GPT-4 has a significantly larger number of parameters (approximately 1.5 trillion, whereas GPT-3.5 has 175 billion), and has a 32K context window. GPT-4 is optimized for handling complex reasoning and planning tasks [185].

4.4.2 Results.

We evaluate each model on the HumanEval [46] and the Mostly Basic Python Problems (MBPP) [16] dataset, and using the Pass@1 and Pass@5 metric as in [46]. We evaluate in a zero-shot setting, only presenting the task requirement without demonstration examples as these models are capable of code generation tasks. The performance of different models is summarized in Table 5. As shown in the table, it is clear that the GPT-4 model with the most parameters achieves the best performance in both the HumanEval and MBPP datasets, at the cost of longer inference time. Also, deploying LLMs such as StarCoder locally for code generation services needs acceleration techniques due to their excessively long inference time.

Table 5.

	Pass@1	Pass@5	Time Cost	Pass@1	Pass@5	Time Cost
	HumanEval			MBPP
StarCoder	59.80%	68.02%	5.73 s/Batch	45.20%	51.35%	11.25 s/Batch
GPT-3.5-Turbo	70.10%	80.56%	1.17 s/Item	49.71%	58.50%	2.58 s/Item
GPT-4	80.18%	91.71%	1.81 s/Item	59.70%	67.25%	3.10 s/Item

Table 5. Pass@k with \(k=1,5\) of Code Generation Models over HumanEval and MBPP Dataset

4.5 Type Inference

4.5.1 Approaches.

Similar to code completion, the type inference task aims to predict the types of variables based on contextual information. It first represents the contextual code into a vector, and then predicts the missing types by a softmax layer. In our work, we employ two state-of-the-art methods for this task.

–

DeepTyper [104] proposes to represent the contextual code by a two-layer biGRU, and then predicts the missing variable types via a softmax layer.

–

Transformer [4] proposes to represent the contextual code by a Transformer encoder network, and then predicts the missing variable types via a softmax layer.

4.5.2 Implementation Details.

We initially tokenize both the code snippets and natural-language descriptions. Subsequently, we establish a common vocabulary comprising 40,000 tokens, determined by sorting them based on frequency. The hardware configuration for training and the optimizer employed remains consistent with the aforementioned specifications. A batch size of 16 and a learning rate of \(1 \times 10^{-4}\) are utilized.

4.5.3 Results.

We evaluate each model on the Py150 [202], by using the Accuracy metric as in [114]. In particular, we measure the performance under the settings of all types and any types. The performance of different models is summarized in Table 6. From this table, it is interesting to see that the simple LSTM-based DeepTyper outperforms the Transformer-based approach, especially under the all types setting, at a lower time cost.

Table 6.

	All types		Any types		Time Cost
	Accuracy@1	Accuracy@5	Accuracy@1	Accuracy@5	Time Cost
DeepTyper	0.52	0.67	0.43	0.67	0.42s/Batch
Transformer	0.34	0.64	0.37	0.75	0.85s/Batch

Table 6. Accuracy of our Model and Baseline Methods for Type Inference over Py150 Dataset

5 Toolkit and Demonstration

This section introduces the design of NaturalCCand its user interface. Figure 4 (left) shows the code structure of NaturalCC. The dataset folder contains data preprocessing code. The ncc folder is the core module. The third_party folder holds model evaluation packages. The gui folder contains graphical user interface files and assets. As shown in Figure 4 (right), NaturalCCis composed of four components, i.e., data preprocessing, code representation, downstream tasks, and their corresponding evaluations. At the stage of data preprocessing, we process the source code with a series of steps, including word tokenization, building vocabulary, and feature extraction. Additionally, a data loader is used to iteratively yield batches of code samples with their features. The resulting batches are then sent into the code representation models, which facilitate a variety of downstream tasks, including code summarization, code search, code completion, and type inference. To evaluate the performance of each task, we also implement several corresponding metrics that have been widely adopted previously.

Fig. 4.

5.1 Data Preprocessing Module

In NaturalCC, we have collected and processed four datasets including CodeSearchNet [111], Python-Doc [243], Py150 [202], and DeepTyper [104]. First, we tokenize the input source code, and then build a vocabulary to map the code tokens into indexes. Currently, we support two types of tokenizations: space tokenizer and BPE tokenizer [120]. Along with code tokens, we also explore different features of code, such as AST, IR, CFGs, and DFGs. All the related scripts for data preprocessing have been put in the data and dataset folders.

5.2 Code Representation Module

As the core component of NaturalCC, we have implemented several encoders that are widely used in state-of-the-art approaches for source code representation, including RNN, GNN, and Transformer. For example, we have implemented LSTM, TreeLSTM and Transformer networks for sequential tokens and (linearized) ASTs. We have also implemented a GNN, i.e., GGNN, to represent the control-flow graph of source code. It is worth mentioning that in NaturalCC, we have also incorporated the pre-training approaches for source code. We have implemented several state-of-the-art pre-trained code models, including CodeBERT [72], PLBART [3], and GPT-2 [163]. The models and modules folders contain all the implemented networks for code representation.

5.3 Tool Implementation

NaturalCCis mainly implemented by PyTorch, and builds upon other successful open-source toolkits in NLP, such as Fairseq, and AllenNLP.

Registry Mechanism. To be flexible, NaturalCCis expected to be easily extended to different tasks and model implementations, with minimum modification. Similar to Fairseq, we design a register decorator on instantiating a new task or model, the implementation of which is in the corresponding __init__.py in each folder. The registry mechanism is to create a global variable to store all the available tasks, models, and objects at the initialization stage, so that users can easily access them throughout the whole project.

Efficient Training. NaturalCCsupports efficient training of models in a distributed way through torch.distributed. It can utilize multiple GPUs across different servers. Furthermore, NaturalCCcan support calculation in mixed precision to further increase the training speed, including both FP32 and FP16 training. Typically, the gradients are updated in FP16 while the parameters are saved in FP32.

Flexible Configuration. Instead of employing argparse for managing command-line options within Fairseq, we advocate the adoption of individual yaml configuration files for each model’s configuration. We contend that the flexibility offered by modifying these yaml configuration files is better suited for model exploration.

5.4 Graphical User Interface

We also develop a web-based graphical user interface to facilitate users in exploring the outcomes of trained models. The design is based on the open-source demonstration of AllenNLP [78]. Figure 5(a) displays a screenshot of our demonstration system, which currently features three tasks of code intelligence: code summarization, code search, and code completion. We leave the integration of other code intelligence tasks to our future work.

Fig. 5.

5.5 Leaderboard

We release a leaderboard so that researchers can report the results of their own models and compete with others, as shown in Figure 5(b). Currently, we only support researchers and developers who use NaturalCCto implement their approach and update the experimental results via pull requests in GitHub. In our future work, we will build a web-based service, which allows users to upload their predicted results and evaluate the model performance automatically using the ground-truth labels as a reference.

6 Challenges and Opportunities

Although much effort has been made into deep learning for code intelligence, this area of research is still in its infancy with many open challenges and opportunities. To inspire future research, this section suggests several potential directions that are worth pursuing.

Comprehensive Code Comprehension. Designing a representation approach to effectively and efficiently preserve the semantics of programs has always been a fundamental problem in code intelligence. Despite much effort on code representation, as mentioned in this paper, there are still three main obstacles to be overcome. (a) Open Vocabulary. Building a vocabulary to index the textual tokens of code is the first step toward applying deep learning models for code intelligence. Since the unambiguous characteristic of code, the vocabulary in code is much more open and complicated than the vocabulary in natural languages. The vocabulary of programming languages often consists of keywords, identifiers, customized method names, and variable names. The large vocabulary contains much “noise”, making it difficult to comprehend the code. Although many attempts [53, 60, 120] have been made towards mitigating the OOV issue, it remains a challenge to design a simple yet effective approach to map the source code into indexes while preserving the semantics. (b) Complex Structure of Program. Unlike natural language, code is written with strict grammar. The computations described by code can be executed in an order that is different from the order in which the code was written. This is often seen in operations such as loops, recursions, and pointer manipulation. Although many attempts to capture the structure of code from different modalities, as we surveyed in this paper, we believe that the structures of code are not sufficiently preserved, and more effort is needed here. Inspired by the GNNs, there is potential to design specific GNNs to better represent the structure of programs. For example, from our analysis, ASTs, CFGs, DFGs and CPGs all have high heterogeneity. It is desirable to design some heterogeneous-information-network-based approaches [223] to represent the heterogeneous code graph. (c) How to Feed the Program Structures into LLMs? Current LLM architectures predominantly rely on pre-training with sequential data, lacking support for structural inputs. However, source code exhibits rich structural features, including CFG, DFG, and PDG. Consequently, integrating these structural features into LLMs has emerged as a prominent research endeavor. Recent studies [6, 244] have explored the incorporation of such features into LLMs through designing prompts.

Data Hungry and Data Quality. Despite much progress achieved in deep-learning-based approaches for code intelligence, we argue that existing approaches still suffer from the data-hungry issue. In other words, the effectiveness of cutting-edge techniques significantly depends on the availability of vast quantities of expensive and labor-intensive well-labeled training data. Training the model on a small qualified dataset will result in far less imprecise results, especially for new programming languages or languages with an inadequate number of labeled samples. Therefore, it is important to design approaches to reduce the reliance on a large quantity of labeled data. A similar problem exists in the field of machine learning. One promising solution for this dilemma is transfer learning, which has achieved great success in transferring knowledge to alleviate the data-hungry issue in computer vision and NLP. Similarly, to model an emerging programming language with limited data, it is desirable to mitigate the data-hungry issue by leveraging models trained in programming languages with sufficient labeled training data [42, 45, 57]. Data quality is also a crucial issue for code intelligence, which may exacerbate the data-hungry problem. From our analysis, the collected datasets from online resources, like GitHub and StackOverflow, are not quality ensured. Sun et al. [224] and Shi et al. [215] investigated the importance of data quality and verified it on the tasks of code search and code summarization, respectively. In the area of LLMs, the significance of data quality has surged. For instance, Gunasekar et al. [87] demonstrated the efficacy of training an LLM with a relatively modest parameter count of 1.3 billion using textbook data. Their work underscores the growing importance of meticulously selecting training data and harnessing synthetic data, a trend expected to intensify in the foreseeable future.

Multi-Lingual and Cross-Language. The codebase written in multiple programming languages can be considered a multi-lingual corpus, as in NLP. However, the multi-lingual problem in programming languages has not been well investigated. Different from the multi-lingual problems studied in NLP, the corpus of multiple programming languages will bring more opportunities and challenges to future research. Recently, several attempts have been made to learn the common knowledge shared among multiple programming languages, and transfer the knowledge across different programming languages. For example, Zhang et al. [291] proposed obtaining better interpretability and generalizability by disentangling the semantics of source code from multiple programming languages based on variational autoencoders. Zügner et al. [309] introduced a language-agnostic code representation based on the features directly extracted from the AST. Ahmed and Devanbu [5] conducted an exploratory study and revealed the evidence that multilingual property indeed exists in the source code corpora. For example, it is more likely that programs that solve the same problem in different languages make use of the same or similar identifier names. They also investigate the effect of multilingual (pre-)training for code summarization and code search. Nafi et al. [175] proposed CLCDSA, a cross-language clone detector with syntactical features and API documentation. Bui et al. [30] proposed a bilateral neural network for the task of cross-language algorithm classification. Bui et al. [31] proposed SAR, which can learn cross-language API mappings with minimal knowledge. Recently, Chai et al. [42] proposed a novel approach termed CDCS for domain-specific code search through transfer learning across programming languages. Gui et al. [86] proposed an approach that matches source code and binary code across different languages based on intermediate representation.

Model Interpretability. Lack of interpretability is a common challenge for most deep learning-based techniques for code intelligence, as deep learning is a black-box method. New methods and studies on interpreting the working mechanisms of deep neural networks should be a potential research direction. Recently, several efforts have been made toward increasing the interpretability of deep-learning-based models. As an example, Li et al. [140] presented a novel approach to explain predicted results for GNN-based vulnerability detection by extracting sub-graphs in the program dependency graph. In addition, Zou et al. [308] proposed interpreting a deep-learning-based model for vulnerability detection by identifying a limited number of tokens that play a significant role in the final prediction of the detectors. Zhang et al. [295] proposed interpretable program synthesis that allows users to see the synthesis process and have control over the synthesizer. Pornprasit et al. [192] proposed a local rule-based model-agnostic approach, termed PyExplainer, to explain the predictions of just-in-time defect models. Rabin et al. [196] proposed a model-agnostic explainer based on program simplification, inspired by the delta debugging algorithms. Wan et al. [242], López et al. [161], and Sharma et al. [212] investigated the explainability of pre-trained code models through probing the code attention and hidden representations. We believe that it is essential to enhance the interpretability of current deep-learning-based approaches for code intelligence.

Effective and Efficient Use of Models. Recently, substantial advancements have been observed in the capabilities of LLMs concerning code intelligence. However, the optimal utilization of LLMs to harness their full potential in code intelligence poses a pressing challenge. To this end, several efforts have been undertaken, including the following three aspects. (a) Prompt Engineering. One prominent avenue of research revolves around crafting prompts aimed at enhancing interactions with LLMs through model inference. Numerous studies have delved into crafting prompts customized for tasks such as code generation [150], code summarization [79], program repair [270], and vulnerability detection [68], respectively. (b) Parameter-Efficient Tuning. As the size of LLMs increases, the expense associated with fully fine-tuning them escalates. Consequently, prompt tuning has emerged as a solution to mitigate this challenge. In [245], the authors have empirically explored the prompt tuning techniques in code intelligence tasks. (c) Model Compression. As current pre-trained code models continue to increase in size, the computational expense of pre-training on large-scale code corpora remains a significant challenge, resulting in high costs associated with both training and model inference. Zhang et al. [297] and Shi et al. [214] proposed to improve the efficiency of the training process by model compressing. It is a promising research direction to reduce the computational resources of pre-trained code models.

Robustness and Security. Despite significant progress being made in the training of accurate models for code intelligence, the robustness and security of these models have rarely been explored. As seen in the fields of NLP and CV, deep neural networks are frequently not robust [40]. Specifically, current deep learning models can be easily deceived by adversarial examples, which are created by making small changes to the inputs of the model that it would consider as benign. There are many different ways to produce adversarial samples in the computer vision and NLP communities, particularly for image classification [37, 40, 71] and sentiment classification [296]. Similarly, for source code models, the adversarial attack also exists. Recently, there have been several efforts to investigate the robustness and security of deep-learning-based models for code intelligence. For example, Ramakrishnan et al. [201] and Yefet et al. [282] investigated how to improve the robustness of source code models through adversarial training. Nguyen et al. [178] empirically investigated the use of adversarial learning techniques for API recommendation. Bielik and Vechev [25] introduced a novel method that incorporates adversarial training and representation refinement to create precise and robust models of source code. Zhou et al. [303], Yang et al. [278] and Zhang et al. [290] proposed a black-box attack for neural code models by generating adversarial examples while preserving the semantics of source code. Based on semantics-preserving code transformations, Quiring et al. [195] and Liu et al. [157] developed a novel attack against authorship attribution of source code. Ramakrishnan and Albarghouthi [200] investigated the possibility of injecting a number of common backdoors into deep-learning-based models, and developed a protection approach based on spectral signatures. Schuster et al. [209] and Wan et al. [241] proposed attacking the neural code models through data poisoning, and verified it in code completion and code search, respectively. Severi et al. [210] suggested an explanation-guided backdoor approach to attack the malware classifiers. Overall, exploring the robustness and security of code intelligence models is an interesting and important research direction.

Privacy. Despite the significant progress in code intelligence achieved through deep learning techniques, particularly LLMs, criticism has been raised regarding their strong propensity to memorize training data, especially concerning privacy concerns due to the inadvertent exposure of sensitive information. One line of work is on membership inference attacks, the aim of which is to infer whether a data sample has been used in training or not [217, 283]. To achieve this goal, one dominant approach is shadow model training [217], which aims to train a binary attack classifier by creating multiple shadow models to mimic the behavior of the target model. Another line of work is on training data extraction attacks [38, 39, 197]. In [39], the authors empirically revealed that LLMs resort to memorizing privacy-sensitive information that has been presented in training data, including personally identifiable information, URLs, code snippets, and UUIDs.

7 Conclusion

In this paper, we study deep learning for code intelligence by conducting a comprehensive survey, establishing a benchmark, as well as developing an open-source toolkit. We begin by providing a thorough literature review on deep learning for code intelligence, from the perspectives of code representations, deep learning techniques, application tasks, and public datasets. We then present an open-source toolkit for code intelligence, termed NaturalCC. On top of NaturalCC, we have benchmarked five popular application tasks about code intelligence, i.e., code summarization, code search, code completion, program synthesis, and type inference. We hope that our study contributes to a better understanding of the latest developments in code intelligence. We also hope that our toolkit and benchmark will contribute to the development of more accurate, robust, and trustworthy code intelligence models.

Footnotes

https://xcodemind.github.io

https://dblp.uni-trier.de

https://xcodemind.github.io

⁴

https://www.antlr.org

⁵

https://tree-sitter.github.io/tree-sitter

⁶

https://clang.llvm.org

⁷

https://plume-oss.github.io/plume-docs

⁸

https://clang-analyzer.llvm.org/scan-build.html

⁹

https://scan.coverity.com

¹⁰

https://www.hpfod.com

¹¹

https://dwheeler.com/flawfinder

¹²

https://fbinfer.com

¹³

https://github.com/features/copilot

Supplemental Material

Supplemental PDF

Supplementary Materials for “Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit”

Download
559.86 KB

References

[1]

2023. GitHub. Retrieved from https://www.github.com

Abstract