1 Introduction
Software is eating the world [
15]. With the advancement of
Artificial Intelligence (AI), it is time to expand that maxim: software ate the world, and AI is eating the software. As the software is primarily composed of code, we define the emerging concept of
code intelligence as the application of AI techniques to extract knowledge from large-scale code repositories, with the aim of developing intelligent tools to improve the quality and productivity of computer programming [
163]. This concept is fueled by the ever-expanding reservoir of source code, often referred to as “
Big Code” [
7], which is harvested from platforms such as GitHub [
1] and StackOverflow [
2]. In this paper, our research scope is confined to code intelligence, with a particular focus on the application of deep learning techniques.
Achieving code intelligence necessitates a collaborative synergy in research across the domains of software engineering, machine learning,
Natural Language Processing (NLP), programming language, and security. From our investigation, precise and reliable code representation learning (or code embedding), which aims to efficiently and effectively encode the semantics of source code into distributed vector representations, is the foundation for code intelligence. Such embedding vectors are then used in various downstream tasks, such as code completion [
121,
153,
203,
229], code search [
83,
111,
240], code summarization [
11,
108,
112,
243,
293], type inference [
8,
104,
193,
260], and program synthesis [
17,
181,
183,
204] and so on. In terms of code representation learning, significant progress has been made by utilizing deep learning and NLP techniques to encode code.
Analogous to word2vec [
170] in NLP, Alon et al. [
14] proposed code2vec, a distributed representation of code, based on a collection of paths extracted from the
Abstract Syntax Tree (AST) of code. Recently, a multitude of neural networks tailored for specific tasks have been proposed and trained using supervised methods. As pre-trained language models (e.g., BERT [
64] and GPT-3 [
28]) have been widely applied to NLP, many pre-trained language models for code have been proposed [
72,
89,
119] to better represent the semantics of code. More recently, the emergence of
Large Language Models (LLMs), exemplified by ChatGPT, has illuminated the pathway for further advancement of pre-trained language models, with a notable trend of increasing model sizes. This trend has extended to the domain of code intelligence, resulting in the development of various LLMs tailored for code, including but not limited to CodeT5 [
253], StarCoder [
134], and Code Llama [
204]. In this paper, we examine code intelligence through the lenses of code representation learning, deep learning methods, and their applications.
Related Surveys and Differences. Within our literature review, we identified several surveys related to ours. Notably, Allamanis et al. [
7] conducted an exhaustive examination of machine learning approaches for modeling the naturalness of programming language. They primarily emphasize machine learning algorithms, with a specific focus on probabilistic models, as opposed to those based on deep learning. Recently, Watson et al. [
256], Wang et al. [
248] and Yang et al. [
276] conducted a thorough review of the literature on applications of deep learning in software engineering research. They investigated mostly software engineering and AI conferences and journals, focusing on various software engineering tasks (not limited to the code) that are based on deep learning. [
63] is a report that summarizes the current status of research on the subject of the intersection between deep learning and software engineering, as well as suggests several future directions. In [
163], the authors introduced CodeXGLUE, a benchmark dataset for code representation and generation. They also presented benchmark results, notably leveraging pre-trained language models like CodeBERT.
Table
1 summarizes the differences between our paper when compared with several related surveys in code intelligence. In contrast to [
7] that focuses on traditional machine learning approaches, this paper places greater emphasis on leveraging deep learning techniques for code intelligence. In contrast to [
256], [
248], [
276], and [
63] that cover various tasks in broad software engineering, our study narrows its focus to tasks associated with source code, examining them specifically from the perspective of deep learning. In addition, we survey papers from various fields including software engineering, programming languages, machine learning, NLP, and security. Furthermore, existing surveys do not provide comprehensive benchmark evaluation results, nor do they develop an open-source toolkit to facilitate further research. This paper addresses this gap by presenting an open-source toolkit, referred to as
NaturalCC(standards for
Natural Code Comprehension)[
239]. The toolkit is designed to streamline the prototyping of code intelligence models and to serve as a benchmarking platform for evaluating various state-of-the-art models. In complement to CodeXGLUE [
163], our focus lies in the building of infrastructures that support diverse model implementations and provide users with the ability to conduct rapid prototyping. Compared to CodeXGLUE, our toolkit contains a more extensive array of tools designed for the entire pipeline involved in constructing code intelligence models, offering heightened flexibility.
Our Contributions. This paper is targeted at researchers and practitioners intrigued by the convergence of code intelligence and deep learning, with a specific emphasis on intelligent software engineering, NLP, and programming languages. In this paper, we begin by providing a thorough review of existing research on deep learning for code intelligence. Subsequently, we advance our contribution by developing an open-source toolkit, referred to as
NaturalCC, that incorporates state-of-the-art models across various downstream tasks. Employing
NaturalCC, we conduct a comprehensive performance benchmark of each model across five downstream tasks, including code summarization, code search, code completion, and type inference. The major contributions of this paper are summarized as follows.
–
We conduct a comprehensive review of deep learning for code intelligence. Specifically, we have collected 276 papers from various top-tier venues and arXiv, covering multiple domains including software engineering, artificial intelligence, NLP, programming languages, and security.
–
We benchmark the performance of 18 leading models across five different tasks (i.e., code summarization, code search, code completion, program synthesis, and type inference). All the resources, datasets and source code are publicly available.
1–
We introduce NaturalCC, an open-source toolkit featuring integrated state-of-the-art baselines across various tasks, aimed at streamlining research in code intelligence. Researchers in software engineering, NLP, and related domains can leverage this toolkit for rapid prototyping.
5 Toolkit and Demonstration
This section introduces the design of
NaturalCCand its user interface. Figure
4 (left) shows the code structure of
NaturalCC. The
dataset folder contains data preprocessing code. The
ncc folder is the core module. The
third_party folder holds model evaluation packages. The
gui folder contains graphical user interface files and assets. As shown in Figure
4 (right),
NaturalCCis composed of four components, i.e., data preprocessing, code representation, downstream tasks, and their corresponding evaluations. At the stage of data preprocessing, we process the source code with a series of steps, including word tokenization, building vocabulary, and feature extraction. Additionally, a data loader is used to iteratively yield batches of code samples with their features. The resulting batches are then sent into the code representation models, which facilitate a variety of downstream tasks, including code summarization, code search, code completion, and type inference. To evaluate the performance of each task, we also implement several corresponding metrics that have been widely adopted previously.
5.1 Data Preprocessing Module
In
NaturalCC, we have collected and processed four datasets including CodeSearchNet [
111], Python-Doc [
243], Py150 [
202], and DeepTyper [
104]. First, we tokenize the input source code, and then build a vocabulary to map the code tokens into indexes. Currently, we support two types of tokenizations: space tokenizer and BPE tokenizer [
120]. Along with code tokens, we also explore different features of code, such as AST, IR, CFGs, and DFGs. All the related scripts for data preprocessing have been put in the
data and
dataset folders.
5.2 Code Representation Module
As the core component of
NaturalCC, we have implemented several encoders that are widely used in state-of-the-art approaches for source code representation, including RNN, GNN, and Transformer. For example, we have implemented LSTM, TreeLSTM and Transformer networks for sequential tokens and (linearized) ASTs. We have also implemented a GNN, i.e., GGNN, to represent the control-flow graph of source code. It is worth mentioning that in
NaturalCC, we have also incorporated the pre-training approaches for source code. We have implemented several state-of-the-art pre-trained code models, including CodeBERT [
72], PLBART [
3], and GPT-2 [
163]. The
models and
modules folders contain all the implemented networks for code representation.
5.3 Tool Implementation
NaturalCCis mainly implemented by PyTorch, and builds upon other successful open-source toolkits in NLP, such as Fairseq, and AllenNLP.
Registry Mechanism. To be flexible, NaturalCCis expected to be easily extended to different tasks and model implementations, with minimum modification. Similar to Fairseq, we design a register decorator on instantiating a new task or model, the implementation of which is in the corresponding __init__.py in each folder. The registry mechanism is to create a global variable to store all the available tasks, models, and objects at the initialization stage, so that users can easily access them throughout the whole project.
Efficient Training. NaturalCCsupports efficient training of models in a distributed way through torch.distributed. It can utilize multiple GPUs across different servers. Furthermore, NaturalCCcan support calculation in mixed precision to further increase the training speed, including both FP32 and FP16 training. Typically, the gradients are updated in FP16 while the parameters are saved in FP32.
Flexible Configuration. Instead of employing argparse for managing command-line options within Fairseq, we advocate the adoption of individual yaml configuration files for each model’s configuration. We contend that the flexibility offered by modifying these yaml configuration files is better suited for model exploration.
5.4 Graphical User Interface
We also develop a web-based graphical user interface to facilitate users in exploring the outcomes of trained models. The design is based on the open-source demonstration of AllenNLP [
78]. Figure
5(a) displays a screenshot of our demonstration system, which currently features three tasks of code intelligence: code summarization, code search, and code completion. We leave the integration of other code intelligence tasks to our future work.
5.5 Leaderboard
We release a leaderboard so that researchers can report the results of their own models and compete with others, as shown in Figure
5(b). Currently, we only support researchers and developers who use
NaturalCCto implement their approach and update the experimental results via pull requests in GitHub. In our future work, we will build a web-based service, which allows users to upload their predicted results and evaluate the model performance automatically using the ground-truth labels as a reference.
6 Challenges and Opportunities
Although much effort has been made into deep learning for code intelligence, this area of research is still in its infancy with many open challenges and opportunities. To inspire future research, this section suggests several potential directions that are worth pursuing.
Comprehensive Code Comprehension. Designing a representation approach to effectively and efficiently preserve the semantics of programs has always been a fundamental problem in code intelligence. Despite much effort on code representation, as mentioned in this paper, there are still three main obstacles to be overcome.
(a) Open Vocabulary. Building a vocabulary to index the textual tokens of code is the first step toward applying deep learning models for code intelligence. Since the unambiguous characteristic of code, the vocabulary in code is much more open and complicated than the vocabulary in natural languages. The vocabulary of programming languages often consists of keywords, identifiers, customized method names, and variable names. The large vocabulary contains much “noise”, making it difficult to comprehend the code. Although many attempts [
53,
60,
120] have been made towards mitigating the OOV issue, it remains a challenge to design a simple yet effective approach to map the source code into indexes while preserving the semantics.
(b) Complex Structure of Program. Unlike natural language, code is written with strict grammar. The computations described by code can be executed in an order that is different from the order in which the code was written. This is often seen in operations such as loops, recursions, and pointer manipulation. Although many attempts to capture the structure of code from different modalities, as we surveyed in this paper, we believe that the structures of code are not sufficiently preserved, and more effort is needed here. Inspired by the GNNs, there is potential to design specific GNNs to better represent the structure of programs. For example, from our analysis, ASTs, CFGs, DFGs and CPGs all have high heterogeneity. It is desirable to design some heterogeneous-information-network-based approaches [
223] to represent the heterogeneous code graph.
(c) How to Feed the Program Structures into LLMs? Current LLM architectures predominantly rely on pre-training with sequential data, lacking support for structural inputs. However, source code exhibits rich structural features, including CFG, DFG, and PDG. Consequently, integrating these structural features into LLMs has emerged as a prominent research endeavor. Recent studies [
6,
244] have explored the incorporation of such features into LLMs through designing prompts.
Data Hungry and Data Quality. Despite much progress achieved in deep-learning-based approaches for code intelligence, we argue that existing approaches still suffer from the data-hungry issue. In other words, the effectiveness of cutting-edge techniques significantly depends on the availability of vast quantities of expensive and labor-intensive well-labeled training data. Training the model on a small qualified dataset will result in far less imprecise results, especially for new programming languages or languages with an inadequate number of labeled samples. Therefore, it is important to design approaches to reduce the reliance on a large quantity of labeled data. A similar problem exists in the field of machine learning. One promising solution for this dilemma is transfer learning, which has achieved great success in transferring knowledge to alleviate the data-hungry issue in computer vision and NLP. Similarly, to model an emerging programming language with limited data, it is desirable to mitigate the data-hungry issue by leveraging models trained in programming languages with sufficient labeled training data [
42,
45,
57]. Data quality is also a crucial issue for code intelligence, which may exacerbate the data-hungry problem. From our analysis, the collected datasets from online resources, like GitHub and StackOverflow, are not quality ensured. Sun et al. [
224] and Shi et al. [
215] investigated the importance of data quality and verified it on the tasks of code search and code summarization, respectively. In the area of LLMs, the significance of data quality has surged. For instance, Gunasekar et al. [
87] demonstrated the efficacy of training an LLM with a relatively modest parameter count of 1.3 billion using textbook data. Their work underscores the growing importance of meticulously selecting training data and harnessing synthetic data, a trend expected to intensify in the foreseeable future.
Multi-Lingual and Cross-Language. The codebase written in multiple programming languages can be considered a multi-lingual corpus, as in NLP. However, the multi-lingual problem in programming languages has not been well investigated. Different from the multi-lingual problems studied in NLP, the corpus of multiple programming languages will bring more opportunities and challenges to future research. Recently, several attempts have been made to learn the common knowledge shared among multiple programming languages, and transfer the knowledge across different programming languages. For example, Zhang et al. [
291] proposed obtaining better interpretability and generalizability by disentangling the semantics of source code from multiple programming languages based on variational autoencoders. Zügner et al. [
309] introduced a language-agnostic code representation based on the features directly extracted from the AST. Ahmed and Devanbu [
5] conducted an exploratory study and revealed the evidence that multilingual property indeed exists in the source code corpora. For example, it is more likely that programs that solve the same problem in different languages make use of the same or similar identifier names. They also investigate the effect of multilingual (pre-)training for code summarization and code search. Nafi et al. [
175] proposed CLCDSA, a cross-language clone detector with syntactical features and API documentation. Bui et al. [
30] proposed a bilateral neural network for the task of cross-language algorithm classification. Bui et al. [
31] proposed SAR, which can learn cross-language API mappings with minimal knowledge. Recently, Chai et al. [
42] proposed a novel approach termed CDCS for domain-specific code search through transfer learning across programming languages. Gui et al. [
86] proposed an approach that matches source code and binary code across different languages based on intermediate representation.
Model Interpretability. Lack of interpretability is a common challenge for most deep learning-based techniques for code intelligence, as deep learning is a black-box method. New methods and studies on interpreting the working mechanisms of deep neural networks should be a potential research direction. Recently, several efforts have been made toward increasing the interpretability of deep-learning-based models. As an example, Li et al. [
140] presented a novel approach to explain predicted results for GNN-based vulnerability detection by extracting sub-graphs in the program dependency graph. In addition, Zou et al. [
308] proposed interpreting a deep-learning-based model for vulnerability detection by identifying a limited number of tokens that play a significant role in the final prediction of the detectors. Zhang et al. [
295] proposed interpretable program synthesis that allows users to see the synthesis process and have control over the synthesizer. Pornprasit et al. [
192] proposed a local rule-based model-agnostic approach, termed PyExplainer, to explain the predictions of just-in-time defect models. Rabin et al. [
196] proposed a model-agnostic explainer based on program simplification, inspired by the delta debugging algorithms. Wan et al. [
242], López et al. [
161], and Sharma et al. [
212] investigated the explainability of pre-trained code models through probing the code attention and hidden representations. We believe that it is essential to enhance the interpretability of current deep-learning-based approaches for code intelligence.
Effective and Efficient Use of Models. Recently, substantial advancements have been observed in the capabilities of LLMs concerning code intelligence. However, the optimal utilization of LLMs to harness their full potential in code intelligence poses a pressing challenge. To this end, several efforts have been undertaken, including the following three aspects.
(a) Prompt Engineering. One prominent avenue of research revolves around crafting prompts aimed at enhancing interactions with LLMs through model inference. Numerous studies have delved into crafting prompts customized for tasks such as code generation [
150], code summarization [
79], program repair [
270], and vulnerability detection [
68], respectively.
(b) Parameter-Efficient Tuning. As the size of LLMs increases, the expense associated with fully fine-tuning them escalates. Consequently, prompt tuning has emerged as a solution to mitigate this challenge. In [
245], the authors have empirically explored the prompt tuning techniques in code intelligence tasks.
(c) Model Compression. As current pre-trained code models continue to increase in size, the computational expense of pre-training on large-scale code corpora remains a significant challenge, resulting in high costs associated with both training and model inference. Zhang et al. [
297] and Shi et al. [
214] proposed to improve the efficiency of the training process by model compressing. It is a promising research direction to reduce the computational resources of pre-trained code models.
Robustness and Security. Despite significant progress being made in the training of accurate models for code intelligence, the robustness and security of these models have rarely been explored. As seen in the fields of NLP and CV, deep neural networks are frequently not robust [
40]. Specifically, current deep learning models can be easily deceived by adversarial examples, which are created by making small changes to the inputs of the model that it would consider as benign. There are many different ways to produce adversarial samples in the computer vision and NLP communities, particularly for image classification [
37,
40,
71] and sentiment classification [
296]. Similarly, for source code models, the adversarial attack also exists. Recently, there have been several efforts to investigate the robustness and security of deep-learning-based models for code intelligence. For example, Ramakrishnan et al. [
201] and Yefet et al. [
282] investigated how to improve the robustness of source code models through adversarial training. Nguyen et al. [
178] empirically investigated the use of adversarial learning techniques for API recommendation. Bielik and Vechev [
25] introduced a novel method that incorporates adversarial training and representation refinement to create precise and robust models of source code. Zhou et al. [
303], Yang et al. [
278] and Zhang et al. [
290] proposed a black-box attack for neural code models by generating adversarial examples while preserving the semantics of source code. Based on semantics-preserving code transformations, Quiring et al. [
195] and Liu et al. [
157] developed a novel attack against authorship attribution of source code. Ramakrishnan and Albarghouthi [
200] investigated the possibility of injecting a number of common backdoors into deep-learning-based models, and developed a protection approach based on spectral signatures. Schuster et al. [
209] and Wan et al. [
241] proposed attacking the neural code models through data poisoning, and verified it in code completion and code search, respectively. Severi et al. [
210] suggested an explanation-guided backdoor approach to attack the malware classifiers. Overall, exploring the robustness and security of code intelligence models is an interesting and important research direction.
Privacy. Despite the significant progress in code intelligence achieved through deep learning techniques, particularly LLMs, criticism has been raised regarding their strong propensity to memorize training data, especially concerning privacy concerns due to the inadvertent exposure of sensitive information. One line of work is on membership inference attacks, the aim of which is to infer whether a data sample has been used in training or not [
217,
283]. To achieve this goal, one dominant approach is shadow model training [
217], which aims to train a binary attack classifier by creating multiple shadow models to mimic the behavior of the target model. Another line of work is on training data extraction attacks [
38,
39,
197]. In [
39], the authors empirically revealed that LLMs resort to memorizing privacy-sensitive information that has been presented in training data, including personally identifiable information, URLs, code snippets, and UUIDs.