research-article

Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks

Authors:

Zishuo Ding,

Heng Li,

Weiyi Shang,

Tse-Hsun (Peter) ChenAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2

Article No.: 48, Pages 1 - 43

https://doi.org/10.1145/3542944

Published: 30 March 2023 Publication History

Get Access

Abstract

Code embeddings have seen increasing applications in software engineering (SE) research and practice recently. Despite the advances in embedding techniques applied in SE research, one of the main challenges is their generalizability. A recent study finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not particularly trained for. Therefore, in this article, we propose GraphCodeVec, which represents the source code as graphs and leverages the Graph Convolutional Networks to learn more generalizable code embeddings in a task-agnostic manner. The edges in the graph representation are automatically constructed from the paths in the abstract syntax trees, and the nodes from the tokens in the source code. To evaluate the effectiveness of GraphCodeVec, we consider three downstream benchmark tasks (i.e., code comment generation, code authorship identification, and code clones detection) that are used in a prior benchmarking of code embeddings and add three new downstream tasks (i.e., source code classification, logging statements prediction, and software defect prediction), resulting in a total of six downstream tasks that are considered in our evaluation. For each downstream task, we apply the embeddings learned by GraphCodeVec and the embeddings learned from four baseline approaches and compare their respective performance. We find that GraphCodeVec outperforms all the baselines in five out of the six downstream tasks, and its performance is relatively stable across different tasks and datasets. In addition, we perform ablation experiments to understand the impacts of the training context (i.e., the graph context extracted from the abstract syntax trees) and the training model (i.e., the Graph Convolutional Networks) on the effectiveness of the generated embeddings. The results show that both the graph context and the Graph Convolutional Networks can benefit GraphCodeVec in producing high-quality embeddings for the downstream tasks, while the improvement by Graph Convolutional Networks is more robust across different downstream tasks and datasets. Our findings suggest that future research and practice may consider using graph-based deep learning methods to capture the structural information of the source code for SE tasks.

References

[1]

Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-scale and language-oblivious code authorship identification. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 101–114. DOI:

Abstract

References

Cited By

Index Terms

Recommendations

Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks

Analyzing CS1 Student Code Using Code Embeddings

Automatic detection of code smells using metrics and CodeT5 embeddings: a case study in C#

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations