Article

Unsupervised Graph Neural Networks for Source Code Similarity Detection

Authors:

Julien Cassagne,

Guy-Vincent Jourdan,

Iosif-Viorel OnutAuthors Info & Claims

Discovery Science: 26th International Conference, DS 2023, Porto, Portugal, October 9–11, 2023, Proceedings

Pages 535 - 549

https://doi.org/10.1007/978-3-031-45275-8_36

Published: 09 October 2023 Publication History

Abstract

In this paper, we propose a novel unsupervised approach for code similarity and clone detection that is based on Graph Neural Networks. We propose a hybrid approach to detect similarities within source code, using centroid distances and a Graph Auto-Encoder that uses a raw abstract syntax trees as input. When compared to

R_{TV} NN

[33], the state-of-the-art unsupervised approach for code similarity and clone detection, our method improves significantly training and inference time efficiency, while preserving or improving precision. In our experiments, our algorithm is on average 77 times faster during training and 21 times faster during inference. This shows that using Graph Auto-Encoders in the domain of source code similarity analysis is the better option in an industrial context or in a production environment. We illustrate this by using our approach to compute source code similarity within a large dataset of phishing kits written in PHP provided by our industry partner.

References

[1]

Repository. https://gitlab.com/polymtl-static-analysis/vgae-code-analysis

[2]

Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance (Cat. No. 98CB36272), pp. 368–377 (1998)

[3]

Ducasse, S., Nierstrasz, O., Rieger, M.: On the effectiveness of clone detection by string matching: research articles. J. Softw. Maint. Evol. 18(1) (2006)

[4]

Feng, S., Duarte, M.F.: Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing (2018)

[5]

Fey, M., Lenssen, J.E.: Fast Graph Representation Learning with PyTorch Geometric (2019)

[6]

Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. CoRR abs/1704.01212 (2017)

[7]

Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks (2005)

[8]

Jiang S, Hong Y, Fu C, Qian Y, and Han L Function-level obfuscation detection method based on graph convolutional networks J. Inf. Secur. Appl. 2021 61

[9]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2014)

[10]

Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv:1611.07308 [cs, stat] (2016)

[11]

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)

[12]

Li, Y., Gu, C., Dullien, T., Vinyals, O., Kohli, P.: Graph matching networks for learning the similarity of graph structured objects (2019)

[13]

Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks (2015)

[14]

Liu, C., Lin, Z., Lou, J.G., Wen, L., Zhang, D.: Can neural clone detection generalize to unseen functionalities

f

. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 617–629 (2021)

[15]

Liu, S.: A unified framework to learn program semantics with graph neural networks. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2020)

[16]

Ma G, Ahmed NK, Willke TL, and Yu PS Deep graph similarity learning: a survey Data Min. Knowl. Disc. 2021 35 3 688-725

[17]

McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2020)

[18]

Mehrotra, N., Agarwal, N., Gupta, P., Anand, S., Lo, D., Purandare, R.: Modeling functional similarity in source code with graph-based siamese networks. arXiv:2011.11228 [cs] (2020)

[19]

Merlo, E., Antoniol, G., Di Penta, M., Rollo, V.: Linear complexity object-oriented similarity for clone detection and software evolution analyses. In: Proceedings of the 20th IEEE International Conference on Software Maintenance, pp. 412–416 (2004)

[20]

Nair, A., Roy, A., Meinke, K.: funcGNN: a graph neural network approach to program similarity. In: Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–11 (2020). arXiv: 2007.13239

[21]

Nguyen, V.A., Nguyen, D.Q., Nguyen, V., Le, T., Tran, Q.H., Phung, D.: ReGVD: revisiting graph neural networks for vulnerability detection. In: 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (2022)

[22]

Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., Zhang, C.: Adversarially regularized graph autoencoder for graph embedding. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. AAAI Press (2018)

[23]

Park, J., Lee, M., Chang, H., Lee, K., Choi, J.: Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

[24]

Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)

[25]

Roy CK, Cordy JR, and Koschke R Comparison and evaluation of code clone detection techniques and tools: a qualitative approach Sci. Comput. Program. 2009 74 7 470-495

Digital Library

[26]

Rozi, M.F., Ban, T., Ozawa, S., Kim, S., Takahashi, T., Inoue, D.: JStrack: enriching malicious JavaScript detection based on AST graph analysis and attention mechanism. In: Neural Information Processing: ICONIP (2021)

[27]

Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, and Monfardini G The graph neural network model IEEE Trans. Neural Netw. 2009 20 1 61-80

Digital Library

[28]

Siow, J.K., Liu, S., Xie, X., Meng, G., Liu, Y.: Learning program semantics with code representations: an empirical study. In: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 554–565 (2022)

[29]

Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp. 1556–1566. Association for Computational Linguistics (2015)

[30]

Wang, L., et al.: Inductive and unsupervised representation learning on graph structured objects. In: International Conference on Learning Representations (2020)

[31]

Wang, W., Li, G., Ma, B., Xia, X., Jin, Z.: Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271 (2020)

[32]

Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017 (2017)

[33]

White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98 (2016)

[34]

Wu Z, Pan S, Chen F, Long G, Zhang C, and Yu PS A comprehensive survey on graph neural networks IEEE Trans. Neural Netw. Learn. Syst. 2020 32 4-24

[35]

Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1) (2023)

[36]

Yu, H., Lam, W., Chen, L., Li, G., Xie, T., Wang, Q.: Neural detection of semantic code clones via tree-based convolution. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 70–80 (2019)

[37]

Zeng J, Ben K, Li X, and Zhang X Fast code clone detection based on weighted recursive autoencoders IEEE Access 2019 7 125062-125078

[38]

Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794 (2019)

[39]

Zhou J, Cui G, Zhang Z, Yang C, Liu Z, and Sun M Graph neural networks: a review of methods and applications AI Open 2020 1 57-81

Recommendations

One-class graph neural networks for anomaly detection in attributed networks
Abstract
Nowadays, graph-structured data are increasingly used to model complex systems. Meanwhile, detecting anomalies from graph has become a vital research problem of pressing societal concerns. Anomaly detection is an unsupervised learning task of ...
Unsupervised Classifying of Software Source Code Using Graph Neural Networks
FRUCT'24: Proceedings of the 24th Conference of Open Innovations Association FRUCT

Usually automated programming systems consist of two parts: source code analysis and source code generation. This paper is focused on the first part. Automated source code analysis can be useful for errors and vulnerabilities searching and for ...
Code classification with graph neural networks: Have you ever struggled to make it work?
Abstract
Code classification is a meaningful task with plenty of practical applications. Combined with recently popular graph neural networks (GNNs), a body of research attempts to address the code classification problem with the help of fruitful ...
Highlights
- The effect of GNN Layers in the code classification task is confirmed.
- Recognizing vulnerable code from real world is proved to be difficult for GNN.
- Choosing a proper training strategy matters for learning a good code metric.
- ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Discovery Science: 26th International Conference, DS 2023, Porto, Portugal, October 9–11, 2023, Proceedings

Oct 2023

724 pages

ISBN:978-3-031-45274-1

DOI:10.1007/978-3-031-45275-8

Editors:
Albert Bifet
https://ror.org/013fsnh78Waikato University, Hamilton, New Zealand
,
Ana Carolina Lorena
https://ror.org/05vh67662Aeronautics Institute of Technology, São José dos Campos, Brazil
,
Rita P. Ribeiro
https://ror.org/043pwc612University of Porto, Porto, Portugal
,
João Gama
https://ror.org/043pwc612University of Porto, Porto, Portugal
,
Pedro H. Abreu
https://ror.org/04z8k9a98University of Coimbra, Coimbra, Portugal

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 October 2023

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents