Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-45275-8_36guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Unsupervised Graph Neural Networks for Source Code Similarity Detection

Published: 09 October 2023 Publication History

Abstract

In this paper, we propose a novel unsupervised approach for code similarity and clone detection that is based on Graph Neural Networks. We propose a hybrid approach to detect similarities within source code, using centroid distances and a Graph Auto-Encoder that uses a raw abstract syntax trees as input. When compared to RTVNN [33], the state-of-the-art unsupervised approach for code similarity and clone detection, our method improves significantly training and inference time efficiency, while preserving or improving precision. In our experiments, our algorithm is on average 77 times faster during training and 21 times faster during inference. This shows that using Graph Auto-Encoders in the domain of source code similarity analysis is the better option in an industrial context or in a production environment. We illustrate this by using our approach to compute source code similarity within a large dataset of phishing kits written in PHP provided by our industry partner.

References

[2]
Baxter, I., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance (Cat. No. 98CB36272), pp. 368–377 (1998)
[3]
Ducasse, S., Nierstrasz, O., Rieger, M.: On the effectiveness of clone detection by string matching: research articles. J. Softw. Maint. Evol. 18(1) (2006)
[4]
Feng, S., Duarte, M.F.: Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing (2018)
[5]
Fey, M., Lenssen, J.E.: Fast Graph Representation Learning with PyTorch Geometric (2019)
[6]
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. CoRR abs/1704.01212 (2017)
[7]
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks (2005)
[8]
Jiang S, Hong Y, Fu C, Qian Y, and Han L Function-level obfuscation detection method based on graph convolutional networks J. Inf. Secur. Appl. 2021 61
[9]
Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2014)
[10]
Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv:1611.07308 [cs, stat] (2016)
[11]
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)
[12]
Li, Y., Gu, C., Dullien, T., Vinyals, O., Kohli, P.: Graph matching networks for learning the similarity of graph structured objects (2019)
[13]
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks (2015)
[14]
Liu, C., Lin, Z., Lou, J.G., Wen, L., Zhang, D.: Can neural clone detection generalize to unseen functionalitiesf. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 617–629 (2021)
[15]
Liu, S.: A unified framework to learn program semantics with graph neural networks. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2020)
[16]
Ma G, Ahmed NK, Willke TL, and Yu PS Deep graph similarity learning: a survey Data Min. Knowl. Disc. 2021 35 3 688-725
[17]
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2020)
[18]
Mehrotra, N., Agarwal, N., Gupta, P., Anand, S., Lo, D., Purandare, R.: Modeling functional similarity in source code with graph-based siamese networks. arXiv:2011.11228 [cs] (2020)
[19]
Merlo, E., Antoniol, G., Di Penta, M., Rollo, V.: Linear complexity object-oriented similarity for clone detection and software evolution analyses. In: Proceedings of the 20th IEEE International Conference on Software Maintenance, pp. 412–416 (2004)
[20]
Nair, A., Roy, A., Meinke, K.: funcGNN: a graph neural network approach to program similarity. In: Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–11 (2020). arXiv: 2007.13239
[21]
Nguyen, V.A., Nguyen, D.Q., Nguyen, V., Le, T., Tran, Q.H., Phung, D.: ReGVD: revisiting graph neural networks for vulnerability detection. In: 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (2022)
[22]
Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., Zhang, C.: Adversarially regularized graph autoencoder for graph embedding. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. AAAI Press (2018)
[23]
Park, J., Lee, M., Chang, H., Lee, K., Choi, J.: Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
[24]
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
[25]
Roy CK, Cordy JR, and Koschke R Comparison and evaluation of code clone detection techniques and tools: a qualitative approach Sci. Comput. Program. 2009 74 7 470-495
[26]
Rozi, M.F., Ban, T., Ozawa, S., Kim, S., Takahashi, T., Inoue, D.: JStrack: enriching malicious JavaScript detection based on AST graph analysis and attention mechanism. In: Neural Information Processing: ICONIP (2021)
[27]
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, and Monfardini G The graph neural network model IEEE Trans. Neural Netw. 2009 20 1 61-80
[28]
Siow, J.K., Liu, S., Xie, X., Meng, G., Liu, Y.: Learning program semantics with code representations: an empirical study. In: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 554–565 (2022)
[29]
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, pp. 1556–1566. Association for Computational Linguistics (2015)
[30]
Wang, L., et al.: Inductive and unsupervised representation learning on graph structured objects. In: International Conference on Learning Representations (2020)
[31]
Wang, W., Li, G., Ma, B., Xia, X., Jin, Z.: Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271 (2020)
[32]
Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017 (2017)
[33]
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98 (2016)
[34]
Wu Z, Pan S, Chen F, Long G, Zhang C, and Yu PS A comprehensive survey on graph neural networks IEEE Trans. Neural Netw. Learn. Syst. 2020 32 4-24
[35]
Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1) (2023)
[36]
Yu, H., Lam, W., Chen, L., Li, G., Xie, T., Wang, Q.: Neural detection of semantic code clones via tree-based convolution. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 70–80 (2019)
[37]
Zeng J, Ben K, Li X, and Zhang X Fast code clone detection based on weighted recursive autoencoders IEEE Access 2019 7 125062-125078
[38]
Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794 (2019)
[39]
Zhou J, Cui G, Zhang Z, Yang C, Liu Z, and Sun M Graph neural networks: a review of methods and applications AI Open 2020 1 57-81

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Discovery Science: 26th International Conference, DS 2023, Porto, Portugal, October 9–11, 2023, Proceedings
Oct 2023
724 pages
ISBN:978-3-031-45274-1
DOI:10.1007/978-3-031-45275-8
  • Editors:
  • Albert Bifet,
  • Ana Carolina Lorena,
  • Rita P. Ribeiro,
  • João Gama,
  • Pedro H. Abreu

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 October 2023

Author Tags

  1. Graph neural network
  2. Unsupervised Learning
  3. Machine learning
  4. Phishing kits similarity
  5. Software similarity analysis
  6. Static analysis

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media