Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks

Published: 30 March 2023 Publication History

Abstract

Code embeddings have seen increasing applications in software engineering (SE) research and practice recently. Despite the advances in embedding techniques applied in SE research, one of the main challenges is their generalizability. A recent study finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not particularly trained for. Therefore, in this article, we propose GraphCodeVec, which represents the source code as graphs and leverages the Graph Convolutional Networks to learn more generalizable code embeddings in a task-agnostic manner. The edges in the graph representation are automatically constructed from the paths in the abstract syntax trees, and the nodes from the tokens in the source code. To evaluate the effectiveness of GraphCodeVec, we consider three downstream benchmark tasks (i.e., code comment generation, code authorship identification, and code clones detection) that are used in a prior benchmarking of code embeddings and add three new downstream tasks (i.e., source code classification, logging statements prediction, and software defect prediction), resulting in a total of six downstream tasks that are considered in our evaluation. For each downstream task, we apply the embeddings learned by GraphCodeVec and the embeddings learned from four baseline approaches and compare their respective performance. We find that GraphCodeVec outperforms all the baselines in five out of the six downstream tasks, and its performance is relatively stable across different tasks and datasets. In addition, we perform ablation experiments to understand the impacts of the training context (i.e., the graph context extracted from the abstract syntax trees) and the training model (i.e., the Graph Convolutional Networks) on the effectiveness of the generated embeddings. The results show that both the graph context and the Graph Convolutional Networks can benefit GraphCodeVec in producing high-quality embeddings for the downstream tasks, while the improvement by Graph Convolutional Networks is more robust across different downstream tasks and datasets. Our findings suggest that future research and practice may consider using graph-based deep learning methods to capture the structural information of the source code for SE tasks.

References

[1]
Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-scale and language-oblivious code authorship identification. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 101–114. DOI:
[2]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. ACM, 38–49. DOI:
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=BJOFETxR-.
[4]
Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33rd International Conference on Machine Learning series. JMLR.org, 2091–2100. Retrieved from http://proceedings.mlr.press/v48/allamanis16.html.
[5]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 404–419. DOI:
[6]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. PACMPL 3, POPL (2019), 40:1–40:29. DOI:
[7]
Liliane Barbour, Foutse Khomh, and Ying Zou. 2011. Late propagation in software clones. In Proceedings of the IEEE 27th International Conference on Software Maintenance. IEEE Computer Society, 273–282. DOI:
[8]
Yoshua Bengio. 2012. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade - Second Edition, Grégoire Montavon, Genevieve B. Orr, and Klaus-Robert Müller (Eds.). Lecture Notes in Computer Science, Vol. 7700. Springer, 437–478. DOI:
[9]
Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic model for code. In Proceedings of the 33rd International Conference on Machine Learning. JMLR.org, 2933–2942. Retrieved from http://proceedings.mlr.press/v48/bielik16.html.
[10]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5 (2017), 135–146. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/999.
[11]
Lutz Büch and Artur Andrzejak. 2019. Learning-based recursive aggregation of abstract syntax trees for code clone detection. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 95–104. DOI:
[12]
Shaofeng Cai, Yao Shu, Gang Chen, Beng Chin Ooi, Wei Wang, and Meihui Zhang. 2019. Effective and efficient dropout for deep convolutional neural networks. (2019). arXiv:cs.LG/1904.03392.
[13]
Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1623–1633. DOI:
[14]
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 3438–3445. Retrieved from https://aaai.org/ojs/index.php/AAAI/article/view/5747.
[15]
Tse-Hsun Chen, Stephen W. Thomas, and Ahmed E. Hassan. 2016. A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21, 5 (2016), 1843–1919. DOI:
[16]
Zimin Chen and Martin Monperrus. 2018. The remarkable role of similarity in redundancy-based program repair. CoRR abs/1811.05703 (2018).
[17]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3837–3845. Retrieved from http://papers.nips.cc/paper/6081-convolutional-neural-networks-on-graphs-with-fast-localized-spectral-filtering.
[18]
Zishuo Ding, Jinfu Chen, and Weiyi Shang. 2020. Towards the use of the readily available tests from the release pipeline as performance tests. Are we there yet? In Proceedings of the 42nd International Conference on Software Engineering.
[19]
Yael Dubinsky, Julia Rubin, Thorsten Berger, Slawomir Duszynski, Martin Becker, and Krzysztof Czarnecki. 2013. An exploratory study of cloning in industrial software product lines. In Proceedings of the 17th European Conference on Software Maintenance and Reengineering. IEEE Computer Society, 25–34. DOI:
[20]
Vasiliki Efstathiou and Diomidis Spinellis. 2019. Semantic source code models using identifier embeddings. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE/ACM, 29–33. DOI:
[21]
Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? An empirical study on logging practices in industry. In Proceedings of the 36th International Conference on Software Engineering. ACM, New York, NY, 24–33.
[22]
Christian Garbin, Xingquan Zhu, and Oge Marques. 2020. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia Tools Appl. 79, 19–20 (May 2020), 12777–12815.
[23]
Yoav Goldberg and Graeme Hirst. 2017. Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers.
[24]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press. Retrieved from http://www.deeplearningbook.org.
[25]
Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Sang Peter Chin, and Tomo Lazovich. 2018. Automated software vulnerability detection with machine learning. CoRR abs/1803.04497 (2018).
[26]
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar T. Devanbu. 2012. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering. IEEE Computer Society, 837–847. DOI:
[27]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov.1997), 1735–1780. DOI:
[28]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179–2217.
[29]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension. ACM, 200–210. DOI:
[30]
Aylin Caliskan Islam, Richard E. Harang, Andrew Liu, Arvind Narayanan, Clare R. Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Security Symposium. USENIX Association, 255–270. Retrieved from https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/caliskan-islam.
[31]
Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, 103–112. DOI:
[32]
Rafael Kallis, Andrea Di Sorbo, Gerardo Canfora, and Sebastiano Panichella. 2021. Predicting issue types on GitHub. Science of Computer Programming 205 (2021), 102598.
[33]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 7 (2002), 654–670. DOI:
[34]
Hong Jin Kang, Tegawendé F. Bissyandé, and David Lo. 2019. Assessing the generalizability of code2vec token embeddings. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 1–12. DOI:
[35]
Shinji Kawaguchi, Pankaj K. Garg, Makoto Matsushita, and Katsuro Inoue. 2006. MUDABlue: An automatic categorization system for open source repositories. J. Syst. Softw. 79, 7 (2006), 939–953. DOI:
[36]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1746–1751. DOI:
[37]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SJU4ayYgl.
[38]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 67–72. DOI:
[39]
Alexandros Komninos and Suresh Manandhar. 2016. Dependency based embeddings for sentence classification tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1490–1500.
[40]
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 1188–1196. Retrieved from http://proceedings.mlr.press/v32/le14.html.
[41]
Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics, 302–308. DOI:
[42]
Chen Li, Jianxin Li, Yangqiu Song, and Ziwei Lin. 2018. Training and evaluating improved dependency-based word embeddings. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 5836–5843. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16429.
[43]
Guohao Li, Matthias Müller, Ali K. Thabet, and Bernard Ghanem. 2019. DeepGCNs: Can GCNs go as deep as CNNs? In Proceedings of the International Conference on Computer Vision. IEEE, 9266–9275. DOI:
[44]
Heng Li, Tse-Hsun (Peter) Chen, Weiyi Shang, and Ahmed E. Hassan. 2018. Studying software logging using topic models. Empirical Software Engineering 23, 5 (2018), 2655–2694.
[45]
Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 3538–3545. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16098.
[46]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the 4th International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1511.05493.
[47]
Zhenhao Li, Heng Li, Tse-Hsun Peter Chen, and Weiyi Shang. 2021. DeepLV: Suggesting log levels using ordinal based neural networks. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering. IEEE, 1461–1472. DOI:
[48]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81. Retrieved from https://aclanthology.org/W04-1013.
[49]
Zhiyuan Liu, Yankai Lin, and Maosong Sun. 2020. Representation Learning for Natural Language Processing. Springer. DOI:
[50]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, Nov. (2008), 2579–2605.
[51]
Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1506–1515. DOI:
[52]
Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. abs/1804.07612 (2018).
[53]
Jean Mayrand, Claude Leblanc, and Ettore Merlo. 1996. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the International Conference on Software Maintenance. IEEE Computer Society, 244. DOI:
[54]
Paul W. McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 279–290. DOI:
[55]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1301.3781.
[56]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems. Curran Associates Inc., 3111–3119. Retrieved from http://dl.acm.org/citation.cfm?id=2999792.2999959.
[57]
Andriy Mnih and Geoffrey E. Hinton. 2008. A scalable hierarchical distributed language model. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 1081–1088. Retrieved from https://proceedings.neurips.cc/paper/2008/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html.
[58]
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori L. Pollock, and K. Vijay-Shanker. 2013. Automatic generation of natural language summaries for Java classes. In Proceedings of the IEEE 21st International Conference on Program Comprehension. IEEE Computer Society, 23–32. DOI:
[59]
Frederic Morin and Yoshua Bengio. [n.d.]. Hierarchical probabilistic neural network language model. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics. Retrieved from http://www.gatsby.ucl.ac.uk/aistats/fullpapers/208.pdf.
[60]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, 1287–1293. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775.
[61]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, 1287–1293. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775.
[62]
Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. [n.d.]. On the equivalence of information retrieval methods for automated traceability link recovery. In Proceedings of the 18th IEEE International Conference on Program Comprehension. IEEE Computer Society, 68–71. DOI:
[63]
Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2020. On the equivalence of information retrieval methods for automated traceability link recovery: A ten-year retrospective. In Proceedings of the 28th International Conference on Program Comprehension. ACM, 1. DOI:
[64]
John F. Pane, Chotirat (Ann) Ratanamahatana, and Brad A. Myers. 2001. Studying the language and structure in non-programmers’ solutions to programming problems. Int. J. Hum. Comput. Stud. 54, 2 (2001), 237–264. DOI:
[65]
Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado Aaron Visaggio, Gerardo Canfora, and Harald C. Gall. 2015. How can I improve my app? Classifying user reviews for software maintenance and evolution. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE Computer Society, 281–290. DOI:
[66]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, 311–318. Retrieved from https://www.aclweb.org/anthology/P02-1040/.
[67]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[68]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532–1543. Retrieved from https://www.aclweb.org/anthology/D14-1162/.
[69]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. PACMPL 2, OOPSLA (2018), 147:1–147:25. DOI:
[70]
Likun Qiu, Yue Zhang, and Yanan Lu. 2015. Syntactic dependencies and distributed word representations for analogy detection and mining. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2441–2450.
[71]
Veselin Raychev, Martin T. Vechev, and Andreas Krause. 2015. Predicting program properties from “big code.” In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 111–124. DOI:
[72]
Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. ELRA, 45–50. Retrieved from http://is.muni.cz/publication/884893/en.
[73]
Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
[74]
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards deep graph convolutional networks on node classification. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=Hkx1qkrKPr.
[75]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. ACM, 1157–1168. DOI:
[76]
Tobias Schnabel, Igor Labutov, David M. Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. The Association for Computational Linguistics, 298–307. DOI:
[77]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori L. Pollock, and K. Vijay-Shanker. 2010. Towards automatically generating summary comments for Java methods. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering. ACM, 43–52. DOI:
[78]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929–1958.
[79]
Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution. IEEE Computer Society, 476–480. DOI:
[80]
Bart Theeten, Frederik Vandeputte, and Tom Van Cutsem. 2019. Import2vec learning embeddings for software libraries. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE/ACM, 18–28. DOI:
[81]
Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. 2010. An empirical study on the maintenance of source code clones. Empir. Softw. Eng. 15, 1 (2010), 1–34. DOI:
[82]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 542–553. DOI:
[83]
Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhattacharyya, and Partha P. Talukdar. 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 3308–3318. DOI:
[84]
Mario Linares Vásquez, Collin McMillan, Denys Poshyvanyk, and Mark Grechanik. 2014. On using machine learning to automatically classify software applications into domain categories. Empir. Softw. Eng. 19, 3 (2014), 582–618. DOI:
[85]
Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Dynamic neural program embeddings for program repair. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=BJuWrGW0Z.
[86]
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297–308. DOI:
[87]
Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, 3034–3040. DOI:
[88]
Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors influencing the surprising instability of word embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2092–2102. DOI:
[89]
Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and transforming program repair ingredients via deep learning code similarities. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 479–490. DOI:
[90]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98. DOI:
[91]
Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term extraction. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. IJCAI/AAAI Press, 2979–2985. Retrieved from http://www.ijcai.org/Abstract/16/423.
[92]
Xiaohan Yu, Quzhe Huang, Zheng Wang, Yansong Feng, and Dongyan Zhao. 2020. Towards context-aware code comment generation. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 3938–3947. DOI:
[93]
Lei Zeng, Yang Xiao, and Hui Chen. 2015. Linux auditing: Overhead and adaptation. In Proceedings of the International Conference on Communications. IEEE, 7168–7173. DOI:
[94]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE/ACM, 783–794. DOI:
[95]
Xiaoqing Zhang, Yu Zhou, Tingting Han, and Taolue Chen. 2020. Training deep code comment generation models via data augmentation. In Proceedings of the 12th Asia-Pacific Symposium on Internetware. ACM, 185–188. DOI:
[96]
Kuangqi Zhou, Yanfei Dong, Kaixin Wang, Wee Sun Lee, Bryan Hooi, Huan Xu, and Jiashi Feng. 2021. Understanding and resolving performance degradation in deep graph convolutional networks. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, 2728–2737. DOI:

Cited By

View all
  • (2024)Assessment of Software Vulnerability Contributing Factors by Model-Agnostic Explainable AIMachine Learning and Knowledge Extraction10.3390/make60200506:2(1087-1113)Online publication date: 16-May-2024
  • (2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
  • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
  • Show More Cited By

Index Terms

  1. Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 2
      March 2023
      946 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3586025
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 March 2023
      Online AM: 08 June 2022
      Accepted: 07 May 2022
      Revised: 31 March 2022
      Received: 30 December 2020
      Published in TOSEM Volume 32, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Machine learning
      2. source code representation
      3. code embeddings
      4. neural network

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)229
      • Downloads (Last 6 weeks)31
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Assessment of Software Vulnerability Contributing Factors by Model-Agnostic Explainable AIMachine Learning and Knowledge Extraction10.3390/make60200506:2(1087-1113)Online publication date: 16-May-2024
      • (2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
      • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
      • (2023) LoGenText-Plus: Improving Neural Machine Translation Based Logging Texts Generation with Syntactic TemplatesACM Transactions on Software Engineering and Methodology10.1145/362474033:2(1-45)Online publication date: 22-Dec-2023
      • (2023)DF4RT: Deep Forest for Requirements Traceability Recovery Between Use Cases and Source Code2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53992.2023.10394259(617-622)Online publication date: 1-Oct-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media