Abstract
Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fast-growing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although text-based SCC and image-based SCC approaches achieve very high (\(> 93.5\%\)) and similar accuracies, text-based classification has significantly better performance in terms of execution time.
Similar content being viewed by others
References
Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC++: predicting the programming language of questions and snippets of Stack Overflow. J Syst Softw. 2020;. https://doi.org/10.1016/j.jss.2019.110505.
Zevin S, Holzem C. Machine learning based source code classification using syntax oriented features 2017. arXiv preprint arXiv:1703.07638.
Gilda S. Source code classification using neural networks. In: 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), IEEE; 2017. p. 1–6.
Baquero JF, Camargo JE, Restrepo-Calle F, Aponte JH, González FA. Predicting the programming language: Extracting knowledge from stack overflow posts. In: Colombian Conference on Computing. Springer; 2017. p. 199–210.
Ott J, Atchison A, Harnack P, Bergh A, Linstead E. A deep learning approach to identifying source code in images and video. In: 15th International Conference on Mining Software Repositories (MSR), IEEE/ACM; 2018. p. 376-386.
Zhao D, Xing Z, Chen C, Xia X, Li G. ActionNet: vision-based workflow action recognition from programming screencasts.In: 41st International Conference on Software Engineering (ICSE), IEEE/ACM; 2019. p. 350–361.
Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S. Learning to generate pseudo-code from source code using statistical machine translation. In: 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE/ACM; 2015. p. 574–584.
Kuhn A, Ducasse S, Gírba T. Semantic clustering: identifying topics in source code. Inf Softw Technol. 2007;. https://doi.org/10.1016/j.infsof.2006.10.017.
Darwish O, Maabreh M, Karajeh O, Alsinglawi B. Source codes classification using a modified instruction count pass. In: Workshops of the International Conference on Advanced Information Networking and Applications (WAINA), Springer; 2019. p. 897–906.
Nguyen AT, Nguyen TN. Graph-based statistical language model for code. In : 37th IEEE International Conference on Software Engineering (ICSE), IEEE/ACM; vol 1; 2015. p.858–868.
Phana AH, Chau PN, Nguyen ML, Bui LT. Automatically classifying source code using tree-based approaches. 2018;. https://doi.org/10.1016/j.datak.2017.07.003.
Wilson W, Muteteke JJ, Li L. Automatic clustering of source code using self-organizing maps, In: Proceedings of 19th Annual Conference of SAIS. 2016; p. 1–5.
Shi ST, Li M, Lo D, Thung F, Huo X. Automatic code review by learning the revision of source code. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33; 2019. p. 4910–4917.
Bandara U, Wijayarathna G. Source code author identification with unsupervised feature learning. Pattern Recognit Lett. 2013;. https://doi.org/10.1016/j.patrec.2012.10.027.
Ying AT, Robillard MP. Code fragment summarization. In: Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. 2013; p. 655–658.
Alvares M, Marwala T, de Lima Neto FB. Application of computational intelligence for source code classification. In: Congress on Evolutionary Computation (CEC), IEEE; 2014. p. 895–902.
Alrashedy K, Dharmaretnam D, German DM, Srinivasan V, Gulliver TA. SCC: Automatic classification of code snippets 2018. arXiv preprint arXiv:1809.07945
Reyes J, Ramírez D, Paciello J, Automatic classification of source code archives by programming language: a deep learning approach. In: International Conference on Computational Science and Computational Intelligence (CSCI), IEEE; 2016. p. 514–519.
Dam V, Kennedy J, Zaytsev V. Software language identification with natural language classifiers. In: 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol 01; 2016. p. 624–628.
Khasnabish JN, Sodhi M, Deshmukh J, Srinivasaraghavan G. Detecting programming language from source code using bayesian learning techniques. In: International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer; 2014. p. 513–522.
Klein D, Muuray K, Weber S. Algorithmic programming language identification 2011. arXiv preprint arXiv:1106.4064.
Alahmadi M, Hassel J, Parajuli B, Haiduc S, Kumar P. Accurately predicting the location of code fragments in programming video tutorials using deep learning. In: Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), 2018. p. 2–11.
Guo G, Zhang N. A survey on deep learning based face recognition. Comput Vis Image Underst. 2019;. https://doi.org/10.1016/j.cviu.2019.102805.
Pastor-Pellicer J, Castro-Bleda MJ, España-Boquera S, Zamora-Martíez F. Handwriting recognition by using deep learning to extract meaningful features. AI Commun. 2019;. https://doi.org/10.3233/AIC-170562.
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikainen M. Deep learning for generic object detection: a survey. Int J Comput Vis. 2020;. https://doi.org/10.1007/s11263-019-01247-4.
Iwasaki R, Hasegawa T, Mori N, Matsumoto K. Relaxation method of convolutional neural networks for natural language processing. In: International Symposium on Distributed Computing and Artificial Intelligence. Springer; 2018. p.188–195.
Gimenez M, Palanca J, Botti V. Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. A case of study in sentiment analysis. Neurocomputing. 2020; https://doi.org/10.1016/j.neucom.2019.08.096.
Gao H, Lin S, Li C, Yang Y. Application of hyperspectral image classification based on overlap pooling. Neural Process Lett. 2019;49:1335–54.
Laks R. Image-based detection of programming languages. In: Github. 2018. https://github.com/rivol/programming-language-detection. Accessed 15 Nov 2019.
Heres D. Programming language identification tool. In: Algorithmia. 2016. https://algorithmia.com/algorithms. Accessed 8 July 2020.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Deep learning approaches for data analysis: A practical perspective” guest edited by D. Jude Hemanth, Lipo Wang and Anastasia Angelopoulou.
Rights and permissions
About this article
Cite this article
Kiyak, E.O., Cengiz, A.B., Birant, K.U. et al. Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning. SN COMPUT. SCI. 1, 266 (2020). https://doi.org/10.1007/s42979-020-00281-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00281-1