Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks

Published: 22 December 2023 Publication History

Abstract

Machine learning (ML) has been increasingly used in a variety of domains, while solving ML programming tasks poses unique challenges due to the fundamental difference in the nature and the construct of general programming tasks, especially for developers who do not have ML backgrounds. Automatic code generation that produces a code snippet from a natural language description can be a promising technique to accelerate ML programming tasks. In recent years, although many deep learning-based neural code generation models have been proposed with high accuracy, the fact that most of them are mainly evaluated on general programming tasks calls into question their effectiveness and usefulness in ML programming tasks. In this article, we set out to investigate the effectiveness of existing neural code generation models on ML programming tasks. For our analysis, we select six state-of-the-art neural code generation models and evaluate their performance on four widely used ML libraries, with newly created 83K pairs of natural-language described ML programming tasks. Our empirical study reveals some good, bad, and missing aspects of neural code generation models on ML tasks, with a few major ones listed below. (Good) Neural code generation models perform significantly better on ML tasks than on non-ML tasks with an average difference of 10.6 points in BLEU-4 scores. (Bad) More than 80% of the generated code is semantically incorrect. (Bad) Code generation models do not have significance in improving developers’ completion time. (Good) The generated code can help developers write correct code by providing developers with clues for using correct APIs. (Missing) The observation from our user study reveals the missing aspects of code generation for ML tasks, e.g., decomposing code generation for divide-and-conquer into API sequence identification and API usage generation.

References

[1]
Rajas Agashe, Srinivasan Iyer, and Luke Zettlemoyer. 2019. Juice: A large scale distantly supervised dataset for open domain context-based code generation. Retrieved from https://arXiv:1910.02216
[2]
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655–2668.
[3]
Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. 2021. AVATAR: A parallel corpus for Java-Python program translation. Retrieved from https://arXiv:2108.11590
[4]
Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153.
[5]
Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR’13). IEEE, 207–216.
[6]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’19). IEEE, 291–300.
[7]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[8]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Rose Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with Mesh-tensorflow. https://api.semanticscholar.org/CorpusID:245758737
[9]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
[10]
Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 511–521.
[11]
Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching connected API subgraph via text phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. 1–11.
[12]
Chi Chen, Xin Peng, Bihuan Chen, Jun Sun, Zhenchang Xing, Xin Wang, and Wenyun Zhao. 2022. “More Than Deep Learning”: Post-processing for API sequence recommendation. Empir. Softw. Eng. 27, 1 (2022), 1–32.
[13]
Chi Chen, Xin Peng, Zhenchang Xing, Jun Sun, XinWang, Yifan Zhao, andWenyun Zhao. 2022. Holistic combination of structural and textual code information for context based API recommendation. IEEE Trans. Softw. Eng. 48, 8 (aug 2022), 2987–3009.
[14]
Nadezhda Chirkova and Sergey Troshin. 2021. Empirical study of transformers for source code. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 703–715.
[15]
J. Cohen. 1988. Statistical power analysis for the behavioral 698 sciences. Stat. Power Anal. Behav. Sci. 2 (1988), 495.
[16]
Anthony E. Cozzie and Samuel King. 2012. Macho: Writing programs with natural language and examples. (2012). http://hdl.handle.net/2142/33791
[17]
Samip Dahal, Adyasha Maharana, and Mohit Bansal. 2021. Analysis of tree-structured architectures for code generation. In Proceedings of the Association for Computational Linguistics (ACL-IJCNLP’21). 4382–4391.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805
[19]
Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. Retrieved from https://arXiv:1601.01280
[20]
Li Dong and Mirella Lapata. 2018. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 731–742.
[21]
Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artific. Intell. Res. 61 (2018), 65–170.
[22]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the IEEE/ACM 40th International Conference on Software Engineering (ICSE’18). IEEE, 933–944.
[23]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. Retrieved from https://arXiv:2009.08366
[24]
Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, and Graham Neubig. 2018. Retrieval-based neural code generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[25]
Jin Huang and Charles X. Ling. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17, 3 (2005), 299–310.
[26]
Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang. 2018. API method recommendation without worrying about the task-API knowledge gap. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE’18). IEEE, 293–304.
[27]
Hui Jiang, Chulun Zhou, Fandong Meng, Biao Zhang, Jie Zhou, Degen Huang, Qingqiang Wu, and Jinsong Su. 2021. Exploring dynamic selection of branch expansion orders for code generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5076–5085.
[28]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. Retrieved from https://arXiv:1602.02410
[29]
Samina Khalid, Tehmina Khalil, and Shamila Nasreen. 2014. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the Science and Information Conference. IEEE, 372–378.
[30]
Alex LeClair and Collin McMillan. 2019. Recommendations for datasets for source code summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 3931–3937.
[31]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[32]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. 74–81.
[33]
Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang, and Andrew Senior. 2016. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 599–609.
[34]
Chang Liu, XinWang, Richard Shin, Joseph E. Gonzalez, and Dawn Song. 2017. Neural Code Completion. https://openreview.net/forum?id=rJbPBt9lg
[35]
Hui Liu, Mingzhu Shen, Jiaqi Zhu, Nan Niu, Ge Li, and Lu Zhang. 2022. Deep learning based program generation from requirements text: Are we there yet? IEEE Transactions on Software Engineering 48, 4 (2022), 1268–1289.
[36]
Xiaoyu Liu, LiGuo Huang, and Vincent Ng. 2018. Effective API recommendation without historical software repositories. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 282–292.
[37]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. Retrieved from https://arXiv:2102.04664
[38]
Chen Lyu, Ruyun Wang, Hongyu Zhang, Hanwen Zhang, and Songlin Hu. 2021. Embedding API dependency graph for neural code generation. Empir. Softw. Eng. 26, 4 (2021), 1–51.
[39]
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 336–347.
[40]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering. 111–120.
[41]
Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. 2018. Neural sketch learning for conditional program generation. In Proceedings of the International Conference on Learning Representations.
[42]
Muhammad Sajid Nawaz, Saif Ur Rehman Khan, Shahid Hussain, and Javed Iqbal. 2023. A study on application programming interface recommendation: State-of-the-art techniques, challenges and future directions. Library Hi Tech 41, 2 (2023), 355–385.
[43]
Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, Alvaro Lopez Garcia, Ignacio Heredia, Peter Malík, and Ladislav Hluchỳ. 2019. Machine learning and deep learning frameworks and libraries for large-scale data mining: A survey. Artific. Intell. Rev. 52, 1 (2019), 77–124.
[44]
Sajad Norouzi, Keyi Tang, and Yanshuai Cao. 2021. Code generation from natural language with less prior knowledge and more monolingual data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 776–785.
[45]
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE’15). IEEE, 574–584.
[46]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[47]
Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, and Michael R. Lyu. 2023. Revisiting, benchmarking and exploring API recommendation: How far are we? IEEE Transactions on Software Engineering 49, 4 (2023), 1876–1897.
[48]
Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1139–1149.
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1–67.
[50]
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. Swim: Synthesizing what i mean-code search and idiomatic snippet synthesis. In Proceedings of the IEEE/ACM 38th International Conference on Software Engineering (ICSE ’16). IEEE, 357–367.
[51]
Mohammad Masudur Rahman, Chanchal K. Roy, and David Lo. 2016. Rack: Automatic api recommendation using crowdsourced knowledge. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER’16), Vol. 1. IEEE, 349–359.
[52]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: A method for automatic evaluation of code synthesis. Retrieved from https://arXiv:2009.10297
[53]
Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2019. A grammar-based structural cnn decoder for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7055–7062.
[54]
Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8984–8991.
[55]
Gias Uddin, Foutse Khomh, and Chanchal K. Roy. 2020. Mining API usage scenarios from stack overflow. Info. Softw. Technol. 122 (2020), 106277.
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[57]
Song Wang, Nishtha Shrestha, Abarna Kucheri Subburaman, Junjie Wang, Moshi Wei, and Nachiappan Nagappan. 2021. Automatic unit test generation for machine learning libraries: How far are we? In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1548–1560.
[58]
Moshi Wei, Yuchao Huang, Junjie Wang, Jiho Shin, Nima Shiri Harzevili, and Song Wang. 2022. API recommendation for machine learning libraries: how far are we? In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 370–381.
[59]
F. Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biom. Bull. 1 (1945), 80–83.
[60]
Wenkai Xie, Xin Peng, Mingwei Liu, Christoph Treude, Zhenchang Xing, Xiaoxin Zhang, and Wenyun Zhao. 2020. API method recommendation via explicit matching of functionality verb phrases. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1015–1026.
[61]
Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig. 2020. Incorporating external knowledge through pre-training for natural language to code generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
[62]
Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the World Wide Web Conference. 1693–1703.
[63]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the IEEE/ACM 15th International Conference on Mining Software Repositories (MSR’18). IEEE, 476–486.
[64]
Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 440–450.
[65]
Pengcheng Yin and Graham Neubig. 2018. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 7–12.
[66]
Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022. CERT: Continual pre-training on sketches for library-oriented code generation. Retrieved from https://arXiv:2206.06888
[67]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 2
February 2024
947 pages
EISSN:1557-7392
DOI:10.1145/3618077
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2023
Online AM: 23 October 2023
Accepted: 06 October 2023
Revised: 30 June 2023
Received: 22 November 2022
Published in TOSEM Volume 33, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural code generation
  2. machine learning tasks
  3. empirical analysis

Qualifiers

  • Research-article

Funding Sources

  • Natural Sciences and Engineering Research Council of Canada (NSERC) and the National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 374
    Total Downloads
  • Downloads (Last 12 months)363
  • Downloads (Last 6 weeks)51
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media