research-article

Open access

CodeFill: multi-token code completion by jointly learning from structure and naming sequences

Authors:

Roberta Gismondi,

Georgios GousiosAuthors Info & Claims

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 401 - 412

https://doi.org/10.1145/3510003.3510172

Published: 05 July 2022 Publication History

Abstract

Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n = 4 tokens). We publicly release our source code and datasets.

References

[1]

Alireza Aghamohammadi, Maliheh Izadi, and Abbas Heydarnoori. 2020. Generating summaries for methods of event-driven programs: An Android case study. Journal of Systems and Software 170 (2020), 110800.

[2]

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143--153.

Digital Library

[3]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1--37.

Digital Library

[4]

Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 472--483.

Digital Library

[5]

Sven Amann, Sebastian Proksch, Sarah Nadi, and Mira Mezini. 2016. A study of visual studio usage in practice. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. IEEE, 124--134.

[6]

Gareth Ari Aye and Gail E Kaiser. 2020. Sequence model design for code completion in the modern IDE. arXiv preprint arXiv:2004.05249 (2020).

[7]

Gareth Ari Aye, Seohyun Kim, and Hongyu Li. 2021. Learning autocompletion from real-world datasets. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 131--139.

Digital Library

[8]

Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: probabilistic model for code. In International Conference on Machine Learning. 2933--2942.

[9]

Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. 213--222.

Digital Library

[10]

Paweł Budzianowski and Ivan Vulić. 2019. Hello, It's GPT-2-How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation. 15--22.

[11]

Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41--75.

[12]

Santanu Kumar Dash, Miltiadis Allamanis, and Earl T. Barr. 2018. RefiNym: Using Names to Refine Types. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 107--117.

Digital Library

[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[14]

Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis lectures on human language technologies 10, 1 (2017), 1--309.

[15]

Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub's Data from a Firehose. In MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories (Zurich, Switzerland), Michael W. Godfrey and Jim Whitehead (Eds.). IEEE, 12--21.

[16]

Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, and Kee-Eung Kim. 2020. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 583--592.

[17]

Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 763--773.

Digital Library

[18]

Vincent J Hellendoorn, Sebastian Proksch, Harald C Gall, and Alberto Bacchelli. 2019. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 960--970.

Digital Library

[19]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 837--847.

Digital Library

[20]

Daqing Hou and David M Pletcher. 2010. Towards a better code completion system by API grouping, filtering, and popularity-based ranking. In Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering. 26--30.

Digital Library

[21]

Maliheh Izadi, Kiana Akbari, and Abbas Heydarnoori. 2022. Predicting the objective and priority of issue reports in software repositories. Empirical Software Engineering 27, 2 (2022), 1--37.

Digital Library

[22]

Maliheh Izadi, Abbas Heydarnoori, and Georgios Gousios. 2021. Topic recommendation for software repositories using multi-label classification algorithms. Empirical Software Engineering 26, 5 (2021), 1--33.

Digital Library

[23]

Xianhao Jin and Francisco Servant. 2018. The Hidden Cost of Code Completion: Understanding the Impact of the Recommendation-List Length on Its Efficiency. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR '18). Association for Computing Machinery, New York, NY, USA, 70--73.

Digital Library

[24]

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. 2020. Big code!= big vocabulary: Open-vocabulary models for source code. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 1073--1085.

Digital Library

[25]

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Prediction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 150--162.

Digital Library

[26]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In Thirtieth AAAI conference on artificial intelligence.

Digital Library

[27]

Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. 2004. The significance of recall in automatic metrics for MT evaluation. In Conference of the Association for Machine Translation in the Americas. Springer, 134--143.

[28]

Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information 62 (2020), 101983.

[29]

Jian Li, Yue Wang, Michael R Lyu, and Irwin King. 2017. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573 (2017).

[30]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[31]

Chang Liu, Xin Wang, Richard Shin, Joseph E Gonzalez, and Dawn Song. 2016. Neural code completion. (2016).

[32]

Fang Liu, Ge Li, Bolin Wei, Xin Xia, Zhiyi Fu, and Zhi Jin. 2020. A Self-Attentional Neural Architecture for Code Completion with Multi-Task Learning. In Proceedings of the 28th International Conference on Program Comprehension. 37--47.

Digital Library

[33]

Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task Learning based Pre-trained Language Model for Code Completion. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 473--485.

Digital Library

[34]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).

[35]

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.

[36]

Son Nguyen, Tien Nguyen, Yi Li, and Shaohua Wang. 2019. Combining program analysis and statistical language model for code statement completion. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 710--721.

Digital Library

[37]

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 532--542.

Digital Library

[38]

Dragomir R Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluating Web-based Question Answering Systems. In LREC. Citeseer.

[39]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).

[40]

Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic model for code with decision trees. ACM SIGPLAN Notices 51, 10 (2016), 731--747.

Digital Library

[41]

Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learning programs from noisy data. ACM Sigplan Notices 51, 1 (2016), 761--774.

Digital Library

[42]

Romain Robbes and Michele Lanza. 2008. How program history can improve code completion. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 317--326.

Digital Library

[43]

Romain Robbes and Michele Lanza. 2010. Improving code completion with program history. Automated Software Engineering 17, 2 (2010), 181--212.

Digital Library

[44]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.

[45]

Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).

[46]

Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval. Vol. 39. Cambridge University Press Cambridge.

[47]

Holger Schwenk and Jean-Luc Gauvain. 2002. Connectionist language modeling for large vocabulary continuous speech recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, I-7-65.

[48]

Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650 (2018).

[49]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433--1443.

Digital Library

[50]

Alexey Svyatkovskiy, Sebastian Lee, Anna Hadjitofi, Maik Riechert, Juliana Vicente Franco, and Miltiadis Allamanis. 2021. Fast and memory-efficient neural code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 329--340.

[51]

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: Ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2727--2735.

Digital Library

[52]

Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 269--280.

Digital Library

[53]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

[54]

Fengcai Wen, Emad Aghajani, Csaba Nagy, Michele Lanza, and Gabriele Bavota. 2021. Siri, Write the Next Method. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 138--149.

[55]

Yixiao Yang, Yu Jiang, Ming Gu, Jiaguang Sun, Jian Gao, and Han Liu. 2017. A language model for statements of software code. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 682--687.

[56]

Yu Zhang and Qiang Yang. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering (2021).

Cited By

López JChen BSaad MSharma TVarró D(2025)On Inter-Dataset Code Duplication and Data Leakage in Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.350428651:1(192-205)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TSE.2024.3504286
Kotsiantis SVerykios VTzagarakis M(2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
https://doi.org/10.3390/electronics13040767
Hou XZhao YLiu YYang ZWang KLi LLuo XLo DGrundy JWang H(2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/369598833:8(1-79)Online publication date: 20-Sep-2024
https://dl.acm.org/doi/10.1145/3695988
Show More Cited By

Index Terms

CodeFill: multi-token code completion by jointly learning from structure and naming sequences
1. Software and its engineering
  1. Software notations and tools

Recommendations

Multi-task learning based pre-trained language model for code completion
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown ...
Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation
ICSE '23: Proceedings of the 45th International Conference on Software Engineering

Software bugs claim ≈ 50% of development time and cost the global economy billions of dollars. Once a bug is reported, the assigned developer attempts to identify and understand the source code responsible for the bug and then corrects the code. Over ...
Syntax-aware on-the-fly code completion
Abstract Context:
Code completion aims to help improve developers’ productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

May 2022

2508 pages

ISBN:9781450392211

DOI:10.1145/3510003

General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NWO MIPL project
European Union?s Horizon 2020 research and innovation programme

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

50
Total Citations
View Citations
1,171
Total Downloads

Downloads (Last 12 months)448
Downloads (Last 6 weeks)50

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

López JChen BSaad MSharma TVarró D(2025)On Inter-Dataset Code Duplication and Data Leakage in Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.350428651:1(192-205)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TSE.2024.3504286
Kotsiantis SVerykios VTzagarakis M(2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
https://doi.org/10.3390/electronics13040767
Hou XZhao YLiu YYang ZWang KLi LLuo XLo DGrundy JWang H(2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/369598833:8(1-79)Online publication date: 20-Sep-2024
https://dl.acm.org/doi/10.1145/3695988
Siddiq Mda Silva Santos JDevareddy SMuller AFilkov VRay BZhou M(2024)SALLM: Security Assessment of Generated CodeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694934(54-65)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691621.3694934
Yu XLi CPan MLi XFilkov VRay BZhou M(2024)DroidCoder: Enhanced Android Code Completion with Context-Enriched Retrieval-Augmented GenerationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695063(681-693)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695063
Liu YTantithamthavorn CLiu YThongtanunam PLi L(2024)Automatically Recommend Code Updates: Are We There Yet?ACM Transactions on Software Engineering and Methodology10.1145/367816733:8(1-27)Online publication date: 16-Jul-2024
https://dl.acm.org/doi/10.1145/3678167
de Moor Avan Deursen AIzadi MAdams BZimmermann TOzkaya ILin DZhang J(2024)A Transformer-Based Approach for Smart Invocation of Automatic Code CompletionProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664760(28-37)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3664646.3664760
van Dam Tvan der Heijden Fde Bekker PNieuwschepen BOtten MIzadi MLo DPenta MXia XHu X(2024)Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case StudyProceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering10.1145/3650105.3652289(91-102)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3650105.3652289
Siddiq MRoney LZhang JSantos JSpinellis DConstantinou EBacchelli A(2024)Quality Assessment of ChatGPT Generated Code and their Use by DevelopersProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3645071(152-156)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3645071
Zhu TLiu ZXu TTang ZZhang TPan MXia XBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644421
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten