research-article

Learning autocompletion from real-world datasets

Authors:

Gareth Ari Aye,

Hongyu LiAuthors Info & Claims

ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

Pages 131 - 139

https://doi.org/10.1109/ICSE-SEIP52600.2021.00022

Published: 17 December 2021 Publication History

Abstract

Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

References

[1]

G. C. Murphy, M. Kersten, and L. Findlater, "How are java software developers using the eclipse ide?" IEEE Softw., vol. 23, no. 4, pp. 76--83, Jul. 2006. [Online].

Digital Library

[2]

M. Bruch, M. Monperrus, and M. Mezini, "Learning from examples to improve code completion systems," in Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ser. ESEC/FSE '09. New York, NY, USA: ACM, 2009, pp. 213--222. [Online].

Digital Library

[3]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 837--847. [Online]. Available: http://dl.acm.org/citation.cfm?id=2337223.2337322

Digital Library

[4]

G. A. Aye and G. E. Kaiser, "Sequence model design for code completion in the modern ide," 2020.

[5]

S. Kim, J. Zhao, Y. Tian, and S. Chandra, "Code prediction by feeding trees to transformers," 2020.

[6]

J. Li, Y. Wang, M. R. Lyu, and I. King, "Code completion with neural attention and pointer networks," Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Jul 2018. [Online].

Digital Library

[7]

V. Raychev, M. Vechev, and E. Yahav, "Code completion with statistical language models," in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '14. New York, NY, USA: ACM, 2014, pp. 419--428. [Online].

Digital Library

[8]

R. Karampatsis and C. Sutton, "Maybe deep neural networks are the best choice for modeling source code," CoRR, vol. abs/1903.05734, 2019. [Online]. Available: http://arxiv.org/abs/1903.05734

[9]

V. Raychev, P. Bielik, and M. Vechev, "Probabilistic model for code with decision trees," in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ser. OOPSLA 2016. New York, NY, USA: Association for Computing Machinery, 2016, p. 731--747. [Online].

Digital Library

[10]

M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, "Generative code modeling with graphs," 2019.

[11]

U. Alon, R. Sadaka, O. Levy, and E. Yahav, "Structural language models of code," 2020.

[12]

V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, "When code completion fails: A case study on real-world completions," ser. ICSE '19. IEEE Press, 2019, p. 960--970. [Online].

Digital Library

[13]

K. Heafield, "KenLM: Faster and smaller language model queries," in Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: Association for Computational Linguistics, Jul. 2011, pp. 187--197. [Online]. Available: https://www.aclweb.org/anthology/W11-2123

Digital Library

[14]

K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-Ney language model estimation," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 690--696. [Online]. Available: https://www.aclweb.org/anthology/P13-2121

[15]

S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL '96. USA: Association for Computational Linguistics, 1996, p. 310--318. [Online].

Digital Library

[16]

R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," 2015.

[17]

M. Allamanis, "The adverse effects of code duplication in machine learning models of code," 2019.

Digital Library

[18]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. IEEE Press, 2012, p. 837--847.

Digital Library

[19]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "A statistical semantic language model for source code," in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013. New York, NY, USA: Association for Computing Machinery, 2013, p. 532--542. [Online].

Digital Library

[20]

V. J. Hellendoorn and P. Devanbu, "Are deep neural networks the best choice for modeling source code?" in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY, USA: Association for Computing Machinery, 2017, p. 763--773. [Online].

Digital Library

[21]

P. Bielik, V. Raychev, and M. Vechev, "Phog: Probabilistic model for code," in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML'16. JMLR.org, 2016, p. 2933--2942.

Cited By

Ziegler AKalliamvakou ELi XRice ARifkin DSimister SSittampalam GAftandilian E(2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1145/3633453
Izadi MKatzy JVan Dam TOtten MPopescu RVan Deursen ARoychoudhury APaiva AAbreu RStorey M(2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639138
Guo QCao JXie XLiu SLi XChen BPeng XRoychoudhury APaiva AAbreu RStorey M(2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3623306
Show More Cited By

Index Terms

Learning autocompletion from real-world datasets

Index terms have been assigned to the content through auto-classification.

Recommendations

Improving code autocompletion with transfer learning
ICSE-SEIP '22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice

Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integration. Recently, accuracy in autocompletion prediction improved 12.8%[2] from training on a real-...
Toward deep learning software repositories
MSR '15: Proceedings of the 12th Working Conference on Mining Software Repositories

Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these models to generalize well has ushered in tremendous advances in many fields such as natural language processing (NLP). Recent research in the ...
Improving code completion with program history

Code completion is a widely used productivity tool. It takes away the burden of remembering and typing the exact names of methods or classes: As a developer starts typing a name, it provides a progressively refined list of candidates matching the name. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

May 2021

405 pages

ISBN:9780738146690

Conference Chairs:
Sigrid Eldh
Ericsson, Sweden
,
Davide Falessi
California Polytechnic State University

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 17 December 2021

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '21

Sponsor:

SIGSOFT

ICSE '21: 43rd International Conference on Software Engineering

May 25 - 28, 2021

Virtual Event, Spain

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
21
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ziegler AKalliamvakou ELi XRice ARifkin DSimister SSittampalam GAftandilian E(2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1145/3633453
Izadi MKatzy JVan Dam TOtten MPopescu RVan Deursen ARoychoudhury APaiva AAbreu RStorey M(2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639138
Guo QCao JXie XLiu SLi XChen BPeng XRoychoudhury APaiva AAbreu RStorey M(2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3623306
Dinh TZhao JTan SNegrinho RLausen LZha SKarypis GOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667916
Wang CHu JGao CJin YXie THuang HLei ZDeng YChandra SBlincoe KTonella P(2023)How Practitioners Expect Code Completion?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616280(1294-1306)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616280
Weyssow MZhou XKim KLo DSahraoui HChandra SBlincoe KTonella P(2023)On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of CodeProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616244(1470-1482)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616244
Bibaev VKalina ALomshakov VGolubev YBezzubov APovarov NBryksin TRoychoudhury ACadar CKim M(2022)All you need is logs: improving code completion by learning from anonymous IDE usage logsProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558968(1269-1279)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3558968
Ziegler AKalliamvakou ELi XRice ARifkin DSimister SSittampalam GAftandilian EChaudhuri SSutton C(2022)Productivity assessment of neural code completionProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534864(21-29)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3520312.3534864
Cito JDillig IMurali VChandra SHarman MMiller H(2022)Counterfactual explanations for models of codeProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513081(125-134)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510457.3513081
Izadi MGismondi RGousios GDwyer MDamian DZeller A(2022)CodeFillProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510172(401-412)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510172
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten