Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICSE-SEIP52600.2021.00022acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Learning autocompletion from real-world datasets

Published: 17 December 2021 Publication History

Abstract

Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

References

[1]
G. C. Murphy, M. Kersten, and L. Findlater, "How are java software developers using the eclipse ide?" IEEE Softw., vol. 23, no. 4, pp. 76--83, Jul. 2006. [Online].
[2]
M. Bruch, M. Monperrus, and M. Mezini, "Learning from examples to improve code completion systems," in Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ser. ESEC/FSE '09. New York, NY, USA: ACM, 2009, pp. 213--222. [Online].
[3]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 837--847. [Online]. Available: http://dl.acm.org/citation.cfm?id=2337223.2337322
[4]
G. A. Aye and G. E. Kaiser, "Sequence model design for code completion in the modern ide," 2020.
[5]
S. Kim, J. Zhao, Y. Tian, and S. Chandra, "Code prediction by feeding trees to transformers," 2020.
[6]
J. Li, Y. Wang, M. R. Lyu, and I. King, "Code completion with neural attention and pointer networks," Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Jul 2018. [Online].
[7]
V. Raychev, M. Vechev, and E. Yahav, "Code completion with statistical language models," in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '14. New York, NY, USA: ACM, 2014, pp. 419--428. [Online].
[8]
R. Karampatsis and C. Sutton, "Maybe deep neural networks are the best choice for modeling source code," CoRR, vol. abs/1903.05734, 2019. [Online]. Available: http://arxiv.org/abs/1903.05734
[9]
V. Raychev, P. Bielik, and M. Vechev, "Probabilistic model for code with decision trees," in Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ser. OOPSLA 2016. New York, NY, USA: Association for Computing Machinery, 2016, p. 731--747. [Online].
[10]
M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, "Generative code modeling with graphs," 2019.
[11]
U. Alon, R. Sadaka, O. Levy, and E. Yahav, "Structural language models of code," 2020.
[12]
V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, "When code completion fails: A case study on real-world completions," ser. ICSE '19. IEEE Press, 2019, p. 960--970. [Online].
[13]
K. Heafield, "KenLM: Faster and smaller language model queries," in Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: Association for Computational Linguistics, Jul. 2011, pp. 187--197. [Online]. Available: https://www.aclweb.org/anthology/W11-2123
[14]
K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-Ney language model estimation," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 690--696. [Online]. Available: https://www.aclweb.org/anthology/P13-2121
[15]
S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL '96. USA: Association for Computational Linguistics, 1996, p. 310--318. [Online].
[16]
R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," 2015.
[17]
M. Allamanis, "The adverse effects of code duplication in machine learning models of code," 2019.
[18]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. IEEE Press, 2012, p. 837--847.
[19]
T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "A statistical semantic language model for source code," in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013. New York, NY, USA: Association for Computing Machinery, 2013, p. 532--542. [Online].
[20]
V. J. Hellendoorn and P. Devanbu, "Are deep neural networks the best choice for modeling source code?" in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY, USA: Association for Computing Machinery, 2017, p. 763--773. [Online].
[21]
P. Bielik, V. Raychev, and M. Vechev, "Phog: Probabilistic model for code," in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML'16. JMLR.org, 2016, p. 2933--2942.

Cited By

View all
  • (2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
  • (2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
  • (2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Index Terms

  1. Learning autocompletion from real-world datasets
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice
          May 2021
          405 pages
          ISBN:9780738146690

          Sponsors

          In-Cooperation

          • IEEE CS

          Publisher

          IEEE Press

          Publication History

          Published: 17 December 2021

          Check for updates

          Author Tags

          1. code completion
          2. integrated development environments
          3. machine learning
          4. naturalness
          5. neural networks
          6. software language models
          7. software tools

          Qualifiers

          • Research-article

          Conference

          ICSE '21
          Sponsor:

          Upcoming Conference

          ICSE 2025

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)4
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 05 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Measuring GitHub Copilot's Impact on ProductivityCommunications of the ACM10.1145/363345367:3(54-63)Online publication date: 22-Feb-2024
          • (2024)Language Models for Code Completion: A Practical EvaluationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639138(1-13)Online publication date: 20-May-2024
          • (2024)Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical StudyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623306(1-13)Online publication date: 20-May-2024
          • (2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
          • (2023)How Practitioners Expect Code Completion?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616280(1294-1306)Online publication date: 30-Nov-2023
          • (2023)On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of CodeProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616244(1470-1482)Online publication date: 30-Nov-2023
          • (2022)All you need is logs: improving code completion by learning from anonymous IDE usage logsProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558968(1269-1279)Online publication date: 7-Nov-2022
          • (2022)Productivity assessment of neural code completionProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534864(21-29)Online publication date: 13-Jun-2022
          • (2022)Counterfactual explanations for models of codeProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513081(125-134)Online publication date: 21-May-2022
          • (2022)CodeFillProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510172(401-412)Online publication date: 21-May-2022
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media