Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2983990.2984041acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Probabilistic model for code with decision trees

Published: 19 October 2016 Publication History

Abstract

In this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic models of code learned over large codebases (e.g., GitHub) to make predictions about new programs (e.g., code completion, repair, etc).
The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen). This allows us to condition the prediction of a program element on a dynamically computed context. Further, our problem formulation enables us to easily instantiate known decision tree learning algorithms such as ID3, but also to obtain new variants we refer to as ID3+ and E13, not previously explored and ones that outperform ID3 in prediction accuracy.
Our approach is general and can be used to learn a probabilistic model of any programming language. We implemented our approach in a system called Deep3 and evaluated it for the challenging task of learning probabilistic models of JavaScript and Python. Our experimental results indicate that Deep3 predicts elements of JavaScript and Python code with precision above 82% and 69%, respectively. Further, Deep3 often significantly outperforms state-of-the-art approaches in overall prediction accuracy.

References

[1]
M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, 2013.
[2]
M. Allamanis and C. Sutton. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5.
[3]
M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY, USA, 2015.
[4]
ACM. ISBN 978-1-4503-3675-8.
[5]
M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 2123–2132. JMLR.org, 2015.
[6]
R. Alur, R. Bod´ık, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013, pages 1–8, 2013.
[7]
S. Barman, R. Bodik, S. Chandra, E. Torlak, A. Bhattacharya, and D. Culler. Toward tool support for interactive synthesis. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), Onward! 2015, pages 121–136, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3688-8.
[8]
P. Bielik, V. Raychev, and M. T. Vechev. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.
[9]
P. Garg, D. Neider, P. Madhusudan, and D. Roth. Learning invariants using decision trees and implication counterexamples. In Proceedings of the 43rd Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 499–512, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.
[10]
T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 416–432, 2015.
[11]
T. Gvero, V. Kuncak, I. Kuraj, and R. Piskac. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 27–38. ACM, 2013. ISBN 978-1-4503-2014-6.
[12]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978- 1-4673-1067-3.
[13]
T. Hottelier and R. Bodik. Synthesis of layout engines from relational constraints. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 74–88, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3689-5.
[14]
C.-H. Hsiao, M. Cafarella, and S. Narayanasamy. Using web corpus statistics for program analysis. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 49–65, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1.
[15]
S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 215–224, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-719-6.
[16]
S. Karaivanov, V. Raychev, and M. T. Vechev. Phrase-based statistical translation of programming languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, part of SLASH ’14, Portland, OR, USA, October 20-24, 2014, pages 173–184, 2014. 2661136.2661148.
[17]
E. Kneuss, I. Kuraj, V. Kuncak, and P. Suter. Synthesis modulo recursive functions. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 407–426, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1.
[18]
P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchical bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 639–646, 2010.
[19]
F. Long and M. Rinard. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 298–312, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.
[20]
C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 649–657, 2014.
[21]
T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997.
[22]
A. T. Nguyen and T. N. Nguyen. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 858–868, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5.
[23]
T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9.
[24]
J. R. Quinlan. Induction of decision trees. Mach. Learn., 1 (1):81–106, Mar. 1986. ISSN 0885-6125. 1022643204877.
[25]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
[26]
ISBN 1-55860-238-0.
[27]
V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2784-8.
[28]
V. Raychev, M. Vechev, and A. Krause. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111–124. ACM, 2015. ISBN 978-1-4503-3300-9.
[29]
V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 761–774, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.
[30]
M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 792–800, 2015.
[31]
R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE, 2000.
[32]
A. Solar-Lezama. Program sketching. STTT, 15(5-6):475– 495, 2013.
[33]
A. Solar-Lezama, L. Tancau, R. Bod´ık, S. A. Seshia, and V. A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404–415, 2006.
[34]
A. Stolcke. SRILM-an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing, 2002.
[35]
I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4):1085–1094, 1991.
[36]
T. Zhaopeng, S. Zhendong, and D. Premkumar. On the localness of software. In Foundations of Software Engineering, FSE ’14, New York, NY, USA, 2014. ACM.

Cited By

View all
  • (2024)Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded CredentialsProceedings of the ACM on Software Engineering10.1145/36608181:FSE(2515-2537)Online publication date: 12-Jul-2024
  • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
  • (2024)A practical tool for detecting cross-language code pairs with similar control structuresProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636134(1301-1303)Online publication date: 8-Apr-2024
  • Show More Cited By

Index Terms

  1. Probabilistic model for code with decision trees

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
    October 2016
    915 pages
    ISBN:9781450344449
    DOI:10.1145/2983990
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code Completion
    2. Decision Trees
    3. Probabilistic Models of Code

    Qualifiers

    • Research-article

    Conference

    SPLASH '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 268 of 1,244 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)173
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded CredentialsProceedings of the ACM on Software Engineering10.1145/36608181:FSE(2515-2537)Online publication date: 12-Jul-2024
    • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
    • (2024)A practical tool for detecting cross-language code pairs with similar control structuresProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636134(1301-1303)Online publication date: 8-Apr-2024
    • (2024)Res-Unet based blood vessel segmentation and cardio vascular disease prediction using chronological chef-based optimization algorithm based deep residual network from retinal fundus imagesMultimedia Tools and Applications10.1007/s11042-024-18810-yOnline publication date: 20-Mar-2024
    • (2024)A study of common bug fix patterns in RustEmpirical Software Engineering10.1007/s10664-023-10437-129:2Online publication date: 12-Feb-2024
    • (2024)Rethinking AI code generation: a one-shot correction approach based on user feedbackAutomated Software Engineering10.1007/s10515-024-00451-y31:2Online publication date: 12-Jul-2024
    • (2024)Guiding Enumerative Program Synthesis with Large Language ModelsComputer Aided Verification10.1007/978-3-031-65630-9_15(280-301)Online publication date: 25-Jul-2024
    • (2023)CROSSCODEEVALProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668145(46701-46723)Online publication date: 10-Dec-2023
    • (2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
    • (2023)LongCoderProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618894(12098-12107)Online publication date: 23-Jul-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media