research-article

Probabilistic model for code with decision trees

Authors:

Veselin Raychev,

Martin VechevAuthors Info & Claims

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

Pages 731 - 747

https://doi.org/10.1145/2983990.2984041

Published: 19 October 2016 Publication History

Abstract

In this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic models of code learned over large codebases (e.g., GitHub) to make predictions about new programs (e.g., code completion, repair, etc).

The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen). This allows us to condition the prediction of a program element on a dynamically computed context. Further, our problem formulation enables us to easily instantiate known decision tree learning algorithms such as ID3, but also to obtain new variants we refer to as ID3+ and E13, not previously explored and ones that outperform ID3 in prediction accuracy.

Our approach is general and can be used to learn a probabilistic model of any programming language. We implemented our approach in a system called Deep3 and evaluated it for the challenging task of learning probabilistic models of JavaScript and Python. Our experimental results indicate that Deep3 predicts elements of JavaScript and Python code with precision above 82% and 69%, respectively. Further, Deep3 often significantly outperforms state-of-the-art approaches in overall prediction accuracy.

References

[1]

M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, 2013.

Digital Library

[2]

M. Allamanis and C. Sutton. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5.

Digital Library

[3]

M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY, USA, 2015.

Digital Library

[4]

ACM. ISBN 978-1-4503-3675-8.

[5]

M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 2123–2132. JMLR.org, 2015.

[6]

R. Alur, R. Bod´ık, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013, pages 1–8, 2013.

[7]

S. Barman, R. Bodik, S. Chandra, E. Torlak, A. Bhattacharya, and D. Culler. Toward tool support for interactive synthesis. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), Onward! 2015, pages 121–136, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3688-8.

Digital Library

[8]

P. Bielik, V. Raychev, and M. T. Vechev. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.

[9]

P. Garg, D. Neider, P. Madhusudan, and D. Roth. Learning invariants using decision trees and implication counterexamples. In Proceedings of the 43rd Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 499–512, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.

Digital Library

[10]

T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 416–432, 2015.

Digital Library

[11]

T. Gvero, V. Kuncak, I. Kuraj, and R. Piskac. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 27–38. ACM, 2013. ISBN 978-1-4503-2014-6.

Digital Library

[12]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978- 1-4673-1067-3.

Digital Library

[13]

T. Hottelier and R. Bodik. Synthesis of layout engines from relational constraints. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 74–88, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3689-5.

Digital Library

[14]

C.-H. Hsiao, M. Cafarella, and S. Narayanasamy. Using web corpus statistics for program analysis. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &#38; Applications, OOPSLA ’14, pages 49–65, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1.

Digital Library

[15]

S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 215–224, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-719-6.

Digital Library

[16]

S. Karaivanov, V. Raychev, and M. T. Vechev. Phrase-based statistical translation of programming languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, part of SLASH ’14, Portland, OR, USA, October 20-24, 2014, pages 173–184, 2014. 2661136.2661148.

Digital Library

[17]

E. Kneuss, I. Kuraj, V. Kuncak, and P. Suter. Synthesis modulo recursive functions. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &#38; Applications, OOPSLA ’13, pages 407–426, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1.

Digital Library

[18]

P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchical bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 639–646, 2010.

[19]

F. Long and M. Rinard. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 298–312, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.

Digital Library

[20]

C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 649–657, 2014.

[21]

T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997.

Digital Library

[22]

A. T. Nguyen and T. N. Nguyen. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 858–868, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5.

Digital Library

[23]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9.

Digital Library

[24]

J. R. Quinlan. Induction of decision trees. Mach. Learn., 1 (1):81–106, Mar. 1986. ISSN 0885-6125. 1022643204877.

[25]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

Digital Library

[26]

ISBN 1-55860-238-0.

[27]

V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2784-8.

Digital Library

[28]

V. Raychev, M. Vechev, and A. Krause. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111–124. ACM, 2015. ISBN 978-1-4503-3300-9.

Digital Library

[29]

V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 761–774, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2.

Digital Library

[30]

M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 792–800, 2015.

Digital Library

[31]

R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE, 2000.

[32]

A. Solar-Lezama. Program sketching. STTT, 15(5-6):475– 495, 2013.

Digital Library

[33]

A. Solar-Lezama, L. Tancau, R. Bod´ık, S. A. Seshia, and V. A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404–415, 2006.

Digital Library

[34]

A. Stolcke. SRILM-an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing, 2002.

[35]

I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4):1085–1094, 1991.

Digital Library

[36]

T. Zhaopeng, S. Zhendong, and D. Premkumar. On the localness of software. In Foundations of Software Engineering, FSE ’14, New York, NY, USA, 2014. ACM.

Digital Library

Cited By

Huang YLi YWu WZhang JLyu M(2024)Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded CredentialsProceedings of the ACM on Software Engineering10.1145/36608181:FSE(2515-2537)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660818
Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1145/3649594
Dai FChiba SHong JPark J(2024)A practical tool for detecting cross-language code pairs with similar control structuresProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636134(1301-1303)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636134
Show More Cited By

Index Terms

Probabilistic model for code with decision trees
1. Software and its engineering
  1. Software notations and tools

Recommendations

Probabilistic model for code with decision trees
OOPSLA '16

In this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic ...
Decision trees as possibilistic classifiers

This paper addresses the classification problem with imperfect data. More precisely, it extends standard decision trees to handle uncertainty in both building and classification procedures. Uncertainty here is represented by means of possibility ...
Learning probabilistic decision trees for AUC
Special issue: ROC analysis in pattern recognition

Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications

October 2016

915 pages

ISBN:9781450344449

DOI:10.1145/2983990

General Chair:
Eelco Visser
Delft University of Technology, Netherlands
,
Program Chair:
Yannis Smaragdakis
University of Athens, Greece

ACM SIGPLAN Notices Volume 51, Issue 10
OOPSLA '16
October 2016
915 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3022671
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPLASH '16

Sponsor:

SIGPLAN

SPLASH '16: Conference on Systems, Programming, Languages, and Applications: Software for Humanity

November 2 - 4, 2016

Amsterdam, Netherlands

Acceptance Rates

Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

131
Total Citations
View Citations
1,248
Total Downloads

Downloads (Last 12 months)173
Downloads (Last 6 weeks)17

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang YLi YWu WZhang JLyu M(2024)Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded CredentialsProceedings of the ACM on Software Engineering10.1145/36608181:FSE(2515-2537)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660818
Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1145/3649594
Dai FChiba SHong JPark J(2024)A practical tool for detecting cross-language code pairs with similar control structuresProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636134(1301-1303)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636134
S BKadry SDhanaraj RK SManthiramoorthy C(2024)Res-Unet based blood vessel segmentation and cardio vascular disease prediction using chronological chef-based optimization algorithm based deep residual network from retinal fundus imagesMultimedia Tools and Applications10.1007/s11042-024-18810-yOnline publication date: 20-Mar-2024
https://doi.org/10.1007/s11042-024-18810-y
Robati Shirzad MLam P(2024)A study of common bug fix patterns in RustEmpirical Software Engineering10.1007/s10664-023-10437-129:2Online publication date: 12-Feb-2024
https://doi.org/10.1007/s10664-023-10437-1
Le KAndrzejak A(2024)Rethinking AI code generation: a one-shot correction approach based on user feedbackAutomated Software Engineering10.1007/s10515-024-00451-y31:2Online publication date: 12-Jul-2024
https://doi.org/10.1007/s10515-024-00451-y
Li YParsert JPolgreen E(2024)Guiding Enumerative Program Synthesis with Large Language ModelsComputer Aided Verification10.1007/978-3-031-65630-9_15(280-301)Online publication date: 25-Jul-2024
https://doi.org/10.1007/978-3-031-65630-9_15
Ding YWang ZAhmad WDing HTan MJain NRamanathan MNallapati RBhatia PRoth DXiang BOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)CROSSCODEEVALProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668145(46701-46723)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668145
Dinh TZhao JTan SNegrinho RLausen LZha SKarypis GOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667916
Guo DXu CDuan NYin JMcAuley JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)LongCoderProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618894(12098-12107)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618894
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents