research-article

Open access

Automatically generating features for learning program analysis heuristics for C-like languages

Authors:

Hongseok YangAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 1, Issue OOPSLA

Article No.: 101, Pages 1 - 25

https://doi.org/10.1145/3133925

Published: 12 October 2017 Publication History

Abstract

We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been developed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the nontrivial task of designing so called features to the hands of the designers. Our technique aims at automating this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world static analyses. The experimental results show that these analyses with automatically-generated features are cost-effective and consistently perform well on a wide range of programs.

References

[1]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 38–49.

Digital Library

[2]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning. 2091–2100.

[3]

L. O. Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen. (DIKU report 94/19).

[4]

Thomas Ball and Sriram K. Rajamani. 2002. The SLAM Project: Debugging System Software via Static Analysis. In Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’02). ACM, New York, NY, USA, 1–3.

Digital Library

[5]

Nels E. Beckman and Aditya V. Nori. 2011. Probabilistic, Modular and Scalable Inference of Typestate Specifications. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). ACM, New York, NY, USA, 211–221.

Digital Library

[6]

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.

Digital Library

[7]

Pavol Bielik, Veselin Raychev, and Martin T Vechev. 2016. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML. 19–24.

Digital Library

[8]

Sam Blackshear, Bor-Yuh Evan Chang, and Manu Sridharan. 2015. Selective Control-flow Abstraction via Jumping. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). ACM, New York, NY, USA, 163–182.

Digital Library

[9]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.

Digital Library

[10]

Sooyoung Cha, Sehun Jeong, and Hakjoo Oh. 2016. Learning a Strategy for Choosing Widening Thresholds from a Large Codebase. In Asian Symposium on Programming Languages and Systems. Springer, 25–41.

[11]

Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith. 2003. Counterexample-guided Abstraction Refinement for Symbolic Model Checking. J. ACM 50, 5 (Sept. 2003), 752–794.

Digital Library

[12]

Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical Similarity of Binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 266–280.

Digital Library

[13]

Azadeh Farzan and Zachary Kincaid. 2012. Verification of Parameterized Concurrent Programs by Modular Reasoning About Data and Control. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’12). ACM, New York, NY, USA, 297–308.

Digital Library

[14]

Pranav Garg, Christof Löding, P Madhusudan, and Daniel Neider. 2014. ICE: A robust framework for learning invariants. In International Conference on Computer Aided Verification. Springer, 69–87.

Digital Library

[15]

Pranav Garg, Daniel Neider, P. Madhusudan, and Dan Roth. 2016. Learning Invariants Using Decision Trees and Implication Counterexamples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 499–512.

Digital Library

[16]

Timon Gehr, Dimitar Dimitrov, and Martin Vechev. 2015. Learning commutativity specifications. In International Conference on Computer Aided Verification. Springer, 307–323.

[17]

Sergey Grebenshchikov, Ashutosh Gupta, Nuno P Lopes, Corneliu Popeea, and Andrey Rybalchenko. 2012. HSF (C): a software verifier based on horn clauses. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 549–551.

Digital Library

[18]

Radu Grigore and Hongseok Yang. 2016. Abstraction Refinement Guided by a Learnt Probabilistic Model. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 485–498.

Digital Library

[19]

Bhargav S Gulavani, Supratik Chakraborty, Aditya V Nori, and Sriram K Rajamani. 2008. Automatically refining abstract interpretations. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 443–458.

[20]

Ashutosh Kumar Gupta, Rupak Majumdar, and Andrey Rybalchenko. 2013. From tests to proofs. International Journal on Software Tools for Technology Transfer 15, 4 (2013), 291–303.

Digital Library

[21]

Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Kenneth L. McMillan. 2004. Abstractions from Proofs. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’04). ACM, New York, NY, USA, 232–244.

Digital Library

[22]

Kihong Heo, Hakjoo Oh, and Hongseok Yang. 2016. Learning a variable-clustering strategy for octagon from labeled data generated by a static analysis. In International Static Analysis Symposium. Springer, 237–256.

[23]

Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA, 837–847. http://dl.acm.org/citation.cfm?id=2337223.2337322

[24]

Bertrand Jeannet and Antoine Miné. 2009. Apron: A library of numerical abstract domains for static analysis. In International Conference on Computer Aided Verification. Springer, 661–667.

Digital Library

[25]

Sehun Jeong, Minseok Jeon, Sungdeok Cha, and Hakjoo Oh. 2017. Data-Driven Context-Sensitivity for Points-to Analysis. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017).

Digital Library

[26]

George Kastrinis and Yannis Smaragdakis. 2013. Hybrid Context-sensitivity for Points-to Analysis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 423–434.

Digital Library

[27]

Omer Katz, Ran El-Yaniv, and Eran Yahav. 2016. Estimating Types in Binaries Using Predictive Modeling. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 313–326.

Digital Library

[28]

Sulekha Kulkarni, Ravi Mangal, Xin Zhang, and Mayur Naik. 2016. Accelerating Program Analyses by Cross-program Training. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, New York, NY, USA, 359–377.

Digital Library

[29]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.

[30]

Percy Liang, Omer Tripp, and Mayur Naik. 2011. Learning Minimal Abstractions. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’11). ACM, New York, NY, USA, 31–42.

Digital Library

[31]

Benjamin Livshits, Aditya V. Nori, Sriram K. Rajamani, and Anindya Banerjee. 2009. Merlin: Specification Inference for Explicit Information Flow Problems. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’09). ACM, New York, NY, USA, 75–86.

Digital Library

[32]

Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 298–312.

Digital Library

[33]

Magnus Madsen and Anders Møller. 2014. Sparse dataflow analysis with pointers and reachability. In International Static Analysis Symposium. Springer, 201–218.

[34]

Ravi Mangal, Xin Zhang, Aditya V. Nori, and Mayur Naik. 2015. A User-guided Approach to Program Analysis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 462–473.

Digital Library

[35]

A. Miné. 2006. The Octagon Abstract Domain. Higher-Order and Symbolic Computation 19, 1 (2006), 31–100.

Digital Library

[36]

Alon Mishne, Sharon Shoham, and Eran Yahav. 2012. Typestate-based Semantic Code Search over Partial Programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12). ACM, New York, NY, USA, 997–1016.

Digital Library

[37]

Mayur Naik, Hongseok Yang, Ghila Castelnuovo, and Mooly Sagiv. 2012. Abstractions from Tests. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’12). ACM, New York, NY, USA, 373–386.

Digital Library

[38]

Aditya V. Nori and Rahul Sharma. 2013. Termination Proofs from Tests. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013). ACM, New York, NY, USA, 246–256.

Digital Library

[39]

Hakjoo Oh, Kihong Heo, Wonchan Lee, Woosuk Lee, and Kwangkeun Yi. 2012. Design and Implementation of Sparse Global Analyses for C-like Languages. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 229–238.

Digital Library

[40]

Hakjoo Oh, Wonchan Lee, Kihong Heo, Hongseok Yang, and Kwangkeun Yi. 2014. Selective Context-sensitivity Guided by Impact Pre-analysis. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 475–484.

Digital Library

[41]

Hakjoo Oh, Hongseok Yang, and Kwangkeun Yi. 2015. Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). ACM, New York, NY, USA, 572–588.

Digital Library

[42]

Saswat Padhi, Rahul Sharma, and Todd Millstein. 2016. Data-driven Precondition Inference with Learned Features. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 42–56.

Digital Library

[43]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (Nov. 2011), 2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195

Digital Library

[44]

Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learning Programs from Noisy Data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 761–774.

Digital Library

[45]

Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting Program Properties from "Big Code". In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’15). ACM, New York, NY, USA, 111–124.

Digital Library

[46]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 419–428.

Digital Library

[47]

John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and Xuejun Yang. 2012. Test-case Reduction for C Compiler Bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 335–346.

Digital Library

[48]

Sriram Sankaranarayanan, Swarat Chaudhuri, Franjo Ivančić, and Aarti Gupta. 2008. Dynamic Inference of Likely Data Preconditions over Predicates by Tree Learning. In Proceedings of the 2008 International Symposium on Software Testing and Analysis (ISSTA ’08). ACM, New York, NY, USA, 295–306.

Digital Library

[49]

Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Percy Liang, and Aditya V Nori. 2013. A data driven approach for algebraic loop invariants. In European Symposium on Programming. Springer, 574–592.

Digital Library

[50]

Rahul Sharma, Aditya V Nori, and Alex Aiken. 2012. Interpolants as classifiers. In International Conference on Computer Aided Verification. Springer, 71–87.

Digital Library

[51]

Gagandeep Singh, Markus Püschel, and Martin Vechev. 2015. Making Numerical Program Analysis Fast. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’15). ACM, New York, NY, USA, 303–313.

Digital Library

[52]

Yannis Smaragdakis, George Kastrinis, and George Balatsouras. 2014. Introspective Analysis: Context-sensitivity, Across the Board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 485–495.

Digital Library

[53]

Xin Zhang, Ravi Mangal, Radu Grigore, Mayur Naik, and Hongseok Yang. 2014. On Abstraction Refinement for Program Analyses in Datalog. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, New York, NY, USA, 239–248.

Digital Library

[54]

Xin Zhang, Mayur Naik, and Hongseok Yang. 2013. Finding Optimum Abstractions in Parametric Dataflow Analysis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 365–376.

Digital Library

[55]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems. 649–657.

[56]

He Zhu, Aditya V. Nori, and Suresh Jagannathan. 2015. Learning Refinement Types. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP 2015). ACM, New York, NY, USA, 400–411.

Digital Library

[57]

He Zhu, Gustavo Petri, and Suresh Jagannathan. 2016. Automatically Learning Shape Specifications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 491–507.

Digital Library

Cited By

Yan ZZhang XDi P(2024)Scaling Abstraction Refinement for Program Analyses in Datalog using Graph Neural NetworksProceedings of the ACM on Programming Languages10.1145/36897658:OOPSLA2(1532-1560)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689765
Du YYu ZChandra SBlincoe KTonella P(2023)Pre-training Code Representation with Semantic Flow Graph for Effective Bug LocalizationProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616338(579-591)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616338
Ko YOh HGrundy JPollock LPenta M(2023)Learning to Boost Disjunctive Static Bug-FindersProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00099(1097-1109)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00099
Show More Cited By

Index Terms

Automatically generating features for learning program analysis heuristics for C-like languages
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program analysis

Recommendations

Data-driven context-sensitivity for points-to analysis

We present a new data-driven approach to achieve highly cost-effective context-sensitive points-to analysis for Java. While context-sensitivity has greater impact on the analysis precision and performance than any other precision-improving techniques, ...
Learning graph-based heuristics for pointer analysis without handcrafting application-specific features

We present Graphick, a new technique for automatically learning graph-based heuristics for pointer analysis. Striking a balance between precision and scalability of pointer analysis requires designing good analysis heuristics. For example, because ...
AFGSL: Automatic Feature Generation based on Graph Structure Learning
Abstract
Feature engineering relies on domain knowledge and human intervention. To automate the process of feature engineering, automated feature construction methods use deep neural networks to capture feature interactions and attention ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 1, Issue OOPSLA

October 2017

1786 pages

EISSN:2475-1421

DOI:10.1145/3152284

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2017

Published in PACMPL Volume 1, Issue OOPSLA

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
327
Total Downloads

Downloads (Last 12 months)86
Downloads (Last 6 weeks)21

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan ZZhang XDi P(2024)Scaling Abstraction Refinement for Program Analyses in Datalog using Graph Neural NetworksProceedings of the ACM on Programming Languages10.1145/36897658:OOPSLA2(1532-1560)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689765
Du YYu ZChandra SBlincoe KTonella P(2023)Pre-training Code Representation with Semantic Flow Graph for Effective Bug LocalizationProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616338(579-591)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616338
Ko YOh HGrundy JPollock LPenta M(2023)Learning to Boost Disjunctive Static Bug-FindersProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00099(1097-1109)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00099
Kim HRaghothaman MHeo KDwyer MDamian DZeller A(2022)Learning probabilistic models for static analysis alarmsProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510098(1282-1293)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510098
Park JLee HRyu S(2021)A Survey of Parametric Static AnalysisACM Computing Surveys10.1145/346445754:7(1-37)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464457
Gharachorlu GSumner N(2021)Leveraging Models to Reduce Test Cases in Software Repositories2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)10.1109/MSR52588.2021.00035(230-241)Online publication date: May-2021
https://doi.org/10.1109/MSR52588.2021.00035
Raychev V(2021)Learning to Find Bugs and Code Quality Problems - What Worked and What not?2021 International Conference on Code Quality (ICCQ)10.1109/ICCQ51190.2021.9392977(1-5)Online publication date: 27-Mar-2021
https://doi.org/10.1109/ICCQ51190.2021.9392977
Jeon DJeon MOh H(2021)A practical algorithm for learning disjunctive abstraction heuristics in static program analysisInformation and Software Technology10.1016/j.infsof.2021.106564135(106564)Online publication date: Jul-2021
https://doi.org/10.1016/j.infsof.2021.106564
Xu RHe FWang BDevanbu PCohen MZimmermann T(2020)Interval counterexamples for loop invariant learningProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3368089.3409752(111-122)Online publication date: 8-Nov-2020
https://dl.acm.org/doi/10.1145/3368089.3409752
Romanov VIvanov VSucci G(2020)Approaches for Representing Software as Graphs for Machine Learning Applications2020 International Computer Symposium (ICS)10.1109/ICS51289.2020.00109(529-534)Online publication date: Dec-2020
https://doi.org/10.1109/ICS51289.2020.00109
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents