Montecarlo Prog Analysis
Montecarlo Prog Analysis
Montecarlo Prog Analysis
Marcel Böhme
Monash University, marcel.boehme@acm.org
User-Generated Executions Per Day Accuracy ϵ
ABSTRACT HFD Ro3
Static program analysis today takes an analytical approach which OSS-Fuzz 200.0 billion test inputs* [46] 3.6 · 10−6 2.3 · 10−11
Netflix 86.4 billion API requests [50] 5.5 · 10−6 5.3 · 10−11
is quite suitable for a well-scoped system. Data- and control-flow
Youtube 6.2 billion videos watched [47] 2.1 · 10−5 7.4 · 10−10
is taken into account. Special cases such as pointers, procedures, Google 5.6 billion searches [47] 2.1 · 10−5 8.2 · 10−10
and undefined behavior must be handled. A program is analyzed Tinder 2.0 billion swipes [51] 3.6 · 10−5 2.3 · 10−9
precisely on the statement level. However, the analytical approach Facebook 1.5 billion user logins [48] 4.2 · 10−5 3.1 · 10−9
Twitter 681.7 million tweets posted [47] 6.2 · 10−5 6.8 · 10−9
arXiv:1911.04687v1 [cs.SE] 12 Nov 2019
is ill-equiped to handle implementations of complex, large-scale, Skype 253.8 million calls [47] 1.0 · 10−4 1.8 · 10−8
heterogeneous software systems we see in the real world. Existing Visa 150.0 million transactions [45] 1.3 · 10−4 3.1 · 10−8
static analysis techniques that scale, trade correctness (i.e., sound- Instagram 71.1 million photos uploaded [47] 1.9 · 10−4 6.5 · 10−8
(*) per widely-used, security-critical library, on average. 10 trillion total.
ness or completeness) for scalability and build on strong assump-
tions (e.g., language-specificity). Scalable static analysis are well- Figure 1: Daily number of executions of large software sys-
known to report errors that do not exist (false positives) or fail to tems in 2018, and the lower (Ro3) and upper bounds (HFD)
report errors that do exist (false negatives). Then, how do we know on the accuracy ϵ of an estimate µ̂ of the probability µ that a
the degree to which the analysis outcome is correct? binary program property φ holds (e.g., bug is found). Specifi-
In this paper, we propose an approach to scale-oblivious grey- cally, µ ∈ [µ̂ − ϵ, µ̂ + ϵ] with probability δ = 0.01 (i.e., 99%-CIs).
box program analysis with bounded error which applies efficient
approximation schemes (FPRAS) from the foundations of machine We believe that static program analysis today resembles the an-
learning: PAC learnability. Given two parameters δ and ϵ, with prob- alytical approach in the natural sciences. However, in the natural
ability at least (1 − δ ), our Monte Carlo Program Analysis (MCPA) sciences—when analytical solutions are no longer tractable—other
approach produces an outcome that has an average error at most ϵ. approaches are used. Monte Carlo methods are often the only means
The parameters δ > 0 and ϵ > 0 can be choosen arbitrarily close to solve very complex systems of equations ab initio [19]. For in-
to zero (0) such that the program analysis outcome is said to be stance, in quantum mechanics the following multi-dimensional
probably-approximately correct (PAC). We demonstrate the perti- integral gives the evolution of an atom that is driven by laser light,
nent concepts of MCPA using three applications: (ϵ, δ )-approximate undergoes “quantum jumps” in a time interval (0, t), and emits
quantitative analysis, (ϵ, δ )-approximate software verification, and exactly n photons at times tn ≥ . . . ≥ t 1 :
(ϵ, δ )-approximate patch verification. ∫ t ∫ tn ∫ t2
dt n dt n−1 . . . dt 1 p[0, t ) (t 1, . . . , t n )
0 0 0
1 INTRODUCTION
where p[0,t ) is the elementary probability density function [14].
In 2018, Harman and O’Hearn launched an exciting new research Solving this equation is tractable only using Monte ∫Carlo (MC)
agenda: the innovation of frictionless1 program analysis techniques t
integration [33]. MC integration solves an integral F = 0 i dti f (x)
that thrive on industrial-scale software systems [21]. Much progress
by “executing” f (x) on random inputs x ∈ R . Let X j ∈ [0, ti ]
n
has been made. Tools like Sapienz, Infer, and Error Prone are rou-
be the j-th of N samples from the uniform distribution, then F is
tinely used at Facebook and Google [1, 9, 37]. Yet, many challenges ti ÍN
estimated as ⟨F N ⟩ = N j=1 f (X j ). MC integration guarantees
remain. For instance, the developers of Infer set the clear expectation √
that the tool may report many false alarms, does not handle certain that this estimate converges to the true value F at a rate of 1/ N .
language features, and can only report certain types of bugs.2 Such This is true, no matter how many variables ti the integral has or
tools often trade soundness or completeness for scalability. Then, how the function f : Rn → R is “implemented”.
just how sound or complete is an analysis which ignores, e.g., expen- We argue that frictionless program analysis at the large, indus-
sive pointer analysis, reflection in Java, undefined behaviors in C, trial scale requires a fundamental change of perspective. Rather
or third-party libraries for which code is unavailable? than starting with a precise program analysis, and attempting to
carefully trade some of this precision for scale, we advocate an
1 Friction is a technique-specific resistance to adoption, such as developers’ reluctance inherently scale-oblivious approach. In this paper, we cast scale-
to adopt tools with high false positive rates.
2 https://fbinfer.com/docs/limitations.html oblivious program analysis as a probably-approximately correct
(PAC)-learning problem [25], provide the probabilistic framework,
Permission to make digital or hard copies of part or all of this work for personal or and develop several fully polynomial-time randomized approxima-
classroom use is granted without fee provided that copies are not made or distributed tion schemes (FPRAS). Valiant [43] introduced the PAC framework
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. to study the computational complexity of machine learning tech-
For all other uses, contact the owner/author(s). niques. The objective of the learner is to receive a set of samples
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia and generate a hypothesis that with high probability has a low gen-
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
eralization error. We say the hypothesis is probably, approximately
https://doi.org/10.1145/nnnnnnn.nnnnnnn correct, where both adverbs are formalized and quantified.
1
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme
We call our technique Monte Carlo program analysis (MCPA). • If a patch was submitted, it would require at most 400 million
The learner is the program analysis. The samples (more formally API requests (equiv. 7 minutes) to state with confidence (p-
Monte Carlo trials) are distinct program executions that are poten- value< 10−3 ) that the failure probability has indeed reduced
tially generated by different users of the software system under if none of those 400M executions exposes the bug after the
normal workload. The hypothesis is the analysis outcome. Given patch is applied. MCPA provides an a-priori conditional prob-
two parameters δ and ϵ, an MCPA produces an outcome that with abilistic guarantee that the one-sided null-hypothesis can be
probability at least (1 −δ ) has an average error at most ϵ. It is impor- rejected at significance level α = 0.01 (Sec. 6).
tant to note that the analysis outcome is probably-approximately • For an arbitrary program analysis, let H be a finite hypothe-
correct w.r.t. the distribution of Monte Carlo trials. For instance, if sis class containing all possible program analysis outcomes.
the executions are generated by real users, the analysis outcome is If MCPA chooses the outcome h ∈ H which explains the
probably-approximately correct w.r.t. the software system as it is analysis target (e.g., points-to-analysis) for all 5.6 billion daily
used by real users. Google searches, then h explains the analysis targets with a
In contrast to existing program analysis techniques, MCPA log(|H |/δ )
maximum error of ϵ ≤ 5.6·109 for “unseen” searches [41].
• allows to reason about the entire software systems under
normal workload, i.e., an MCPA while the system is used On a high level, MCPA combines the advantages of static and
normally produces an outcome w.r.t. normal system use, dynamic program analysis while mitigating individual challenges.
Like a static program analysis, MCPA allows to make statements
• requires only greybox access to the software system, i.e., no
over all executions of a program. However, MCPA does not take
source code is analyzed, instead lightweight program instru-
an analytical approach. A static analysis evaluates statements in
mentation allows to monitor software properties of interest,
the program source code which requires strong assumptions about
• provides probabilistic guarantees on the correctness and ac-
the programming language and its features. This poses substantial
curacy of the analysis outcome; i.e., to exactly quantify the
challenges for static program analysis. For instance, alias analy-
scalability-correctness tradeoff,
sis is undecidable [35]. Like a dynamic program analysis, MCPA
• can be implemented in a few lines of Python code, and overcomes these challenges by analyzing the execution of the soft-
• is massively parallelizable, scale-oblivious, and can be in- ware system. However, MCPA does not analyze only one execution.
terupted at any time. Instead, MCPA allows to make statements over all executions gen-
MCPA is a general approach that can produce outcomes with erated from a given (operational) distribution.
bounded error for arbitrary program analyses. This is because the The contributions of this article are as follows:
PAC-framework covers all of machine learning which can learn • We discuss the opportunities of employing Monte Carlo methods
arbitrary concepts, as well. Shalev-Shwartz and Ben-David [41] as an approximate, greybox approach to large-scale quantitative
provide an excellent overview of the theory of machine learning. program analysis with bounded error.
However, in this paper we focus here on (ϵ, δ )-approximations • We develop several fully-polynomial randomized approximation
of the probability that a binary property holds for an arbitrary exe- schemes (FPRAS) that guarantee that the produced estimate µ̂ of
cution of a terminating program. This has important applications, the probability µ that a program property holds is no more than ϵ
e.g., in assessing reliability [5, 6, 16], quantifying information leaks away from µ with probability at least (1 − δ ). We tackle an open
[34], or exposing side-channels [3]. Given parameters δ > 0 and problem in automated program repair: When is it appropriate to
ϵ > 0, our MCPA guarantees that the reported estimate µ̂ is no more claim with some certainty that the bug is indeed repaired?
than ϵ away from the true value µ with probability at least 1 − δ , • We evaluate the efficiency of our MCPA algorithms probabilis-
i.e., P(| µ̂ − µ | ≥ ϵ) ≤ δ . In statistical terms, MCPA guarantees the tically by providing upper and lower bounds for all 0 ≤ µ ≤ 1,
(1 − δ )-confidence interval µ̂ ± ϵ. Note that confidence and accuracy and empirically in more than 1018 simulation experiments.
parameters δ and ϵ can be chosen arbitrarily close to zero (0) to
minimize the probability of false negatives. 2 MONTE CARLO PROGRAM ANALYSIS
To illustrate the power of MCPA, we show how often well-known,
Monte Carlo program analysis (MCPA) allows to derive statements
industrial-scale software systems are executed per day (Fig. 1). For
about properties of a software system that are probably approxi-
instance, Netflix handles 86.4 billion API requests (8.64 · 1010 ) per
mately correct [42]. Given a confidence parameter 0 < δ < 1 and an
day. MCPA guarantees that the following statements are true with
accuracy parameter ϵ > 0, with probability at least (1 − δ ), MCPA
probability greater than 99% w.r.t. the sampled distribution:
produces an analysis outcome that has an average error at most ϵ.
• No matter which binary property we check or how many Problem statement. In the remainder of this paper, we focus on
properties we check simultaneously, we can guarantee a a special case of MCPA, i.e., to assess the probability µ that a binary
priori that the error ϵ of our estimate(s) µ̂ of µ is bounded property φ holds (for an arbitrary number of such properties). The
between 5.3 · 10−11 ≤ ϵ ≤ 5.3 · 10−6 such that µ ∈ µ̂ ± ϵ. probability that φ does not hold (i.e., that ¬φ holds) is simply (1 − µ).
• If 0 out of 86.4 billion API requests expose a security flaw, Suppose, during the analysis phase, it was observed that φ holds
then the true probability that there exists a security flaw that for the proportion µ̂ of sample executions. We call those sample
has not been exposed is less than 5.3 · 10−11 (Sec. 4). executions in the analysis phase as Monte Carlo trials. The analysis
• If 1000 of 86.4 billion API requests expose a vulnerability, phase can be conducted during testing or during normal usage of
then the true probability µ is probabilistically guaranteed to the software system. The analysis outcome of an MCPA is expected
be within µ ∈ 1.16 × 10−8 ± 1.45 × 10−9 (Sec. 5). to hold w.r.t. the concerned executions, e.g., further executions of
2
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia
the system during normal usage. Our objective is to guarantee for 2.3 Assumption 3: Property Outcomes Are
the concerned executions that the reported estimate µ̂ is no more Independent and Identically Distributed
than ϵ away from the true value µ with probability at least 1 − δ ,
The sequence of n MC trials is a stochastic process F {Xm }m=1 n
i.e., P(| µ̂ − µ | ≥ ϵ) ≤ δ .
where Xm ∈ F is a binomial random variable that is true if φ = 1
Assumptions. We adopt the classical assumptions from soft-
in the m-th trial. We assume that the property outcomes F are in-
ware testing: We assume that (i) all executions terminate, (ii) prop-
dependent and identically distributed (IID). This is a classic assump-
erty violation is observable (e.g., an uncaught exception), and (iii) the
tion in testing, e.g., all test inputs are sampled independently from
Monte Carlo trials are representative of the concerned executions.
the same (operational) distribution [16]. More generally, through-
We elaborate each assumption and their consequences in the fol-
out the analysis phase, the probability µ that φ holds is invariant;
lowing sections. We make no other assumptions about the program
the binomial distribution over Xm is the same for all Xm ∈ F .
P or the property φ. For instance, P could be deterministic, proba-
Testing IID. In order to test whether F is IID, there are several
bilistic, distributed, or concurrent, written in C, Java, or Haskell, or
statistical tools available. For instance, the turning point test is a
a trained neural network, or a plain old program. We could define φ
statistical test of the independence of a series of random variables.
to hold if an execution yields correct behavior, if a security property
If assumption does not hold. The assumption that executions
is satisfied, if a soft deadline is met, if an energy threshold is not
are identically distributed does not hold for stateful programs where
exceeded, if the number of cache-misses is sufficiently low, if no
the outcome of φ in one execution may depend on previous execu-
buffer overwrite occurs, if an assertion is not violated, et cetera.
tions. In such cases, we suggest to understand each fine-grained
2.1 Assumption 1: All Executions Terminate execution as a transition from one state to another within a single
non-terminating execution. The state transitions can then be mod-
We assume that all executions of the analyzed program P terminate.
elled as Markov chain and checked using tools such as probabilistic
With terminating executions we mean independent units of behav-
model checking [22, 24].
ior that have a beginning and an end. For instance, a user session
runs from login to logout, a transaction runs from the initial hand- 2.4 Assumption 4: Monte Carlo Trials
shake until the transaction is concluded, and a program method Represent Concerned Executions
runs from the initial method call until the call is returned. Other
We assume that the Monte Carlo trials are representative of the con-
examples are shown in Figure 1. This is a realistic assumption also
cerned executions. In other words, the executions of the software
in software testing.
system that were generated during the analysis phase are from
If assumption does not hold. As we require all (sample and
the same distribution as the executions w.r.t. which the analysis
concerned) executions to terminate, MCPA cannot verify tempo-
outcome is expected to hold. This is a realistic assumption shared
ral properties, such as those expressed in linear temporal logic
with software testing, and any empirical analysis in general.
(LTL). If the reader wishes to verify temporal properties over non-
Realistic Behavior. A particular strength of MCPA compared
terminating executions, we refer to probabilistic model checking
to existing techniques is that the software system can be analyzed
[22, 24, 28] or statistical model checking [29, 39]. If the reader
under normal workload to derive an probably-approximately cor-
wishes to verify binary properties over non-terminating executions,
rect analysis outcome that holds w.r.t. the software system as it is
we suggest to use probabilistic symbolic execution [18] on partial
normally executed.
path constraints of bounded length k.
2.2 Assumption 2: Property Outcomes Are 3 MOTIVATION: CHALLENGES OF THE
Observable ANALYTICAL APPROACH
We assume that the outcome of property φ ∈ {0, 1} can be auto- The state-of-the-art enabling technology for quantitative program
matically observed for each execution. Simply speaking, we cannot analysis is probabilistic symbolic execution (PSE) which combines
estimate the proportion µ̂ of concerned executions for which φ symbolic execution and model counting. In a 2017 LPAR keynote
holds, if we cannot observe whether φ holds for any Monte Carlo [44], Visser called PSE the new hammer. Indeed, we find that PSE is
trial during the analysis phase. This is a realistic assumption that an exciting, new tool in the developer’s reportoire for quantitative
also exists in software testing. program analysis problems particularly if an exact probability is re-
Greybox access. Some properties φ can be observed externally quired. However, the analytical approach of PSE introduces several
without additional code instrumentation. For instance, we can mea- challenges for the analysis of large-scale, heterogeneous software
sure whether latency, performance, or energy thresholds are ex- systems which MCPA is able to overcome.
ceeded by measuring the time it takes to respond or to compute
the final result, or how much energy is consumed. Other properties 3.1 Probabilistic Symbolic Execution
can be observed by injecting lightweight instrumention directly Suppose property φ is satisfied for all program paths I . Conceptually,
into the program binary, e.g., using DynamoRIO or Intel Pin. Other for each path i ∈ I , PSE computes the probability pi that an input
properties can be made observable at compile-time causing very exercises i and then reports µ = i ∈I pi . To compute the probability
Í
low runtime overhead. An example is AddressSanitizer [40] which pi that an input exercises path i, PSE first uses symbolic execution to
is routinely compiled into security-critical program binaries to re- translate the source code that is executed along i into a Satisfiability
port (exploitable) memory-related errors, such as buffer overflows Modulo Theory (SMT) formula, called path condition. The path
and use-after-frees. condition π (i) is satisfied by all inputs that exercise i. A model
3
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme
counting tool can then determine the number of inputs Jπ (i)K that (4) No confidence intervals are available for approximate PSE.
exercise i. Given the size JDK of the entire input domain D, the While approximate PSE trades accuracy (of the analysis out-
probability pi to exercise path i can be computed as pi = Jπ (i)K/JDK. come) for efficiency (of the analysis), it does not provide
There are two approaches to establish the model count. Exact any means to assess this trade-off.4 Approximate PSE pro-
model counting [3, 7, 16, 18] determines the model count precisely. vides only a (negatively biased) point estimate µ̂ of µ. Due
For instance, LaTTE [31] computes Jπ (i)K as the volume of a con- to the modular nature of the composition of the estimate µ̂,
vex polytope which represents the path constraint. However, an it is difficult to derive a reliable (1 − δ )-confidence interval
exact count cannot be established efficiently and turns out to be [µ̂ − ϵ, µ̂ + ϵ].
intractable in practice [2, 11].3 Hence, recent research has focussed (5) To provide an analysis outcome for programs under normal
on PSE involving approximate model counting. workload, a usage profile must be developed that can be
Approximate model counting trades accuracy for efficiency and integrated with the path condition. PSE requires to formally
determines Jπ (i)K approximately. Inputs are sampled from non- encode user behavior as a usage profile [16]. However, de-
overlapping, axis-aligned bounding boxes that together contain all riving the usage profile from a sample of typical execution
(but not only) solutions of π (i) [5, 6, 17, 32]. To determine whether can introduce uncertainty in the stochastic model that must
the sampled input exercises i, it is plugged into π (i) and checked for be accounted for [30].
satisfiability. The proportion of “hits” multiplied by the size of the
We believe that many of these limitations can be addressed. Yet,
bounding box, summed over all boxes gives the approximate model
they do demonstrate a shortcoming of the analytical approach. A
count. In contrast to arbitrary polytopes, the size of an axis-aligned
static program analysis must parse the source code and interpret
bounding box can be precisely determined efficiently.
each program statement on its own. It cannot rely on the compiler
to inject meaning into each statement. The necessary assumptions
3.2 Challenges limit the applicability of the analysis. The required machinery is
Probabilistic symbolic execution uses an incremental encoding of a substantial performance bottleneck. For PSE, we failed to find
the program’s source code as a set of path conditions. Whether reports of experiments on programs larger than a few hundred lines
static or dynamic symbolic execution, the semantic meaning of of code. Notwithstanding, we are excited about recent advances in
each individual program statement is translated into an equivalent scalable program analysis where correctness is carefully traded for
statement in the supported Satisfiability Modulo Theory (SMT). scalability. However, how do we quantify the error of the analysis
The quantitative program analysis, whether exact or approximate, outcome? What is the probability of a false positive/negative?
is then conducted upon those path conditions. This indirection
introduces several limitations. 3.3 Opportunities
(1) Only programs that can be represented in the available SMT Monte Carlo program analysis resolves many of these limitations.
theory can be analyzed precisely (e.g., bounded linear integer (1) Every executable program can be analyzed. Apart from termi-
arithmetic [18] or strings [7]). For approximate PSE, only nation we make no assumptions about the software system.
bounded numerical input domains are allowed. Otherwise,
(2) Every binary property φ can be checked, including a non-
bounding boxes and their relative size cannot be derived. Pro-
functional property, as long as it can be automatically ob-
grams that are built on 3rd-party libraries, are implemented
served. Moreover, the number of required trials is indepen-
in several languages, contain unbounded loops, involve com-
dent of the number of simultaneously checked properties.
plex data structures, execute several threads simultaneously
pose challenges for probabilistic symbolic execution. (3) The maximum likelihood estimate µ̂ for the binomial pro-
portion µ is unbiased. Unlike PSE, MCPA does not require
(2) Only functional properties φ can be checked; those can be
to identify or enumerate program paths that satisfy φ.
encoded within a path condition, e.g., as assertions over vari-
able values. In contrast, non-functional properties, such as (4) The error is bounded. The analysis results are probably ap-
whether the response-time or energy consumption exceeds proximately correct, i.e., given the parameters δ and ϵ, with
some threshold value, are difficult to check. probability (1 − δ ), we have that µ ∈ [µ̂ − ϵ, µ̂ + ϵ].
(3) Only a lower bound on the probability µ that φ holds can (5) The best model of a system is the running system itself. The
be provided. PSE cannot efficiently enumerate all paths for system can be analyzed directly during normal execution
which property φ holds. Reachability is a hard problem. Oth- with analysis results modulo concerned executions.
erwise, symbolic execution would be routinely used to prove It is interesting to note that approximate probabilistic symbolic
programs correct. Hence, PSE computes µ as µ ′ = i ∈I ′ pi execution employs Monte Carlo techniques upon each path con-
Í
only for a reasonable subset of paths I ⊆ I that satisfy φ
′ dition, effectively simulating the execution of the program. Our
[16], which is why the reported probability µ ′ ≤ µ. key observation is that the compiler already imbues each statement
with meaning and program binaries can be executed concretely
3 Gulwani et al. [38] observe that the “exact volume determination for real polyhedra
many orders of magnitutes faster than they can be simulated.
is quite expensive, and often runs out of time/memory on the constraints that are
4 We note that the sample variance σ̂ is not an indicator of estimator accuracy. Suppose,
obtained from our benchmarks”. Time is exponential in the number of input variables.
While a simple constraint involving four variables is quantified in less then 3 minutes in µ = 0.5 such that the (true) variance σ 2 = 0.25/n is maximal. Due to randomness, we
some experiments, a similarly simple constraint involving eight variables is quantified may take n samples where φ does not hold. Our estimate µ̂ = 0 is clearly inaccurate,
in just under 24 hours [31]. yet the sample variance σ̂ 2 = 0 gives no indication of this inaccuracy.
4
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia
4: for φ i ∈ Φ do 𝛿
5: if φ i does not hold in current execution 0.0
0.0 0.2 0.4
𝛿
0.6 0.8 1.0
6: then return “Property φ i violated for current execution P” Computed by Wolfram|Alpha Computed by Wolfram|Alpha
1.0
Algorithm 2 (ϵ, δ )-Approximate Quantitative Program Analysis
0.8 (Fully-Polynomial Randomized Approximation Scheme)
0.6
Input: Program P, Binary property φ
ε Input: Accuracy ϵ, Confidence δ
0.4
1: Let n = log(2/δ )/(2ϵ 2 )
ε 0.2
2: Let m = 1
3: repeat
𝛿
0.0
0.0 0.2 0.4
𝛿
0.6 0.8 1.0 4: Execute P, observe outcome of φ, and increment m
Computed by Wolfram|Alpha Computed by Wolfram|Alpha
CI CP = [p L , pU ] (8) µ̂ ′ = µ̂ + 1+ (12)
2n n
6A ′
probabilistic method starts with the underlying process and makes predictions and zϵδ /2 is the corrected standard deviation
about the observations the process generates. In contrast, a statistical analysis starts
with the observations and attempts to derive the parameters of the underlying random s
2 z δ2 /2
, !
process that explain the observations. In other words, a statistical method requires ′ µ̂(1 − µ̂) z δ /2
observations to compute the estimates while a probabilistic method provides a-priori ϵ = z δ /2 + 1+ (13)
n 4n 2 n
bounds on such estimates (e.g., based on Ro3 or HfD).
6
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia
Algorithm 3 (ϵ, δ )-Approximate Quantitative Program Analysis Algorithm 4 Approximate Patch Verification
(Reduced number of Clopper-Pearson CI calculations) Input: Program Pfix , Binary property φ, Confidence δ
Input: Program P, Binary property φ Input: Total n bug and unsuccessful trials X bug in Pbug
Input: Accuracy ϵ, Confidence δ 1: Let n fix = log(δ )/log (1 − p L ) where p L is the lower limit
1: Let n = n next = log(δ )/log(1 − ϵ) of the Clopper-Pearson interval (cf. Eq. (8)).
2: for each of n next Monte Carlo trials do
2: for Each of n fix Monte Carlo trials do
3: Execute P and observe the outcome of φ
4: end for
3: Execute program Pfix and observe outcome of φ
5: Let µ̂ = X /n where X is the total frequency φ has held 4: if φ does not hold in the current execution
6: if 0 < µ̂ < 1 then 5: then return “Property φ violated for current execution Pfix ”
7: repeat 6: end for
l q m
8: Let n next = (z 2p0q 0 + z z 2p02q 02 + 2ϵp0q 0 + ϵ)/ 2ϵ 2 return “The null hypothesis can be rejected at significance-level at
least δ , to accept the alternative that failure rate has decreased.”
where p0 = µ̂ and q 0 = 1 − p0 and z = zδ /2
9: for each of n next Monte Carlo trials do
10: Execute P and observe the outcome of φ
11: end for 6 APPROXIMATE PATCH VERIFICATION
12: Let n = n + n next “It turns out that detecting whether a crash is fixed or not is an
13: Let ϵ ′ be the radius of the current CP interval (cf. Eq. (9))
14: Let µ̂ = X /n where X is the total frequency φ has held interesting challenge, and one that would benefit from further
15: until ϵ ′ ≤ ϵ scientific investigation by the research community. [..] How long
16: end if should we wait, while continually observing no re-occurrence of
return “We estimate the prob. that φ holds in P is µ̂ and guarantee a failure (in testing or production) before we claim that the root
that the true prob. µ ∈ µ̂ ± ϵ with prob. at least (1 − δ ).” cause(s) have been fixed?” [1]
Let µ bug and µ fix be the probability to expose an error before and
Evaluation. We experimentally investigate properties of the after the bug was fixed, respectively. Suppose, we have n bug exe-
three confidence intervals for binomial proportions in Section 7. cutions of the buggy program out of which X bug = Bin(µ bug , n bug )
We find that only the Clopper-Pearson confidence interval guaran- exposed the bug. Hence, an unbiased estimator of µ bug is µ̂ bug =
tees that µ ∈ CI with probability at least (1 −δ ) for all µ : 0 ≤ µ ≤ 1. X bug /n bug . We call an execution as successful if no bug (of interest)
The other two intervals cannot be used for (ϵ, δ )-approximate quan- was observed. Given a confidence parameter δ : 0 < δ < 1, we
tiative analysis. ask how many successful executions n fix of the fixed program (i.e.,
0 = Bin(µ fix , n fix ) = X fix ) are needed to reject the null hypothesis
5.3 Quantitative Monte Carlo Program Analysis
with statistical significance α = δ ?
Algorithm 2 shows the procedure of the MCPA. Höffding’s inequal- To reject the null hypothesis at significance-level δ , we require that
ity provides the upper bound on the number of Monte Carlo trials re- there is no overlap between both (1 − δ )-confidence intervals. For
quired for an (ϵ, δ )-approximation of the probability µ that property the buggy program version, we can compute the Clopper-Pearson
φ holds in program P (Eq. (6)). The conservative Clopper-Pearson interval CI CP = [p L , pU ] (cf. Sec. 5.2). Recall that the probability
confidence interval provides an early exit condition (Line 6). that a property φ holds is simply the complement of probability that
Efficiency. Algorithm 2 runs in time that is polynomial in 1/ϵ ̸ φ holds. For the fixed program version, we leverage the generalized
and in log(1/δ ) and is thus a fully-polynomial randomized approxi- rule-of-three CI Ro3 = [0, 1 − δ 1/nfix ] (cf. Eq. (4)). In order to reject
mation scheme (FPRAS). The worst-case running time is visualized the null, we seek n fix such that
in Figure 3. Reducing δ (i.e., increasing confidence) by an order of 1
magnitude less than doubles execution time while reducing ϵ (i.e., 1 − δ nfix < p L (14)
increasing accuracy) by an order of magnitude increases execution log(δ )
time by two orders of magnitude. However, computing the Clopper- which is true for n fix > (15)
log (1 − p L )
Pearson (CP) CI after each Monte Carlo trial is expensive. In our
experiments, computing the CP interval 105 times takes 58 seconds, where p L is the lower limit of the Clopper-Pearson CI (cf. Eq. (8)).
on average. Computing the CP interval for each of n = 2 · 1011 test Efficiency. The CP limit p L is upper- and lower-bounded by the
inputs that OSS-Fuzz generates on a good day (cf. Figure 1) would generalized rule-of-three and Höffding’s inequality, respectively,
take about 3.8 years.
s
log(2/δ ) 1
Optimization. Hence, Algorithm 3 predicts the number of Monte µ̂ bug − ≤ p L ≤ µ̂ − 1 − δ nbug (16)
2n bug
Carlo trials needed for a Clopper-Pearson interval with radius ϵ.
The estimator in Line 9 was developed by Krishnamoorthy and for all 0 ≤ µ bug ≤ 1. Hence, Algorithm 4 is an FPRAS that runs in
Peng [27] and requires an initial guess p0 of µ. The algorithm com- time that is polynomial in log(1/(1 − µ̂ bug ))−1 and in log(1/δ ). The
putes the first guess p0 = µ̂ L after running the minimal number of worst-case efficiency of Algorithm 4 is visualized in Figure 4. For
trials required for the given confidence and accuracy parameters instance, if 800k of 200 billion executions expose a bug in Pbug , then
(Lines 1–6)—as provided by the generalized rule-of-three (Line 1). it requires less than 1.8k executions to reject the Null at significance
Subsequent guesses are computed from the improved maximum level p < 0.001 in favor of the alternative that the probability of
likelihood estimate (Line 14). exposing a bug has indeed reduced for Pfix .
7
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme
1.0
Epsilon = 0.1 Epsilon = 0.01 Epsilon = 0.001
0.8 Algorithm 2
106
Algorithm 3
0.6
Upper Bound (HfD)
𝛿
4
10
Lower Bound (Ro3)
2
10
𝛿 0.2
Wolfram Alpha LLC. 2019. https://www.wolframalpha.com/input/?i=plot+log(delta)%2Flog(1-mu%2Bsqrt(log(2%2Fdelta)%2F(2*4e6)))+where+mu=0.001..1, delta=0.001..1 (access June 23, 2019).
104
Figure 4: The number n fix of Monte Carlo trials required for
the patched program to reject the Null and conclude that the 102
100.0%
95.0%
Nominal conf. level: 95%
equality (HfD) and the generalized rule-of-three (Ro3), respectively.
90.0% Figure 6 shows the efficiency of both algorithms as µ varies, as well
85.0%
as the probabilistic bounds, for various values of ϵ and δ . Each ex-
periment was repeated 1000 times and average values are reported.
80.0%
RQ3 (Experience with PSE). Geldenhuys et al. [18] developed exists with an error that exceeds a residual risk of ϵ = 10−5 with
what is now the state-of-the-art of exact quantitative program analy- probability at most 5%. Decreasing ϵ by an order of magnitude to
sis, probabilistic symbolic execution (PSE). The authors implemented ϵ = 10−6 also increases the number of required trials only by an
PSE into the JPF-ProbSymb tool and evaluated it using the two pro- order of magnitude to 3 · 10−6 .
grams BinaryTree and BinomialHeap. For both programs, Figure 7
shows the exact branch probabilities µ as computed by PSE and the Algorithm 1 is highly efficient. It exhibits a performance that is
corresponding (ϵ, δ )-approximations as computed by Algorithm 3. nearly inversely proportional to the given allowable residual risk.
To generate a Monte Carlo trial, we randomly sampled from the
same domain that PSE is given. Moreover, the Clopper-Pearson interval CI CP provides an upper
As we can see, our Monte Carlo program analysis (MCPA) ap- limit that is very close to the probabilistic lower bound as given by
proximates the branch probabilities µ within the specified minimum the rule-of-three—confirming our observation in RQ2.
accuracy. For δ = ϵ = 0.001, our estimates µ̂ are usually within
±0.0005 of the true values µ. We would like to highlight that ap- 7.3 Approximate Patch Verification
proximate PSE [5, 6, 17, 32] does not provide any such guarantees
on the accuracy of the produced estimates. We also observed that µbug=0.9
0.1 µbug=0.99
0.01 µbug=0.999
0.001
PSE is tractable only for very small integer domains (e.g,. v ∈ [0, 9] 104
[18] or v ∈ [0, 100] [10]) while our MCPA works with arbitrarily
1 × 10+9
108
the null hypothesis at a desired p-value compared to exist-
7.5 × 10+8 ing hypothesis testing methodologies? In Figure 9, we compare
106
our MCPA to the Fisher’s exact test and the Mann-Whitney U test.
5 × 10+8
104 The Fisher’s exact test is the standard two-sample hypothesis test
102
.Rule−of−three
2.5 × 10+8 for binomial proportions when the one of the estimates (here µ̂ fix )
is zero or one.7 The Mann-Whitney U test is the standard test used
Clopper−Pearson
Hoeffding's Ineq.
0 × 10+0
100
1e−08 1e−06 1e−04 1e−02 1e+00 1e−08 1e−06 1e−04 1e−02 1e+00
in the AB testing platform at Netflix [49, 50] to assess, e.g., perfor-
Accuracy ε Accuracy ε mance degradation between versions. The U test is a non-parametric
test of the null hypothesis that it is equally likely that a randomly
Figure 8: Efficiency versus residual risk. Both plots show the
selected value from one sample will be less than or greater than a
number of required Monte Carlo trials n, where φ is observed
randomly selected value from a second sample.
to hold, to guarantee—with an error that exceeds ϵ with prob-
ability at most δ = 0.05—that φ always holds. Algorithm 4 is more efficient than existing techniques for high
levels of confidence (e.g, p-value = α ≤ 0.05) on this task of ap-
RQ4 (Residual Risk). For a given allowable residual risk ϵ, proximate patch verification. The Mann-Whitney test that Netflix
how efficient is Alg. 1 in providing the probabilistic guaran- uses for A/B testing performs worst on this task; the difference to
tee that no bug exists when none has been observed? Figure 8 our MCPA increases as α → 0. Fisher’s exact test approaches the
demonstrates the efficiency of approximate software verification. performance of our MCPA as α → 0.
Alg. 1 exhibits a performance that is nearly inversely proportional
to the given allowable residual risk. For instance, it requires about 7 We note that the popular Chi-squared test, while easier to compute, is an approxima-
3 · 105 trials where no error is observed to guarantee that no error tion method that cannot be used when the one of the estimates µ̂ bug , µ̂ fix is 0 or 1.
9
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme
10
MCPA: Program Analysis as Machine Learning Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia
ACKNOWLEDGMENTS [24] Joost-Pieter Katoen. 2016. The Probabilistic Model Checking Landscape. In
Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science
We thank Prof. David Rosenblum for his inspiring ASE’16 keynote (LICS ’16). 31–45.
speech on probabilistic thinking [36]. This research was partially [25] Michael J. Kearns and Umesh V. Vazirani. 1994. An Introduction to Computational
Learning Theory. MIT Press, Cambridge, MA, USA.
funded by the Australian Government through an Australian Re- [26] Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments and
search Council Discovery Early Career Researcher Award (DE190100046). A/B Testing. Springer US, 922–929.
[27] K. Krishnamoorthy and Jie Peng. 2007. Some Properties of the Exact and Score
Methods for Binomial Proportion and Sample Size Calculation. Communications
REFERENCES in Statistics - Simulation and Computation 36, 6 (2007), 1171–1186.
[1] Nadia Alshahwan, Xinbo Gao, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, [28] Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verifi-
Taijin Tei, and Ilya Zorin. 2018. Deploying Search Based Software Engineering cation of Probabilistic Real-time Systems. In Proceedings of the 23rd International
with Sapienz at Facebook. In Search-Based Software Engineering. 3–45. Conference on Computer Aided Verification (CAV’11). 585–591.
[2] Sanjeev Arora and Boaz Barak. 2009. Computational Complexity: A Modern [29] Axel Legay, Benoît Delahaye, and Saddek Bensalem. 2010. Statistical Model
Approach (1st ed.). Cambridge University Press, New York, NY, USA. Checking: An Overview. In Proceedings of the First International Conference on
[3] Lucas Bang, Abdulbaki Aydin, Quoc-Sang Phan, Corina S. Păsăreanu, and Tevfik Runtime Verification (RV’10). 122–135.
Bultan. 2016. String Analysis for Side Channels with Segmented Oracles. In Pro- [30] Yamilet R. Serrano Llerena, Marcel Böhme, Marc Brünink, Guoxin Su, and David S.
ceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations Rosenblum. 2018. Verifying the Long-Run Behavior of Probabilistic System
of Software Engineering (FSE 2016). 193–204. Models in the Presence of Uncertainty. In Proceedings of the 12th Joint meeting of
[4] Marcel Böhme. 2019. Assurance in Software Testing: A Roadmap. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium
the 41st International Conference on Software Engineering: New Ideas and Emerging on the Foundations of Software Engineering (ESEC/FSE). 1–11.
Results (ICSE-NIER ’19). IEEE Press, Piscataway, NJ, USA, 5–8. [31] Jesús A. De Loera, Raymond Hemmecke, Jeremiah Tauzer, and Ruriko Yoshida.
[5] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, and Corina S. Păsăreanu. 2015. 2004. Effective lattice point counting in rational convex polytopes. Journal of
Iterative Distribution-aware Sampling for Probabilistic Symbolic Execution. In Symbolic Computation 38, 4 (2004), 1273 – 1302.
Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering [32] Kasper Luckow, Corina S. Păsăreanu, Matthew B. Dwyer, Antonio Filieri, and
(ESEC/FSE 2015). 866–877. Willem Visser. 2014. Exact and Approximate Probabilistic Symbolic Execution
[6] Mateus Borges, Antonio Filieri, Marcelo D’Amorim, Corina S. Păsăreanu, and for Nondeterministic Programs. In Proceedings of the 29th ACM/IEEE International
Willem Visser. 2014. Compositional Solution Space Quantification for Probabilis- Conference on Automated Software Engineering (ASE ’14). 575–586.
tic Software Analysis. In Proceedings of the 35th ACM SIGPLAN Conference on [33] Klaus Mølmer, Yvan Castin, and Jean Dalibard. 1993. Monte Carlo wave-function
Programming Language Design and Implementation (PLDI ’14). 123–132. method in quantum optics. Journal of the Optical Society of America B 10, 3 (Mar
[7] Tegan Brennan, Nestan Tsiskaridze, Nicolás Rosner, Abdulbaki Aydin, and Tevfik 1993), 524–538.
Bultan. 2017. Constraint Normalization and Parameterized Caching for Quantita- [34] Quoc-Sang Phan, Pasquale Malacaria, Corina S. Păsăreanu, and Marcelo
tive Program Analysis. In Proceedings of the 2017 11th Joint Meeting on Foundations D’Amorim. 2014. Quantifying Information Leaks Using Reliability Analysis.
of Software Engineering (ESEC/FSE 2017). 535–546. In Proceedings of the 2014 International SPIN Symposium on Model Checking of
[8] Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. 2001. Interval Estima- Software (SPIN 2014). 105–108.
tion for a Binomial Proportion. Statist. Sci. 16, 2 (05 2001), 101–133. [35] G. Ramalingam. 1994. The Undecidability of Aliasing. ACM Trans. Program. Lang.
[9] Cristiano Calcagno, Dino Distefano, Peter W. O’Hearn, and Hongseok Yang. 2011. Syst. 16, 5 (Sept. 1994), 1467–1471.
Compositional Shape Analysis by Means of Bi-Abduction. J. ACM 58, 6, Article [36] D. S. Rosenblum. 2016. The power of probabilistic thinking. In 2016 31st IEEE/ACM
26 (Dec. 2011), 66 pages. International Conference on Automated Software Engineering (ASE). 3–3.
[10] Bihuan Chen, Yang Liu, and Wei Le. 2016. Generating Performance Distributions [37] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera
via Probabilistic Symbolic Execution. In Proceedings of the 38th International Jaspan. 2018. Lessons from Building Static Analysis Tools at Google. Commun.
Conference on Software Engineering (ICSE ’16). 49–60. ACM 61, 4 (March 2018), 58–66.
[11] Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2017. Approximate [38] Sriram Sankaranarayanan, Aleksandar Chakarov, and Sumit Gulwani. 2013. Static
counting in SMT and value estimation for probabilistic programs. Acta Informat- Analysis for Probabilistic Programs: Inferring Whole Program Properties from
ica 54, 8 (01 Dec 2017), 729–764. Finitely Many Paths. In Proceedings of the 34th ACM SIGPLAN Conference on
[12] C. J. Clopper and E. S. Pearson. 1934. The Use of Confidence or Fiducial Limits Programming Language Design and Implementation (PLDI ’13). 447–458.
Illstrated in the Case of the Binomial. Biometrika 26, 4 (12 1934), 404–413. [39] Koushik Sen, Mahesh Viswanathan, and Gul Agha. 2004. Statistical Model Check-
[13] Sundar Dorai-Raj. 2015. R – Binom package. https://cran.r-project.org/web/ ing of Black-Box Probabilistic Systems. In Computer Aided Verification. 202–215.
packages/binom/binom.pdf. Accessed: 2019-06-17. [40] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry
[14] R. Dum, P. Zoller, and H. Ritsch. 1992. Monte Carlo simulation of the atomic Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceed-
master equation for spontaneous emission. Physical Review A 45 (Apr 1992), ings of the 2012 USENIX Conference on Annual Technical Conference (USENIX
4879–4887. Issue 7. ATC’12). USENIX Association, Berkeley, CA, USA, 28–28.
[15] A. Filieri, C. S. Pasareanu, and G. Yang. 2015. Quantification of Software Changes [41] Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning:
through Probabilistic Symbolic Execution (N). In Proceedings of the 30th IEEE/ACM From Theory to Algorithms. Cambridge University Press, New York, NY, USA.
International Conference on Automated Software Engineering (ASE’15). 703–708. [42] Leslie Valiant. 2013. Probably Approximately Correct: Nature’s Algorithms for
[16] Antonio Filieri, Corina S. Păsăreanu, and Willem Visser. 2013. Reliability Analysis Learning and Prospering in a Complex World. Basic Books, Inc.
in Symbolic Pathfinder. In Proceedings of the 2013 International Conference on [43] L. G. Valiant. 1984. A Theory of the Learnable. Commun. ACM 27, 11 (Nov. 1984),
Software Engineering (ICSE ’13). 622–631. 1134–1142.
[17] Antonio Filieri, Corina S. Păsăreanu, Willem Visser, and Jaco Geldenhuys. 2014. [44] Willem Visser. 2017. Probabilistic Symbolic Execution: A New Hammer. https:
Statistical Symbolic Execution with Informed Sampling. In Proceedings of the 22Nd //easychair.org/smart-program/LPAR-21/LPAR-21-s3-Visser.pdf Keynote at the
ACM SIGSOFT International Symposium on Foundations of Software Engineering 21st International Conference on Logic for Programming, Artificial Intelligence
(FSE 2014). 437–448. and Reasoning (LPAR-21).
[18] Jaco Geldenhuys, Matthew B. Dwyer, and Willem Visser. 2012. Probabilistic Sym- [45] Website. 2010. Visa: Transactions per Day. https://usa.visa.com/
bolic Execution. In Proceedings of the 2012 International Symposium on Software run-your-business/small-business-tools/retail.html. Accessed: 2019-06-17.
Testing and Analysis (ISSTA 2012). 166–176. [46] Website. 2017. OSS-Fuzz: Five Months Later. https://testing.googleblog.com/
[19] Brian L Hammond, William A Lester, and Peter James Reynolds. 1994. Monte 2017/05/oss-fuzz-five-months-later-and.html. Accessed: 2017-11-13.
Carlo methods in ab initio quantum chemistry. Vol. 1. World Scientific. [47] Website. 2018. Domo: Data Never Sleeps Report 6.0. https://www.domo.com/
[20] JA Hanley and A Lippman-Hand. 1983. If nothing goes wrong, is everything all blog/data-never-sleeps-6/. Accessed: 2018-11-16.
right? Interpreting zero numerators. Journal of the American Medical Association [48] Website. 2018. Facebook: Company Info and Statistics. https://newsroom.fb.com/
249, 13 (1983), 1743–1745. company-info/. Accessed: 2018-11-16.
[21] Mark Harman and Peter O’Hearn. 2018. From Start-ups to Scale-ups: Opportuni- [49] Website. 2018. Netflix Tech Blog: Automated Canary Analysis
ties and Open Problems for Static and Dynamic Program Analysis. Keynote at at Netflix with Kayenta. https://medium.com/netflix-techblog/
the 18th IEEE International Working Conference on Source Code Analysis. automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69. Ac-
[22] T. Hérault, R. Lassaigne, F. Magniette, and S. Peyronnet. 2004. Approximate cessed: 2018-10-13.
Probabilistic Model Checking. In Proceedings of the 5th International Conference [50] Website. 2018. Netflix TechBlog: Edge Load Balancing. https://medium.com/
on Verification, Model Checking and Abstract Interpretation (VMCAI’04). 307–329. netflix-techblog/netflix-edge-load-balancing-695308b5548c. Accessed: 2018-11-
[23] Wassily Hoeffding. 1963. Probability Inequalities for Sums of Bounded Random 16.
Variables. J. Amer. Statist. Assoc. 58, 301 (1963), 13–30.
11
Technical Report @ Arxiv, Submitted: 12. Nov. 2019, Melbourne Australia Marcel Böhme
12