Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Coding Ockham's Razor
Coding Ockham's Razor
Coding Ockham's Razor
Ebook331 pages2 hours

Coding Ockham's Razor

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book explores inductive inference using the minimum message length (MML) principle, a Bayesian method which is a realisation of Ockham's Razor based on information theory. Accompanied by a library of software, the book can assist an applications programmer, student or researcher in the fields of data analysis and machine learning to write computer programs based upon this principle.

MML inference has been around for 50 years and yet only one highly technical book has been written about the subject.  The majority of research in the field has been backed by specialised one-off programs but this book includes a library of general MML–based software, in Java.  The Java source code is available under the GNU GPL open-source license.  The software library is documented using Javadoc which produces extensive cross referenced HTML manual pages.  Every probability distribution and statistical model that is described in the book is implemented and documentedin the software library.  The library may contain a component that directly solves a reader's inference problem, or contain components that can be put together to solve the problem, or provide a standard interface under which a new component can be written to solve the problem.

This book will be of interest to application developers in the fields of machine learning and statistics as well as academics, postdocs, programmers and data scientists. It could also be used by third year or fourth year undergraduate or postgraduate students.

LanguageEnglish
PublisherSpringer
Release dateMay 4, 2018
ISBN9783319764337
Coding Ockham's Razor

Related to Coding Ockham's Razor

Related ebooks

Applications & Software For You

View More

Reviews for Coding Ockham's Razor

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Coding Ockham's Razor - Lloyd Allison

    © Springer International Publishing AG, part of Springer Nature 2018

    Lloyd AllisonCoding Ockham's Razorhttps://doi.org/10.1007/978-3-319-76433-7_1

    1. Introduction

    Lloyd Allison¹ 

    (1)

    Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia

    This book is about inductive inference using the minimum message length (MML) principle and a computer. It is accompanied by a library of software (Chap. 13) to help an applications programmer, student or researcher in the fields of data analysis or machine learning to write computer programs of this kind.

    Inductive inference is the business of learning something general from a given data-set. The thing that is learnt might be a parameter estimate, probability distribution, hypothesis, statistical model, explanation or theory, and the data-set is sometimes called examples, observations, evidence or training-data. Over-fitting is a well known danger in which a complex model is liable to fit data better than a simpler model yet may not be justified—the improvement in fit may not be worth the added complexity of the model. MML [92, 93] relies on Bayes’s theorem [8] and Shannon’s information theory [76]. It quantifies a model’s complexity and allows it to be traded off against the model’s fit to data. It is a realisation of Ockham’s razor [78] (or Occam’s razor).

    Inductive,

    characterized by the inference of general laws from particular instances.

    Inference,

    a conclusion reached on the basis of evidence and reasoning.¹

    An inference is an answer to some question about a process that gives rise to data and that we are interested in. The question assumes some aspects of the process that are taken to be common knowledge, some aspects that are unknown but of no interest, and some aspects that are unknown but of interest and that are to be inferred. It specifies the form of an answer—what unknowns or parameters have to be filled in. It is important to be clear about exactly what question is being asked and about the form of an answer—what it includes and does not include.

    An inference can be something modest such as an estimate of a parameter of a probability distribution, or something grand such as a theory of fundamental particles in physics, say. People tend to use the terms estimate, distribution, hypothesis, model , explanation, and theory in rough order of increasing grandeur although in reality there are no hard divisions between them. Regardless, an inference may well be wrong in that it is possible for it to be invalidated by future data, although it may be good enough to be useful until that day [13].

    Q:

    What is the difference between a hypothesis and a theory?

    A:

    Think of a hypothesis as a card. A theory is a house made of hypotheses.²

    For hypotheses H, H 0, H 1, … and data D, common terminology (Fig. 1.1) includes, prior and posterior which refer to before and after seeing the data D. If X is a random variable that may take one of the values x 1, x 2, …, we often write pr(x i ) for pr(X = x i ).

    ../images/456870_1_En_1_Chapter/456870_1_En_1_Fig1_HTML.gif

    Fig. 1.1

    Standard definitions

    Bayes’s theorem [8], in a form [52, 81] that is relevant here, states that the probability of hypothesis H and data D equals the probability of H times the probability of D given H:

    $$\displaystyle \begin{aligned} {\mathrm{pr}}(H \& D) = {\mathrm{pr}}(H) . {\mathrm{pr}}(D|H) = {\mathrm{pr}}(D) . {\mathrm{pr}}(H|D) {} \end{aligned} $$

    (1.1)

    pr(H) is given by a prior probability distribution over hypotheses and represents general ideas about what might typically be inferred from real data before that has been sighted. pr(H|D) is given by a posterior distribution over hypotheses and is to be calculated. pr(D) and pr(D|H) are given by distributions over data; the former is often not known in practice and the latter follows from the specification of a model. It follows from Bayes that the posterior odds ratio of two hypotheses H 1 and H 2 is

    $$ \frac {{\mathrm {pr}}(H_1|D)}{{\mathrm {pr}}(H_2|D)} = \frac {{\mathrm {pr}}(H_1).{\mathrm {pr}}(D|H_1)} {{\mathrm {pr}}(H_2).{\mathrm {pr}}(D|H_2)} $$

    .

    Shannon’s mathematical theory of communication [76] tells us that the length of a message transmitting event E using an optimum code is

    $$\displaystyle \begin{aligned} I(E) = - \log({\mathrm{pr}}(E)) {} \end{aligned} $$

    (1.2)

    The letter I stands for information. If the $$ \log $$ is the natural log to base e the units of I(E) are nits , also known as nats . If the $$ \log $$ is to base 2 the units are bits . Incidentally, arithmetic coding algorithms [64, 70] can get arbitrarily close to I(E), even in the case of fractional bits! Sometimes I(.) is used to stand for information content and sometimes msg(.) to emphasise message length; they are synonyms,

    $$\displaystyle \begin{aligned} msg(E) = I(E) = - \log({\mathrm{pr}}(E)) {} \end{aligned} $$

    (1.3)

    Putting Bayes (1.1) and Shannon (1.2) together gives

    $$\displaystyle \begin{aligned} I(H \& D) = I(H) + I(D|H) \end{aligned} $$

    (1.4)

    The length of a message transmitting hypothesis H and data D is the length of first transmitting H and then transmitting D under the assumption that H is true. This two-part message, first H and then (D|H), is the basis of the MML principle [92, 93]. Note that it transmits strictly more than the data D alone; it also transmits an opinion H about the data and so will in general be longer than an optimal message transmitting D alone. If anything, the first part H, being an answer, is generally of more interest than D|H which can be thought of as the noise after the structure, H. A sensible person would choose a good hypothesis H, but she or he is not forced to do so. I(H) is a natural measure of the complexity of hypothesis H. A more complex H 2 may fit D better than a simpler H 1, that is I(H 1) < I(H 2) and I(D|H 1) > I(D|H 2), but if I(H 1) + I(D|H 1) < I(H 2) + I(D|H 2) then H 1 is preferred over H 2. This creates a brake on choosing more and more complex hypotheses, a stopping criterion that resists over-fitting.

    William of Ockham (c. 1288–1347) was a Franciscan friar who is believed to have been born in Ockham, Surrey, and who is widely remembered for the principle known as Ockham’s razor. He wrote Numquam ponenda est pluralitas sine necessitate (Plurality is never to be posited without necessity) and made other similar statements [78], although positions in favour of simplicity were taken at least as long ago as Aristotle (384–322bc): We may assume the superiority ceteris paribus [other things being equal] of the demonstration which derives from fewer postulates or hypotheses—in short from fewer premisses [58]. In MML inference, we might interpret a reduction in I(H) + I(D|H) as being sufficient need to posit an increase in I(H).

    Note that the complexity, I(H), of a hypothesis depends not just on the number of parameters that H has but also on the precision with which those parameters are stated. It is certainly possible that H 1 has more parameters than H 2 and yet I(H 1) < I(H 2).

    Also note that making inferences with MML rarely calls for actual encoding and transmission of messages. It is sufficient to compute what would be the lengths of hypothetical messages. A data-compression program can in principle be based on MML plus an arithmetic coder but data-compression programs are almost invariably expected to run quickly, processing mega-bytes of data in seconds or less, and having linear or near linear time complexity—think gzip [35] for example. It is indeed a great thing if an inference program also has these properties but we are generally prepared to give it more time, perhaps much more time, in return for a good answer.

    In what follows a hypothesis is a fully parameterised model. The word model is near the middle of the estimate-theory scale, and it is nice and short.

    1.1 Explanation versus Prediction

    Predict,

    Say or estimate that (a specified thing) will happen in the future or will be a consequence of something.³

    Inductive inference is a different kind of problem to prediction. In prediction we want to know what is likely to happen in the next sample, or in the context of certain observations. In pure prediction we do not care why this is. An inductively inferred model comes complete with its inferred parameter values but in pure prediction we are not interested in them. An inference can naturally be used to make a prediction and the best model does often give good predictions but its primary purpose is to explain what has happened, not to predict what will happen. An average over all models, integrating out the model, may give better predictions than a single model, and indeed if H 1, H 2, … are all the distinct possible causes of D, Bayes [8] gives

    $$\displaystyle \begin{aligned} {\mathrm{pr}}(D) = \sum_i {\mathrm{pr}}(H_i) \cdot {\mathrm{pr}}(D | H_i) \end{aligned}$$

    However, the average over models has no structure, has no parameters, and it cannot be examined to explain why the source of the data is the way it is. Here we are principally concerned with inductive inference.

    A subclass of models, the function-models will be encountered later (Chaps. 5, and 8). They are commonly used to make predictions where an input id is thought to, at least partly, cause an output od f(id). Given an input, a function-model returns a model, a probabilistic prediction, of the possible values of the output conditional on that input. The function-model inferred by an optimal inference algorithm is the best function-model, out of the set of function-models that the algorithm was allowed to choose from, at modelling (explaining) how id influences od. It may not be the best function-model of all possible function-models at predicting od, and the best of these may be nothing like a good explanation of how id influences od.

    1.2 Models

    The main duty of a fully parameterised model over a data-space is to give a probability pr(d) to d where d is a datum, a data value, from the model’s data-space. It is sometimes convenient to deal with negative log probability −log e (pr(d)) nits, or possibly in bits, −log2(pr(d)). Continuous data can only ever be measured to limited accuracy $$ \pm \frac {\epsilon }{2} $$ so even a distribution (model) over continuous data can deliver a probability of such a $$ d \pm \frac {\epsilon }{2} $$ , as well as a probability density pdf(d). It is also desirable that a model be able to generate random values according to its idea of what is random.

    An unparameterised model needs to be supplied with statistical parameters to produce a fully parameterised model. For example, the unparameterised 2-state model over the set {head, tail} requires at least pr(head). (Then pr(tail) = 1 −pr(head).) An unparameterised model may have problem-defining parameters but these are not statistical parameters; problem-defining parameters are given, stated up front, whereas statistical parameters are inferred (estimated, fitted, learned) from data. For example, the 2-state model is a special case of the multistate model (Sect. 2.​3) where the bounds on values in the data-space are 〈head, tail〉. The bounds are problem defining parameters for this use of the multistate model (Sect. 2.​2) and pr(head) is a statistical parameter. The distinction between the unparameterised and fully parameterised model here concerns its statistical parameters.

    Ideally, an unparameterised model will have a method of estimating a fully parameterised model from a given data-set , ds = [d 1, d 2, …, d N ]. An estimator may itself have parameters, perhaps to specify prior beliefs about likely estimates, or to limit how much time a search algorithm is allowed to take.

    Sufficient statistics ss = stats(ds) consist of some possibly structured value derived from a data-set ds from which a model M can calculate the negative log likelihood

    $$ nlLH(ss) = - \log ({\mathrm {pr}}(ds|M)) $$

    of the data-set, and from which an estimator can estimate a fully parameterised model. In the worst case ss is just the data-set itself, but often ss is much condensed compared to ds, and nlLH(ss), while mathematically equal to

    $$ \sum _i - \log ({\mathrm {pr}}(d_i|M)) $$

    provided the d i are independent, is often faster to compute than the sum.

    1.2.1 Implementation

    The notions above can be implemented as unparameterised UPModel , Estimator and fully parameterised Model in a suitable programming language:

    UPModel          // an UnParameterised Model

     { defnParams()  // UPModel's problem defining

                           parameters

       stats(ds)     // suff stats of data-set ds for

                           estimation

       estimator(ps) // Estimator of a Model, parameter ps

       ...

     }

    Estimator        // an Estimator of Models

     { stats(ds)     // suff stats of ds for estimation

       ss2Model(ss)  // suff stats to Model

       ds2Model(ds)  // data-set to Model

       ...

     }

    Model            // a Fully Parameterised Model

     { statParams()  // Model's statistical parameters

       pr(d)         // probability of datum d

       nlPr(d)       // neg log_e probability of d

       nl2Pr(d)      // nlPr in bits

       stats(ds)     // stats of data-set ds for nlLH()

       nlLH(ss)      // neg log likelihood, nlLH(stats(ds))

       random()      // generate, a random datum

       ...

     }

    In an implementation it is desirable that each UPModel, Estimator, and Model satisfies the relevant standard interface because it can then be used as a component of other structured UPModels, Estimators, and Models.

    Note that unparameterised and fully parameterised are sometimes dropped, and we then use model on its own to mean either or both of these as implied by context. Mathematics is written thus,

    $$ -\log ({\mathrm {pr}}\ d) $$

    , and its implementation in computer code is written in fixed width font thus, -log(pr(d)). There is more information on a Java implementation of the models described in Chap. 13.

    1.3 Estimators

    A fully parameterised model is sometimes just a given. For example, we may have a use for $$ \mathcal {N}_{0,1} $$ , the normal (Gaussian) probability distribution (Sect. 4.​3) with mean 0 and standard deviation 1. However we are usually more interested in a model inferred from a data-set ds. As touched on earlier, MML judges such a model on its two-part message length msg = msg 1 + msg 2 where msg 1 corresponds to the model itself and msg 2 to the data-set from which it was inferred. These are properties of an implementation of a model.

    Model            // a Fully Parameterised Model

     { statParams()  // Model's statistical parameters

       ...

       msg()         // = msg1() + msg2()

       msg1()        // message length of this Model and

       msg2()        // of data the Model inferred from

       ...

     }

    In the case of a model that is a given, such as $$ \mathcal {N}_{0,1} $$ above, set msg 1 = msg 2 = 0 to indicate that the model is common knowledge and not the result of an inference. There is no need to encode something that everyone already knows.

    It can be argued that a message length is the property of the act of estimating a Model in some context rather than of the Model itself. However, given the way that Models are estimated and compared, in nine times out of ten it is convenient to have msg, msg1 and msg2 directly available from the Model. Just make sure that if a Model is used twice in a message, it is only transmitted once and is free the second time.

    Message lengths allow us to compute the posterior log-odds ratio of two models,

    $$\displaystyle \begin{aligned} \begin{array}{rcl} log \frac{ {\mathrm{pr}}(H_1|D) }{ {\mathrm{pr}}(H_2|D) } &amp;\displaystyle =&amp;\displaystyle \log\, {\mathrm{pr}}(H_1|D) - \log\, {\mathrm{pr}}(H_2|D) \\ &amp;\displaystyle =&amp;\displaystyle msg_1(H_2) + msg_2(D|H_2) - msg_1(H_1) - msg_2(D|H_1) \end{array} \end{aligned} $$

    and to compare two or more models and choose the best one—the one having the shorter two-part message length. An implemented Estimator can do the same thing. However, finding the best Model, or even a good Model, is sometimes a hard computational problem.

    Recall that the sufficient statistic(s), ss = stats(ds), of a data-set, ds, is something

    Enjoying the preview?
    Page 1 of 1