Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

PHIL 6334-Probability/Statistics Lecture Notes 6: An Introduction to Bayesian Inference

2019

Bayesian inference begins with a statistical model: M(x)={(x;θ) θ∈Θ} x∈R  for θ∈Θ⊂R    where (x;θ) is the distribution of the sample X:=(1  ) R is the sample space and Θ the parameter space. Bayesian inference modifies the frequentist inferential set up in two crucial respects: (A) It views the unknown parameter(s) θ as random variables with their own distribution, known as the prior distribution: (): Θ→ [0 1]

PHIL 6334 - Probability/Statistics Lecture Notes 6: An Introduction to Bayesian Inference Aris Spanos [Spring 2019] 1 Introduction to Bayesian Inference The main objective is to introduce the reader to Bayesian inference by comparing and contrasting it to the frequentist inference. To avoid unnecessary technicalities the discussion focuses mainly on the simple Bernoulli model. 1.1 The Bayesian inference framework Bayesian inference begins with a statistical model: M (x)={(x; θ) θ∈Θ} x∈R  for θ∈Θ⊂R     where  (x; θ) is the distribution of the sample X:=(1    ) R is the sample space and Θ the parameter space. Bayesian inference modifies the frequentist inferential set up in two crucial respects: (A) It views the unknown parameter(s) θ as random variables with their own distribution, known as the prior distribution: (): Θ → [0 1] which represents one’s a priori assessment of how likely the various values of θ in Θ are, which amounts to ranking the different models M (x) for all θ∈Θ. In frequentist θ is viewed as a set of unknown constants indexing (x; θ) x∈R  (B) It re-interprets the distribution of the sample as conditional on the unknown parameters θ and denoted by  (x|θ) Taken together these modifications imply that for Bayesians the joint distribution of the sample is now defined by:  (x θ) = (x|θ)·(θ) ∀θ∈Θ ∀x∈R  where ∀ denotes ‘for all’. In terms of the above distinguishing criteria: [a] The Bayesian approach to statistical inference interprets probability as the degrees of belief [subjective, logical or rational]. [b] In the context of Bayesian inference, relevant information includes: (i) the data x0 :=(1  2    ) and (ii) the prior distribution (θ) θ∈Θ [c] The primary aim of the Bayesian approach is to revise the inital ranking (θ) in light of the data x0  as summarized by (θ|x0 ) to derive the updated ranking in terms of the posterior distribution: (θ|x0 ) =    (x0 |)·()  (x0 |)·() 1 ∝ (θ|x0 )·(θ) θ∈Θ (1) where (θ|x0 ) ∝  (x0 |θ) θ∈Θ denotes the likelihood function, as re-interpreted by Bayesianism. A famous Bayesian, Savage (1954) summarized Bayesian inference succinctly by: ‘Inference means for us the change of opinion induced by evidence on the application of Bayes’ theorem.” (p. 178) O’Hagan (1994) is more specific: “Having obtained the posterior density (θ|x0 ), the final step of the Bayesian method is to derive from it suitable inference statements. The most usual inference question is this: After seeing the data x0 , what do we now know about the parameter θ The only answer to this question is to present the entire posterior distribution." (p. 6) In this sense, learning from data in the context of the Bayesian perspective pertains to how the original beliefs (θ) being revised in light of data x0 , the revision coming in the form of the posterior: (θ|x0 ) ∀θ∈Θ According to O’Hagan (1994): “The objective [of Bayesian inference] is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries. There are two criteria in this process. The first is to identify interesting features of the posterior distribution. ... The second criterion is good communication. Summaries should be chosen to convey clearly and succinctly all the features of interest.” (p. 14) 1.2 The Bayesian approach to statistical inference In order to avoid any misleading impressions it is important to note that there are numerous variants of Bayesianism; more than 46656 varieties of Bayesianism according to Good (1971)! In this section we discuss some of the elements of the Bayesian approach which are shared by most variants of Bayesianism. Bayesian inference, like frequentist inference, begins with a statistical model M (x), but modifies the inferential set up in two crucial respects: (i) the unknown parameter(s) θ are now viewed as random variables (not unknown constants) with their own distribution, known as the prior distribution: (): Θ → [0 1] which represents the modeler’s assessment of how likely the various values of θ in Θ are a priori, and (ii) the distribution of the sample (x; θ) is re-interpreted by Bayesians to be defined as conditional on θ and denoted by (x|θ) Taken together these modifications imply that there exists a joint distribution relating the unknown parameters θ and a sample realization x: (x θ)=(x|θ)·(θ) ∀θ∈Θ Bayesian inference is based exclusively on the posterior distribution (θ|x0 ) which is viewed as the revised (from the initial (θ)) degrees of belief for different values of θ in light of the summary of the data by (θ|x0 ). 2 Table 1: The Bayesian approach to statistical inference Prior distribution (θ) ∀θ∈Θ & Data: x0 :=(1    ) % Bayes’ rule Posterior distribution Likelihood =⇒ (θ|x )∝(θ)·(θ|x ) ∀θ∈Θ Statistical model −→ 0 0 (θ|x0 ) M (x)={ (x; θ) θ∈Θ} x∈R  Example 10.9. Consider the simple Bernoulli model (table 10.2), and let the prior () be Beta( ) distributed with a density function: 1 (−1) (1−)−1  0 0 01 ()= B() (2) Combining the likelihood in (??) with the prior in (2) yields the posterior distribu³ ´£ ¤ tion: 1 (|x0 ) ∝ B() (−1) (1−)−1  (1−)(1−) = h i (3) 1 +(−1) (1 − )(1−)+−1  = B() In view of the formula in (2), (3) as an ‘non-normalized’ density of a Beta(∗   ∗ ) where: ∗ = +   ∗ =(1 − ) +  (4) As the reader might have suspected, the choice of the prior in this case was not arbitrary. The Beta prior in conjunction with a Binomial-type LF gives rise to a Beta posterior. This is known in Bayesian terminology as a conjugate pair, where () and (|x0 ) belong to the same family of distributions. Savage (1954), one of the high priests of modern Bayesian statistics, summarizes Bayesian inference succinctly by asserting that: ‘Inference means for us the change of opinion induced by evidence on the application of Bayes’ theorem.” (p. 178). In the terms of the main grounds stated above, the Bayesian approach: [a] Adopts the degrees of belief interpretation of probability introduced via (θ) ∀θ∈Θ. [b] The relevant information includes both (i) the data x0 :=(1  2    ) and (ii) prior information. Such prior information comes in the form of a prior distribution (θ) ∀θ∈Θ which is assigned a priori and represents one’s degree of belief in ranking the different values of θ in Θ as more probable and less probable. [c] The primary aim of the Bayesian approach is to revise the original ranking based on (θ) in light of the data x0 by updating in the form of the posterior distribution: 0 |)·() ∝(θ|x0 )·(θ) ∀θ∈Θ (θ|x0 )=   (x (5) (x0 |)·()  3 where (θ|x0 ) ∝ (x0 |θ) denotes a re-interpreted likelihood function as being conditional onR x0 . The Bayesian approach is depicted in table 10.8. Since the denominator (x0 )= ∈(01) (θ)(x0 |θ)θ, known as the predictive distribution, derived by integrating out θ can be absorbed into the constant of proportionality in (5) and ignored for most practical purposes. The only exception to that is when one needs to treat (|x0 ) as a proper density function which integrates to one, (x0 ) is needed as a normalizing constant. Learning from data. In this context, learning from data x0 takes the form revising one’s degree of belief for different values of θ [i.e. different models M (x) θ∈Θ], in light of data x0 , the learning taking the form (θ|x0 )−(θ) ∀θ∈Θ. That is, the learning from data x0 about the phenomenon of interest takes place in the head of the modeler. In this sense, the underlying inductive reasoning is neither factual nor hypothetical, it’s all-inclusive in nature: it pertains to all θ in Θ, as ranked by (θ|x0 ) Hence, Bayesian inference does not pertain directly to the real world phenomenon of interest per se, but to one’s beliefs about M (x) θ∈Θ 1.2.1 The choice of prior distribution Over the last two decades the focus of disagreement among Bayesians has been the choice of the prior. Although the original justification for using a prior is that it gives a modeler the opportunity incorporate substantive information into the data analysis, the discussions among Bayesians in the 1950s and 1960s made computational convenience the priority for the choice of a prior distribution, and that led to conjugate priors that ensure that the prior and the posterior distributions belong to the same family of distributions; see Berger (1985). More recently, discussions among Bayesians shifted the choice of the prior question to ‘subjective’ vs. ‘objective’ prior distributions. The concept of an ‘objective’ prior was pioneered by Jeffreys (1939) in an attempt to address Fisher’s (1921) criticisms of Bayesian inference that routinely assumed a Uniform prior for  as an expression of ignorance. Fisher’s criticism was that if one assumes that a Uniform prior () ∀∈Θ expresses ignorance because all values of  are assigned the same prior probability, then a reparameterization of  say =() will give rise to a very informative prior for  0.25 1.4 1.2 0.20 Prior density Prior density 1.0 0.8 0.6 0.15 0.10 0.4 0.05 0.2 0.0 0.00 0.25 0.50 0.75 0.00 1.00 Fig. 10.1: Uniform prior density of  -5.0 -2.5 0.0 2.5 5.0 Fig. 10.2: Logistic prior density of  4 7.5 Example 10.10. In the context of the simple Bernoulli model (table 10.2), let the prior be vU(0 1) 0 ≤  ≤ 1 (figure 10.1). Note that U(0 1) is a special case of the Beta( ) for ==1 B e ta (a ,b) de ns itie s fo r diffe re nt (a ,b) a 1 1 1 2 2 2 4 4 4 4 Density 3 2 b 1 2 4 1 2 4 1 2 4 1 0 0.0 0.2 0.4 0.6  0.8 1.0 Fig. 10: Beta( ) for different values of ( )  ), implies that vLogistic(0 1) −∞    ∞ Reparameterizing  into = ln( 1− Looking at the prior for  (figure 10.2) it becomes clear that the ignorance about  has been transformed into substantial knowledge about the different values of  In his attempt to counter Fisher’s criticism, Jeffreys proposed a form of prior distribution that was invariant to such transformations. To achieve the reparameterization invariance Jeffreys had to use Fisher’s information associated with the score function and the Cramer-Rao lower bound; see chapters 11-12. 5 7 6 4 4 Density Density 5 3 3 2 2 1 1 0 0.0 0.2 0.4 ? 0.6 0.8 0 1.0 Fig. 10.3: Jeffreys prior 0.05 0.15 0.25 ? 0.35 0.45 0.55 Fig. 10.4: Jeffreys posterior Example 10.11. In the case of the simple Bernoulli model (10.2), the Jeffreys’ −5 prior is: (1−)−5   () v Beta(5 5)  ()=  B(55) In light of the posterior in (3), using =4 =20 gives rise to:   (|x0 ) v Beta(45 165) 5 For comparison purposes, the Uniform prior vU(0 1) in example 10.10 yields: (|x0 ) v Beta(5 17) Attempts to extend Jeffreys prior to models with more than one unknown parameter initiated a variant of Bayesianism that uses what is called objective (default, reference) priors because they minimize the role of the prior distribution and maximize the contribution of the likelihood function in deriving the posterior; see Berger (1985), Bernardo and Smith (1994). 1.3 Bayesian Estimation Given the posterior distribution (|x0 ) the Bayesian point estimator is often chosen to be its mode e  : (e  |x0 ) = sup(|x0 ) ∈Θ −1 For  v Beta( ) mode of () is = +−2  the Bayesian estimator is: ∗ e = (+−1)   = ∗+−1 (6) ∗ −2 (++−2) P If we compare this with the MLE b (X)=  = 1 =1  , the two coincide algebraically only when ==1 → e  =   The restrictions ==1 imply that the prior () is Uniformly distributed. Another "natural" choice (depending on the implicit loss function) for the Bayesian point estimator is the mean of the posterior distribution. For  vBeta( )   and thus: ()= + ∗ (+) b  = ∗+ ∗ = (++)  (7) Example. Let () vBeta(5 5). (a) =4, =20 ∗ = + =45  ∗ =(1-)+=165 45 b  = 45+165 =214 35 e  = 21−2 =184 (b) =12, =20 ∗ =+=125  ∗ =(1−)+=85 e =605  = 115 19 b  = 125 =595 21 When Bayesians claim that all the relevant information for any inference concerning  is given by (θ|x0 ) they only admit to half the truth. The other half is that for selecting a Bayesian ‘optimal’ estimator of  one needs to invoke additional information like a loss (or utility) function L(b (X) ). Using different loss functions gives rise to different choices of Bayes’s estimate. For example: (i) When L(b  )=(b  − )2 the resulting Bayes estimator is the mean of (|x0 ) (ii) when L(e  )=|e  − | the Bayes estimator is the median of (|x0 ) and (iii) when L( )=( ) where ()= the mode of (|x0 ). 6 ½ 0 for =  the Bayes estimator  is 1 for 6= 1.4 Bayesian Credible Intervals A Bayesian (1−) credible interval for  is constructed by ensuring that the area between  and  is equal to (1−): R ( ≤   ) =  (|x0 ) = 1− In practice one can define an infinity of (1−) credible intervals using the same posterior (|x0 ) To avoid this indeterminacy one needs to impose additional relike the interval Rstrictions R 1 with the shortest length or one with equal tails, i.e. 1  (|x )=(1− ) (|x0 )= 2 ; see Robert (2007). 0 2   Example. For the simple P (one parameter -  2 is known) Normal model, the 1 sampling distribution of   =  =1  and the posterior distribution of  derived on the basis of an improper uniform prior [()=1 for all ∈R] are: TSN 2 2   v N(∗   ) (|x0 ) v N(   ) (8) The two distributions can be used, respectively, to construct (1−) Confidence and Credible Intervals: ³ ´   ∗ √ √   P   − 2 (  ) ≤  ≤   + 2 (  ); = =1− (9) ´ ³   − 2 ( √ ) ≤  ≤  + 2 ( √ )|x0 =1− (10) The two intervals might appear the same, but they are drastically different. First, in (9) the r.v. is   and its sampling distribution  ( ; ) is defined over x∈R  but in (10) the r.v. is  and its posterior (|x0 ) is defined over ∈R Second, the reasoning underlying (9) is factual (TSN), but that of (10) is All Possible States of Nature (APSN). Hence, the (1−) Confidence Interval (9) provides the shortest random upper (X)=  + 2 ( √ ) and lower (X)=  − 2 ( √ ) bounds that cover the true  with probability (1−). In contrast, the (1−) Credible Interval (10) provides the shortest interval defined by two non-random, lower (x0 )= − 2 ( √ ) and upper (x0 )= + 2 ( √ ) values, such that (1−)% of the posterior (|x0 ) lies within, i.e. (10) is the highest posterior non-random interval of length 2 2 ( √ ); it includes (1−)% of the highest ranked values of ∈R This raises a pointed question: I is a Bayesian (1−) Credible Interval an inference that pertains to the "true" ? If not, I what does a Bayesian (1−) Credible interval say about the process that generated 7 data x0 ? Distribution Plot Po ster io r b ase d o n Je ffr eys' pr io r fo r th e B ino mial B eta: a= 12.5, b=8.5 4 0.15 3 Density Probability Binomial, n=20, p=0.5 0.20 0.10 1 0.05 0.00 2 0 2 4 6 8 10 12 14 16 18 0 .0 0.2 y Fig. 2: Bin (=5; =20) 0.4  0.6 0.8 Fig. 14: (|x0 )vBeta(125 85) The contrast between the sampling distribution of   and the posterior distribution of  in the case of the simple Bernoulli model brings the difference out more starkly; one is discrete the other is continuous. For example, in the case of  =12, =20 the sampling distribution of   (0 = 5) is given in fig. 2 and the posterior distribution based on a Jeffreys prior, centered at  =6 is given in fig. 14. Example. For the simple Bernoulli model, the end points of an equal-tail credible interval can be evaluated using the F tables and the fact that:  v Beta(∗   ∗ ) ⇒ ∗  ∗ (1−) v F(2∗  2 ∗ ) Denoting the 2 and (1− 2 ) percentiles of the F(2∗  2 ∗ ) distribution, by f( 2 ) and f(1− 2 ) respectively, the Bayesian (1−) credible interval for  is: ³ ´−1 ´−1 ³ ∗ ∗ 1 + ∗ f(1−  ) ≤  ≤ 1 + ∗ f(  )  2 2 Example. Let () vBeta(5 5). (a) =2, =20 =05 ∗ = + =25  ∗ =(1 − ) + =185 ³ ´−1 185 1+ 25(163)  (1− 2 )=163 ( 2 )=293 ³ ´−1 185 ≤  ≤ 1+ 25(293) ⇔ (0216 ≤  ≤ 284)  (b) =18, =20 =05 ∗ = + =185  ∗ =(1-) + =25 b  = 185 =881  (1- 2 )=341  ( 2 )=6188 21 ³ ³ ´−1 ´−1 25 25 1+ 185(341) ≤  ≤ 1+ 185(6188) ⇔ (716 ≤  ≤ 979)  8 1.0 One can also use the asymptotic approximation in (??) to construct an approximate Credible interval for  : µ ¶ √ √     [1−  ] [1−  ]        −  2 √  b ≤b  +  2 √ =1− (11) (+++1) (+++1) where  2 denotes the Normal 2 percentile. Example. Let () vBeta(5 5). =0119 :  = 25 (a) =2, =20 =05 ∗ =25  ∗ =185 b 21 √   ]  [1− b  ±  2 √   = (−00163 0254)  (+++1) (b) =18, =20 =05 ∗ =185  ∗ =25 b =881 :  = 185 21 √   [1−  ] b  ±   √  = (0746 1016)  2 (+++1) It important to emphasize that this approximation can be very crude in practice, when  is small and/or ¡  ¢the posterior distribution is skewed. The approximation will be better for = ln 1−  For additional numerical examples see Appendix. 1.5 Bayesian Testing Bayesian testing of hypotheses is not as easy to handle using the posterior distribution, as it is for Credible Intervals, especially for point hypotheses, because of the technical difficulty in attaching probabilities to particular values of θ since the parameter space Θ ⊂ R is usually uncountable. Indeed, the only prior distribution that makes technical sense is: () = 0 for each value ∈Θ In their attempt to deflect attention away from this technical difficulty, Bayesians criticized the use of point hypotheses such as =0 in frequentist testing as nonsensical because they can never be exactly true! This is a nonsensical argument because the notion of exactly true, has no place in statistics. 1.5.1 Point null alternative hypotheses There have been several attempts to address the difficulty with point hypotheses, but no agreement seems to have emerged; see Roberts (2007). Let us consider one such attempt for testing of the hypotheses: 0 : θ = θ0 vs. 1 : θ = θ1  Like all Bayesian inferences, the basis is the posterior distribution. Hence, an obvious way to assess their respective degrees of belief is the posterior odds: ³ ´³ ´ ( 0 |x0 ) (0 ) (0 |x0 )·( 0 ) (0 |x0 ) = =  (12) ( 1 |x0 ) (1 |x0 )·( 1 ) (1 ) (1 |x0 ) 9 (0 |x0 ) 0) where the factor ( represents the prior odds, and ( the likelihood ratio. (1 ) 1 |x0 ) In light of the fact that technical problem stems from the prior (θ) assigning probabilities to particular values of θ an obvious way to sidestep the problem is to cancel the prior odds factor, by using the ratio of the posterior to the prior odds to define the Bayes Factor (BF): ´ ³ ´ ³ (0 ) 0 |x0 ) 0 |x0 )  (θ0  θ1 |x0 )= (  (13)  = ( (1 |x0 ) (1 ) (1 |x0 ) This addresses the technical problem because the likelihood function is definable for particular values of  For this reason Bayesian testing is often based on the BF combined with certain rules of thumb, concerning the strength of the degree of belief against 0 as it relates to the magnitude of  (x0 ; 0 ) (Robert, 2007): I 0 ≤  (x0 ; 0 ) ≤ 32 the degree of belief against 0 is poor, I 32   (x0 ; 0 ) ≤ 10 the degree of belief against 0 is substantial, I 10   (x0 ; 0 ) ≤ 100 the degree of belief against 0 is strong, and I  (x0 ; 0 )  100 the degree of belief against 0 is decisive. The Likelihoodist approach. It is important to note that the Law of Likelihood defining the likelihood ratio: (0  1 |x0 ) = (0 |x0 )  (1 |x0 ) provides the basis of the Likelihoodist approach to testing, but applies only to tests of point vs. point hypotheses. In contrast, the Bayes Factor can be extended to composite hypotheses. Notice that like point estimation and credible intervals, the claim that Bayesian inference relies exclusively on the posterior distribution for inference purposes is only half the truth. The other half for Bayesian testing is the use of rules of thumb to go from the BF to evidence for or against the null, that have been called into questioned as largely ad hoc; see Kass and Raftery (1995). 1.5.2 Composite hypotheses Consider the following hypotheses: 0 :  ≤ 0 vs. 1 :   0  0 = 5 in the context of the simple Bernoulli case with a Jeffreys invariant prior, with data =12, =20 An obvious way to evaluate the posterior odds for these two interval hypotheses is as follows: ¢ R 5 ¡ 115 Γ(21) 75  (1-) =186 ( ≤ 0 |x0 )= Γ(125)Γ(85) 0 (  0 |x0 )=1-( ≤ 0 |x0 )=814 10 One can then employ the posterior odds criterion: (≤0 |x0 ) (0 |x0 ) = 186 814 = 229 which indicates that the degree of belief against 0 is poor. 1.5.3 Point null but composite alternative hypothesis Pretending that point hypotheses are small intervals. A ‘pragmatic’ way to handle point hypotheses in Bayesian inference is to sidestep the technical difficulty in handling hypotheses of the form: 0 :  = 0 vs. 1 :  6= 0  by pretending that =0 is actually a small interval: 0 : ∈Θ0 :=(0 −  0 + ) and attaching a spiked prior of the form: R1 (=0 )=0  1 = 0 (6=0 )=1 − 0  (14) i.e. attach a prior of 0 to =0 , and then distribute the rest 1−0 to all the other values of ; see Berger (1985). Using Credible Intervals as surrogates for tests. Lindley (1965) suggested an adaptation of a frequentist procedure of using the duality between the acceptance region and Confidence Intervals as surrogates for tests, by replacing the latter with Credible Intervals. His Bayesian adaptation to handle point null hypotheses, say: 0 :  = 8 vs. 1 :  6= 8 is to construct a (1 − ) Credible Interval using an "uninformative" prior and reject 0 if it lies outside that interval. Example. For =12, =20 the likelihood function is: (; x0 ) ∝ 12 (1 − )8  ∈[0 1] when combined with () vBeta(1 1) yields: (|x0 ) v Beta(13 9) ∈[0 1] A 95 credible interval for  is: Γ(22) Γ(13)Γ(9) R1 (384 ≤   782)=95 12 (1−)8 =0975 384 Γ(22) Γ(13)Γ(9) R1 7817 12 (1−)8 =0025 This suggests that the null 0 = 8 should be rejected because it lies outside the credible interval. 11 2 The large  problem and Bayesian testing The large  problem was initially raised by Lindley (1957) in the context of the simple Normal model (??) where the variance  2  0 is assumed known, by pointing out: [a] the large  problem: frequentist testing is susceptible to the fallacious result that there is always a large enough sample size  for which any point null, say 0 : =0 , will be rejected by a frequentist -significance level test. Lindley claimed that this result is paradoxical because, when viewed from the Bayesian perspective, one can show: [b] the Jeffreys-Lindley paradox: for certain choices of the prior, the posterior probability of 0  given a frequentist -significance level rejection, will approach one as →∞. Claims [a] and [b] contrast the behavior of a frequentist test (p-value) and the posterior probability of 0 as →∞, that highlights a potential for conflict between the frequentist and Bayesian accounts of evidence: [c] Bayesian charge 1: “The Jeffreys-Lindley paradox shows that for inference about  P-values and Bayes factors may provide contradictory evidence and hence can lead to opposite decisions.” (Ghosh et. al, 2006, p. 177) [d] Bayesian charge 2: a hypothesis that is well-supported by Bayes factor can be (misleadingly) rejected by a frequentist test when  is large; see Berger and Sellke (1987), pp. 112-3. A paradox? No! From the error statistical perspective: (i) There is nothing fallacious about a small p-value, or a rejection of 0  when  is large [it is a feature of a consistent frequentist test], but there is a problem when such results are detached from the test itself, and are treated as providing the same evidence for a particular alternative 1 , regardless of the generic capacity (the power) of the test in question, which depends crucially on  I Hence, the real problem does not lie with the p-value or the accept/reject rules as such, but with how such results are transformed into evidence for or against a particular . The large  problem can be circumvented by using the post-data severity assessment. How does the Bayesian approach explain why the result (0 |x0 )→1 as →∞ (irrespective of the truth or falsity of 0 ) is conducive to a more sound evidential account? Example. Consider the following example in Stone (1997): “A particle-physics complex plans to record the outcomes of a large number of independent particle collisions of a particular type, where the outcomes are either type A or type B. ... the results are to be used to test a theoretical prediction that the proportion of type A outcomes, h, is precisely 15, against the vague alternative that h could take any other value. The results arrive: 106298 type A collisions out of 527135.” (p. 263) 12 2.1 Bayesian testing Consider applying the Bayes factor procedure to the hypotheses (??) using a uniform prior:  v U(0 1) i.e. ()=1 for all ∈[0 1] (15) This gives rise to the Bayes factor: (527135)(2)106298 (1−2)527135−106298 0 ;x0 )  1 106298  (x0 ; 0 )=  1( = = 527135 106298 (1−)527135−106298 ) 0 (;x0 ) 0 ((106298) 000015394 =8115 = 000001897 (16) Note that the same Bayes factor (16) arises in the case of the spiked prior (14) with 0 =5 where =0 is given prior probability of 5 and other half is distributed equally among the remaining values of ; for 0 =5 the ratio (0 [1 − 0 ]) =1 and will cancel out from  (x0 ; ). I A Bayes factor result  (x0 ; 0 )  8115 indicates that data x0 favor the null against all other values of  substantially; see Robert (2007). Is the result as clear cut as it appears? No, because, on the basis of same data x0 , the Bayes factor  (x0 ; 0 ) ‘favors’, not only 0 =2 but each individual value 1 inside a certain interval around 0 =2: Θ :=[199648 203662] ⊂ Θ1 :=Θ − {2} (17) where the square bracket indicates inclusion of the end point, in the sense that, for each 1 ∈Θ   (x0 ; 1 )  1, i.e. R1 (18) (1 ; x0 )  0 (; x0 ) for all 1 ∈Θ  Worse, certain values ‡ in Θ are favored by  (x0 ; ‡ ) more strongly than 0 =2: ‡ ∈Θ :=(2 20331] ⊂ Θ  (19) It is important to emphasize that the subsets Θ ⊂ Θ ⊂ Θ exist for every data x0  and one can locate them by trial and error. However, there is a much more efficient way to do that. As shown below, Θ can be defined as a subset of Θ around the Maximum Likelihood Estimate (MLE): 106298 b =020165233  (x0 )= 527135 Is this a coincidence? NO, as Mayo (1996), p. 200, pointed out, ¨ =b  (x0 ) is always the maximally likely alternative, irrespective of the null or other substantive values of interest. In this example, the Bayes factor for 0 : =¨ vs. 1 : 6=¨ yields: (20165233)106298 (1−20165233)527135−106298 (527135 106298)  1 527135 106298 = (1−)527135−106298 ) 0 ((106298) =721911 = 0013694656 000001897  (x0 ; ¨ ) = 13 (20) indicating extremely decisive evidence for =¨ =b  (x0 ): ¨ is favored by  (x0 ; ¨ ) more than 89' 721911 times stronger than 0 =2! 8115 Indeed, if one were to test the point hypotheses: 0 : =2 vs. 0 : =¨ , (2)106298 (1−2)527135−106298 (527135 106298) = (20165233)106298 (1−20165233)527135−106298 (527135 106298) 000015394 =011241 = 001369466 the result reverses the original Bayes factor result and suggests that the degree to 1 which data x0 favor =¨ over 0 =2 is much stronger (89' 011241 ) as in (20). ¥ This result is an instance of the fallacy of acceptance: the Bayes factor  (x0 ; 0 )  8 is misinterpreted as providing evidence for 0 : 0 = 2 against any value of  in Θ1 :=Θ − {2} when in fact  (x0 ; ‡ ) provides much stronger evidence for certain values of ‡ in Θ1 ; in particular all ‡ ∈Θ ⊂ Θ1 . What is the source of the problem? The key problem is that the Bayes factor is invariant to the sample size  i.e. it is irrelevant whether  results from =10 or =1010  when going from  (x0 ; 0 )  8 to claiming that data x0 provide strong evidence for 0 : =2. Hence, going from: (0  ¨ ; x0 ) = step 1: (0 ;x0 )   (0  1 ; x0 )= ( 1 ;x0 ) (21) indicating that 0 is  times more likely than 1 , to: step 2 : fashioning   0 into the strength of evidence for 0  the Bayesian interpretation goes astray by ignoring . Where does the above post-data severity perspective leave the Bayesian (and likelihoodist) inferences? I Both approaches are plagued by two key problems: (i) the maximally likely alternative problem (Mayo, 1996), in the sense that the value b  (x0 )=¨ is always favored against every other value of , irrespective of the substantive values of interest, and (ii) the invariance of (0  1 ; x0 ) to the sample size . In contrast, the severity of the inferential claim   ¨ is always low, being equal to 5 (table S), calling into question the appropriateness of such a choice. In addition, the severity assessment in table S calls seriously into question the results associated with the two intervals Θ :=[199653 203662] and Θ :=(2 20331], because these intervals include values ‡ of  for which the severity of the relevant inferential claim   ‡ is very low, e.g.  ( ;   2033) ' 001 Conclusion. Any evidential account aiming to provide a sound answer the question: ‘when do data x0 provide evidence for or against a hypothesis (or a claim)?’ can ignore the generic capacity of a test at its peril! 14 2.2 Nonsense Bayesians utter about the frequentist approach Frequentist inference, in general "Non-Bayesians, who we hereafter refer to as frequentists, argue that situations not admitting repetition under essentially identical conditions are not within the realm of statistical enquiry, and hence ’probability’ should not be used in such situations. Frequentists define the probability of an event as its long-run relative frequency. ... that definition is nonoperational since only a finite number of trials can ever be conducted.’ (p. 2) Koop, G. D.J. Poirier and J.L. Tobias (2007), Bayesian Econometric Methods, Cambridge University Press, Cambridge. About p-values Bayesians often point to the closeness of the p-value and the posterior probability of 0 to make two misleading claims: First, Bayesian testing often enjoys good ‘error probabilistic’ properties, whatever that might mean. Second, the closeness of the posterior probability and the p-value indicates the superiority of the former because it provides what modelers want: “... the applied researcher would really like to be able to place a degree of belief on the hypothesis.” (Press, 2003, p. 220) Really???? This interpretation is completely at odds with the proper frequentist interpretation of the p-value which is firmly attached to the testing procedure and is used as a measure discordance between ∗ and the null value 0 . What is often ignored in such Bayesian discussions is that the p-value is susceptible to the fallacies of rejection and acceptance, rendering the posterior probability for 0 equally vulnerable to the same fallacies; see Spanos (2013). The above comparison raises several interesting questions: I Is there a connection between the p-value and the posterior of 0 ? More generally, I is there a connection between frequentist error probabilities and posterior probabilities associated with the null and alternative hypotheses? If yes, I what does that imply about the relationship between the frequentist and Bayesian approaches to testing? Example - simple (one parameter) Normal model. Consider the simple Normal Model, with  2 =1 Returning to (9), the two distributions: =∗ (|x0 ) v N(  1 )   v N(∗  1 ) are different because one can easily draw (|x0 ) because  is a known number, 2 but  in N( 1 ) is unknown. For instance, the evaluation of (  −2 ) gives two different answers: 2  (  − 2 )= 1  but  (2 − 2 )= − 1    15 where  () and  () indicate expectations with respect to  ( ; ) and (|x0 ), re  spectively. 2.3 Bayesian Prediction The best Bayesian predictor for +1 is based on its posterior predictive density, which is defined by: R1  (+1 |x0 )= 0  (+1 |x0 ;) (x0 |)() Note that (+1  x0  θ)= [ (+1 |x0 ;) ·  (x0 |) · ()] defines the joint distribution of (+1  x0  θ) Since +1 vBer((1−)) integrating out  yields: ⎧ (+) ⎨ (++)  if +1 =1  (+1 |x0 )= ⎩ ((1−)+)  if +1 =0 (++) (+) . The Bayesian predictor, based on the which is a Bernoulli density with ∗ = (++) mode of (+1 |x) is: ½ ∗ ∗ ∗ e+1 = 1 if max (∗  [1 − ∗ ]) =  ∗  0 if max (  [1 −  ]) =[1 −  ] The posterior expectation predictor is given by: (+)  (+1 |x0 ) = (++) which in the case where ==1 () v Beta(1 1)=(0 1), the posterior expectation predictor gives rise to Laplace’s law of succession:  (+1 |x0 ) = (+1) (+2) 3 Fisher’s criticisms of Bayesian inference In “The Design of Experiments” (1935), pp. 6-7, R. A. Fisher gave three criticisms of Bayesian inference. The first of Fisher’s criticisms concerns the degrees of belief interpretation of probability being unscientific, in contrast to the objective relative frequency interpretation. Modern Bayesians proposed a twofold counter-argument: there is nothing problematic about their interpretation for scientific reasoning purposes, but the frequentist interpretation of probability is problematic (circular). Despite their rhetoric, there is a clear move away from subjective ‘informative’ priors towards priors relating to the likelihood function; reference or ‘objective’ priors. Fisher’s second criticism of Bayesian inference is based on the fact that the reliability of scientific research and inference is never justified by invoking Bayesian reasoning, despite its long history going back to the 18th century. In particular, Fisher (1955) objected vehemently to viewing statistical inference as a ‘decision problem under uncertainty’ based on arbitrary loss or utility functions. In his writings 16 Fisher argued that notions like inductive inference, evidence and learning from data differ crucially from notions like decisions, behavior and actions with losses and gains. He considered Bayesian inference and decision-theoretic formulations highly artificial and not in tune with learning from data and scientific reasoning. In his second criticism Fisher called into question the Bayes formula: (θ|x0 ) =  ()· (x0 |)  ()· (x0 |) ∈Θ θ∈Θ being viewed as an axiom whose truth can be taken for granted. His criticisms focused primarily on the fact that this formula treats the existence of the prior distribution (θ) as self-evident and straightforward. In particular, Fisher (1921) criticized the notion of prior ignorance widely used by Bayesians since the 1820s as the basis of their inference. The claim going back to Laplace that a uniform prior: () v U(0 1) all ∈Θ can be used to quantify a state of ignorance about the unknown parameter  was been vigorously challenged by Fisher (1921) on non-invariance to reparameterization grounds: one is ignorant about  but very informed about =(); see section 1.2.1 above. Fisher’s third criticism raised a fundamental question pertaining to Bayesian inference which has not been answer in a satisfactory way to this day. The question is: How does one choose the prior distribution ()? The answer ‘from a priori substantive information’ is both inaccurate and misleading. It is inaccurate because substantive information almost never comes in the form of a prior () defined over statistical parameter(s) . It often comes in the form of the sign and magnitude of unknown parameters. It is also misleading because when no information pertaining different values of  is available, one has a difficult task of framing that; falling into the above trap! 17 4 Where do prior distributions come from? In what follows a number of different strategies for choosing the prior are discussed. 4.1 Conjugate prior and posterior distributions The above Bayesian inference was based on a particular choice of a prior known as conjugate pairs. This is the case where the product of the prior () and the posterior: (|x0 ) ∝ () · (; x0 ) for all ∈Θ belongs to the same family of distributions; (; x0 ) is family preserving. Table 2 - Conjugate pairs (() (|x0 )) Likelihood () Binomial (Bernoulli) Beta( ) Negative Binomial Beta( ) Poisson Gamma( ) Exponential Gamma( ) Gamma Gamma( ) Uniform Pareto( ) Normal for  =  N(  2 ) ∈R  2 0 Normal for  =  2 Inverse Gamma( ) Example. Let the prior distribution be: () vBeta( ) when combined with the likelihood function: (; x0 ) ∝  (1 − )(1−)  ∈[0 1] as shown above, gives rise to the posterior: (|x0 ) vBeta(∗   ∗ ) Table 2 presents some examples of conjugate pairs of prior and posterior distributions, as they combine with different likelihood forms. Conjugate pairs make mathematical sense, but does it make ‘modeling’ sense? The various justifications in the Bayesian literature vary from, ‘these help the objectivity of inference’ to ‘they enhance the allure of the Bayesian approach as a black box’ and these claims are often contradictory! 4.2 Jeffreys invariant prior for  In respond to Fisher’s second criticism that a uniform (proper or improper) prior distribution for  is not invariant to reparameterizations of the form =(), e.g. = ln  Jeffreys proposed a new class of priors which satisfy this property. This family of invariant priors proposed by Jeffreys was based on Fisher’s average information: µ h i2 ¶ R R 1  ln (;x) 2 1  ln (;x) (; x)=  = ··· (  ) x (22)   x x∈R  18 Note that the above derivation involves some hand-waving in the sense that if the likelihood function (; x0 ) is viewed, like the Bayesians do, as only a function of the data x0 , then taking expectations outside the brackets makes no sense; the expectation is with respect to the distribution of the sample  (x;) for all possible values of x∈R . As we can see, the derivation of (; x) runs afoul to the likelihood principle since all possible values of the sample X, not just the observed data x0 , are taken into account. Note that in the case of a random (IID) sample, the Fisher information  (; x) for the sample X:=(1  2    ) is related to the above average information via:  (; x) = (; x) In the case of a single parameter, Jeffreys invariant prior takes the form: p (23) () ∝ (; x) That is, the likelihood function determines the prior distribution. The crucial property of this prior is that it is invariant to reparameterizations of the form =() This follows from the fact that:  ln (;x)   =  ln (;x) (  )  which implies that:  2 (; x)=(; x)(  ) ⇒ () ∝ (24) p  (; x)=()(  ) (25)  using the change-of-variable rule with the Jacobian (  ). Consequently: p p (; x)= (; x) The intuition underlying this result is that the prior () inherits the invariance b of =() can be of MLE estimators to reparameterizations. That is, the MLE  derived by a simple replacement of  with its MLE b : b = (b  ) The simple Bernoulli model. In view of the fact that the log-likelihood takes the form: ln (; x)= ln () + (1 − ) ln(1 − )  ln (;x)  =  − (1−)  1− 2 ln (;x) = 2 − (  )− 2 (1−)  (1−)2 From the second derivative, it follows that: µ h i´ i2 ¶ ³h 1 1  ln (;x) 1 2 ln (;x) = − = (1−)      2 This follows directly from ()= since: i´ ³h 2 = 2 +  − 1  ln(;x) 2 (1−) 1 = (1−)2 19 + 1 = 1  1− (1−) (26) (27) From the definition of Jeffreys invariant prior we can deduce that for  : q p 1 1 1 () ∝ (; x)= (1−) = − 2 (1 − )− 2  0    1 (28) which is an ‘unnormalized’ Beta( 21  21 ) distribution; it does not integrate to one as it stands since: R 1 −1 1  2 (1 − )− 2  =  0 Jeffreys' invariant prior for the Binomial Beta: a=0.5, b=0.5 7 6 Density 5 4 3 2 1 0 0.0 0.2 0.4  0.6 0.8 1.0 Fig. 11: Jeffreys prior 1 −5 (1−)−5 ()= (55) This is a special case of the Beta( ) distribution: 1 ()= () (−1) (1 − )−1    0   0 (29) Jeffreys invariance prior (28) is also the reference prior; Bernardo and Smith (1994). 4.3 Examples of alternative Priors for  A. Uniform prior: ==1 () v Beta(1 1)=(0 1) This is a proper prior used by both Bayes (1763) and Laplace (1774) in conjunction with the simple Bernoulli model. U n i fo r m p r io r B eta: a= 1, b= 1 1 .0 Density 0 .8 0 .6 0 .4 0 .2 0 .0 0 .0 0 .2 0 .4  0 .6 0 .8 1 .0 Fig. 15: Uniform prior For this prior the posterior (|x) and the likelihood function coincide: (|x0 )=(; x0 ) ∝  (1 − )(1−)  ∈[0 1] 20 (30) Note that in this case the posterior distribution coincides with the likelihood function! B. Jeffreys prior for the Negative Binomial: =0 = 21  () v Beta(0 5) Je ffr e y 's inv arint prior for N e ga tiv e B inom ial Je ffre ys' inv ariant prior for the B inomial B eta: a= 0.5, b= 0.5 Beta: a= 0, b= 0.5 7 0.4 6 0.3 4 Density Density 5 3 2 0.2 0.1 1 0 0.0 0.2 0.4  0.6 0.8 0.0 1.0 0.0 0.2 0.4  0.6 0 .8 1.0 Fig. 17: Jeffreys prior: () v Beta(0 5) Fig. 11: Jeffreys prior () v Beta(5 5) The Negative Binomial case arises as a re-interpretation of the simple Bernoulli model where the likelihood function: (31) (; x0 ) ∝  (1 − )(1−)  ∈[0 1] is interpreted, not as arising from a sequence of IID Bernoulli trials, but as a sequence of trials until a pre-specified number of successes  is achieved. In this case the distribution of the sample is Negative Binomial of the form: ¡+−1¢   (1 − )  =0 1 2   where the random variable  denotes the number of failures before the -th success. Assuming that it took  trials to achieve = then = − = −  and ¡+−1 ¢ successes, (1−)  giving rise to the same the above density function becomes:  (1−)  likelihood function (31) as the Binomial density; they differ only in the combinatorics term which is absorbed in the proportionality constant. However, when one proceeds to derive Fisher’s information by taking expectations over the random variable of interest, the answer is different because for a Negative Binomial: ()= (1−)   Hence, the derivation of Fisher’s information yields: ³h ³ ³ ´ ´ i´ (1−) 1 2 ln (;x)   1 1   −  2 =   2 + (1−)2 =  2 + (1−)2 = ´ ³ ´ ³  1  =  2 (1−) = 1 2 + (1−) This gives rise to the Jeffreys invariant prior: q 1 1 () ∝ = −1 (1 − )− 2 v Beta(0 12 ) 2 (1−) 21 This derivation shows most clearly that when one takes expectations of the derivatives of the log-likelihood function to derive Fisher’s information, by definition one returns to the distribution of the sample which is a function of the random variables comprising the sample X given ; the likelihood function treats x as fixed at x0 . As shown above, Jeffreys invariance prior for the Binomial distribution is: () v Beta( 12  21 ) In order to get some idea as to the relative weights the two prior distributions attach to different values of  let us evaluate the prior probability for  at there different intervals: ∈[1 15] ∈[5 55] ∈[9 95] in the two cases. 1 1 Prior for Binomial. () ∝ − 2 (1 − )− 2 : R 55 − 1 R 95 − 1 R 15 − 1 1 1 1 2 (1−)− 2 =152 2 (1−)− 2 =1   2 (1−)− 2 =192  5 9 1 1 Prior for Negative Binomial: () ∝ −1 (1−)− 2 : R 55 −1 R 95 −1 R 15 −1 1 − 12 − 12  (1−)  (1−)− 2 =2  (1−) =433 =138 5 9 1 P o s t e rio r b a s e d o n B in o m ia l v s . N e g . B in o m ia l p rio r B e ta : b = 1 6 .5 5 a 4 .5 4 Density 4 3 2 1 0 0.0 0 .2 0 .4 0 .6  0.8 1 .0 Fig. 18: The Posterior distribution: Jeffrey’s prior for Bin. vs. Neg. Bin. The effect on the posterior can also be seen in fig. 18 for the case where =4, =20 C. Haldane’s (improper) prior: ==0 () v Beta(0 0). 22 Haldane's prior for Binom ial Beta: a=0, b=0 0.20 Density 0.15 0.10 0.05 0.00 0.0 0.2 0.4  0.6 0.8 1.0 Fig. 19: Haldane’s prior: () v Beta(0 0) R1 This is an improper prior because: 0 −1 (1−)−1 =∞ 4.4 Choosing Prior distributions The prior distribution () is the key to Bayesian inference and its determination is therefore the most important step in drawing inferences. In practice there is usually insufficient (not precise enough) information to help one specify () uniquely. 4.4.1 Subjective determination and approximations for priors A. Axiomatic approach. This approach is similar to the way utility functions are constructed from preferences. The basic argument is that people who make consistent choices in uncertain situations behave as if they had subjective prior probability distributions over the different states of nature. B. Approximating the prior in the case where the parameter space Θ is finite. When Θ is uncountable, say Θ:=[0 1] the approach is vulnerable to the partition paradoxes. C. Maximum entropy priors. This involves maximizing: P () = −  (32) =1 (  ) ln((  )) subject to some side conditions: ( ()) =   =1 2   giving rise to: ∗ () =  exp{  =1   (  )}    =1 exp{ =1   (  )} (33) Note that these priors belong to the Exponential family of distributions. D. Parametric approximations. One chooses a prior, say () v Beta( ) and then uses sample information, such as additional data, to choose  and . 23 E. Empirical Bayes. An extension of D with a fully frequentist estimation of the parameters of the prior distribution. F. Hierarchical Bayes. A variation on D with a priori specification of hyperprior parameters over a certain narrower range of plausible values. Example. In the case where the prior distribution is () vBeta( ) one could assume that:  v U(0 2)  v U(0 2) (34) giving rise to a new prior distribution: R2R2 1 )(−1) (1 − )−1   † ()= 14 0 0 ( B() (35) This prior distribution can then be approximated using numerical integration, say Simpson’s rule. As shown by Welsh (1996) this will give rise to a prior with   1   1 which looks like the Jeffreys invariance prior with flatter middle section. 4.4.2 ‘Objective’ prior distributions A. Laplace’s prior: Uniform over the relevant parameter space. B. Data invariant priors: the parameters obey the same transformations as the data; e.g. translation and scaling invariant priors. C. Jeffreys invariant priors: the prior is invariance to reparameterizations =() D. Reference priors: an extension of Jeffreys priors to more than one parameters by adopting a priority list for the parameters in order to define their conditional prior distributions sequentially. Consider the case where the distribution of the random variable  depends on two unknown parameters, say  (; 1  2 ) where 1 is the parameter of interest. The reference prior   (1 ) is obtained in three steps. Step one: derive (2 | 1 ) as the Jeffreys prior associated with ( | 1  2 ) assuming that 1 is fixed. Step two: derive the marginal distribution: R ( | 1 ) =  ( | 1  2 )(2 | 1 )2  by integrating out the nuisance parameter 2 . Step three: derive the Jeffreys prior (1 ) associated with ( | 1 ); Bernardo and Smith (1994). Note that by reversing the order of the parameters (1  2 ) both the reference priors change! E. Matching priors: the choice of priors that achieve approximate (asymptotic) matching between posterior and frequentist ‘error probabilities’, such as coverage probabilities of Bayesian credible intervals with the corresponding frequentist CI coverage probabilities. 24 5 Appendix: Examples based on Jeffreys prior For the simple Bernoulli model, consider selecting Jeffreys invariant prior: 1 ()= (55) −5 (1 − )−5  ∈[0 1] This gives rise to a posterior distribution of the form: (|x0 ) v Beta( + 5 (1−) + 5) ∈[0 1] ¥ (a) For =2, =20 the likelihood function is: (; x0 ) ∝ 2 (1−)18  ∈[0 1] and the posterior density is: (|x0 ) vBeta(25 185) ∈[0 1] b The Bayesian point estimates are: e =0789  = 25 =119  = 15 19 21 A 95 credible interval for  is: (0214 ≤   3803)=95 1 B(25185) R1 =0214 15 (1−)175 =975 1 B(25185) R1 =3803 15 (1−)175 =025 ¥ (b) For =18, =20 the likelihood function is: (; x0 ) ∝ 18 (1 − )2  ∈[0 1] and the posterior density is: (|x0 ) vBeta(185 25) ∈[0 1] b The Bayesian point estimates are: e  = 175 =921  = 185 =881 19 21 A 95 credible interval for  is: (716 ≤   97862)=95 1 B(18525) R1 =716 175 (1−)15 =0975 1 B(18525) R1 =979 175 (1−)15 =0025 ¥ (c) For =72, =80 the likelihood function is: (; x0 ) ∝ 72 (1 − )8  ∈[0 1] and the posterior density is: (|x0 ) v Beta(725 85) ∈[0 1] b The Bayesian point estimates are: e  = 715 =905  = 725 =895 79 81 A 95 credible interval for  is: (82 ≤   9515)=95 1 B(72585) R1 715 (1−)75 =0975 =82 1 B(72585) R1 =9515 715 (1−)75 =0025 ¥ (d) for =40, =80 the likelihood function is: (; x0 ) ∝ 40 (1 − )40  ∈[0 1] and the posterior density is: (|x0 ) v Beta(405 405) ∈[0 1] b The Bayesian point estimates are: e =5  = 405 =5  = 395 79 81 A 95 credible interval for  is: (3923 ≤   6525)=95 1 B(405405) R1 395 (1−)395 =975 =392 1 B(405405) 25 R1 =6525 395 (1−)395 =025 In view of the symmetry of the posterior distribution, even the asymptotic Normal (+) =05 credible interval (11) should give a good approximation. Given that b  = (++) the approximate credible interval is: ¶ µ √ √ 5(1−5) 5(1−5)  [5−196 √80 ]=390 ≤   610=[5+196 √80 ] =1− which provides a reasonably good approximation to the exact one. 5.1 A litany of misleading Bayesian claims concerning frequentist inference “Broadly speaking, some of the arguments in favour of the Bayesian approach are that it is fundamentally sound, very flexible, produces clear and direct inferences and makes use of all the available information. In contrast, the classical approach suffers from some philosophical flaws, has restrictive range of inferences with rather indirect meaning and ignores prior information.” (O’Hagan, 1994, p. 16) [1] Bayesian inference is fundamentally sound because it can be given an axiomatic foundation based on coherent (rational) decision making, but frequentist inference suffers from several philosophical flaws. [2] Frequentist inference is not very flexible and has a restrictive range of applicability. According to Koop, Poirier and Tobias (2007): "Non-Bayesians, who we hereafter refer to as frequentists, argue that situations not admitting repetition under essentially identical conditions are not within the realm of statistical enquiry, and hence ’probability’ should not be used in such situations. Frequentists define the probability of an event as its long-run relative frequency. ... that definition is nonoperational since only a finite number of trials can ever be conducted.’ (p. 2) [3] Bayesian inference produces clear and direct inferences, in contrast to frequentist inference producing unclear and indirect inferences, e.g. credible intervals vs. confidence intervals. [4] Bayesian inference makes use of all the available a priori information, but frequentist inference does not. [5] A number of counter-examples, introduced by Bayesians, show that frequentist inference is fundamentally flawed. [6] The subjectivity charge against Bayesians is misplaced because: “All statistical methods that use probability are subjective in the sense of relying on mathematical idealizations of the world. Bayesian methods are sometimes said to be especially subjective because of their reliance on a prior distribution, but in most problems, scientific judgement is necessary to specify both the ’likelihood’ and the prior’ parts of the model.” (Gelman, et al. (2004), p. 14) “... likelihoods are just as subjective as priors.” (Kadane, 2011, p. 445) 26 [7] For inference purposes, the only relevant point in the sample space R is just the data x0 as summarized by the likelihood function (θ|x0 ) θ∈Θ. This feature of Bayesian inference is formalized by the Likelihood Principle. Likelihood Principle. For inference purposes the only relevant sample information pertaining to θ is contained in the likelihood function (x0 |θ) ∀θ∈Θ Moreover, if x0 and y0 are two sample realizations contain the same information about θ if they are proportional to one another (Berger and Wolpert, 1988, p. 19). Frequentist inference procedures, such as estimation (point and interval), hypothesis testing and prediction inference procedure invoke other realizations x∈R  beyond the observed data x0  contravening the LP. Indeed, two generations of Bayesian statisticians take delight in poking fun at frequentist testing by quoting Jeffreys’s (1939) remark about the ‘absurdity’ of invoking the quantifier ‘for all x∈R ’: “What the use of P [p-value] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure.” (p. 385) [ha, ha, ha!] 27