Beyond Significance Testing Statistics Reform in The Behavioral Sciences
Beyond Significance Testing Statistics Reform in The Behavioral Sciences
Beyond Significance Testing Statistics Reform in The Behavioral Sciences
SIGNIFICANCE
TESTING
BEYOND
SIGNIFICANCE
TESTING
Statistics Reform in
the Behavioral Sciences
SECOND EDITION
Rex B. Kline
Published by To order
American Psychological Association APA Order Department
750 First Street, NE P.O. Box 92984
Washington, DC 20002 Washington, DC 20090-2984
www.apa.org Tel: (800) 374-2721; Direct: (202) 336-5510
Fax: (202) 336-5502; TDD/TTY: (202) 336-6123
Online: www.apa.org/books/
E-mail: order@apa.org
In the U.K., Europe, Africa, and the Middle East, copies may be ordered from
American Psychological Association
3 Henrietta Street
Covent Garden, London
WC2E 8LU England
The opinions and statements published are the responsibility of the authors, and such
opinions and statements do not necessarily represent the policies of the American
Psychological Association.
Kline, Rex B.
Beyond significance testing : statistics reform in the behavioral sciences / Rex B. Kline.
p. cm.
“Second edition”—Introduction.
Rev ed. of: Beyond significance testing : reforming data analysis methods in behavioral
research. c2004.
Includes bibliographical references and index.
ISBN-13: 978-1-4338-1278-1
ISBN-10: 1-4338-1278-9
1. Psychometrics. I. Title.
BF39.K59 2013
150.72'4—dc23
2012035086
DOI: 10.1037/14136-000
For my family,
Joanna, Julia Anne, and Luke Christopher,
and
my brother,
Don Neil Justin Dwayne Foxworth (1961–2011),
fellow author
And so it is with us: we face change, much of it hard,
whether we like it or not.
But it is in the hard times especially that we grow,
that we become transformed.
—Patrick Doyle
CONTENTS
Acknowledgments....................................................................................... xi
Introduction.................................................................................................. 3
ix
III. Alternatives to Significance Testing.............................................. 263
Chapter 9. Replication and Meta-Analysis..................................... 265
Chapter 10. Bayesian Estimation and Best Practices Summary........ 289
References................................................................................................. 313
Index......................................................................................................... 335
About the Author..................................................................................... 349
x contents
Acknowledgments
It was a privilege to work once again with the APA Books staff,
including Linda Malnasi McCarter, who helped to plan the project; Beth
Hatch, who worked with the initial draft and offered helpful suggestions;
Dan Brachtesende, who shepherded the book through the various pro-
duction stages; Ron Teeter, who oversaw copyediting and organized the
book’s design; and Robin Easson, who copyedited the technically complex
manuscript while helping to improve the presentation. Bruce Thompson
reviewed the complete first draft and gave many helpful suggestions. Any
remaining shortcomings in the presentation are solely my own. My loving
family was again at my side the whole time. Thanks Joanna, Julia, and Luke.
xi
BEYOND
SIGNIFICANCE
TESTING
Introduction
The goals of this second edition are basically the same as those of the
original. This book introduces readers to the principles and practice of sta-
tistics reform in the behavioral sciences. It (a) reviews the now even larger
literature about shortcomings of significance testing; (b) explains why these
criticisms have sufficient merit to justify major changes in the ways research-
ers analyze their data and report the results; (c) helps readers acquire new
skills concerning interval estimation and effect size estimation; and (d) reviews
alternative ways to test hypotheses, including Bayesian estimation. I aim to
change how readers think about data analysis, especially among those with
traditional backgrounds in statistics where significance testing was presented
as basically the only way to test hypotheses. I want all readers to know that
there is a bigger picture concerning the analysis that blind reliance on signifi-
cance testing misses.
I wrote this book for researchers and students in psychology and other
behavioral sciences who do not have strong quantitative backgrounds. I
DOI: 10.1037/14136-011
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
3
assume that the reader has had undergraduate courses in statistics that cov-
ered at least the basics of regression and factorial analysis of variance. Each
substantive chapter emphasizes fundamental statistical concepts but does
not get into the minutiae of statistical theory. Works that do so are cited
throughout the text, and readers can consult such works when they are ready.
I emphasize instead both the sense and the nonsense of common data analysis
practices while pointing out alternatives that I believe are more scientifi-
cally strong. I do not shield readers from complex topics, but I try to describe
such topics using clear, accessible language backed up by numerous examples.
This book is suitable as a textbook for an introductory course in behavioral
science statistics at the graduate level. It can also be used in undergraduate-
level courses for advanced students, such as honors program students, about
modern methods of data analysis. Especially useful for all readers are Chapters
3 and 4, which respectively consider the logic and illogic of significance test-
ing and misinterpretations about the outcomes of statistical tests. These mis-
interpretations are so widespread among researchers and students alike that
one can argue that data analysis practices in the behavioral sciences are based
more on myth than fact.
That the first edition of this book was so well reviewed and widely cited
was very satisfying. I also had the chance to correspond with hundreds of read-
ers from many different backgrounds where statistics reform is increasingly
important. We share a common sense that the behavioral sciences should
be doing better than they really are concerning the impact and relevance of
research. Oh, yes, the research literature is very large, but quantity does not
in this case indicate quality, and many of us know that most published studies
in the behavioral studies have very little impact. Indeed, most publications
are never cited again by authors other than those of the original works, and
part of the problem has been our collective failure to modernize our methods
of data analysis and describe our findings in ways relevant to target audiences.
New to this edition is coverage of robust statistical methods for param-
eter estimation, effect size estimation, and interval estimation. Most data sets
in real studies do not respect the distributional assumptions of parametric
statistical tests, so the use of robust statistics can lend a more realistic tenor
to the analysis. Robust methods are described over three chapters (2, 3, and
5), but such methods do not remedy the major shortcomings of significance
testing. There is a new chapter (3) about the logic and illogic of significance
testing that deals with issues students rarely encounter in traditional statistics
courses. There is expanded coverage of interval estimation in all chapters
and also of Bayesian estimation as an increasingly viable alternative to tradi-
tional significance testing. Exercises are included for chapters that deal with
fundamental topics (2–8). A new section in the last chapter summarizes best
practice recommendations.
introduction 5
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
1
Changing Times
This chapter explains the basic rationale of the movement for statis-
tics reform in the behavioral sciences. It also identifies critical limitations
of traditional significance testing that are elaborated throughout the book
and reviews the controversy about significance testing in psychology and
other disciplines. I argue that overreliance on significance testing as basi-
cally the sole way to evaluate hypotheses has damaged the research literature
and impeded the development of psychology and other areas as empirical
sciences. Alternatives are introduced that include using interval estimation
of effect sizes, taking replication seriously, and focusing on the substantive
significance of research results instead of just on whether or not they are
statistically significant. Prospects for further reform of data analysis methods
are also considered.
DOI: 10.1037/14136-001
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
9
Précis of Statistics Reform
Cognitive Errors
changing times 11
cheaper method has the same chance of making correct decisions in the long
run (F. L. Schmidt & Hunter, 1997).
A consequence of low power is that the research literature is often dif-
ficult to interpret. Specifically, if there is a real effect but power is only .50,
about half the studies will yield statistically significant results and the rest
will yield no statistically significant findings. If all these studies were some-
how published, the number of positive and negative results would be roughly
equal. In an old-fashioned, narrative review, the research literature would
appear to be ambiguous, given this balance. It may be concluded that “more
research is needed,” but any new results will just reinforce the original ambi-
guity, if power remains low.
Confusing statistical significance with scientific relevance unwittingly
legitimizes fad topics that clutter the literature but have low substantive
value. With little thought about a broader rationale, one can collect data
and then apply statistical tests. Even if the numbers are random, some of the
results are expected to be statistically significant, especially in large samples.
The objective appearance of significance testing can lend an air of credibility
to studies with otherwise weak conceptual foundations. This is especially true
in “soft” research areas where theories are neither convincingly supported nor
discredited but simply fade away as researchers lose interest (Meehl, 1990).
This lack of cumulativeness led Lykken (1991) to declare that psychology
researchers mainly build castles in the sand.
Statistical tests of a treatment effect that is actually clinically signifi-
cant may fail to reject the null hypothesis of no difference when power is
low. If the researcher in this case ignored whether the observed effect size is
clinically significant, a potentially beneficial treatment may be overlooked.
This is exactly what was found by Freiman, Chalmers, Smith, and Kuebler
(1978), who reviewed 71 randomized clinical trials of mainly heart- and
cancer-related treatments with “negative” results (i.e., not statistically sig-
nificant). They found that if the authors of 50 of the 71 trials had considered
the power of their tests along with the observed effect sizes, those authors
should have concluded just the opposite, or that the treatments resulted in
clinically meaningful improvements.
If researchers become too preoccupied with statistical significance, they
may lose sight of other, more important aspects of their data, such as whether
the variables are properly defined and measured and whether the data respect
test assumptions. There are clear problems in both of these areas. One is
the measurement crisis, which refers to a substantial decline in the quality
of instruction about measurement in psychology over the last 30 years or
so. Psychometrics courses have disappeared from many psychology under-
graduate programs, and about one third of psychology doctoral programs in
North America offer no formal training in this area at all (Aiken et al., 1990;
but its accuracy is dubious, given the issues just raised. If p values
are generally wrong, so too are decisions based on them.
changing times 13
3. Probabilities from statistical tests (p values) generally assume
that all other sources of error besides sampling error are nil. This
includes measurement error; that is, it is assumed that rXX = 1.00,
where rXX is a score reliability coefficient. Other sources of error
arise from failure to control for extraneous sources of variance
or from flawed operational definitions of hypothetical constructs.
It is absurd to assume in most studies that there is no error vari-
ance besides sampling error. Instead it is more practical to expect
that sampling error makes up the small part of all possible kinds
of error when the number of cases is reasonably large (Ziliak &
McCloskey, 2008).
The p values from statistical tests do not tell researchers what they want
to know, which often concerns whether the data support a particular hypoth-
esis. This is because p values merely estimate the conditional probability of
the data under a statistical hypothesis—the null hypothesis—that in most
studies is an implausible, straw man argument. In fact, p values do not directly
“test” any hypothesis at all, but they are often misinterpreted as though they
describe hypotheses instead of data.
Although p values ultimately provide a yes-or-no answer (i.e., reject
or fail to reject the null hypothesis), the question—p < a?, where a is the
criterion level of statistical significance, usually .05 or .01—is typically unin-
teresting. The yes-or-no answer to this question says nothing about scientific
relevance, clinical significance, or effect size. This is why Armstrong (2007)
remarked that significance tests do not aid scientific progress even when they
are properly done and interpreted.
changing times 15
(1997a) referred to the law of the diffusion of idiocy, which says that every
foolish practice of significance testing will beget a corresponding misstep with
confidence intervals. This law applies to effect sizes, too. But misinterpretation
of the new statistics is less likely to occur if researchers can refrain from apply-
ing the same old, dichotomous thinking from significance testing. Thinking
meta-analytically can also help to prevent misunderstanding.
You should know that measuring effect size in treatment outcome stud-
ies is insufficient to determine clinical significance, especially when outcomes
have arbitrary (uncalibrated) metrics with no obvious connection to real-
world status. An example is a 7-point Likert scale for an item on a self-report
measure. This scale is arbitrary because its points could be represented with
different sets of numbers, such as 1 through 7 versus -3 through 3 in whole-
number increments, among other possibilities. The total score over a set of such
items is arbitrary, too. It is generally unknown for arbitrary metrics (a) how a
1-point difference reflects the magnitude of change on the underlying con-
struct and (b) exactly at what absolute points along the latent dimension
observed scores fall. As Andersen (2007) noted, “Reporting effect sizes on
arbitrary metrics alone with no reference to real-world behaviors, however,
is no more meaningful or interpretable than reporting p values” (p. 669).
So, determining clinical significance is not just a matter of statistics; it also
requires strong knowledge about the subject matter.
These points highlight the idea that the evaluation of the clinical, prac-
tical, theoretical, or, more generally, substantive significance of observed
effect sizes is a qualitative judgment. This judgment should be informed and
open to scrutiny, but it will also reflect personal values and societal concerns.
This is not unscientific because the assessment of all results in science involves
judgment (Kirk, 1996). It is better to be open about this fact than to base deci-
sions solely on “objective,” mechanically applied statistical rituals that do not
address substantive significance. Ritual is no substitute for critical thinking.
Retrospective
changing times 17
normality). An older term for the standard error—actually two times the
square root of the standard error—is the modulus, described in 1885 by the
economist Francis Ysidro Edgeworth (Stigler, 1978) to whom the term statis-
tical significance is attributed. From about 1940–1960, during what Gigerenzer
and Murray (1987) called the inference revolution, the Intro Stats method
was widely adopted in psychology textbooks and journal editorial practice
as the method to test hypotheses. The move away from the study of single
cases (e.g., operant conditioning studies) to the study of groups over roughly
1920–1950 contributed to this shift. Another factor is what Gigerenzer
(1993) called the probabilistic revolution, which introduced indeterminism
as a major theoretical concept in areas such as quantum mechanics in order
to better understand the subject matter. In psychology, though, it was used to
mechanize the inference process, a critical difference, as it turns out.
After the widespread adoption of the Intro Stats method, there was
an increase in the reporting of statistical tests in journal articles in psychol-
ogy. This trend is obvious in Figure 1.1, reproduced from Hubbard and Ryan
(2000). They sampled about 8,000 articles published during 1911–1998 in
randomly selected issues of 12 different APA journals. Summarized in the fig-
ure are percentages of articles in which results of statistical tests were reported.
100
90
80
70
60
Percentage
50
40
30
20
10
0
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Year
changing times 19
interactions, Instructions × Incentive, F(1, 72) = 11.95, p < .001; Instruc-
tions × Goals, F(1, 72) = 25.40, p < .01; Incentive × Goals, F(1, 72) = 9.25,
p < .01, and two of three of the main effects, Instructions, F(1, 72) = 11.60,
p < .01; Goals, F(1, 72) = 6.25, p < .05.
This text chockablock with numbers—which is poor writing style—says
nothing about the magnitudes of all those “significant” effects. If later in
the hypothetical article the reader is still given no information about effect
sizes, that is sizeless science. Getting excited about “significant” results while
knowing nothing about their magnitudes is like ordering from a restaurant
menu with no prices: You may get a surprise (good or bad) when the bill
(statement of effect size) comes.
There has been controversy about statistical tests for more than 80 years,
or as long as they have been around. Boring (1919), Berkson (1942), and
Rozeboom (1960) are among earlier works critical of significance testing.
Numbers of published articles critical of significance testing have increased
exponentially since the 1940s. For example, Anderson, Burnham, and
Thompson (2000) found less than 100 such works published during the
1940s–1970s in ecology, medicine, business, economics, or the behavioral
sciences, but about 200 critical articles were published in the 1990s. W. L.
Thompson (2001) listed a total of 401 references for works critical of sig-
nificance testing, and Ziliak and McCloskey (2008, pp. 57–58) cited 125
such works in psychology, education, business, epidemiology, and medicine,
among other areas.
The fifth edition of the Publication Manual (APA, 2001) took a stand
similar to that of the TFSI regarding significance testing. That is, it acknowl-
edged the controversy about statistical tests but stated that resolving this
issue was not a proper role of the Publication Manual. The fifth edition went
on to recommend the following:
1. Report adequate descriptive statistics, such as means, variances,
and sizes of each group and a pooled within-groups variance–
covariance matrix in a comparative study. This information is
necessary for later meta-analyses or secondary analyses by others.
2. Effect sizes should “almost always” be reported, and the absence
of effect sizes was cited as an example of a study defect.
3. The use of confidence intervals was “strongly recommended”
but not required.
The sixth edition of the Publication Manual (APA, 2010) used similar
language when recommending the reporting of effect sizes and confidence
intervals. Predictably, not everyone is happy with the report of the TFSI or
the wording of the Publication Manual. B. Thompson (1999) noted that only
encouraging the reporting of effect sizes or confidence intervals presents a
self-canceling mixed message. Ziliak and McCloskey (2008, p. 125) chastised
changing times 21
the Publication Manual for “retaining the magical incantations of p < .05 and
p < .01.” S. Finch, Cumming, and Thomason (2001) contrasted the rec-
ommendations about statistical analyses in the Publication Manual with the
more straightforward guidelines in the Uniform Requirements for Manuscripts
Submitted to Biomedical Journals, recently revised (International Committee
of Medical Journal Editors, 2010). Kirk (2001) urged that the then-future
sixth edition of the Publication Manual should give more detail than the fifth
edition about the TFSI’s recommendations. Alas, the sixth edition does not
contain such information, but I aim to provide you with specific skills of this
type as you read this book.
Journal editorials and reviewers are the gatekeepers of the research litera-
ture, so editorial policies can affect the quality of what is published. Described
next are three examples of efforts to change policies in reform-oriented direc-
tions with evaluations of their impact; see Fidler, Thomason, Cumming,
Finch, and Leeman (2004) and Fidler et al. (2005) for more examples.
Kenneth J. Rothman was the assistant editor of the American Journal
of Public Health (AJPH) from 1984 to 1987. In his revise-and-submit letters,
Rothman urged authors to remove from their manuscripts all references to p
values (e.g., Fidler et al., 2004, p. 120). He founded the journal Epidemiology
in 1990 and served as its first editor until 2000. Rothman’s (1998) editorial
letter to potential authors was frank:
When writing for Epidemiology, you can . . . enhance your prospects if
you omit tests of statistical significance. . . . In Epidemiology, we do not
publish them at all. . . . We discourage the use of this type of thinking
in the data analysis. . . . We also would like to see the interpretation of
a study based not on statistical significance, or lack of it . . . but rather
on careful quantitative consideration of the data in light of competing
explanations for the findings. (p. 334)
Fidler et al. (2004) examined 594 AJPH articles published from 1982
to 2000 and 100 articles published in Epidemiology between 1990 and 2000.
Reporting based solely on statistical significance dropped from about 63%
of the AJPH articles in 1982 to about 5% of articles in 1986–1989. But in
many AJPH articles there was evidence that interpretation was based mainly
on undisclosed significance test results. The percentages of articles in which
confidence intervals were reported increased from about 10% to 54% over
the same period. But these changes in reporting practices in AJPH articles
did not generally persist past Rothman’s tenure.
changing times 23
Miller (2006) reviewed 736 articles published over 2002–2005 in five dif-
ferent applied, experimental, or personnel psychology journals. The overall
rate of effect size reporting was about 62.5%. Among studies where no effect
sizes were reported, use of the techniques of analysis of variance (ANOVA)
and the t test were prevalent. Later I will show you that effect sizes are actu-
ally easy to calculate in such analyses, so there is no excuse for not report-
ing them. Andersen (2007) found that in a total of 54 articles published in
2005 in three different sport psychology journals, effect sizes were reported in
44 articles, or 81%. But the authors of only seven of these articles interpreted
effect sizes in terms of substantive significance. Sun, Pan, and Wang (2010)
reviewed a total of 1,243 works published in 14 different psychology and
education journals during 2005–2007. The percentage of articles reporting
effect sizes was 49%, and 57% of these authors interpreted their effect sizes.
Evidence for progress in statistics reform is thus mixed. Researchers
seem to report effect sizes more often, but improvement in reporting confi-
dence intervals may lag behind. Too many authors do not interpret the effect
sizes they report, which avoids dealing with the question of why does an effect
of this size matter. It is poor practice to compute effect sizes only for statisti-
cally significant results. Doing so amounts to business as usual where the
significance test is still at center stage (Sohn, 2000). Real reform means that
effect sizes are interpreted for their substantive significance, not just reported.
Obstacles to Reform
There are two great obstacles to continued reform. The first is inertia:
It is human nature to resist change, and it is hard to give up familiar routines.
Belasco and Stayer (1993) put it like this: “Most of us overestimate the value
of what we currently have, and have to give up, and underestimate the value
of what we may gain” (p. 312). But science demands that researchers train
the lens of skepticism on their own assumptions and methods. Such self-
criticism and intellectual honesty do not come easy, and not all researchers
are up for the task. Defense attorney Gerry Spence (1995) wrote, “I would
rather have a mind opened by wonder than one closed by belief” (p. 98). This
conviction identifies a scientist’s special burden.
The other big obstacle is vested interest, which is in part economic.
I am speaking mainly about applying for research grants. Most of us know
that grant monies are allocated in part on the assurance of statistical signifi-
cance. Many of us also know how to play the significance game, which goes
like this: Write application. Promise significance. Get money. Collect data
until significance is found, which is virtually guaranteed because any effect
that is not zero needs only a large enough sample in order to be significant.
Prospective
I have no crystal ball, but I believe that I can reasonably speculate about
three anticipated developments in light of the events just described:
1. The role of significance testing will continue to get smaller
and smaller to the point where researchers must defend its use.
This justification should involve explanation of why the narrow
assumptions about sampling and score characteristics in signifi-
cance testing are not unreasonable in a particular study. Estima-
tion of a priori power will also be required whenever statistical
changing times 25
tests are used. I and others (e.g., Kirk, 1996) envision that the
behavioral sciences will become more like the natural sciences.
That is, we will report the directions, magnitudes, and preci-
sions of our effects; determine whether they replicate; and eval-
uate them for their substantive significance, not simply their
statistical significance.
2. I expect that the best behavioral science journals will require
evidence for replication. This requirement would send the
strong message that replication is standard procedure. It would
also reduce the number of published studies, which may actu-
ally improve quality by reducing noise (one-shot studies, unsub-
stantiated claims) while boosting signal (replicated results).
3. I concur with Rodgers (2010) that a “quiet methodological rev-
olution” is happening that is also part of statistics reform. This
revolution concerns the shift from testing individual hypotheses
for statistical significance to the evaluation of entire mathe-
matical and statistical models. There is a limited role for signifi-
cance tests in statistical modeling techniques such as structural
equation modeling (e.g., Kline, 2010, Chapter 8), but it requires
that researchers avoid making the kinds of decision errors often
associated with such tests.
Conclusion
Learn More
Listed next are three works about the significance testing controversy
from fields other than psychology, including Armstrong (2007) in forecasting;
changing times 27
This page intentionally left blank
2
Sampling and Estimation
In times of change, learners inherit the Earth, while the learned find
themselves beautifully equipped to deal with a world that no longer exists.
—Eric Hoffer (1973, p. 22)
DOI: 10.1037/14136-002
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
29
the 2010 Census in the United States was $13 billion, and about 635,000
temporary workers were hired for it (U.S. Census Bureau, 2010). It may be
practically impossible to study even much smaller populations. The base rate
of schizophrenia, for example, is about 1%. But if persons with schizophrenia
are dispersed over a large geographic area, studying all of them is probably
impracticable.
Types of Samples
Sampling Error
This discussion assumes a population size that is very large and assumes
that the size of each sample is a relatively small proportion of the total popu-
lation size. There are some special corrections if the population size is small,
such as less than 5,000 cases, or if the sample size exceeds 20% or so of the
population size that are not covered here (see S. K. Thompson [2012] for
more information).
Values of population parameters, such as means (µ) or variances (s2),
are usually unknown. They are instead estimated with sample statistics, such
as M (means) or s2 (variances). Statistics are subject to sampling error, which
refers to the difference between an estimator and the corresponding param-
eter (e.g., µ - M). These differences arise because the values of statistics from
takes on the lowest value possible in a particular sample. Due to these proper-
ties, sample means are described as least squares estimators. The statistic M is
also an unbiased estimator because its expected value across random samples
of the same size is the population mean µ.
SS
s2 = (2.2)
df
SS
S2 = (2.3)
N
is a negatively biased estimator because its values are on average less than
s2. The reason is that squared deviations are taken from M (Equation 2.1),
which is not likely to equal µ. Therefore, sample sums of squares are generally
too small compared with taking squared deviations from µ. The division of SS
by df instead of N, which makes the whole ratio larger (s2 > S2), is sufficient
to render s2 an unbiased estimator. In larger samples, though, the values of
s2 and S2 converge, and in very large samples they are asymptotically equal.
Expected values of positively biased estimators exceed those of the corre-
sponding parameter.
There are ways to correct other statistics for bias. For example, although
s is an unbiased estimator of s2, the sample standard deviation s is a nega-
2
1
σˆ = 1 + s (2.4)
4df
σ
σM = (2.5)
N
s
sM = (2.6)
N
As either sample variability decreases or the sample size increases, the value
of sM decreases. For example, given s = 10.00, sM equals 10.00/251/2, or 2.00,
for N = 25, but for N = 100 the value of sM is 10.00/1001/2, or 1.00. That is,
the standard error is twice as large for N = 25 as it is for N = 100. A graphi-
cal illustration is presented in Figure 2.1. An original normal distribution is
shown along with three different sampling distributions of M based on N = 4,
16, or 64 cases. Variability of the sampling distributions in the figure decreases
as the sample size increases.
The standard error sM, which estimates variability of the group statistic
M, is often confused with the standard deviation s, which measures vari-
ability at the case level. This confusion is a source of misinterpretation of
both statistical tests and confidence intervals (Streiner, 1996). Note that
N = 64
N = 16
N=4
Original
distribution
Figure 2.1. An original distribution of scores and three distributions of random sample
means each based on different sample sizes, N = 4, N = 16, or N = 64.
2
σ s = σ2
2 (2.7)
df
2
ss = s2
2 (2.8)
df
1
This central t distributional calculator accepts either integer or noninteger df values: http://www.usable
stats.com/calcs/tinv
2It can also refer to including irrelevant predictors, estimating linear relations only when the true relation
Interval Estimation
where the term in brackets is the positive two-tailed critical value in a cen-
tral t distribution with N – 1 degrees of freedom at the a level of statistical
significance. Suppose that
9.00
sM = = 1.800
25
and t2-tail, .05 (24) = 2.064. The 95% confidence interval for µ is thus
σ12 σ 22
σ M −M = + (2.10)
n 1 n2
1 2
where σ 12 and σ 22 are the population variances and n1 and n2 are the sizes of
each group. If we assume homogeneity of population variance or homosce-
dasticity (i.e., σ 12 = σ 22), the expression for the standard error reduces to
1 1
σ M −M = σ2 + (2.11)
1 2
n1 n2
1 1
sM − M = s2pool + (2.12)
1 2
n1 n2
where s2pool is the weighted average of the within-groups variances. Its equa-
tion is
where s12 and s22 are the group variances, df1 = n1 – 1, df2 = n2 – 1, and SSW and
dfW are, respectively, the pooled within-groups sum of squares and the degrees
of freedom. The latter can also be expressed as dfW = N – 2. Only when the
group sizes are equal can s2pool also be calculated as the simple average of the
two group variances, or (s12 + s22)/2.
The general form of a 100 (1 – a)% confidence interval for µ1 – µ2 based
on the difference between two independent means is
( M1 − M2 ) ± sM − M [t2−tail , α ( N − 2)]
1 2
(2.14)
1 1
sM − M = 6.25 + = 1.118
1 2
10 10
and t2-tail, .05 (18) = 2.101. The 95% confidence interval for µ1 – µ2 is
which defines the interval [-.35, 4.35]. On the basis of these results, we can
say that µ1 – µ2 could be as low as -.35 or as high as 4.35, with 95% confidence.
The specific interval [-.35, 4.35] includes zero as an estimate of µ1 – µ2.
This fact is subject to misinterpretation. For example, it may be incorrectly
concluded that µ1 = µ2 because zero falls within the interval. But zero is only
one value within a range of estimates of µ1 – µ2, so it has no special status
in interval estimation. Confidence intervals are subject to sampling error, so
zero may not be included within the 95% confidence interval in a replication.
Confidence intervals also assume that other sources of error are nil. All these
caveats should reduce the temptation to fixate on a particular value (here,
zero) in a confidence interval.
There is special relation between a confidence interval for µ1 – µ2 and
the outcome of the independent samples t test based on the same data:
Whether a 100 (1 – a)% confidence interval for µ1 – µ2 includes zero yields
an outcome equivalent to either rejecting or not rejecting the corresponding
null hypothesis at the a level of statistical significance for a two-tailed test.
For example, the specific 95% confidence interval [-.35, 4.35] includes zero;
thus, the outcome of the t test for these data of H0: µ1 – µ2 = 0 is not statisti-
cally significant at the .05 level, or
2.00
t (18 ) = = 1.789, p = .091
1.118
But if zero is not contained within a particular 95% confidence interval for
µ1 – µ2, the outcome of the independent samples t test will be statistically
significant at the .05 level.
Be careful not to falsely believe that confidence intervals are just statis-
tical tests in disguise (B. Thompson, 2006a). One reason is that null hypoth-
eses are required for statistical tests but not for confidence intervals. Another
is that many null hypotheses have little if any scientific value. For example,
Anderson et al. (2000) reviewed null hypotheses tested in several hundred
empirical studies published from 1978 to 1998 in two environmental sciences
journals. They found many implausible null hypotheses that specified things
such as equal survival probabilities for juvenile and adult members of a spe-
cies or that growth rates did not differ across species, among other assump-
tions known to be false before collecting data. I am unaware of a similar
survey of null hypotheses in the behavioral sciences, but I would be surprised
if the results would be very different.
Confidence intervals over replications may be less susceptible to mis-
interpretation than results of statistical tests. Summarized in Table 2.1 are
outcomes of six hypothetical replications where the same two conditions
are compared on the same outcome variable. Results of the independent
samples t test lead to rejection of the null hypothesis at p < .05 in three out
of six studies, a “tie” concerning statistical significance (3 yeas, 3 nays). More
informative than the number of null hypothesis replications is the average
of M1 – M2 across all six studies, 3.54. This average is from a meta-analysis
of all results in the table for a fixed effects model, where a single population
effect size is presumed to underlie the observed contrasts. (I show you how
to calculate this average in Chapter 9.) The overall average of 3.54 may be a
better estimate of µ1 - µ2 than M1 – M2 in any individual study because it is
based on all available data.
The 95% confidence intervals for µ1 – µ2 in Table 2.1 are shown in
Figure 2.2 as error bars in a forest plot, which displays results from replications
and a meta-analytic weighted average with confidence intervals (Cumming,
2012). The 95% confidence interval based on the overall average of 3.54, or
[2.53, 4.54] (see Table 2.1), is narrower than any of the intervals from the six
replications (see Figure 2.2). This is because more information contributes
to the confidence interval based on results averaged over all replications. For
these data, µ1 – µ2 may be as low as 2.53 or as high as 4.54, with 95% confi-
dence based on all available data.
1 2 3 4 5 6 Average
Replication
Figure 2.2. A forest plot of 95% confidence intervals for µ1 – µ2 based on mean
differences from the six replications in Table 2.1 and the meta-analytic 95%
confidence interval for µ1 – µ2 across all replications for a fixed effects model.
You should verify for these data the results presented next:
These confidence intervals for µ are plotted in Figure 2.3 along with the 95%
confidence interval for µ1 – µ2 for these data [-.35, 4.35]. Group means are
represented on the y-axis, and the mean contrast (2.00) is represented on
the floating difference axis (Cumming, 2012) centered at the grand mean
across both groups (12.00). The error bars of the 95% confidence intervals
for µ overlap by clearly more than one half of their lengths. According to
the overlap rule, this amount of overlap is more than moderate. So the mean
difference should not be statistically significant at the .05 level, which is true
for these data.
18 6
16 4
Mean
14 2
12 0
10 2
1 2 Difference
Group
Figure 2.3. Plot of the 95% confidence interval for µ1, 95% confidence interval for µ2,
and 95% confidence interval for µ1 – µ2, given M1 = 13.00, s12 = 7.50, M2 = 11.00, s22 =
5.00, and n1 = n2 = 10. Results for the mean difference are shown on a floating differ-
ence axis where zero is aligned at the grand mean across both samples (12.00).
s12 s22
sWel = + (2.15)
n1 n 2
where s12 estimates s12 and s22 estimates s22 (i.e., heteroscedasticity is allowed).
The degrees of freedom for the critical value of central t in the Welch proce-
dure are estimated empirically as
2
s12 s22
+
n1 n 2
dfWel = (2.16)
(s12 )2 (s22 )2
+
n12 (n1 − 1) n 22 (n 2 − 1)
Variability among cases in the first group is obviously greater than that in the
second group. A pooled within-groups variance would mask this discrepancy.
The researcher elects to use the Welch procedure. The estimated standard
error is
75.25 15.00
sWel = + = 1.939
25 20
σD
σM =D
(2.18)
n
σ 2D = 2σ 2 (1 − ρ12 ) (2.19)
sD
sM = D
(2.20)
n
where cov12 is the cross-conditions covariance of the original scores. The latter is
MD ± sM [ t2-tail, α ( n − 1)]
D
(2.23)
Presented in Table 2.2 are raw scores and descriptive statistics for a small data
set where the mean contrast is 2.00. In a dependent samples analysis of these
data, n = 5 and r12 = .735. The cross-conditions covariance is
1 2
9 8
12 12
13 11
15 10
16 14
M 13.00 11.00
s2 7.50 5.00
s 2.739 2.236
Note. In a dependent samples analysis, r12 = .735.
1.871
sM =
D
= .837
5
The value of t2-tail, .05 (4) is 2.776, so the 95% confidence interval for µD is
which defines the interval [-.32, 4.32]. Exercise 4 asks you to verify that the
95% confidence interval for µD assuming a correlated design is narrower than
the 95% confidence interval for µ1 – µ2 assuming unrelated samples for the
same data (see Table 2.2), which is [-1.65, 5.65].
Many statistics other than means have complex distributions. For exam-
ple, distributions of the Pearson correlation r are symmetrical only if the pop-
ulation correlation is r = 0, but they are negatively skewed when r > 0 and
positively skewed when r < 0. Other statistics have complex distributions,
including some widely used effect sizes introduced in Chapter 5, because they
estimate more than one parameter.
Until recently, confidence intervals for statistics with complex distri-
butions were estimated with approximate methods. One method involves
confidence interval transformation (Steiger & Fouladi, 1997), where the
statistic is mathematically transformed into normally distributed units. The
confidence interval is built by adding and subtracting from the transformed
statistic the product of the standard error in the transformed metric and the
appropriate critical value of the normal deviate z. The lower and upper bounds
1 1+ r
Zr = ln (2.24)
2 1 − r
where ln is the natural log function to base e, which is about 2.7183. The
sampling distribution of Zr is approximately normal with the standard error
1
sZ = (2.25)
N−3
r
The lower and upper bounds of the 100 (1 – a)% confidence interval based
on Zr are defined by
Zr ± sZ ( z 2-tail, α )
r
(2.26)
where z2-tail, a is the positive two-tailed critical value of the normal deviate,
which is 1.96 for a = .05 and the 95% confidence level. Next, transform both
the lower and upper bounds of the confidence interval in Zr units back to
r units by applying the inverse transformation
e2 Z − 1
r
rZ = (2.27)
e2 Z + 1
r
1 1 + .6803 1
Zr = ln = .8297 and sZ = = .2425
2 1 − .6803 r
20 − 3
3http://faculty.vassar.edu/lowry/rho.html
e2(.3544) − 1 e2(1.3051) − 1
= . 34 0 3 and = .8630
e2(.3544) + 1 e2(1.3051) + 1
In r units, the approximate 95% confidence interval for r is [.34, .86] at two-
place accuracy.
Another approximate method builds confidence intervals directly
around the sample statistic; thus, they are symmetrical about it. The width
of the interval on either side is a product of the two-tailed critical value of a
central test statistic and an estimate of the asymptotic standard error, which
estimates what the standard error would be in a large sample (e.g., > 500). If
the researcher’s sample is not large, though, this estimate may not be accu-
rate. Another drawback is that some statistics, such as R2 in multiple regres-
sion, have distributions so complex that a computer is needed to estimate
standard error. Fortunately, there are increasing numbers of computer tools
for calculating confidence intervals, some of which are mentioned later.
A more precise method is noncentrality interval estimation (Steiger &
Fouladi, 1997). It also deals with situations that cannot be handled by approx-
imate methods. This approach is based on noncentral test distributions that
do not assume a true null hypothesis. Some perspective is in order. Families
of central distributions of t, F, and c2 (in which H0 is assumed to be true) are
special cases of noncentral distributions of each test statistic just mentioned.
Compared to central distributions, noncentral distributions have an extra
parameter called the noncentrality parameter that indicates the degree to
which the null hypothesis is false.
Central t distributions are defined by a single parameter, the degrees of
freedom (df), but noncentral t distributions are described by both df and the
noncentrality parameter D (Greek uppercase delta). In two-group designs,
the value of D for noncentral t is related to (but not exactly equal to) the true
difference between the population means µ1 and µ2. The larger that differ-
ence, the more the noncentral t distribution is skewed. That is, if µ1 > µ2, then
D > 0 and the resulting noncentral t distributions are positively skewed, and
if µ1 < µ2, then D < 0 and the corresponding resulting noncentral t distribu-
tions are negatively skewed. But if µ1 = µ2 (i.e., there is no difference), then
D = 0 and the resulting distributions are the familiar and symmetrical central
t distributions. Presented in Figure 2.4 are two t distributions where df = 10.
For the central t distribution in the left part of the figure, D = 0, but for the
noncentral t distribution in the right side of the figure, D = 4.00. (The mean-
ing of a particular value for D is defined in Chapter 5.) Note in the figure that
the distribution for noncentral t (10, 4.00) is positively skewed.
.30 Central t
Probability density
Noncentral t
.20
.10
0
5.00 2.50 0 2.50 5.00 7.50 10.00 12.50
t
Figure 2.4. Distributions of central t and noncentral t where the degrees of freedom
are df = 10 and where the noncentrality parameter is D = 4.00 for noncentral t.
4http://www.thenewstatistics.com/
5http://www.statsoft.com/#
6http://www.provalisresearch.com/
A 12 16 K 16 37
B 19 46 L 13 30
C 21 66 M 18 32
D 16 70 N 18 53
E 18 27 O 22 52
F 16 27 P 17 34
G 16 44 Q 22 54
H 20 69 R 12 5
I 16 22 S 14 38
J 18 61 T 14 38
80
64
Frequency
48
32
16
0
0 .10 .20 .30 .40 .50 .60 .70 .80 .90
r
7http://woodm.myweb.port.ac.uk/nms/resample.xls
8http://woodm.myweb.port.ac.uk/nms/resample.exe
9Resampling Stats is available for a 10-day trial from http://www.resample.com/
Robust Estimation
The least squares estimators M and s2 are not robust against the effects
of extreme scores. This is because their values can be severely distorted by
even a single outlier in a smaller sample or by just a handful of outliers in a
larger sample. Conventional methods to construct confidence intervals rely
on sample standard deviations to estimate standard errors. These methods
also rely on critical values in central test distributions, such as t and z, that
assume normality or homoscedasticity (e.g., Equation 2.13).
Such distributional assumptions are not always plausible. For example,
skew characterizes the distributions of certain variables such as reaction times.
Many if not most distributions in actual studies are not even symmetrical,
much less normal, and departures from normality are often strikingly large
(Micceri, 1989). Geary (1947) suggested that this disclaimer should appear
in all introductory statistics textbooks: “Normality is a myth; there never was,
and never will be, a normal distribution” (p. 214). Keselman et al. (1998)
reported that the ratios across different groups of largest to smallest variances
as large as 8:1 were not uncommon in educational and psychological studies,
so perhaps homoscedasticity is a myth, too.
One option to deal with outliers is to apply transformations, which con-
vert original scores with a mathematical operation to new ones that may be
more normally distributed. The effect of applying a monotonic transforma-
tion is to compress one part of the distribution more than another, thereby
changing its shape but not the rank order of the scores. Examples of transfor-
mations that may remedy positive skew include X1/2, log10 X, and odd-root
functions (e.g., X1/3). There are many other kinds, and this is one of their
potential problems: It can be difficult to find a transformation that works in
a particular data set. Some distributions can be so severely nonnormal that
basically no transformation will work. The scale of the original scores is lost
when scores are transformed. If that scale is meaningful, the loss of the scal-
ing metric creates no advantage but exacts the cost that the results may be
difficult (or impossible) to interpret.
An alternative that also deals with departures from distributional
assumptions is robust estimation. Robust (resistant) estimators are gener-
ally less affected than least squares estimators by outliers or nonnormality.
X − Mdn
> 2.24 (2.28)
1.483 ( mad )
The value of the ratio in Equation 2.28 is the distance between a score and
the median expressed in robust standard deviation units. The constant 2.24
in the equation is the square root of the approximate 97.5th percentile in a
central c2 distribution with a single degree of freedom. A potential outlier
thus has a score on the ratio in Equation 2.28 that exceeds 2.24. Wilcox
(2012) described additional robust detection methods.
A robust variance estimator is the Winsorized variance s2Win. (The
terms Winsorized and Winsorization are named after biostatistician Charles
P. Winsor.) When scores are Winsorized, they are (a) ranked from lowest
to highest. Next, (b) the ptr most extreme scores in the lower tail of the
distribution are all replaced by the next highest original score that was not
replaced, and (c) the ptr most extreme scores in the upper tail are all replaced
by the next lowest original score that was not replaced. Finally, (d) s2Win is
calculated among the Winsorized scores using the standard formula for s2
(Equation 2.3) except that squared deviations are taken from the Winsorized
mean MWin, the average of the Winsorized scores, which may not equal Mtr
in the same sample. The statistic s2Win estimates the Winsorized population
variance σ2Win, which may not equal s2 if the population distribution is
nonnormal.
15 16 19 20 22 24 24 29 90 95
The mean and variance of these scores are M = 35.40 and s2 = 923.60, both
of which are affected by the extreme scores 90 and 95. The 20% trimmed
mean is calculated by first deleting the lower and upper .20 (10) = 2 most
extreme scores from each end of the distribution, represented next by the
strikethrough characters:
15 16 19 20 22 24 24 29 90 95
Next, calculate the average based on the remaining 6 scores (i.e., 19–29).
The result is Mtr = 23.00, which as expected is less than the sample mean,
M = 35.40.
When one Winsorizes the scores for the same trimming proportion
(.20), the two lowest scores in the original distribution (15, 16) are each
replaced by the next highest score (19), and the two highest scores (90, 95)
are each replaced by the next lowest score (29). The 20% Winsorized scores
are listed next:
19 19 19 20 22 24 24 29 29 29
The Winsorized mean is MWin = 23.40. The total sum of squared deviations
of the Winsorized scores from the Winsorized mean is SSWin = 166.40, and
the degrees of freedom are 10 – 1, or 9. These results imply that the 20%
Winsorized variance for this example is s2Win = 166.40/9, or 18.49. The vari-
ance of the original scores is greater (923.60), again as expected.
sWin
stm = (2.29)
(1 − 2 ptr ) n
4.30
stm = = 2.266
[1 − 2 (.20 )] 10
The general form of a robust 100 (1 – a)% confidence interval for µtr in this
method is
where ntr is the number of scores that remain after trimming. For the example
where n = 10 and ptr = .20, the number of deleted scores is 4, so ntr = 6. The
degrees of freedom are thus 6 – 1 = 5. The value of t2-tail, .05 (5) is 2.571, so the
robust 95% confidence interval for µtr is
which defines the interval [17.17, 28.83]. It is not surprising that this robust
interval is narrower than the conventional 95% confidence interval for µ
calculated with the original scores, which is [13.66, 57.14]. (You should verify
this result.)
A robust estimator of the standard error for the difference between inde-
pendent trimmed means when not assuming homoscedasticity is part of the
Yuen–Welch procedure (e.g., Yuen, 1974). Error variance of each trimmed
mean is estimated as
s2Win ( n i − 1)
wi = i
(2.31)
n tr ( n tr − 1)
i i
where s2Wini, ni, and ntri are, respectively, the Winsorized variance, original group
size, and effective group size after trimming in the ith group. The Yuen–Welch
estimate for the standard error of Mtr may be somewhat more accurate than
the estimate in the Tukey–McLaughlin method (Equation 2.29), but the two
methods usually give similar values (Wilcox, 2012).
The Yuen–Welch standard error of Mtr1 – Mtr2 is
sYW = w1 − w2 (2.32)
(w1 + w2 )2
dfYW = (2.33)
w12 w22
+
n tr1 − 1 n tr2 − 1
1 2
15 3
16 2
19 21
20 18
22 16
24 16
24 13
28 19
90 20
95 82
M 35.40 21.00
Mtr 23.00 17.00
MWin 23.40 16.80
s2 923.600 503.778
s Win
2
18.489 9.067
Note. The trimming proportion is ptr = .20.
The general form of a 100 (1 – s)% confidence interval for µtr1 – µtr2 in this
method is
Listed in Table 2.4 are raw scores with outliers and descriptive statis-
tics for two groups where n = 10. The trimming proportion is ptr = .20, so
ntr = 6 in each group. Outliers in both groups inflate variances relative to
their robust counterparts (e.g., s22 = 503.78, s2Win2 = 9.07). Extreme scores in
group 2 (2, 3, 82) fall in both tails of the distribution, so nonrobust versus
robust estimates of central tendency are more similar (M2 = 21.00, Mtr2 =
17.00) than in group 1. Exercise 5 asks you to verify the robust estimators
for group 2 in Table 2.4.
Summarized next are robust descriptive statistics for the data in Table 2.4:
The standard error of the trimmed mean contrast is estimated in the Yuen–
Welch method as
( 5.547 + 2.720 )2
dfYW = = 8.953
5.5472 2.720 2
+
5 5
The value of t2-tail, .05 (8.953) is 2.264. The robust 95% confidence interval for
µtr1 – µtr2 is
which defines the interval [-.51, 12.51]. Thus, µtr1 – µtr2 could be as low as
-.51 or it could be as high as 12.51, with 95% confidence and not assuming
homoscedasticity. Wilcox (2012) described a robust version of the Welch
procedure that is an alternative to the Yuen–Welch method, and Keselman,
Algina, Lix, Wilcox, and Deering (2008) outlined robust methods for depen-
dent samples.
A modern alternative in robust estimation to relying on formulas to esti-
mate standard errors and degrees of freedom in central test distributions that
assume normality is bootstrapping. There are methods to construct robust non-
parametric bootstrapped confidence intervals that protect against repeated
selection of outliers in the same generated sample (Salibián-Barrera & Zamar,
2002). Otherwise, bootstrapping is applied in basically the same way as described
in the previous section but to generate empirical sampling distributions for
robust estimators.
Standard computer programs for general statistical analyses, such as SPSS
and SAS/STAT, have limited capabilities for robust estimation. Wilcox (2012)
described add-on modules (packages) for conducting robust estimation in R, a
free, open source computing environment for statistical analyses, data mining,
and graphics.10 It runs on Unix, Microsoft Windows, and Apple Macintosh fam-
ilies of operating systems. A basic R installation has about the same capabilities
as some commercial statistical programs, but there are now over 2,000 packages
that further extend its capabilities. Wilcox’s (2012) WRS package has routines
for robust estimation, outlier detection, comparisons, and confidence interval
http://www.r-project.org/
10
Conclusion
The basic logic of sampling and estimation was described in this chap-
ter. Confidence intervals based on statistics with simple distributions rely on
central test statistics, but statistics with complex distributions may follow
noncentral distributions. Special software tools are typically needed for non-
centrality interval estimation. The lower and upper bounds of a confidence
interval set reasonable limits for the value of the corresponding parameter,
but there is no guarantee that a specific confidence interval contains the
parameter. Literal interpretation of the percentages associated with a confi-
dence interval assumes random sampling and that all other sources of impre-
cision besides sampling error are nil. Interval estimates are better than point
estimates because they are, as the astronomer Carl Sagan (1996, pp. 27–28)
described them, “a quiet but insistent reminder that no knowledge is com-
plete or perfect.” Methods for robust interval estimation based on trimmed
means and Winsorized variances were introduced. The next chapter deals
with the logic and illogic of significance testing.
Learn More
http://dornsife.usc.edu/labs/rwilcox/software/
11
http://www.iumsp.ch/Unites/us/Alfio/msp_programmes.htm
12
DOI: 10.1037/14136-003
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
67
Two Schools
Summarized in Table 3.1 are the basic steps of the Fisher and Neyman–
Pearson approaches to statistical inference. In Fisher’s method, p from a sta-
tistical test measures the strength of the evidence against the null hypothesis
H0. If p is sufficiently small, H0 can be rejected. Fisher advocated p < .05 as
a pragmatism for “small enough” (i.e., to reject H0 due to a sufficiently low
value of p) but not as a golden rule. There was no alternative hypothesis H1
in Fisher’s method. Specification of a fixed level of a (e.g., .05 or .01) and an
explicit H1 typifies the Neyman–Pearson model. These steps imply the dis-
tinction between Type I error (false rejection of H0) and Type II error (false
retention of H0). The probability of a Type I error is a, and the likelihood of
a Type II error is represented by b. Power is the complement of b, or 1 – b,
defined as the probability of correctly rejecting H0 when H1 is true. A power
analysis estimates the probability of a Type II error as b = 1 – power. Recall
that power is the probability of getting a statistically significant result over
random replications (in the long run) when H1 is true.
Power analysis concerns a loss function for Type II error. A loss func-
tion estimates with a single number the cost of a specific decision error. Cost
can be measured in monetary terms or in a different metric that represents
loss of utility, or relative satisfaction, in some area. A loss function theoreti-
cally enables the researcher to weigh the consequences of low power (high b)
against the risk of Type I error (a). This mental balancing act could facilitate
a better understanding of implications for specifying a = .05 versus a = .01
(or some other value).
Fisher vehemently opposed loss functions because he believed that sta-
tistical inference must respect the pure aim of science, the accumulation and
dissemination of knowledge. Entertaining any other consideration would,
Table 3.1
Steps in Fisher Significance Testing and Neyman–Pearson
Hypothesis Testing
Fisher Neyman–Pearson
Emphasized next are aspects of significance testing that are not well
understood by many students and researchers.
Null Hypotheses
H 0 : µ1 − µ 2 = 0 H0 : µ D = 0 H0 : ρ = 0
Alpha (a) is the probability of making a Type I error over random rep-
lications. It is also the conditional prior probability of rejecting H0 when it
is actually true, or
p ( H1 )β 1
α des = (3.2)
1 − p ( H1 ) dRS
.60 (.10 ) 1
α des = = .30
1 − .60 .50
which says that a = .30 reflects the desired balance of Type I versus Type II
error. The main point is that researchers should not rely on a mechanical
ritual (i.e., automatically specify .05 or .01) to control risk for Type I error
that ignores the consequences of Type II error. Note that the estimate of
p (H1) could come from a Bayesian analysis based on results of prior stud-
ies. In this case, the form of the probability that H1 is true would be that of
the conditional probability p (H1 | Data), where “Data” reflects extant results
and the whole conditional probability is estimated with Bayesian methods
(Chapter 10).
The level of a sets the risk of Type I error for a single test. There is also
experimentwise (familywise) error rate, or the likelihood of making at least
one Type I error across a set of tests. If each individual test is conducted at
the same level of a, then
α ew = 1 – (1 − α )
c
(3.3)
where c is the number of tests. Suppose that 20 statistical tests are conducted,
each at a = .05. The experimentwise error rate is
α ew = 1 – (1 – .05) = .64
20
p Values
All statistical tests do basically the same thing: The difference between
a sample result and the value of the corresponding parameter(s) specified in
H0 is divided by the estimated sampling error, and this ratio is then summa-
rized as a test statistic (e.g., t, F, c2). That ratio is converted by the computer
to a probability based on a theoretical sampling distribution (i.e., random
sampling is assumed). Test probabilities are often printed in computer output
under the column heading p, which is the same abbreviation used in jour-
nal articles. You should not forget that p actually stands for the conditional
probability
TS = eS × f ( N ) (3.4)
where ES is an effect size and f (N) is a function of sample size. This equation
explains how it is possible that (a) trivial effects can be statistically signifi-
cant in large samples or (b) large effects may not be statistically significant
in small samples. So p is a confounded measure of effect size and sample size.
Statistics that directly measure effect size are introduced in Chapter 5.
Power
1http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
2http://www.stat.uiowa.edu/~rlenth/Power/
The t tests reviewed next compare means from either two independent
or two dependent samples. Both are special cases of the F test for means such
that t2 = F for the same contrast and a nil hypothesis. The general form of t
for independent samples is
(M1 − M2 ) − ( µ1 − µ 2 )
t ( N − 2) = (3.5)
sM − M
1 2
For a two-tailed H1, results of the t test for the two null hypotheses are
15.00 − 10.00
t non-nil ( 48 ) = = 1.11, p = .272
4.50
15.00
t nil ( 48 ) = = 3.33, p = .002
4.50
This example illustrates the principle that the relative rareness of data under
implausible null hypotheses compared with more null plausible hypotheses
is exaggerated (respectively, .002 vs. .272). This is why Rouder, Speckman,
Sun, and Morey (2009) wrote, “As a rule of thumb, hypothesis testing should
be reserved for those cases in which the researcher will entertain the null as
theoretically interesting and plausible, at least approximately” (p. 235).
MD − µ D
t ( n − 1) = (3.6)
sMD
where the degrees of freedom are the group size (n) minus 1, MD and sMD
are, respectively, the observed mean difference score and its standard error
(Equation 2.20), and µD is the population dependent mean contrast specified
in H0. The latter is zero for a nil hypothesis.
Assuming a nil hypothesis, both forms of the t test defined express a
contrast as the proportion of its standard error. If t = 1.50, for example, the
first mean is 1½ standard errors higher than the second, but the sign of t is
arbitrary because it depends on the direction of subtraction. You should know
that the standard error metric of t is affected by sample size. Suppose descrip-
tive statistics for two groups in a balanced design are
which imply M1 – M2 = 2.00. Reported in Table 3.2 are results of the indepen-
dent samples t test for these data at three different group sizes, n = 5, 15, and
30. Note that the pooled within-groups variance, s pool 2
= 6.25 (Equation 2.13),
is unaffected by group size. This is not true for the denominator of t, sM1 - M2,
which gets smaller as n increases. This causes the value of t to go up and its
p value to go down for the larger group sizes. Consequently, the test for n = 5 is
not statistically significant at p < .05, but it is for the larger group sizes. Results
for the latter indicate less sampling error but not a larger effect size. Exercise 1
asks you to verify the results in Table 3.2 for n = 15.
The standard error metric of t is also affected by whether the means are
independent or dependent. Look back at Table 2.2, which lists raw scores and
descriptive statistics for two samples where M1 – M2 = 2.00, s 21 = 7.50, and s 22 =
5.00. Summarized next for these data are results of the t test and confidence
intervals assuming n = 5 in each of two independent samples:
Results for the same data but now assuming n = 5 pairs of scores across depen-
dent samples are
Note the smaller standard error, higher value of t and its lower p value, and the
narrower 95% confidence interval in the dependent samples analysis relative
to the independent samples analysis of the same raw scores. The assumptions
of the t tests are the same as those of the independent samples F test, which
are considered in the next section.
The Welch t test, also called the Welch–James t test (e.g., James, 1951),
for independent samples assumes normality but not homoscedasticity. Its
equation is
( M1 − M 2 ) − ( µ1 − µ 2)
t wel ( dfwel ) = (3.7)
swel
Rejecting H0 says only that differences among M1, M2, and M3 are unlikely.
This result alone is not very informative. A researcher may be more inter-
ested in focused comparisons, such as whether each of two treatment condi-
tions differs from control, which break down the omnibus effect into specific
directional effects. Thus, it is common practice either to follow an omni-
bus comparison with contrasts or to forgo the omnibus test and analyze only
contrasts. The logic of the F test in single-factor designs with a ≥ 3 levels is
considered next. Chapter 7 addresses contrast analysis in such designs, and
Chapter 8 covers designs with multiple factors.
Independent Samples
MSA
F ( dfA , dfW ) = (3.8)
MSW
where dfA = a – 1 and dfW are the pooled within-groups degrees of freedom, or
a a
dfW = ∑ dfi = ∑ (n i − 1) = N − a (3.9)
i =1 i =1
SSA
∑ n i (Mi − MT )2
MSA = = i =1
(3.10)
dfA a −1
SSW
∑ dfi (si2 )
MSW = = i =1
a
(3.11)
dfW
∑ dfi
i =1
where s 2i is the variance of the ith group. If there are only two groups, MSW =
s 2pool, and only in a balanced design can MSW also be computed as the aver-
age of the within-groups variances. The total sum of squares SST is the sum
of SSA and SSW; it can also be computed as the sum of squared deviations of
individual scores from the grand mean.
Presented next are descriptive statistics for three groups:
Reported in Table 3.3 are the results of F tests for these data at group sizes n = 5,
15, and 30. Note in the table that MSW = 5.50 regardless of group size. But both
MSA and F increase along with the group size, which also progressively lowers
p values from .429 for n = 5 to .006 for n = 30. Exercise 2 asks you to verify
results in Table 3.3 for n = 30.
Equation 3.10 for MSA defines a weighted means analysis where squared
deviations of group means from the grand mean are weighted by group size.
If the design is unbalanced, means from bigger groups get more weight. This
may not be a problem if unequal group sizes reflect unequal population base
rates. Otherwise, an unweighted means analysis may be preferred where all
means are given the same weight by (a) computing the grand mean as the
n=5
Between (A) 10.00 2 5.00 .91a
Within (error) 66.00 12 5.50
Total 76.00 14
n = 15
Between (A) 30.00 2 15.00 2.73b
Within (error) 231.00 42 5.50
Total 261.00 44
n = 30
Between (A) 60.00 2 30.00 5.45c
Within (error) 478.50 87 5.50
Total 538.50 89
Note. For all analyses, M1 = 13.00, s 12 = 7.50, M2 = 11.00, s 22 = 5.00, M3 = 12.00, and s 23 = 4.00.
ap = .429. bp = .077. cp = .006.
simple arithmetic average of the group means and (b) substituting the har-
monic mean nh for the actual group sizes in Equation 3.10:
a
nh = (3.12)
a
1
∑n
i =1 i
Results of weighted versus unweighted analysis for the same data tend to
diverge as group sizes are increasingly unbalanced.
The assumptions of the t tests are the same as for the independent sam-
ples F test. They are stated in many introductory books as independence,
normality, and homoscedasticity, but there are actually more. Two are that
(a) the factor is fixed and (b) all its levels are represented in the study. Levels
of fixed effects factors are intentionally selected for investigation, such as
the equally spaced drug dosages 0 (control), 3, 6, 9, and 12 mg kg-1. Because
these levels are not randomly selected, the results may not generalize to other
dosages not studied, such as 15 mg kg-1. Levels of random effects factors
are randomly selected, which yields over replications a representative sample
from all possible levels. A control factor is a special kind of random factor
that is not itself of interest but is included for the sake of generality (Keppel
& Wickens, 2004). An example is when participants are randomly assigned
to receive different versions of a vocabulary test. Using different word lists
Dependent Samples
The variances MSA and MSW are calculated the same way regardless of
whether the samples are independent or dependent (Equations 3.10–3.11),
but the latter no longer reflects only error variance in correlated designs.
This is due to the subjects effect. It is estimated for factors with ≥ 3 levels
as Mcov, the average covariance over all pairs of conditions. The subtraction
MSW – Mcov literally removes the subjects effect from the pooled within-
conditions variance and also defines the error term for the dependent sam-
ples F test. A similar subtraction removes the subjects effect from the error
term of the dependent samples t test (Equation 2.21).
An additive model assumes that the quantity MSW – Mcov reflects only
sampling error. In some sources, this error term is designated as MSres, where
the subscript refers to residual variance after removal of the subjects effect.
A nonadditive model assumes that the error term reflects both random
error and a true person × treatment interaction where some unmeasured
MSA
F(dfA , dfA×S ) = (3.13)
MSA×S
where dfA × S = (a – 1) (n – 1) and MSA × S = MSW – Mcov. The latter can also
be expressed as
where SSS is the sum of squares for the subjects effect with dfS = n – 1 degrees
of freedom. Equation 3.14 shows the decomposition of the total within-
conditions sum of squares into two parts, one due to the subjects effect and
the other related to error, or SSW = SSS + SSA × S.
The potential power advantage of the dependent samples F test over
the independent samples F test is demonstrated next. Data for three samples
are presented in Table 3.4. Results of two different F tests with these data are
reported in Table 3.5. The first analysis assumes n = 5 cases in each of three
independent samples, and the second analysis assumes n = 5 triads of scores
across three dependent samples. Only the second analysis takes account of
the positive correlations between each pair of conditions (see Table 3.4).
Observe the higher F and the lower p values for the dependent sample analy-
sis (Table 3.5). Exercise 3 asks you to verify the results of the dependent
samples F test in Table 3.5.
The dependent samples F test assumes normality. Expected depen-
dency among scores due to the subjects effect is removed from the error term
(Equation 3.14), so the assumptions of homoscedasticity and independence
concern error variances across the levels of the factor. The latter implies that
error variance in the first condition has nothing to do with error variance in
1 2 3
9 8 13
12 12 14
13 11 16
15 10 14
16 14 18
M 13.00 11.00 15.00
s2 7.50 5.00 4.00
Note. In a dependent samples analysis, r12 = .7348, r13 = .7303, and r23 = .8385.
the second condition, and so on. This is a strong assumption and probably
often implausible, too. This is because error variance from measurements
taken close in time, such as adjacent trials in a learning task, may well over-
lap. This autocorrelation of the errors may be less with longer measurement
intervals, but autocorrelated error occurs in many within-subjects designs.
Another assumption for factors with ≥ 3 levels is sphericity (circular-
ity), or the requirement for equal population variances of difference scores
between every pair of conditions, such as
σ 2D = σ 2D = σ 2D
12 13 23
Table 3.5
Results of the Independent Samples F Test and the Dependent
Samples F Test for the Data in Table 3.4
Source SS df MS F
All forms of ANOVA are nothing more than special cases of multiple
regression. In the latter, predictors can be either continuous or categorical
(Cohen, 1968). It is also possible in multiple regression to estimate interaction
or curvilinear effects. In theory, one needs just a regression computer procedure
to conduct any kind of ANOVA. The advantage of doing so is that regression
output routinely contains effect sizes in the form of regression coefficients and
the overall multiple correlation (or R2). Unfortunately, some researchers do not
recognize these statistics as effect sizes and emphasize only patterns of statistical
significance. Some ANOVA computer procedures print source tables with no
effect sizes, but it is easy to calculate some of the same effect sizes seen in regres-
sion output from values in source tables (see Chapter 5).
c2 Test of Association
r c
( fo − fe )2
χ2 [( r − 1)( c − 1)] = ∑ ∑ ij ij
(3..15)
i =1 j =1 fe
ij
where the degrees of freedom are the product of the number of rows (r)
minus one and the number of columns (c) minus one; foij is the observed fre-
quency for the cell in the ith row and jth column; and feij is the expected fre-
quency for the same cell under the nil hypothesis that the two variables are
unrelated. There is a quick way to compute by hand the expected frequency
for any cell: Divide the product of the row and column (marginal) totals for
that cell by the total number of cases, N. It is that simple. Assumptions of
Observed frequencies
n = 40
Treatment 40 24 16 .60 3.20a
Control 40 16 24 .40
Total 80 40 40
n = 80
Treatment 80 48 32 .60 6.40b
Control 80 32 48 .40
Total 160 80 80
p = .074.
a p = .011.
b
fe = ( 40 × 40 ) 80 = 20
which shows the pattern under H0 where the recovery rate is identical in the
two groups (20/40, or .50). The test statistic for n = 40 is
so H0 is not rejected at the .05 level. (You should verify this result.) The effect
of increasing the group size on c2 while keeping all else constant is demon-
strated in the lower part of Table 3.6. For example, H0 is rejected at the .05
level for n = 80 because
even though the difference in the recovery rate is still .20. Exercise 5 asks you
to verify the results of the c2 test for the group size n = 80 in Table 3.6.
where the robust standard error sYW and degrees of freedom dfYW adjusted
for heteroscedasticity are defined by, respectively, Equations 2.32 and 2.33.
There is also a robust Welch–James t test, but the robust Yuen–Welch t test
may yield effective levels of Type I error that are slightly closer to stated levels
of a over random samples (Wilcox, 2012).
Listed next are values of robust estimators for the data from two groups
in Table 2.4:
6.00
t Yw ( 8.953) = = 2.09
2.875
3http://supp.apa.org/psycarticles/supplemental/met_13_2_110/met_13_2_110_supp.html
4http://www.ams.med.uni-goettingen.de/amsneu/ordinal-de.shtml
Conclusion
Learn More
1. For the results reported in Table 3.2, conduct the t test for inde-
pendent samples for n = 15 and construct the 95% confidence
interval for µ1 – µ2.
2. For the results listed in Table 3.3, conduct the F test for inde-
pendent samples for n = 30.
3. For the data in Table 3.4, verify the results of the dependent
samples F test in Table 3.5. Calculate the source table by hand
using four-decimal accuracy for the error term.
4. Explain why the dependent samples t test does not assume
sphericity.
5. For the data in Table 3.6, verify the results of the c2 test for
n = 80.
Many false beliefs are associated with significance testing. Most involve
exaggerating what can be inferred from either rejecting or failing to reject a
null hypothesis. Described next are the “Big Five” misinterpretations with
estimates of their base rates among psychology professors and students. Also
considered in this chapter are variations on the Intro Stats method that may
be helpful in some situations. Reject-support testing is assumed instead of
accept-support testing, but many of the arguments can be reframed for the
latter. I assume also that a = .05, but the issues dealt with next apply to any
other criterion level of statistical significance.
DOI: 10.1037/14136-004
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
95
Table 4.1
The Big Five Misinterpretations of p < .05 and Base Rates
Among Psychology Professors and Students
Base rate (%)
Misinterpretations of 1 – p
Validity Likelihood that H1 is true is > 95% 33–66 59
Replicability Likelihood that result will be 37–60 41
replicated is > 95%
Note. Table adapted from R. B. Kline, 2009, Becoming a Behavioral Science Researcher: A Guide to
Producing Research That Matters, p. 125, New York, Guilford Press. Copyright 2009 by Guilford Press.
Adapted with permission. Dashes (—) indicate the absence of estimated base rates.
aHaller and Krauss (2002), Oakes (1986). bHaller and Krauss (2002).
null hypothesis is < .05, assuming that all distributional requirements of the
test statistic are satisfied and there are no other sources of error variance. Let
us refer to any correct definition as p (D +H0), which emphasizes p as the
conditional probability of the data under H0 given all the other assumptions
just mentioned.
Listed in Table 4.1 are the Big Five false beliefs about statistical signifi-
cance. Three concern p values, but two others involve their complements,
or 1 - p. Also reported in the table are base rates in samples of psychol-
ogy professors or students (Haller & Krauss, 2002; Oakes, 1986). Overall,
psychology students are no worse than their professors regarding erroneous
beliefs. These poor results are not specific to psychology (e.g., forecasting;
Armstrong, 2007). It is also easy to find similar misunderstandings in jour-
nal articles and statistics textbooks (e.g., Cohen, 1994; Gigerenzer, 2004).
These results indicate that myths about significance testing are passed on
from teachers and published works to students.
Odds-Against-Chance Fallacy
Most psychology students and professors may endorse the local Type I
error fallacy (Table 4.1). It is the mistaken belief that p < .05 given a = .05
means that the likelihood that the decision just taken to reject H0 is a Type I
error is less than 5%. Pollard (1993) described this fallacy as confusing the
conditional probability of a Type I error, or
α = p ( reject H0 H0 true)
with the conditional posterior probability of a Type I error given that H0 has
been rejected, or
p ( H0 true reject H0 )
Validity Fallacy
Replicability Fallacy
Results by Oakes (1986) and Haller and Krauss (2002) indicated that
virtually all psychology students and about 80 to 90% of psychology profes-
sors endorsed at least one of the Big Five false beliefs. So it seems that most
researchers believe for the case a = .01 and p < .01 that the result is very
unlikely to be due to sampling error and that the probability a Type I error was
just committed is just as unlikely (< .01 for both). Most researchers might also
conclude that H1 is very likely to be true, and many would also believe that
the result is very likely to replicate (> .99 for both). These (misperceived)
odds in favor of the researcher’s hypothesis are so good that it must be true,
right? The next (il)logical step would be to conclude that the result must
also be important. Why? Because it is significant! Of course, none of these
things are true, but the Big Five are hardly the end of cognitive distortions in
significance testing.
The magnitude fallacy is the false belief that low p values indicate large
effects. Cumming (2012) described a related error called the slippery slope
of significance that happens when a researcher ambiguously describes a result
for which p < a as “significant” without the qualifier “statistically” and then
later discusses the effect as if it were automatically “important” or “large.”
These conclusions are unwarranted because p values are confounded mea-
sures of effect size and sample size (see Equation 3.4). Thus, effects of trivial
magnitude need only a large enough sample to be statistically significant.
If the sample size is actually large, low p values just confirm a large sample,
which is circular logic (B. Thompson, 1992).
Failure Fallacy
The failure fallacy is the mistaken belief that lack of statistical sig-
nificance brands the study as a failure. Gigerenzer (2004) recited this older
incantation about doctoral dissertations and the critical ratio, the predeces-
sor of p values: “A critical ratio of three [i.e., p < .01], or no PhD” (p. 589).
Although improper methods or low power can cause Type II errors, the failure
to reject H0 can be an informative result. Researchers tend to attribute fail-
ure to reject H0 to poor design rather than to the validity of the substantive
hypothesis behind H1 (Cartwright, 1973).
Objectivity Fallacy
Sanctification Fallacy
Robustness Fallacy
important
epoch-making
probative
epochal noteworthy
remarkable
of import
significant
large world-shattering
portentous earthshaking
monumental
prodigious
world-shaking
Figure 4.1. Visual map of words with meanings similar to that of “significant.” Image
and text from the Visual Thesaurus (http://www.visualthesaurus.com). Copyright
1998–2011 by Thinkmap, Inc. All rights reserved. Reprinted with permission.
trated in Figure 4.1 created with the Thinkmap Visual Thesaurus,1 include
“important,” “noteworthy,” and “monumental,” but none of them automati-
cally apply to H0 rejections. One way to guard against overinterpretation
is to drop the word significant from our data analysis vocabulary altogether.
Hurlbert and Lombardi (2009) reminded us that there is no obligation is use
the word significant at all in research. Another is to always use the phrase
statistically significant (see B. Thompson, 1996), which signals that we are not
talking about significance in the everyday sense (e.g., Figure 4.1). Using just
the word statistical should also suffice. For example, rejection of H0: µ1 = µ2
could be described as evidence for a statistical mean difference (Tryon, 2001).
Calling an effect statistical implies that it was observed but not also necessar-
ily important or real.
Another suggested reform is to report exact p values, such as
1http://www.visualthesaurus.com/
Additional Problems
Customer-Centric Science
Equivalence Testing
says that the population means cannot be considered equivalent if the abso-
lute value of their difference is greater than 10.00. The complementary inter-
val for this example is
−10.00 ≤ ( µ1 − µ 2 ) ≤ 10.00
H0 : −10.00 ≤ ( µ1 − µ 2 ) ≤ 10.00
Outlined next are recommendations that call for varying degrees of use
of statistical tests—from none at all to somewhat more pivotal depending on
the context—but with strict requirements for their use. These suggestions are
intended as a constructive framework for reform and renewal. I assume that
reasonable people will disagree with some of the specifics put forward. Indeed,
a lack of consensus has characterized the whole debate about significance
testing. Even if you do not endorse all the points elaborated next, you may
at least learn new ways of looking at the controversy over statistical tests or,
even better, data, which is the ultimate goal of this discussion.
A theme underlying these recommendations can be summarized like
this: Significance testing may have helped us in psychology and other behav-
ioral sciences through a difficult adolescence during which we struggled to
Recommendations
Replicate, Replicate
The rationale for this recommendation is obvious. A replication
requirement would help to filter out some of the fad research topics that
bloom for a short time but then disappear. Such a requirement could be
relaxed for original results with the potential for a large impact in their
field, but the need to replicate studies with unexpected or surprising results
is even greater (Robinson & Levin, 1997). Chapter 9 deals with replication
in more detail.
Conclusion
Significance testing has been like a collective Rorschach inkblot test for
the behavioral sciences: What we see in it has more to do with wish fulfill-
ment than reality. This magical thinking has impeded the development of
psychology and other disciplines as cumulative sciences. There would be no
Learn More
Aguinis et al. (2010) and Hurlbert and Lombardi (2009) give spirited
defenses of modified forms of significance testing. Ziliak and McCloskey (2008)
deliver an eloquent but hard-hitting critique of significance testing, and Lambdin
(2012) takes psychology to task for its failure to abandon statistical witchcraft.
Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & Kohlhausen, D.
(2010). Customer-centric science: Reporting significant research results with
rigor, relevance, and practical impact in mind. Organizational Research Methods,
13, 515–539. doi:10.1177/1094428109333339
Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman–Pearson
decision theory framework and rise of the neoFisherian. Annales Zoologici Fennici,
46, 311–349. Retrieved from http://www.sekj.org/AnnZool.html
Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance
tests are not. Theory & Psychology, 22, 67–90. doi:10.1177/0959354311429854
Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the stan-
dard error costs us jobs, justice, and lives. Ann Arbor: University of Michigan Press.
Exercises
2http://www.investorwords.com/4704/statistically_significant.html
3http://www.statpac.com/surveys/statistical-significance.htm
Statistical significance is the least interesting thing about the results. You
should describe the results in terms of measures of magnitude—not just,
does a treatment affect people, but how much does it affect them.
—Gene Glass (quoted in M. Hunt, 1997, pp. 29–30)
DOI: 10.1037/14136-005
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
123
Definitions of Effect Size
Major contexts for effect size estimation and the difference between
unstandardized and standardized effect sizes are outlined next.
Meta-Analysis
It is standardized effect sizes from sets of related studies that are analyzed
in most meta-analyses. Consulting a meta-analytic study provides a way for
researchers to gauge whether their own effects are smaller or larger than those
from other studies. If no meta-analytic study yet exists, researchers can calculate,
using equations presented later, effect sizes based on descriptive or test statistics
reported by others. Doing so permits direct comparison of results across different
studies of the same phenomenon, which is part of meta-analytic thinking.
Levels of Analysis
Effect sizes for analysis at the group or variable level are based on aggre-
gated scores. Consequently, they do not directly reflect the status of indi-
vidual cases, and there are times when group- or variable-level effects do not
tell the whole story. Knowledge of descriptive statistics including correla-
tion coefficients is required in order to understand group- or variable-level
effect sizes. Not so for case-level effect sizes, which are usually proportions of
scores that fall above or below certain reference points. These proportions
may be observed or predicted, and the reference points may be relative, such
as the median of one group, or more absolute, such as a minimum score on an
admissions test. Huberty (2002) referred to such effect sizes as group overlap
indexes, and they are suitable for communication with general audiences.
There is an old saying that goes, “The more you know, the more simply you
There are two broad classes of standardized effect sizes for analysis at the
group or variable level, the d family, also known as group difference indexes,
and the r family, or relationship indexes (Huberty, 2002; Rosenthal et al.,
2000). Both families are metric- (unit-) free effect sizes that can compare
results across studies or variables measured in different original metrics. Effect
sizes in the d family are standardized mean differences that describe mean
contrasts in standard deviation units, which can exceed 1.0 in absolute value.
Standardized mean differences are signed effect sizes, where the sign of the
statistic indicates the direction of the corresponding contrast.
Effect sizes in the r family are scaled in correlation units that generally
range from -1.0 to +1.0, where the sign indicates the direction of the relation
between two variables. For example, the point-biserial correlation rpb is an
effect size for designs with two unrelated samples, such as treatment versus
control, and a continuous outcome. It is a form of the Pearson correlation r
in which one of the two variables is dichotomous. If rpb = .30, the correlation
between group membership and outcome is .30, and the former explains
.302 = .09, or 9%, of the total variance in the latter. A squared correlation
is a measure of association, which is generally a proportion of variance
explained effect size. Measures of association are unsigned effect sizes and
thus do not indicate directionality.
Because squared correlations can make some effects look smaller than
they really are in terms of their substantive significance, some researchers
prefer unsquared correlations. If r = .30, for example, it may not seem very
impressive to explain .302 = .09, or <10%, of the total variance. McCloskey
and Ziliak (2009) described examples in medicine, education, and other
areas where potentially valuable findings may have been overlooked due to
misinterpretation of squared correlations. Rutledge and Loh (2004) calcu-
lated correlation effect sizes for 15 widely cited studies in behavioral health
(e.g., heart disease, smoking, depression). They found that proportions of
explained variance were typically <.10, yet these studies are considered to be
landmark investigations that demonstrated clinically meaningful results. For
example, the Steering Committee of the Physicians’ Health Study Research
Group (1988) found that the clinical value of small doses of aspirin in pre-
venting heart attack was so apparent that it terminated a randomized clinical
trial early so that the results could be reported. The correlation effect size
was .034. This means that taking aspirin versus placebo explained about
ˆ 2 > εˆ 2 > ωˆ 2
η
but their values converge in large samples. The effect size ŵ2 is reported more
often than ê2, so the latter is not covered further; see Olejnik and Algina
(2000) for more information about ê2. Kirk (1996) described a category of
miscellaneous effect size indexes that includes some statistics not described in
this book, including the binomial effect size display and the counternull value
of an effect size; see also Rosenthal et al. (2000), Ellis (2010), and Grissom
and Kim (2011).
µ1 − µ 2
δ= (5.1)
σ*
where the numerator is the population mean contrast and the denomi-
nator is a population standard deviation on the outcome variable. The
M1 − M2
d= (5.2)
σˆ *
tion in the third study (500.00) is greater than that in the fourth (50.00),
we conclude unequal effect sizes across studies 3 and 4 because d3 = .15 and
d4 = 1.50.
Specific types of d statistics seen most often in the literature and their
corresponding parameters are listed in Table 5.2 and are discussed next (see
also Keselman et al., 2008). From this point, the subscript for d designates its
standardizer.
dpool
1 1
d pool = tind + (5.3)
n1 n 2
This equation is handy when working with secondary sources that do not
report sufficient group descriptive statistics to calculate dpool as (M1 – M2)/spool.
It is also possible to transform the correlation rpb to dpool for the same data:
dfW 1 1
d pool = rpb 1 − r 2 n1 + n 2 (5.4)
pb
Nonrobust
M1 − M2 µ1 − µ 2
dpool
spool σ
M1 − M2 µ1 − µ 2
ds1
s1 σ1
M1 − M2 µ1 − µ 2
ds2
s2 σ2
M1 − M2 µ1 − µ 2
dtotal
sT σ total
MD µD
ddiff
sD σ 2 (1 − ρ12 )
Robust
Mtr1 − Mtr2 µ tr1 − µ tr2
dWin p
sWin p σ Win
Mtr1 − Mtr2 µ tr1 − µ tr2
dWin1
sWin1 σ Win1
Mtr1 − Mtr2 µ tr1 − µ tr2
dWin2
sWin2 σ Win2
Equation 5.4 shows that dpool and rpb describe the same contrast but in dif-
ferent standardized units. An equation that converts d pool to rpb is presented
later.
In correlated designs, dpool is calculated as MD/spool, where MD is the
dependent mean contrast. The standardizer spool in this case assumes that the
cross-conditions correlation r12 is zero (i.e., any subjects effect is ignored).
The parameter estimated when the samples are dependent is µD/s. The value
of dpool can also be computed from tdep, the dependent samples t with n – 1
degrees of freedom for a nil hypothesis, and group size, the within-condition
variances, and the variance of the difference scores (Equation 2.21) as
2 sD2
d pool = t dep (5.5)
n (s12 + s22 )
5.00 5.00
d s1 = = .25 or d s 2 = = 1.00
400.00 25.00
The statistic ds2 indicates a contrast four times larger in standard deviation
units than ds1. The two results are equally correct if there are no conceptual
grounds to select one group standard deviation or the other as the standardizer.
It would be best in this case to report values of both ds1 and ds2, not just the one
that gives the most favorable result. When the group variances are similar, dpool
is preferred because spool is based on larger sample sizes (yielding presumably
more precise statistical estimates) than s1 or s2. But if the ratio of the largest
over the smallest variance exceeds, say, 4.0, then ds1 or ds2 would be better.
dtotal
Olejnik and Algina (2000) noted that spool, s1, and s2 estimate the full
range of variation for experimental factors but perhaps not for individual
difference (nonexperimental) factors. Suppose there is a substantial gender
difference on a continuous variable. In this case, spool, s1, and s2 all reflect a
partial range of individual differences. The unbiased variance estimator for
the whole data set is s2T = SST/dfT, where the numerator and denominator are,
respectively, the total sum of squares and total degrees of freedom, or N – 1.
Gender contrasts standardized against sT would be smaller in absolute value
than when the standardizer is spool, s1, or s2, assuming a group difference.
Whether standardizers reflect partial or full ranges of variability is a crucial
problem in factorial designs and is considered in Chapter 8.
Absolute values of dpool, ds1, ds2, and dtotal are positively biased, but
the degree of bias is slight unless the group sizes are small, such as n < 20.
Multiplication of any of these statistics by the correction factor
3
c ( df ) = 1 − (5.6)
4 df − 1
which implies M1 – M2 = 2.00 and s2pool = 6.25. Reported in Table 5.3 are
results of the independent samples t test and values of d statistics for n = 5,
15, and 30. The t test shows the influence of group size. In contrast, dpool = .80
for all three analyses and in general is invariant to group size, keeping all else
constant. The approximate unbiased estimator c (dfW) dpool is generally less
than dpool, but their values converge as n increases. The two possible values
of d for these data when the standardizer is a group standard deviation are
ds1 = .73 and ds2 = .89. Values of dtotal are generally similar to those of other
d statistics for these data (see Table 5.3), but, in general, dtotal is increasingly
dissimilar to dpool, ds1, and ds2 for progressively larger contrasts on nonexperi-
mental factors. Exercise 1 asks you to verify some of the results in Table 5.3.
Statistic 5 15 30
t test
t 1.26 2.19 3.10
dfW 8 28 58
p .242 .037 .003
Standardized mean differences
dpool .80 .80 .80
c (dfW) dpool .72 .78 .79
ds1 .73 .73 .73
ds2 .89 .89 .89
dtotal .77 .75 .75
Point-biserial correlation
rpb .41 .38 .38
Note. For all analyses, M1 = 13.00, s 21 = 7.50, M2 = 11.00, s 22 = 5.00, s 2pool = 6.25, and p values are two-tailed
and for a nil hypothesis.
µD
δ= (5.7)
σ 2 (1 − ρ12 )
where s and r12 are, respectively, the common population standard deviation
and cross-conditions correlation. Note in this equation that the denominator
is less than s only if r12 > .50.
Mtr1 − Mtr2
d Win p = (5.8)
sWin p
where Mtr1 – Mtr2 is the contrast between 20% trimmed means and sWin p is the
20% pooled Winsorized standard deviation that assumes homoscedasticity.
The latter in a squared metric is
where s2Win1 and s2Win2 are the 20% Winsorized group variances and df1 = n1 – 1,
df2 = n2 – 1, and dfW = N – 2. The parameter estimated by dWin p is
µ tr1 − µ tr2
δ rob = (5.10)
σ Win
which implies s2Win p = 13.778. (You should verify this result using Equation 5.9.)
The robust d statistic based on the pooled standardizer is
23.00 − 17.00
d Win p = = 1.62
13.778
5.00 5.00
d pool1 = = .20 and d pool2 = = 2.00
625.00 6.25
M1 − M2
rpb = pq (5.12)
ST
where ST is the standard deviation in the total data set computed as (SST/N)1/2
and p and q are the proportions of cases in each group (p + q = 1.0). The
expression in parentheses in Equation 5.12 is a d statistic with the standard-
izer ST. It is the multiplication of this quantity by the standard deviation of
the dichotomous factor, (pq)1/2, that transforms the whole expression into
correlation units. It is also possible to convert dpool to rpb for the same data:
d pool
rpb = (5.13)
1 1
d 2
+ dfW +
pool
n1 n2
It may be easier to compute rpb from tind with dfW = N - 2 for a nil
hypothesis:
tind
rpb = (5.14)
t + dfW
2
ind
The absolute value of rpb can also be derived from the independent samples
F statistic with 1, dfW degrees of freedom for the contrast:
Find SSA ˆ
rpb = = =η (5.15)
Find + dfW SST
This equation also shows that rpb is a special case of η̂2 = SSA/SST, where
SSA is the between-groups sum of squares for the dichotomous factor A. In
particular, r2pb = η̂2 in a two-group design. Note that η̂ is an unsigned correla-
tion, but rpb is signed and thus indicates directionality.
The correlation rpb is for designs with two unrelated samples. For depen-
dent samples, we can instead calculate the correlation of which rpb is a special
case, η̂. It is derived as (SSA /SST)1/2 whether the design is between-subjects
or within-subjects. A complication is that η̂ may not be directly comparable
when the same factor is studied with independent versus dependent samples.
This is because SST is the sum of SSA and SSW when the samples are unrelated,
but it comprises SSA, SSS, and SSA × S for dependent samples. Thus, SST reflects
only one systematic effect (A) when the means are independent but two
systematic effects (A, S) when the means are dependent.
A partial correlation that controls for the subjects effect in correlated
designs assuming a nonadditive model is
ˆ= SSA
partial η (5.16)
SSA + SSA × S
where the denominator under the radical represents just one systematic effect
(A). The square of Equation 5.16 is partial η̂2, a measure of association that
refers to a residualized total variance, not total observed variance. Given
partial η̂2 = .25, for example, we can say that factor A explains 25% of the
variance controlling for the subjects effect.
If the subjects effect is relatively large, partial η̂2 can be substantially
higher than η̂2 for the same contrast. This is not contradictory because only
η̂2 is in the metric of the original scores. This fact suggests that partial η̂2 from
a correlated design and η̂2 from a between-subjects design with the same fac-
tor and outcome may not be directly comparable. But η̂2 = partial η̂2 when the
means are unrelated because there is no subjects effect. Exercise 5 asks you to
verify that η̂2 = .167 and partial η̂2 = .588 in the dependent samples analysis
of the data in Table 2.2.
The correlation rpb (and η̂2, too) is affected by base rate, or the pro-
portion of cases in one group versus the other, p and q. It tends to be high-
est in balanced designs. As the design becomes more unbalanced holding
all else constant, rpb approaches zero. Suppose that M1 – M2 = 5.00 and
ST = 10.00 in each of two different studies. The first study has equal group
sizes, or p1 = q1 = .50. The second study has 90% of its cases in the first
group and 10% of them in the second group, or p2 = .90 and q2 = .10. Using
Equation 5.12, we get
5.00 5.00
rpb1 = .50 (.50) = .25 and rpb2 = .90 (.10) = .15
10.00 10.00
The values of these correlations are different even though the mean contrast
and standard deviation are the same. Thus, rpb is not directly comparable
across studies with dissimilar relative group sizes (dpool is affected by base rates,
too, but ds1 or ds2 is not). The correlation rpb is also affected by the total vari-
ability (i.e., ST). If this variation is not constant over samples, values of rpb
may not be directly comparable. Assuming normality and homoscedasticity,
d- and r-type effect sizes are related in predictable ways; otherwise, it can
happen that d and r appear to say different things about the same contrast
(McGrath & Meyer, 2006).
rXY
rˆXY = (5.17)
rXX rYY
where rXX and rYY are the score reliabilities for the two variables. For example,
given rXY = .30, rXX = .80, and rYY = .70,
.30
rˆXY = = .40
.80 (.70 )
which says that the estimated correlation between X and Y is .40 controlling
for measurement error. Because disattenuated correlations are only estimates,
it can happen that r̂ XY > 1.0.
In comparative studies where the factor is presumably measured with
nearly perfect reliability, effect sizes are usually corrected for measurement
error in only the outcome variable, designated next as Y. Forms of this cor-
rection for d- and r-type effect sizes are, respectively,
d rpb
dˆ = and rˆpb = (5.18)
rYY rYY
.75
dˆ = = .79
.90
which says that the contrast is predicted to be .79 standard deviations large
controlling for measurement error. The analogous correction for the correla-
tion ratio is η̂2/rYY.
Appropriate reliability coefficients are needed to apply the correction
for attenuation, and best practice is to estimate these coefficients in your
own samples. The correction works best when reliabilities are good, such
as rXX > .80, but otherwise it is less accurate. The capability to correct effect
sizes for unreliability is no substitute for good measures. Suppose that d = .15
and rYY = .10. A reliability coefficient so low says that the scores are basically
random numbers and random numbers measure nothing. The disattenuated
effect size is d̂ = .15/.101/2, or .47, an adjusted result over three times larger than
the observed effect size. But this estimate is not credible because the scores
should not have been analyzed in the first place. Correction for measurement
error increases sampling error compared with the original effect sizes, but
this increase is less when reliabilities are higher. Hunter and Schmidt (2004)
described other kinds of corrections, such as for range restriction.
Links to computer tools, described next, are also available on this book’s
web page.
Distributions of d- and r-type effect sizes are complex and generally fol-
low, respectively, noncentral t distributions and noncentral F distributions.
Noncentral interval estimation requires specialized computer tools. An alter-
native for d is to construct approximate confidence intervals based on hand-
calculable estimates of standard error in large samples. An approach outlined
by Viechtbauer (2007) is described next.
The general form of an approximate 100 (1 – a)% confidence interval
for d is
d ± sd (z 2-tail, σ ) (5.19)
d2 N
sd ind = + (5.20)
2 df n 1 n 2
where df are the degrees of freedom for the standardizer of the corresponding
d statistic. An estimated standard error when treating the means based on
n pairs of scores as dependent is
d2 2 (1 − r12)
sd dep = + (5.21)
2 (n − 1) n
d2 1
sd diff + (5.22)
2 (n − 1) n
There are versions of these standard error equations for very large samples
where the sample size replaces the degrees of freedom, such as N instead of
dfW = N – 2 in Equation 5.20 for the effect size dpool (e.g., Borenstein, 2009). In
very large samples, these two sets of equations (N, df) give similar results, but
I recommend the versions presented here if the sample size is not very large.
Suppose that n = 30 in a balanced design and dpool = .80. The estimated
standard error is
.80 2 60
sd pool = + = .2687
2 (58) 30 (30)
Because z2-tail, .05 = 1.96, the approximate 95% confidence interval for d is
which defines the interval [.27, 1.33]. This wide range of imprecision is due to
the small group size (n = 30). Exercise 6 asks you to construct the approximate
95% confidence interval based on the same data but for a dependent contrast
where r12 = .75.
n1 n 2
∆ =δ (5.23)
N
When the nil hypothesis is true, d = 0 and D = 0; otherwise, D has the same
sign as d. Equation 5.23 can be rearranged to express d as a function of D and
group sizes:
N
δ=∆ (5.24)
n1 n 2
You can verify these results with an online noncentral t percentile cal-
culator1 or J. H. Steiger’s Noncentral Distribution Calculator (NDC), a freely
available Windows application for noncentrality interval estimation.2 Thus,
the observed effect size of dpool = .80 is just as consistent with a population effect
size as low as d = .27 as it is with a population effect size as high as d = 1.32, with
95% confidence. The approximate 95% confidence interval for d for the same
data is [.27, 1.33], which is similar to the noncentral interval just described.
Smithson (2003) described a set of freely available SPSS scripts that
calculate noncentral confidence intervals based on dpool in two-sample designs
when the means are treated as independent.3 Corresponding scripts for SAS/
STAT and R are also available.4 Kelley (2007) described the Methods for the
Behavioral, Educational, and Social Sciences (MBESS) package for R, which
calculates noncentral confidence intervals for many standardized effect sizes.5
The Power Analysis module in STATISTICA Advanced also calculates non-
central confidence intervals based on standardized effect sizes (see footnote 5,
Chapter 2). In correlated designs, distributions of dpool follow neither central
nor noncentral t distributions (Cumming & Finch, 2001). For dependent
mean contrasts, ESCI uses Algina and Keselman’s (2003) method for finding
approximate noncentral confidence intervals for µD/s.
1http://keisan.casio.com/
2http://www.statpower.net/Software.html
3http://core.ecu.edu/psyc/wuenschk/SPSS/SPSS-Programs.htm
4http://dl.dropbox.com/u/1857674/CIstuff/CI.html
5http://cran.r-project.org/web/packages/MBESS/index.html
λ
η2 = (5.25)
λ+N
Thus, the observed effect size of ĥ2 = .142 is just as consistent with a popula-
tion effect size as low as h2 = .018 as it is with a population size as high as
h2 = .305, with 95% confidence. The range of imprecision is wide due to the
small sample size. You should verify with Equation 5.25 that the bounds of
the confidence interval in l units convert to the corresponding bounds of the
confidence interval in h2 units for these data.
Other computer tools that generate noncentral confidence intervals
for h2 include the aforementioned MBESS package for R and the Power
Analysis module in STATISTICA Advanced. There is a paucity of programs
that calculate noncentral confidence intervals for h2 in correlated designs.
This is because the distributions of h2 in this case may follow neither central
nor noncentral test distributions. An alternative is bootstrapped confidence
intervals.
6http://plaza.ufl.edu/algina/index.programs.html
7http://supp.apa.org/psycarticles/supplemental/met_13_2_110/met_13_2_110_supp.html
The methods outlined next describe effect size at the case level.
Measures of Overlap
a. U1
M2 M1
b. U3
M2 M1
100 100
80 80
Outcome
Outcome
60 60
40 40
20 20
1 2 1 2
Group Group
Figure 5.2. A graphical display of means only (a) versus one that shows box plots
(b) for the scores from two groups with outliers in Table 2.4.
Tail Ratios
p1
p2
M2 M1
b. d = 0
p1
p2
M1 = M2
Figure 5.3. The right tail ratio p1/p2 relative to cutting point when d > 0 for M1 > M2 (a)
and d = 0 for M1 = M2 (b).
Suppose that job applicants will be considered only if their scores exceed 130,
or two standard deviations above the mean. Normal deviate equivalents of this
threshold in the separate distributions for women and men are, respectively,
.0344
rtr = = 1.98
.0174
Thus, women are about twice as likely as men to have scores that exceed the
cutting point. This method may not give accurate results if the distributions
are not approximately normal. One should instead analyze the frequency
distributions to find the exact proportions of scores beyond the cutting point.
Tail ratios are often reported when gender differences at the extremes of dis-
tributions are studied.
Tail ratios generally increase as the threshold moves further to the
right (RTR) or further to the left (LTR) when M1 ≠ M2, assuming sym-
metrical distributions with equal variances. But it can happen that tail
ratios are not zero even though M1 = M2 and d = rpb = 0 when there is hetero
scedasticity. For example, the two distributions in Figure 5.3(b) have the
same means, but the tail ratios are not also generally 1.00 because group 1 is
more variable than group 2. Thus, scores from group 1 are overrepresented
at both extremes of the distributions. If the researcher wants only to com-
pare central tendencies, this “disagreement” between the tail ratios and
d may not matter. In a selection context, though, the tail ratios would be
of critical interest.
McGraw and Wong’s (1992) common language effect size (CL) is the
predicted probability that a random score on a continuous outcome selected
from the group with the higher mean exceeds a random score from the group
with the lower mean. If two frequency distributions are identical, CL = .50,
which says that it is just as likely that a random score from one group exceeds
a random score from the other group. As the two frequency distributions
become more distinct, the value of CL increases up to its theoretical maxi-
mum of 1.00. Vargha and Delaney (2000) described the probability of (sto-
chastic) superiority, which can be applied to ordinal outcome variables.
Huberty and Lowman’s (2000) improvement over chance classification, or
I, is for the classification phase of logistic regression or discriminant func-
tion analysis. The I statistic measures the proportionate reduction in the
error rate compared with random classification. If I = .35, for example, the
observed classification error rate is 35% less than that expected in random
classification.
Substantive Significance
This section provides interpretive guidelines for effect sizes. I also sug-
gest how to avoid fooling yourself when estimating effect sizes.
Questions
As researchers learn about effect size, they often ask, what is a large
effect? a small effect? a substantive (important) effect? Cohen (1962) devised
what were probably the earliest guidelines for describing qualitative effect
size magnitudes that seemed to address the first two questions. The descriptor
medium corresponded to a subjective average effect size in nonexperimental
studies. The other two descriptors were intended for situations where neither
theory nor prior empirical findings distinguish between small and large effects.
In particular, he suggested that d = .50 indicated a medium effect size, d = .25
corresponded to a small effect size, and d = 1.00 signified a large effect size.
Cohen (1969) later revised his guidelines to d = .20, 50, and .80 as, respectively,
small, medium, and large, and Cohen (1988) described similar benchmarks for
correlation effect sizes.
Cohen never intended the descriptors small, medium, and large—T-shirt
effect sizes—to be applied rigidly across different research areas. He also
acknowledged that his conventions were an educated guess. This is why he
encouraged researchers to look first to the empirical literature in their areas
before using these descriptors. Unfortunately, too many researchers blindly
8
http://www.meta-analysis.com/index.html
Clinical Significance
Research Example
This example illustrates effect size estimation at both the group and
case levels in an actual data set. You can download the raw data file in SPSS
format for this example from the web page for this book. At the beginning of
courses in introductory statistics, 667 psychology students (M age = 23.3 years,
s = 6.63; 77.1% women) were administered the original 15-item test of basic
math skills reproduced in Table 5.5. The items are dichotomously scored
as either 0 (wrong) or 1 (correct), so total scores range from 0 to 17. These
Table 5.5
Items of a Basic Math Skills Test
1. 6 14 2. Write as 3. Write as 4. 7 6
=
15 8 a fraction a decimal 17 x
+ 4 12
.0031 = 52 12 % = x=
∑ xy + ∑y =
15. Where does the line
3 x − 2 y = 12 cross
Satisfactory 511 10.96 9.45 .46a .54 .19b .22 .05 .67
Unsatisfactory 156 9.52 10.45
Note. For a nil hypothesis, t (665) = 5.08. Corrections for measurement error based on rXX = .72.
a
Noncentral 95% confidence interval for d [.28, 65]. bNoncentral 95% confidence interval for h2 [.014, .069].
performance. The risk for a negative outcome in statistics increases from about
12% among students who correctly solved at least 80% of the math test items
to the point where nearly half (45.3%) of the students who correctly solved
fewer than 40% of test items had unsatisfactory outcomes.
Conclusion
This chapter introduced basic principles of effect size estimation and two
families of group- or variable-level standardized effect sizes for continuous out-
comes, standardized mean differences and correlation effect sizes. Denominators
of standardized mean differences for contrasts between independent means are
standard deviations in the metric of the original scores, but it is critical to report
which particular standard deviation was specified as the standardizer. There are
robust standardized mean differences that may be less affected by nonnormality,
heteroscedasticity, or outliers. Descriptive correlation effect sizes considered are
all forms of ĥ, or the square root of the sum of squares for the contrast over the
total sum of squares. Case-level analysis of proportions of scores from one group
versus another group that fall above or below certain reference points can illu-
minate practical implications of group-level differences. Estimating effect size is
part of determining substantive significance, but the two are not synonymous.
The next chapter deals with effect sizes for categorical outcomes.
Learn More
Breaugh (2003) reviews mistakes to avoid, and books about effect size
estimation by Ellis (2010) and Grissom and Kim (2011) are good resources
for applied researchers. Kelley and Preacher (2012) give an excellent over-
view of the concept of effect size.
Exercises
DOI: 10.1037/14136-006
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
163
Types of Categorical Outcomes
Effect sizes, considered next, estimate the degree of relative risk for an
undesirable outcome, such as relapsed–not relapsed, across different popula-
tions, such as treatment versus control. The same estimators and their cor-
responding parameters can also be defined when neither level of the outcome
dichotomy corresponds to something undesirable, such as agree–disagree. In
this case, the idea of risk is replaced by that of comparing relative proportions
for binary outcomes.
Presented in Table 6.1 is a fourfold table for comparing treatment and
control groups on the outcome relapsed–not relapsed. The letters in the table
represent observed frequencies in each cell. For example, the size of the con-
trol group is nC = A + B, where A and B, respectively, stand for the number
of untreated cases that relapsed or did not relapse. The size of the treatment
group is nT = C + D, where C and D, respectively, symbolize the number of
treated cases that relapsed or did not relapse. The total sample size is the sum
of A, B, C, and D. Listed in Table 6.2 are the effect sizes, the equation for
each effect size based on the cell frequencies represented in Table 6.1, and
the corresponding parameter.
Risk Rates
Table 6.1
A Fourfold Table for a Contrast on a Dichotomy
Relapsed Not relapsed
Control A B
Treatment C D
Risk rates
pC A pC
A+ B
pT C pT
C+D
Comparative risk
RD pC − pT pC − pT
RR pC πC
pT πT
Correlation
ϕ̂
AD − BC χ 22× 2 ϕ
=
( A + B)(C + D)( A + C)( B + D) N
Note. The letters A–D represent observed cell frequencies in Table 6.1. If A, B, C, or D = 0 in computation
of OR, add .5 to the observed frequencies in all cells. RD = risk difference; RR = risk ratio; OR = odds ratio;
χ 22 × 2 = contingency table chi-square with a single degree of freedom.
or 1 – pC and 1 – pT, are the proportions of cases in each group that did not
relapse. The statistic pC estimates pC, the proportion of cases in the control
population that relapsed, and pT estimates the corresponding parameter pT in
the treatment population.
Comparative Risk
pC pt
oddsC = and oddst = (6.1)
1 − pC 1 − pt
Suppose pC = .60 and pT = .40 are, respectively, the relapse rates among
control and treated cases. The relapse odds in the control group are .60/.40
= 1.50, so the odds of relapse are 3:2. In the treatment group, the odds for
relapse are lower, .40/.60 = .67; that is, the odds of relapse are 2:3. The odds
ratio is OR = 1.50/.67 = 2.25, which says that the relapse odds are 2¼ times
higher among control cases than treated cases. Likewise, OR = .75 would say
that the relapse odds in the control group are only 75% as high as the odds
in the treatment group. In fourfold tables where all margin totals are equal,
OR = RR2. The parameter for OR is w = WC /WT, the ratio of the within-
populations odds where
πC πt
ΩC = and Ω t = (6.2)
1 − πC 1 − πt
Correlation
π Cr π tNr − π CNr π tr
ϕ= (6.4)
π C• π t• π • r π • Nr
Evaluation
The risk difference RD is easy to interpret but has a drawback: Its range
depends on the values of the population proportions pC and pT. That is, the
range of RD is greater when both pC and pT are closer to .50 than when they
are closer to either 0 or 1.00. The implication is that RD values may not be
comparable across different studies when the corresponding parameters pC
and pT are quite different. The risk ratio RR is also easy to interpret. It has
the shortcoming that only the finite interval from 0 to < 1.0 indicates lower
but these tenfold increases in relative risk or odds among treated cases refer
to a rare outcome. Ten times the likelihood of rare event still makes for a low
base rate. Only the risk difference makes it clear that the absolute increase in
risk is slight, RD = .0009, or .09%. King and Zeng (2001) discussed challenges
in estimating rare events in logistic regression.
The correlation ϕ̂
can reach its maximum absolute value (1.0) only if the
marginal proportions for rows and columns in a fourfold table are equal. As the
row and column marginal proportions diverge, the maximum absolute value
of ϕ̂
approaches zero. This implies that the value of ϕ̂ will change if the cell
frequencies in any row or column are multiplied by an arbitrary constant. This
makes ϕ̂ a margin-bound effect size; the correlation rpb is also margin bound
because it is affected by group base rates (see Equation 5.12). Exercise 1 asks
you to demonstrate this property of ϕ̂ . Grissom and Kim (2011, Chapters 8–9)
described additional effect sizes for categorical outcomes.
Table 6.3
Asymptotic Standard Errors of Risk Effect Sizes
Statistic Standard error
pC
pC (1− pC )
nC
pT
pT (1− pT )
nT
RD pC (1− pC ) pT (1− pT )
+
nC nT
ln (RR) 1− pC 1− pT
+
nC pC nT pT
ln (OR) 1 1
+
nC pC (1− pC ) nT pT (1− pT )
The value of z2-tail, .05 is 1.96, so the approximate 95% confidence interval for
pC – pT is
which defines the interval [.06, .34]. Thus, the sample result RD = .20 is just
as consistent with a population risk difference as low as .06 as it is with a
population risk difference as high as .34, with 95% confidence.
This time, I construct the approximate 95% confidence interval for the
population odds ratio w based on OR = 2.25:
ln (2.25) = .8109
1 1
sln ( or ) = + = .2887
100 (.40) (1 − .40) 100 (.60) (1 − .60)
which defines the interval [.2450, 1.3768]. To convert the lower and upper
bounds of this interval back to OR units, I take their antilogs:
If the categorical outcome has more than two levels or there are more
than two groups, the contingency table is larger than 2 × 2. Measures of
comparative risk (RD, RR, OR) can be computed for such a table only if it is
reduced to a 2 × 2 table by collapsing or excluding rows or columns. What is
probably the best known measure of association for contingency tables with
more than two rows or columns is Cramer’s V, an extension of the ϕ̂ coef-
ficient. Its equation is
χ2r × c
V= (6.5)
min (r − 1, c − 1) × N
where the numerator under the radical is the contingency table chi-square
with degrees of freedom equal to the number of rows (r) minus one times the
number of columns (c) minus one (see Equation 3.15). The denominator
under the radical is the product of the sample size and the smallest dimension
of the table minus one. For example, if the table is 3 × 4 in size, then
min (3 – 1, 4 – 1) = 2
For a 2 × 2 table, the equation for Cramer’s V reduces to that for |ϕ̂|. For
larger tables, Cramer’s V is not a correlation, although its range is 0 to 1.00.
Thus, one cannot generally interpret the square of Cramer’s V as a proportion
of explained variance. Exercise 3 asks you to calculate Cramer’s V for the
4 × 2 cross-tabulation in Table 5.7.
1http://www.hutchon.net/ConfidOR.htm
2http://www.pedro.org.au/english/downloads/confidence-interval-calculator/
3http://cran.r-project.org/web/packages/epitools/index.html
Cutting
point
Normal Clinical
Specificity Sensitivity
Negative Positive
result result
Positive Clinical A B
Negative Normal C D
Statistic Equation
Sensitivity A
A+C
Specificity D
B+ D
BR A+C
A+ B+C+ D
PPV A
A+ B
NPV D
C+D
Note. The letters A–D represent observed cell frequencies. BR = base rate; PPV = positive predictive value;
NPV = negative predictive value. The total number of cases is N = A + B + C + D.
Sensitivity, specificity, base rate, and predictive value are defined in the
bottom part of Table 6.4 based on cell frequencies in the top part of the table.
Sensitivity is the proportion of screening test results from clinical cases that
are correct, or A/(A + C). If sensitivity is .80, then 80% of test results in
the clinical group are valid positives and the rest, 20%, are false negatives.
Specificity is the proportion of results from normal cases that are correct, or
D/(B + D). If specificity is .70, then 70% of the results in the normal group are
valid negatives and the rest, 30%, are false positives. The ideal screening test
is 100% sensitive and 100% specific. Given overlap of distributions such as
that illustrated in Figure 6.1, this ideal is not within reach.
Sensitivity and specificity are determined by the threshold on a screen-
ing test. This means that different thresholds on the same test will generate
different sets of sensitivity and specificity values in the same sample. But both
sensitivity and specificity are independent of population base rate and sample
size. For example, a test that is 80% sensitive for a disorder should correctly
Table 6.5
Positive and Negative Predictive Values at Two Different Base Rates
for a Screening Test 80% Sensitive and 70% Specific
True status Predictive value
Screening
test result Prediction Clinical Normal Total Positive Negative
Base rate = .10
Positive Clinical 80 270 350 .23 .97
Negative Normal 20 630 650
Total 100 900 1,000
Base rate = .75
Positive Clinical 600 75 675 .89 .54
Negative Normal 150 175 325
Total 750 250 1,000
.80
Negative
Positive
Predictive value
.60
.40
.20
0
0 .20 .40 .60 .80 1.00
Base rate
Figure 6.2. Expected predictive values as functions of base rate for a screening test
that is 80% sensitive and 70% specific.
sensitivity
Plr = (6.6)
1 − specificity
which indicates the number of times more likely that a positive result comes
from clinical cases (numerator) than from normal cases (denominator). The
negative likelihood ratio (NLR) is
1 − sensitivity
Nlr = (6.7)
specificity
and it measures the degree to which a negative result is more likely to come
from clinical cases than from normal cases.
Using Bayesian methods, diagnosticians can estimate how much the
odds of having a disorder will change given a positive versus negative test
result and the disorder base rate expressed as odds. The relation is
.80
Plr = = 2.667
1 − .70
so positive test results are about 22/3 times more likely among clinical cases
than among normal cases. The value of NLR is
.20
Nlr = = .286
.70
which, as expected, are higher than the pretest odds (.176). To convert odds
to probability, we calculate p = odds/(1 + odds). So the probability of the dis-
order increases from BR = .15 before testing to .471/(1 + .471), or about .32,
after a positive test result. The posttest odds of the disorder after a negative
result are
which are lower than the pretest odds (.176). Thus, the probability of the
disorder decreases from BR = .15 to .050/(1 + .050), or about .048, after
observing a negative result.
A negative test result in this example has a greater relative impact than
a positive result on the odds of having the disorder. The factor by which the
pretest odds are increased, given a positive result, is PLR = 2.667. But the
factor by which the pretest odds are reduced, given a negative result, is NLR =
.286, which is same as dividing the pretest odds by a factor of 1/.286, or about
3.50. This pattern is consistent with this screening test, where sensitivity =
.80, specificity = .70, and BR = .15. That is, the test will be better at ruling out
a disorder than at detecting it under this base rate (see Figure 6.2). In general,
test results have greater impact on changing the pretest odds when the base
rate is moderate, neither extremely low (close to 0) nor extremely high (close
to 1.0). But if the target disorder is either very rare or very common, only a
result from a highly accurate screening test will change things much. There
are web pages that calculate likelihood ratios.4
The method just described to estimate posttest odds can be applied
when base rate or test characteristics vary over populations. Moons, van Es,
Deckers, Habbema, and Grobbee (1997) found that sensitivity, specificity,
and likelihood ratios for the exercise (stress) test for coronary disease varied
by gender and systolic blood pressure at baseline. In this case, no single set
of estimates was adequate for all groups. Exercise 5 concerns the capability
to tailor estimates for different groups, in this case the disorder base rate. It
is also possible to combine results from multiple screening tests, which may
further improve prediction accuracy.
4
http://www.medcalc.org/calc/diagnostic_test.php
Estimates of disorder base rates are not always readily available. Without
large-scale epidemiological studies, other sources of information, such as case
records or tabulations of the frequencies of certain diagnoses, may provide
reasonable approximations. The possibility of estimating base rates from such
sources prompted Meehl and Rosen (1955) to say that “our ignorance of base
rates is nothing more subtle than our failure to compute them” (p. 213). One
can also calculate predictive values for a range of base rates. The use of impre-
cise (but not grossly inaccurate) estimates may not have a large impact on
predictive values, especially for tests with high sensitivities and specificities.
But lower sensitivity and specificity values magnify errors due to imprecise
base rate estimates.
Sensitivity, specificity, and predictive values are proportions that are typi-
cally calculated in samples, so they are subject to sampling error. Base rates are
subject to sampling error, too, if they are empirically estimated. It is possible to
construct confidence intervals based on any of these proportions using the Wald
method. Just use the equation for either pC or pT in Table 6.3 to estimate the
standard error for the proportion of interest. Another option is to use one of the
other methods described by Newcombe (1998) for sample proportions.
Likelihood ratios are affected by sampling error in estimates of sensitivity
and specificity. Simel, Samsa, and Matchar (1991) described a method to con-
struct approximate confidence intervals based on observed likelihood ratios.
Their method analyzes natural log transformations of likelihood ratios, and it is
implemented in some web calculating pages that also derive confidence inter-
vals based on sample sensitivity, specificity, and predictive values.5 Herbert’s
(2011) Confidence Interval Calculator spreadsheet for Excel also uses the
Simel et al. (1991) method to calculate confidence intervals based on observed
likelihood ratios and values of sensitivity and specificity (see footnote 2).
Posttest odds of the disorder are affected by sampling error in estimates
of specificity, sensitivity, likelihood ratios, and base rates. Mossman and Berger
(2001) described five different methods of interval estimation for the posttest
odds following a positive test result. Two of these methods are calculable by
hand and based on natural log transformations, but other methods, such as
Bayesian interval estimation, require computer support. Results of computer
simulations indicated that results across the five methods were generally
comparable for group sizes > 80 and sample proportions not very close to
either 0 or 1.00.
Crawford, Garthwaite, and Betkowska (2009) extended the Bayesian
method described by Mossman and Berger (2001) for constructing confidence
intervals based on posttest probabilities following a positive result. Their
method accepts either empirical estimates of base rates or subjective estimates
(guesses). It also constructs either two-sided or one-sided confidence intervals.
One-sided intervals may be of interest if, for example, the diagnostician is inter-
ested in whether the posttest probability following a positive result is lower
than the point estimate suggests but not in whether it is higher. A computer
program that implements the Crawford et al. (2009) method can be freely
downloaded.6 It requires input of observed frequencies, not proportions, and
it does not calculate intervals for posttest probabilities after negative results.
Suppose that results from an epidemiological study indicate that a total
of 150 people in a sample of 1,000 have the target disorder (BR = .15). In
5
http://ktclearinghouse.ca/cebm/practise/ca/calculators/statscalc
6
http://www.abdn.ac.uk/~psy086/dept/BayesPTP.htm
Research Examples
The data for this example are described in Chapter 5. Briefly, a total of
667 students in introductory statistics courses completed a basic math skills
test at the beginning of the semester. Analysis of scores at the case level
indicated that students with test scores < 40% correct had higher rates of
unsatisfactory course outcomes (see Table 5.7). These data are summarized
in the fourfold table on the left side of Table 6.6. Among the 75 students
with math test scores < 40%, a total of 34 had unsatisfactory outcomes, so
p< 40% = 34/75, or .453, 95% CI [.340, .566]. A total of 122 of 592 students
categorical outcomes
183
with math test scores ≥ 40% had unsatisfactory outcomes, so their risk rate
is p≥ 40% = 122/592, or .206, 95% CI [.173, .239]. The observed risk difference is
RD = .453 - .206, or .247, 95% CI [.129, .364], so students with the lowest
math test scores have about a 25% greater risk for poor outcomes in introduc-
tory statistics.
The risk ratio is RR = .453/.206, or 2.199, 95% CI [1.638, 2.953], which
says that the rate of unsatisfactory course outcomes is about 2.2 times higher
among the students with the lowest math scores (Table 6.6). The odds ratio is
.453 (1 − .453)
or = = 3.192
.206 (1 − .206)
with 95% CI [1.943, 5.244], so the odds of doing poorly in statistics are about
3.2 times higher among students with math scores < 40% correct than among
their classmates with higher scores. The correlation between the dichotomies
of < 40% versus ≥ 40% correct on the math test and unsatisfactory–satisfactory
course outcome is ϕ̂ = .185, so the former explains about 3.4% of the variance
in the latter.
Conclusion
Learn More
Exercises
1. Show that ϕ̂
is margin bound using the following fourfold table:
Not
Relapsed relapsed
Control 60 40
Treatment 40 60
DOI: 10.1037/14136-007
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
189
Contrast Specification and Tests
where (c1, c2, . . . , ca) is the set of contrast weights (coefficients) that
specify the comparison. Application of the same weights to sample means
estimates y:
a
ψˆ = ∑ c i Mi (7.2)
i =1
Contrast weights should respect a few rules: They must sum to zero,
and weights for at least two different means should not equal zero. Means
assigned a weight of zero are excluded, and means with positive weights are
contrasted with means given negative weights. Suppose factor A has a = 3
levels. The weights (1, 0, -1) meet the requirements just stated and specify
among innumerable others with the same pattern of coefficients, all specify
the same pairwise comparison as the set (1, 0, -1). The scale of ŷ1 depends on
which sets of weights are applied to the means. This does not affect statistical
tests or measures of association for contrasts because their equations correct
for the scale of the weights.
But the scale of contrast weights is critical if a comparison should be
interpreted as the difference between the averages of two subsets of means. If
so, the weights should make up a standard set and satisfy what Bird (2002)
called mean difference scaling: The sum of the absolute values of the coeffi-
cients in a standard set is 2.0. This implies for a pairwise comparison that one
weight must be +1, another must be -1, and the rest are all zero. For example,
M1 + M3
ψ̂2 = ( 1 2 ) M1 + ( −1) M2 + ( 1 2 ) M3 = − M2
2
But the weights (1, -2, 1) for the same pattern are not a standard set because
the sum of their absolute values is 4.0, not 2.0.
Two contrasts are orthogonal if they each reflect an independent aspect
of the omnibus effect; that is, the result in one comparison says nothing about
what may be found in the other. In balanced designs (i.e., equal group sizes),
a pair of contrasts is orthogonal if the sum of the products of their corresponding
weights is zero, or
a
∑ c1 c2 i i
= 0 (7.3)
i =1
ψ̂ 1 : (1, 0, − 1)
ψ̂ 2 : ( 12 , − 1, 1
2 )
Intuitively, these contrasts are unrelated because the two means compared
in ŷ1, M1 and M3, are combined in ŷ2 and contrasted against the third
mean, M2. The weights for a second pair of contrasts in the same design are
listed next:
ψ̂ 2 : ( 12 , − 1, − 12 )
ψ̂ 3 : (1, − 1, 0)
This second pair is not orthogonal because the sum of the weight cross-products
is not zero:
a
∑ c2 c3 = ( 1 2 )(1) + ( −1) ( −1) + ( 1 2 )( 0 ) = 1.5
i i
i =1
Contrasts ŷ2 and ŷ3 are correlated because M2 is one of the two means com-
pared in both.
If every pair in a set of contrasts is orthogonal, the entire set is
orthogonal. The maximum number of orthogonal contrasts is limited by
the degrees of freedom for the omnibus effect, dfA. Thus, the omnibus
effect can theoretically be broken down into a – 1 independent directional
effects, where a is the number of groups. Expressed in terms of sums of
squares, this is
a −1
SSA = ∑ SSψˆ i
(7.5)
i =1
where SSA and SSŷ are, respectively, the sum of squares for the omnibus effect
i
and the ith contrast in a set of a – 1 orthogonal comparisons. The same idea
can be expressed in terms of the correlation ratio
a −1
ˆ 2A = ∑ η
η ˆ 2ψ i
(7.6)
i =1
where η̂2A and η̂y2 are, respectively, estimated eta-squared for the omnibus
i
effect and the ith contrast in a set of all possible orthogonal comparisons.
ψ̂ 1 : (1, 0, − 1)
ψ̂ 2 : ( 12 , − 1, 1
2 )
ψ̂ 3 : (1, − 1, 0)
ψ̂ 4 : ( 12 , 1
2 , − 1)
ψˆ − ψ 0
t ψˆ (df ) = (7.7)
sψˆ
where y0 is the value of contrast specified in H0 and sŷ is the standard error.
For a nil hypothesis, y0 = 0 and this term drops out of the equation. If the
means are independent, df = dfW = N – a, the pooled within-groups degrees of
freedom, and the standard error is
a c2
sψˆ ind = MSW ∑ i (7.8)
i=1 n i
sD2
sψˆ dep = (7.9)
ˆ
ψ
where the term in the numerator under the radical is the variance of the
contrast difference scores. Suppose the weights (1, 0, -1) define ŷ1 in a cor-
related design with three conditions. If Y1, Y2, and Y3 are scores from these
conditions, the difference score is computed for each case as
ψˆ 2
SSψˆ = (7.10)
a
c2
∑ ni
i =1 i
Table 7.1
Methods for Controlling Type I Error Over Multiple Comparisons
Method Nature of protection against aEW
Planned comparisons
Unprotected None; uses standard critical values for t ψ̂ or Fψ̂
Dunnett Across pairwise comparisons of a single control group
with each of a – 1 treatment groups
Bechhofer–Dunnett Across a maximum of a – 1 orthogonal a priori contrasts
Bonferroni–Dunn Bonferroni correction applied across total number of
either orthogonal or correlated contrasts
Unplanned comparisons
Newman–Keuls Across pairwise comparisons within sets of means
ordered by differences in rank order
Tukey HSD Across all possible pairwise comparisons
Scheffé Across all possible pairwise or complex comparisons
Note. aEW = experimentwise Type I error; HSD = honestly significant difference. Tukey HSD is also called Tukey A.
where the standard error is defined by Equation 7.8 for independent samples
and by Equation 7.9 for dependent samples. The degrees of freedom for the
critical value of tŷ equal those of the corresponding error term. Simultaneous
(joint) confidence intervals are based on sets of contrasts, and they are gen-
erally wider than confidence intervals for individual contrasts defined by
Equation 7.11. This is because the former control for multiple comparisons.
Suppose in the Bonferroni–Dunn method that aEW = .05 and c = 10, which
implies aBon = .05/10, or .005 for each comparison. The resulting 10 simulta-
neous 99.5% confidence intervals for y are each based on tŷ 2-tail, .005, and these
intervals are wider than the corresponding 95% confidence interval based
on tŷ 2-tail, .05 for any individual contrast; see Bird (2002) for more information.
Standardized Contrasts
Contrast weights that are standard sets are assumed next. A standard-
ized mean difference for a contrast is standardized contrast. It estimates the
parameter dy = y/s*, where the numerator is the unstandardized popula-
tion contrast and the denominator is a population standard deviation. The
general form of the sample estimator is dŷ = ŷ/ŝ*, where the denominator
(standardizer) is an estimator of s* that is not the same in all kinds of stan-
dardized contrasts.
a
c2
d with = t ψˆ ∑ ni (7.12)
i =1 i
1http://supp.apa.org/psycarticles/supplemental/met_13_2_110/met_13_2_110_supp.html
M1 + M3 13.00 + 15.00
ψˆ 2 = − M2 = − 11.00 = 3.00
2 2
−2.00 3.00
d with1 = = − .85 and d with 2 = = 1.28
5.50 5.50
In words, M1 is .85 standard deviations lower than M3, and the average of M1
and M3 is 1.28 standard deviations higher than M2.
Table 7.2
Independent Samples Analysis of the Data in Table 3.4
Partial
Source SS df MS F dwith η̂2 η̂2
Between (A) 40.00 2 20.00 3.64c — .377h .377
ŷ1 = -2.00a 10.00 1 10.00 1.82d -.85f .094 .132i
ŷ2 = 3.00b 30.00 1 30.00 5.45e 1.28g .283 .313j
Within (error) 66.00 12 5.50
Total 106.00 14
Note. The contrast weights for ŷ1 are (1, 0, -1) and those for ŷ2 are (½, -1, ½). A dash (—) indicates that it
is not possible to calculate the statistic indicated in the column heading for the effect listed in that row of the
table. CI = confidence interval.
a95% CI for y [-5.23, 1.23]. b95% CI for y [.20, 5.80]. cp = .058. dp = .202. ep = .038. fApproximate 95% CI for
1 2
dy1 [-2.23, .53]. gApproximate 95% CI for dy2 [.09, 2.47]. hNoncentral 95% CI for η2A [0, .601]. iNoncentral 95%
CI for partial η2ψ1 [0, .446]. jNoncentral 95% CI for partial η2ψ2 [0, .587].
There are two basic ways to standardize mean changes when the sam-
ples are dependent:
1. With one exception, use any of the methods described in the
previous section for contrasts between unrelated means. These
methods estimate population standard deviations in the metric
of the original scores, but they ignore the subjects effect in cor-
related designs. The exception is Equation 7.12, which requires
tŷ for independent samples to compute dwith.
2. Standardize the mean change against the standard deviation of
the difference scores for that particular contrast. This option
takes account of the cross-conditions correlation, but it does
not describe change in the metric of the original scores.
Reported in Table 7.3 are the results of a dependent samples analysis
for an additive model of the data in Table 3.4 for the omnibus effect and the
same two contrasts analyzed in Table 7.2. The F and p values differ for all
effects across the independent samples analysis in Table 7.2 and the depen-
dent samples analysis in Table 7.3. But dwith1 = -.85 and dwith2 = 1.28 in both
analyses, because each is calculated the same way regardless of the design.
If the samples are independent and the effect size is dwith, an approximate
confidence interval for dy can be obtained by dividing the endpoints of the
Table 7.3
Dependent Samples Analysis of the Data in Table 3.4
Partial
Source SS df MS F dwith η̂2 η̂2
Between (A) 40.00 2 20.00 14.12c — .377 .779
ŷ1 = -2.00a 10.00 1 10.00 5.71d -.85f .094 .588
ŷ2 = 3.00b 30.00 1 30.00 27.69e 1.28g .283 .874
Within 66.00 12 5.50
Subjects 54.67 4 13.67
Residual (A) 11.33 8 1.42
Residual (ŷ1) 7.00 4 1.75
Residual (ŷ2) 4.33 4 1.08
Total 106.00 14
Note. The contrast weights for ŷ1 are (1, 0, -1) and those for ŷ2 are (½, -1, ½). A dash (—) indicates that it is not
possible to calculate the statistic indicated in the column heading for the effect listed in that row of the table.
a95% CI for y [-4.32, .32]. b95% CI for y [1.42, 4.58]. cp = .002. dp = .075. ep = .006. fApproximate 95% CI for d
1 2 y1
[-1.84, .14]. gApproximate 95% CI for dy2 [.60, 1.95].
a
c2
sd with = ∑ ni (7.14)
i =1 i
12 0 2 −12
ˆ ind =
sψ1 5.50 + + = 1.4832
5 5 5
Given tŷ 2-tail, .05 (12) = 2.179, the 95% confidence interval for y1 is
3.50
ˆ dep =
sψ1 = .8367
5
Given tŷ 2-tail, .05 (4) = 2.776, the 95% confidence interval for y1 is
which defines the interval [-4.3226, .3226]. Dividing the endpoints of this
interval by the square root of MSW = 5.50 gives the lower and upper bounds
of the approximate 95% confidence interval for dy1 based on dwith1 = -.85 in
the dependent samples analysis. The resulting interval is [-1.8432, .1376], or
[-1.84, .14] at two-decimal accuracy. As expected, this interval for dy1 in the
dependent samples analysis is narrower than the corresponding interval in
the independent samples analysis of the same scores, or [-2.23, .53]. Exercise
2 asks you to construct the approximate 95% confidence interval for dy2 for
the dependent samples analysis in Table 7.3.
The PSY computer program (Bird, Hadzi-Pavlovic, & Isaac, 2000) for
Microsoft Windows calculates individual or simultaneous approximate confi-
dence intervals for dy when the effect size is dwith in designs with one or more
between-subjects or within-subjects factors.2 It accepts only integer contrast
weights, but it can automatically convert the weights to a standard set so that
all contrasts are scaled as mean differences.
Results of computer simulations by Algina and Keselman (2003) indi-
cated that Bird’s (2002) approximate confidence intervals for dy were reason-
ably accurate in between-subjects designs except for larger population effect
sizes, such as dy > 1.50. But in correlated designs, their accuracies decreased as
either the population effect size increased or the cross-conditions correlation
increased. In both designs under the conditions just stated, approximate con-
fidence intervals for dy were generally too narrow, which makes the results
look falsely precise. Algina and Keselman (2003) also found that noncentral
confidence intervals in between-subjects designs and approximate noncen-
tral confidence intervals in within-subjects designs were generally more accu-
rate than Bird’s (2002) approximate intervals.
When the means are independent, dwith follows noncentral tŷ (dfW, D) dis-
tributions. A computer tool calculates a noncentral 100 (1 – a)% confidence
2http://www.psy.unsw.edu.au/research/research-tools/psy-statistical-program
a
c2
δψ = ∆ ∑ ni (7.15)
i =1 i
Fψˆ 1 (1, 12) = 1.82 and t ψˆ 1 (12) = − 1.35 (i.e., t ψ2ˆ 1 = Fψˆ 1)
12 0 2 −12
+ + = .6325
5 5 5
Reviewed next are descriptive and inferential r-type effect sizes for con-
trasts and the omnibus effect. The descriptive effect sizes assume a fixed factor
(as do standardized contrasts). The most general descriptive effect size is the
correlation ratio. For contrasts it takes the form η̂2ψ = SSy/SST, and it measures
the proportion of total observed explained by that contrast. The correspond-
ing effect size for the omnibus effect is η̂2A = SSA/SST. In balanced designs with
a fixed factor, the inferential measures of association ω̂2ψ for contrasts and ω̂A2
for omnibus effects control for positive bias in, respectively, η̂ψ2 and η̂A2 . But for
random factors, contrast analysis is typically uninformative. This is because
levels of such factors are randomly selected, so they wind up in a particular
study by chance. In this case, the appropriate inferential measure of associa-
tion is the intraclass correlation ρ̂1, which is already in a squared metric, for
the omnibus effect in balanced designs.
1
rψˆ = t ψˆ (7.16)
Fψˆ + dfnon-ψˆ (Fnon-ψˆ ) + dfW
where dfnon-ψ̂ and Fnon-ψ̂ are, respectively, the degrees of freedom and F statistic
for all noncontrast sources of between-groups variability. The statistic Fnon-ψ̂ =
MSnon-ψ̂ /MSW, where
For the results in Table 7.2 where the weights (1, 0, -1) specify ŷ1,
Now we calculate
1
ˆ = − 1.35
rψ1 = − .307
1.82 + (1) 5.45 + 12
So we can say that the correlation between the dependent variable and the con-
trast between the first and third groups is -.307 and that this contrast explains
-.3072, or about .094 (9.4%), of the total observed variance in outcome.
The partial correlation effect size
1
partial rψˆ = t ψˆ (7.17)
Fψˆ + dfW
removes the effects of all other contrasts from total variance. For ŷ1 in Table 7.2,
1
partial rψˆ = − 1.35 = − .363
1.82 + 12
1
which says that correlation between ŷ1 and outcome is -.363 controlling for
ŷ2 and that ŷ1 explains -.3632, or about .132 (13.2%), of the residual variance.
The absolute value of partial rŷ is usually greater than that of rŷ for the same
contrast, which is here true for ŷ1 (respectively, .363 vs. .307). Also, partial
r2ŷ values are not generally additive over sets of contrasts, orthogonal or not.
Exercise 3 asks you to calculate rŷ2 and partial rŷ2 for the results in Table 7.2.
The correlation rŷ assumes independent samples. For dependent sam-
ples, we can compute instead the unsigned correlation of which rŷ is a special
case, η̂ψ. For the dependent contrast ŷ1 in Table 7.3 where SSŷ1 = 10.00 and
SST = 106.00, η̂ψ2 1 = (10.00/106.00)1/2, or .307, which is also the absolute value
of rŷ 1 for the same contrast in the independent samples analysis of the same
data (see Table 7.2). The proportion of total variance explained is also the
same in both analyses, or η̂ψ2 1 = r2ŷ1 = .094 (i.e., 9.4% of total variance).
The general form of partial η̂ψ2 is SSŷ/(SSŷ + SSerror), where SSerror is the
error sum of squares for the contrast. If the samples are independent, partial
η̂ψ2 controls for all noncontrast sources of between-conditions variability; if the
samples are dependent, it also controls for the subjects effect. For example, par-
tial η̂ψ2 1 = .132 and η̂ψ2 1 = .094 for the independent samples analysis in Table 7.2.
When dfA ≥ 2, η̂2A is the squared multiple correlation (R2) between the
omnibus effect and outcome. If the samples are independent, η̂2A can also be
computed as
ˆ 2A = FA
η (7.18)
dfW
FA +
dfA
where FA is the test statistic for the omnibus effect with dfA, dfW degrees
of freedom (see Equation 3.8). For example, SSA = 40.00 and SST = 106.00
for the omnibus effect in Tables 7.2 and 7.3, so the omnibus effect explains
40.00/106.00 = .377, or about 37.7%, of the total variance in both analyses.
But only for the independent samples analysis in Table 7.2 where FA (2, 12)
= 3.64 can we also calculate (using Equation 7.18) for the omnibus effect
ˆ 2A = 3.64
η = .377
12
3.64 +
2
The inferential measures ω̂ 2 for fixed factors and ρ̂1 for random factors in
balanced designs are based on ratios of variance components, which involve
the expression of expected sample mean squares as functions of population
sources of systematic versus error variation. Extensive sets of equations for
variance component estimators in sources such as Dodd and Schultz (1973),
Kirk (2012), Vaughan and Corballis (1969), and Winer et al. (1991) provide
the basis for computing inferential measures of association. Schuster and von
dfeffect
σˆ 2effect = ( MSeffect – MSerror ) (7.19)
an
where MSeffect and dfeffect are, respectively, the effect mean square and its
degrees of freedom, and MSerror is its error term. When Equation 7.20 is used,
the estimator for the omnibus effect is
a −1
σˆ 2A fix = ( MSA – MSerror ) (7.20)
an
1
σˆ 2ψ = ( MSψˆ – MSerror ) (7.21)
an
(Recall that MSŷ = SSŷ because dfŷ = 1.) But for a random factor the estimator
for the omnibus effect is
1
σˆ 2A ran = ( MSA – MSerror ) (7.22)
n
Estimation of σ̂ 2total also depends on the design. The sole estimate of error
variance when the samples are independent is MSW, regardless of whether the
factor is fixed or random. This means that total variance in either case is
estimated as
σˆ 2total = σˆ 2A + MSW (7.23)
where σ̂2A is defined by Equation 7.20 for a fixed factor but by Equation 7.22
for a random factor. The composition of total variance for fixed versus ran-
dom factors is thus not the same.
Table 7.4
Numerators and Denominators for Direct Calculation of Inferential Measures
of Association Based on Total Variance for Single-Factor Designs
Sample Model Denominator
Fixed factora
Independent — SST + MSW
Dependent Additive SST + MSS
Dependent Nonadditive SST + MSS + n MSA × S
Random factorb
Independent — SST + MSA
Dependent Additive SST + MSA + MSS - MSres
Dependent Nonadditive a MSA + n MSS + (an –a – n) MSA × S
Note. The cell size is n, factor A has a levels, and MSerror is the ANOVA error term for the corresponding
effect.
aEffect size is ŵ2
effect, numerator = dfeffect (MSeffect - MSerror). Effect size is r̂I, numerator = a (MSA – MSerror).
b
MSa = 20.00, MSψˆ 1 = 10.00, MSψˆ 2 = 30.00, MSW = 5.50, SST = 106.00
ˆ 2A = η
η ˆ 2ψ 1 + η
ˆ 2ψ 2 = .094 + .283 = .377
2 (20.00 − 5.50)
ωˆ 2A = = .260
106.00 + 5.50
As expected, values of the inferential measures just calculated are each smaller
than the corresponding descriptive measure (e.g.,η̂A2 = .377, ω̂2A = .260). Because
ŷ1 and ŷ2 are orthogonal and dfA = 2, it is also true that
Interval Estimation
analysis explains about 4.5% of total variance on the reading task (ĥ2 = .045).
Other key ANOVA results are
Table 7.6
Analysis of Variance (ANOVA) and Analysis of Covariance
(ANCOVA) Results for the Data in Table 7.5
Source SS df MS F η̂2
ANOVA
Between (incentive) 20.91 1 20.91 4.21b .045
Within (error) 446.77 90 4.96
Total 467.68 91
Traditional ANCOVAa
Total effects 117.05 2 58.52 14.86c .250
Covariate (grades) 96.14 1 96.14 24.40c .206
Between (incentive) 18.25 1 18.25 4.63d .039
Within (error) 350.63 89 3.94
Total 467.68 91
ANCOVA-as-regression
Step Predictors R2 R 2 change F change df1 df2
1 Grades .211 .211 24.11 c 1 90
2 Grades, incentive .250 .039 4.63d 1 89
Note. Entries in boldface emphasize common results across traditional ANCOVA and ANCOVA-as-
regression analyses of the same data and are discussed in the text.
a
Type III sums of squares. bp = .043. cp < .001. dp = .034.
The ANCOVA error term is related to the ANOVA error term as follows:
′ = MSW
MSW
(
dfW 1 − rpool
2 ) (7.28)
dfW − 1
where r2pool is the squared pooled within-groups correlation between the covari-
ate and the dependent variable. For this example, where dfW = 90 and rpool = .464,
90 (1 − .464 2 )
MSW′ = 4.96 = 3.94
90 − 1
differs statistically from zero. The F statistic and R2 change values just stated
are each identical to their counterparts in the traditional ANCOVA results
(e.g., η̂2 = .039 for incentive condition; see Table 7.6). Thus, ANCOVA from
a regression perspective is nothing more than a test of the incremental valid
ity of the factor over the covariate in predicting outcome.
Let us now estimate the magnitude of the standardized contrast between
the extrinsic and intrinsic conditions on the reading task. There are two
possibilities for the numerator: the contrast of the two unadjusted means,
M1 – M2, or means adjusted for the covariate, M1′ – M2′ . There are also at least
3.05 − 2.08
d pool = = .44
4.96
That is, the students in the extrinsic reward condition outperformed their
peers in the intrinsic reward condition on the reading task by about .44 stan-
dard deviations. See Colliver and Markwell (2006) and Rutherford (2011,
Chapter 4) for more information.
Research Examples
Raw data files for the two examples considered next are not available,
but you can download from this book’s web page SPSS syntax that analyzes
the summary statistics for each.
Table 7.7
Cognitive Test Scores for Ecstasy (MDMA) Users,
Cannabis Users, and Nonusers
User groupa dwith for pairwise
contrasts
1 2 3
Task Ecstasy Cannabis Nonuser F (2, 81) 1 vs. 2 1 vs. 3 2 vs. 3
Attentionb
Simple 218.9e 221.1 218.7 .07f -.08 .01 .09
(28.2) (26.3) (27.5)
Selective 532.0 484.4 478.6 7.23g .83 .93 .10
(65.4) (57.9) (48.4)
Learning and abstract thinking
Verbalc 4.46 3.71 3.29 9.22h .73 1.13 .41
(.79) (1.15) (1.12)
Visualc 4.61 4.00 4.11 2.12i .52 .42 -.09
(.96) (1.41) (1.13)
Abstract 25.96 29.46 29.50 7.29g -.88 -.89 -.01
thinkingd (4.10) (4.19) (3.64)
Note. From “Impaired Cognitive Performance in Drug Free Users of Recreational Ecstasy (MDMA),” by E.
Gouzoulis-Mayfrank, J. Daumann, F. Tuchtenhagen, S. Pelz, S. Becker, H.-J. Kunert, B. Fimm, and H. Sass,
2000. Journal of Neurology, Neurosurgery & Psychiatry, 68, p. 723. Copyright 2000 by BMJ Publishing Group
Limited. Adapted with permission.
an = 28 for all groups. bScores are in milliseconds; higher scores indicate worse performance. cNumber of tri-
als; higher scores indicate worse performance. dHigher scores indicate better performance. eMean (standard
deviation). fp = .933. gp = .001. hp < .001. ip = .127.
Kanfer and Ackerman (1989) administered to 137 U.S. Air Force per-
sonnel a computerized air traffic controller task, presented over six 10-minute
trials, where the outcome variable was the number of successful landings.
Summarized in Table 7.8 are the means, standard deviations, and correla-
tions across all trials. The last show a typical pattern for learning data in
that correlations between adjacent trials are higher than those between
nonadjacent trials. This pattern may violate the sphericity assumption of
statistical tests for comparing dependent means for equality. Accordingly,
p values for effects with more than a single degree of freedom are based on the
Geisser–Greenhouse conservative test. Task means over trials exhibit both
linear and quadratic trends, which are apparent in Figure 7.1.
Reported in Table 7.9 are the results of repeated measures analyses of
variance of the omnibus trials effect, the linear and quadratic trends, and all
other higher order trends combined (cubic, quartic, quintic; i.e., respectively,
2, 3, or 4 bends in the curve). Effect size is estimated with ŵ 2 for an additive
model. All effects have p values <.001, but their magnitudes are clearly differ-
ent. The omnibus trials effect explains about 43% of the total variance cor-
rected for capitalization on chance. Because the linear trend itself accounts
for about 38% of the total variance, it is plain to see that this polynomial is
the most important aspect of the learning curve. The quadratic trend explains
an additional 5% of the total variance, and all higher order trends together
explain <.1% of the total variance. The orthogonal linear and quadratic trends
together thus account for virtually all of the explained variance.
In their analysis of the same learning curve data, Kanfer and Ackerman
(1989) reduced unexplained variance even more by incorporating a cognitive
r 1.00
.77 1.00
.59 .81 1.00
.50 .72 .89 1.00
.48 .69 .84 .91 1.00
.46 .68 .80 .88 .93 1.00
Note. The group size is n = 137, and the correlation matrix is in lower-diagonal form. Adapted from
“Models for Learning Data,” by M. W. Browne and S. H. C. Du Toit, 1991. In L. M. Collins and J. L. Horn
(Eds.), Best Methods for Analysis of Change, p. 49, Washington, DC: American Psychological Association.
Copyright 1991 by the American Psychological Association.
35.0
30.0
Mean Task Score
25.0
20.0
15.0
10.0
1 2 3 4 5 6
Trial
Figure 7.1. Means and 95% confidence intervals for µ for the learning trial data in
Table 7.8.
Conclusion
Learn More
Exercises
Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses
remove it.
—Alan J. Perlis (1982, p. 10)
DOI: 10.1037/14136-008
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
221
a = 2 levels and factor B has b = 3 levels, for instance, a full A × B factorial
design would have a total of 2 × 3, or 6, conditions, including
which are not all possible permutations (8) due to nesting of B under A.
Nested factors are typically considered random. Levels of factors in fractional
(partial, incomplete) factorial designs may not be studied in every combina-
tion with levels of other factors. This reduces the number of conditions, but
certain main and interaction effects may be confounded. A Latin square
design, which also counterbalances order effects for repeated measures fac-
tors, is perhaps the best known example; see Kirk (2012, Chapters 11–16) for
more information.
Basic Distinctions
B1 B2 B3 B1 B2 B3
A1 5 10 20 A1 5 20 10
A2 10 20 40 A2 10 10 50
a b
SSW
∑ ∑ dfij (sij2 )
i =1 j =1
MSW = = a b
(8.1)
dfW
∑ ∑ dfij
i =1 j =1
where dfij and s2ij are, respectively, the degrees of freedom (nij - 1) and vari-
ance of the cell at the ith level of A and the jth level of B. Only in a balanced
design can MSW also be computed as the simple arithmetic average of the
cell variances.
The between-conditions variance in a single-factor design, MSA, reflects
the effects of factor A, sampling error, and cell size (Equation 3.10). In a fac-
torial design, the overall between-conditions variance reflects the main and
interactive effects of all factors, sampling error, and cell size. For example, the
between-cells variance in a two-way design is designated below as
SSA , B, AB
MSA ,B, AB = (8.2)
dfA , B, AB
where the subscript indicates the main and interaction effects analyzed
together (total effects), and the degrees of freedom equal the number of cells
minus one, or ab - 1. It is only in balanced two-way designs that the sum of
squares for the total effects can be computed directly as
a b
SSA ,B,AB = ∑ ∑ n ( Mij − MT )
2
(8.3)
i =1 j =1
where n is the size of all cells, Mij is the mean for the cell at the ith level of A
and the jth level of B, and MT is the grand mean for the whole design. That is,
This relation can also be expressed in terms of the correlation ratio in bal-
anced designs:
ˆ 2A , B, AB = η
η ˆ 2A + η
ˆ 2B + η
ˆ 2AB (8.5)
(Recall that the general form of η̂ 2effect is SSeffect /SST.) Equations 8.4 and 8.5
define effect orthogonality in two-way designs. Orthogonality in factorial
designs of any size means that the main and interaction effects can appear in
any combination. This means that observing one type of effect, such as a
main effect of factor A, says nothing about whether any other effect will be
found, such as a main effect of B or the interaction effect AB.
Table 8.1
General Descriptive Statistics for a Balanced 2 × 3 Factorial Design
B1 B2 B3 Row means
2 2 2
A1 M11 (s 11 ) M12 (s 12) M13 (s 13 ) MA1
2 2 2
A2 M21 (s 21 ) M22 (s 22) M23 (s 23 ) MA2
Table 8.2
Equations for Main and Interaction Effect Sums of Squares
in Balanced Two-Way Factorial Designs
Source SS df
a
A ∑ bn ( M Ai − MT )2 a-1
i =1
b
B ∑ an ( M Bj − MT )2 b–1
j =1
a b
AB ∑ ∑ n [M
i =1 j =1
ij − ( M A − MT ) − ( MB − MT ) − MT ]2
i j
(a – 1) (b – 1)
a b
= ∑ ∑ n ( Mij − M A − MB + MT )2i j
i =1 j =1
1A common but incorrect description of interaction is that “the factors affect each other.” Factors may
affect the dependent variable individually (main effects) or jointly (interaction), but factors do not affect
each other in factorial designs.
In words, the total sum of squares for all simple effects of each factor equals
the total sum of squares for the main effect of that factor and the interaction.
When all simple effects of a factor are analyzed, it is actually the main and
interactive effects of that factor that are analyzed. Given their overlap in sums
of squares, it is usually not necessary to analyze both sets of simple effects, A at
B and B at A. The choice between them should be made on a rational basis,
depending on the perspective from which the researcher wishes to describe
interaction.
An ordinal interaction occurs when simple effects vary in magnitude but
not in direction. Look at the cell means for the 2 (drug) × 2 (gender) layout
in the left side of Table 8.3, where higher scores indicate a better result. The
interaction is ordinal because (a) women respond better to both drugs, but
the size of this effect is greater for drug 2 than drug 1 (gender at drug simple
effects). Also, (b) mean response is always better for drug 2 than drug 1, but
this is even more true for women than for men (drug at gender simple effects).
Both sets of simple effects just mentioned vary in magnitude but do not change
direction.
The cell means in the right side of Table 8.3 indicate a disordinal (cross-
over) interaction where at least one set of simple effects reverses direction.
These results indicate that drug 2 is better for women, but just the opposite
is true for men. That is, simple effects of drug change direction for women
Table 8.3
Cell Means and Marginal Means for Two-Way Designs
With Ordinal Versus Disordinal Interaction
Ordinal interaction Disordinal interaction
Drug 1 Drug 2 Drug 1 Drug 2
60.0 60.0
50.0 50.0
40.0 40.0
10.0 10.0
60.0 60.0
50.0 50.0
40.0 40.0
30.0 30.0
Drug 1
20.0 20.0
Men Drug 2
10.0 10.0
Figure 8.1. Cell mean plots for the data in Table 8.3 for (a) ordinal interaction and
(b) disordinal interaction.
B1 B2 B1 B2
A1 1 -1 A1 M11 M12
A2 -1 1 A2 M21 M22
Rearranging the terms shows that ψ̂AB equals (a) the difference between
the two simple effects of A and (b) the difference between the two simple
effects of B:
B1 B2 B3 (I)
A1 1 0 -1
A2 -1 0 1
B1 B2 B3 (II)
A1 ½ -1 ½
A2 -½ 1 -½
n ( ψˆ AB )
2
SSψˆ = (8.9)
a
2
b 2
AB
∑ c i ∑ c j
i =1 j=1
B1 B2 B3
A1 1 -2 1
A2 -1 2 -1
compare the quadratic effect of the drug across the groups. That the sum of
the absolute values of the weights is not 4.0 is not a problem because magni-
tudes of differential trends are usually estimated with measures of association.
Unlike simple effects, interaction contrasts and main effects are not
confounded. For this reason, some researchers prefer to analyze interaction
contrasts instead of simple effects when the main effects are relatively large. It
is also possible to test a priori hypotheses about specific facets of an omnibus
interaction through the specification of interaction contrasts. It is not usu-
ally necessary to analyze both simple effects and interaction contrasts in the
same design, so either one or the other should be chosen as a way to describe
interaction.
Presented in Table 8.4 are raw scores and descriptive statistics for bal-
anced 2 × 3 designs, where n = 3. The data in the top part of the table are
arranged in a layout consistent with a completely between-subjects design,
9.00 (7.00) 12.00 (7.00) 9.00 (4.00) 5.00 (4.00) 6.00 (3.00) 13.00 (7.00)
a
Assumes A is the between-subjects factor and B is the repeated measures factor. bCell mean (variance).
where each score comes from a different case. The same layout is also con-
sistent with a split-plot design, where the three scores in each row are from
the same case (e.g., A is a group factor, B is a repeated measures factor). The
same basic data are presented in the bottom part of the table in a completely
within-subjects layout, where the six scores in each row are from the same
case. You should verify the following results by applying Equations 8.1 and 8.3
and those in Table 8.2 to the data in Table 8.4 in either layout:
SSW = 64.00
SSA , B, AB = SSA + SSB + SSAB = 18.00 + 48.00 + 84.00 = 150.00
SST = 64.00 + 150.00 = 214.00
The results of three different factorial analyses of variance for the data
in Table 8.4 assuming fixed factors are reported in Table 8.5. Results in the
top of Table 8.5 are from a completely between-subjects analysis, results in
the middle of the table are from a split-plot analysis, and results in the bottom
of the table are from a completely within-subjects analysis. Note that only the
error terms, F ratios, and p values depend on the design. The sole error term in
the completely between-subjects analysis is MSW, and the statistical assump-
tions for tests with it are described in Chapter 3. In the split-plot analysis, SSW
and dfW are partitioned to form two different error terms, one for between-
subjects effects (A) and another for repeated measures effects (B, AB). Tests
with the former error term, designated in the table as S/A for “subjects within
groups under A,” assume homogeneity of variance for case average scores
across the levels of the repeated measures factor. The within-subjects error
ψˆ ABC = ψˆ AB at C1 − ψˆ AB at C 2 = ψˆ AC at B1 − ψˆ AC at B2
= ψˆ BC at A1 − ψˆ BC at A 2 (8.10)
Analysis Strategy
Model Testing
where Yijk is the kth score in the cell at the ith level of factor A and the jth
level of factor B; µ is the population grand mean; ai, bj, and abij, respectively,
represent the population main and interaction effects as deviations from
the grand mean; and eijk is a random error component. This model under-
lies the derivation of the sums of squares for the source table in the top of
Table 8.5. The complete structural models that underlie the other two source
tables in Table 8.5 are somewhat different because either one or both factors
are within-subjects, but the idea is the same. A structural model generates
predicted marginal and cell means, but these predicted means equal their
observed counterparts for a complete model. That is, the observed marginal
means estimate population main effects, and the observed cell means esti-
mate the population interaction effect.
A reduced structural model does not include parameters for all effects.
Parameters in the complete model are typically considered for exclusion in a
sequential order beginning with the highest order interaction. If the param-
eters for this interaction are retained, the complete model cannot be simpli-
fied. But if the parameters that correspond to abij are dropped, the complete
model reduces to the main effects model
Nonorthogonal Designs
If all factorial designs were balanced—or at least had unequal but pro-
portional cell sizes—there would be no need to deal with the technical prob-
lem raised next. Only two-way nonorthogonal designs are discussed, but the
basic principles extend to larger nonorthogonal designs. One problem is that
the factors are correlated, which means that there is no single, unambigu-
ous way to apportion the total effects sum of squares to individual effects. A
second concerns ambiguity in estimates for means that correspond to main
effects. This happens because there are two different ways to compute mar-
ginal means in unbalanced designs: as arithmetic or as weighted averages of
the corresponding row or column cell means. Consider the data in Table 8.6
for a nonorthogonal 2 × 2 design. The two ways to calculate the marginal
mean for A2 just mentioned are summarized respectively as
The value to the right is the same as that you would find if working with the
eight raw scores in the second row of the 2 × 2 matrix in Table 8.6. There is
no such ambiguity in balanced designs.
2, 3, 4 1, 3
A1 2.50/2.60b
3.00 (1.00)a 2.00 (2.00)
1, 3 4, 5, 5, 6, 6, 6
A2 3.67/4.50
2.00 (2.00) 5.33 (.67)
There are several methods for analyzing data from nonorthogonal designs
(e.g., Maxwell & Delaney, 2004, pp. 320–343). Statisticians do not agree about
optimal methods for different situations, so it is not possible to give defini-
tive recommendations. Most of these methods attempt to correct effect sums
of squares for overlap. They give the same results only in balanced designs,
and estimates from different methods tend to diverge as the cell sizes become
increasingly disproportional. Computer procedures for factorial ANOVA typi-
cally use by default one of the methods described next. If the default is not
suitable in a particular study, the researcher must specify a better method.
An older method for nonorthogonal designs amenable to hand calcula-
tion is unweighted means analysis. Effect sums of squares are computed in this
method using the equations for balanced designs, such as those in Table 8.2
for two-way designs, except that the design cell size is taken as the harmonic
mean of the actual cell sizes, or
ab
nh = (8.13)
a b
1
∑∑ n
i =1 j =1 ij
η̂2A = η̂2B = .095, are also the lowest in this method. The p values for both main
effects are <.05 and have greater explanatory power in Method 2/Type II
sums of squares and Method 3/Type I sums of squares, which gives them higher
priority than in Method 1. Only in Method 3—which analyzes the A, B, and
AB effects sequentially in this order—are the sums of squares and η̂2 values
additive but not unique.
Which of the three sets of results in Table 8.7 is correct? From a purely
statistical view, all are because there is no definitive way to estimate effect
sums of squares in nonorthogonal designs. There may be a preference for
one set of results given a clear rationale about effect priority. But without
such a justification, there is no basis for choosing among these results.
Standardized Contrasts
Designs with fixed factors are assumed next. Methods for standardizing
contrasts in factorial designs are not as well developed as they are for one-way
designs. There is also not complete agreement across works by Glass, McGaw,
and Smith (1981); Morris and DeShon (1997); Cortina and Nouri (2000);
Because MSW does not reflect variability due to effects of the intrinsic off-
factor B, its square root may underestimate s. This implies that a contrast
between levels of A standardized against (MSW)1/2 may overestimate the abso-
lute population effect size. A way to calculate an alternative standardizer that
reflects the total variation on off-factor B is described below.
Now suppose that the off-factor B is extrinsic (it does not vary natu-
rally in the population). Such factors are more likely to be manipulated or
repeated measures variables than individual difference variables. For exam-
ple, the theoretical population for the study of a new treatment can be viewed
as follows: It is true either that every case in the population is given the treat-
ment or that none of them are given the treatment. In either event, there is
no variability because of treatment versus no treatment (Cortina & Nouri,
2000). Because extrinsic off-factors are not of theoretical interest for the sake
of variance estimation, their effects should not contribute to the standard-
izer. In this case, the square root of MSW from the two-way ANOVA would
be a suitable denominator for standardized contrasts on factor A when the
off-factor B does not vary naturally.
Described next are two methods to standardize main or simple compari-
sons that estimate the full range of variability on an intrinsic off-factor that
varies naturally in the population. Both methods pool the variances across
all levels of the factor of interest, so they also generate standardized contrasts
for single-factor comparisons in factorial designs that are directly comparable
with dwith in single-factor designs. These two methods yield the same result in
balanced designs. The first is the orthogonal sums of squares method (Glass
et al., 1981). It requires a complete source table with additive sums of squares.
Assuming that A is the factor of interest, the following term estimates the full
range of variability on the intrinsic off-factor B:
a b
∑ ∑ [ dfij (sij2 ) + n ij (Mij − MA )2 ]
i
i =1 j =1
MSW, B, AB = (8.16)
N−a
9.00 − 13.00
dA at B 3 = = −1.14
3.50
d ψˆ = d A at B1 – d A at B2 = d B at A1 – d B at A 2
AB
(8.18)
But this relation may not hold if either factor varies naturally in the popula-
tion. This is because different sets of simple comparisons can have different
standardizers in this case. Because interaction is a joint effect, however, there
are no off-factors.
There is relatively little in the statistical literature about exactly how
to standardize an interaction contrast when only some factors vary naturally.
Suppose in a balanced 2 × 2 design that factor B varies naturally, but factor A
d ψˆ ABC
= d AB at C1 – d AB at C 2 = d AC at B1 – d AC at B2
= d BC at A1 – d BC at A 2 (8.19)
Interval Estimation
Measures of Association
Descriptive Measures
The effect size η̂2 = SSeffect /SST in factorial designs with fixed factors is the
proportion of total variance explained by an effect. The proportion of residual
variance explained after removing all systematic effects from total variance
other than that due to the effect of interest is partial η̂2 = SSeffect /(SSeffect + SSerror),
where SSerror is the sum of squares for the effect error term. Some researchers
report η̂2 for total effects and partial η̂2 for individual effects, such as
ˆ 2A , B, AB,, partial η
η ˆ 2A, partial η
ˆ 2B, and partiaal η̂ 2AB
ˆ 2effect = SSeffect
generalized η (8.20)
(m × SSeffect ) + ∑ SSmeas + ∑ SSsub, cov
η̂2AB = .393 and partial η̂2AB = .568, but neither effect size controls for the status
of the factors as manipulated versus measured. Because factor B is measured,
m = 0 in Equation 8.20 and
ˆ 2AB = 84.00
generalized η = .429
196.00
Thus, the interaction explains about 42.9% of the residual variable control-
ling only for the main effect of extrinsic factor A, which does not vary natu-
rally in the population.
Inferential Measures
The effect sizes described next assume balanced designs. The inferen-
tial measures of association ω̂2 or partial ω̂2 for effects of fixed factors and ρ̂ I
or partial ρ̂ I for effects of random factors are estimated as ratios of variance
components that depend on the design (see Chapter 7). I use the symbol ρ̂ I
only if all factors are random. An equation for directly computing ω̂2 that is
good for any effect in a completely between-subjects factorial design is
An equation for partial ω̂2 for any effect in the same kind of design is
dfeffect (Feffect − 1)
partial ωˆ 2effect = (8.22)
dfeffect (Feffect − 1) + N
1 1
( MSAB − MSW ) ( MSAB − MSW )
2
σ̂AB
n n
2 2 2 2 2 2
Note. In all cases, σ̂error = MSW, and σ̂total is the sum of σ̂A, σ̂B, σ̂AB , and σ̂error.
When the equations in Table 8.8 are used, the variance components esti-
mators are
1 1
σˆ 2A = ( 48.00 – 6.00 ) = 1.400 and σˆ 2B = ( 4 0.00 – 6.00 ) = 2.267
6 (5) 3 (5)
1
σˆ 2AB = ( 6.00 – 4.00 ) = .4 00 and σˆ 2error = 4.000
5
.40 .40
ρˆ i, AB = = .050 and partial ρˆ i, AB = = .091
8.067 .40 + 4.00
Interval Estimation
You can download the raw data for this example in SPSS format from
the web page for this book. T. G. Brown, Seraganian, Tremblay, and Annis
(2002) randomly assigned 87 men and 42 women who had just been dis-
charged from residential treatment centers for substance abuse to one of two
different 10-week aftercare programs, structured relapse prevention (SRP)
and 12-step facilitation (TSF). The former stressed rehearsal of skills to
avoid relapse, and the latter emphasized traditional methods of Alcoholics
Anonymous. Reported in the top part of Table 8.9 for this 2 × 2 randomized
blocks design are descriptive statistics for a measure of the severity of alcohol-
related problems administered 6 months later where higher scores indicate
more problems. The interaction is disordinal and is illustrated in Figure 8.2:
Women who completed the SRP program have relatively worse outcomes
than women who completed the TSF program, but men had similar outcomes
regardless of aftercare program type.
Presented in the bottom part of Table 8.9 are the source table and values
of standardized contrasts for single-factor effects, including the simple effects
of aftercare program for each gender. The sums of squares are Type I, and the
rationale for their selection in this nonorthogonal design is as follows: Men
have more problems with alcohol than women, so the gender main effect (G)
was not adjusted for other effects. It was less certain whether the aftercare
program (P) would make any difference, so its effect was adjusted for gender.
The GLM procedure of SPSS controlled through its graphical user interface
does not offer an option for calculating sums of squares for user-defined simple
effects, but there is an alternative.
In brief, it is possible to control SPSS by writing text-based syntax
that specifies the data and analysis options. One uses the syntax editor in
SPSS to write and edit the commands, and the resulting syntax file is saved
with the extension .sps (for SPSS syntax). The syntax is executed by high-
lighting (selecting) it with the mouse cursor and then clicking on the “run”
icon, which resembles the icon for “play” in a media player application.
Knowing something about SPSS syntax gives the user access to capabilities
that are not available through the graphical user interface of the program.
For this example, the SPSS syntax listed next requests sums of squares for
23b 19
48 39
Alcohol-related problems
30.0
SRP
20.0
10.0 TSF
Women Men
Figure 8.2. Cell means and 95% confidence intervals for µ for the data in Table 8.9.
SRP = structured relapse prevention; TSF = 12-step facilitation.
type. Given the marginal means in Table 8.9, women reported more alcohol-
related problems at follow-up than men did by about .05 standard deviations
regardless of program type.
Because gender varies naturally, the standardizer for main or simple
effects of program type is the square root of
SSW + SSG + SSGP 52,744.288
MSW , G, GP = = = 415.31
dfW + dfG + dfGP 127
or 20.38. Based on the marginal means in Table 8.9, the average difference
between the two aftercare programs is -.25 standard deviations in favor of the
TSF program. But this result is uninformative due to the presence of disordinal
interaction. Standardized contrasts for the simple effects of program type are
10.54 − 27.91 17.90 − 16.95
d P at women = = −.85 and d P at men = = .05
20.38 20.38
These results say that women in the SRP program reported more alcohol-
related problems than women in the TSF program did by about 85% of a
standard deviation. The magnitude of the corresponding difference for men
was only 5% in standard deviation units, but men did somewhat better in
the TSF program than in the SRP program. The difference between the two
standardized simple effects is a standardized interaction contrast, or
d P at women − d P at men = − .85 − .05 = − .90
Repeat 1× Repeat 3×
We can see in this matrix that the size of the FOE is greater when the voice
is heard just once instead of three times. The unstandardized interaction
contrast based on these cell means is
2.80
d ψˆ =
FR
= .41
46.42
Conclusion
Learn More
C1 C2
B1 B2 B1 B2
For life is not a tournament. Its race is not always to the swift nor its
battle to the strong. What counts is enduring to the end.
—Gilbert Meilaender (2011, p. 20)
DOI: 10.1037/14136-009
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
265
Concepts About Replication
Types of Replication
There is evidence that only small proportions—in some cases < 1%—
of all published studies in the behavioral sciences are specifically described
as replications (e.g., Easley et al., 2000; Kmetz, 2002). Some possible reasons
are listed next:
1. Misinterpretation of statistical significance. Many widespread false
beliefs about the meaning of statistical significance undoubt-
edly discourage replication. Among the obvious suspects are
the replicability, odds-against-chance, inverse probability, and
valid research hypothesis fallacies. The combined effect of cog-
nitive distortions about p values could lead researchers to be
so overconfident about their results that replication is seen as
unnecessary.
2. Editorial preference for novelty. It is easy to see the clear prefer-
ence among journal editors and reviewers for work characterized
as original; that is, providing new theoretical, methodological,
Meta-Analysis
Steps
Summarized next are the basic iterative phases of effect size synthesis
in meta-analysis:
1. Decide whether to combine results across studies and what to
combine.
2. Estimate a common (average) effect size.
3. Estimate the heterogeneity in effect sizes across studies, and
attempt to explain it—that is, find an appropriate statistical
model for the data.
4. Assess the potential for bias.
The first step is often the computation of a weighted average effect size.
If it can be assumed that the observed effect sizes estimate a single population
effect size—that is, a fixed effects model—their average takes the form
k
∑ wi ESi
MES = i =1
k
(9.1)
∑ wi
i =1
where ESi is the effect size (e.g., d) for the ith result in a set of k effect sizes and
wi is the weight for that result. A weight for each effect size that minimizes
the variance of MES is
1
wi = (9.2)
s 2
ESi
The square root of Equation 9.3 is the standard error of the average weighted
effect size. The general form of a 100 (1 – a)% confidence interval for the
population effect size µES is
MES ± sM ( z 2-tail, α )
ES
(9.4)
If a confidence interval for µES includes zero and z2-tail, a = 1.96, the nil hypothesis
that the population effect size is zero cannot be rejected at the .05 level. This
is an example of a statistical test in meta-analysis. The power of this test will
be low if the number of study effect sizes is relatively small, but even trivial
average effect sizes will be statistically significant given sufficiently many pri-
mary studies. These tests also assume that the found studies were randomly
sampled from the population of all studies, but this is not how primary studies
wind up being included in most meta-analyses (i.e., this is another instance
of the design–analysis gap). Thus, statistical tests in meta-analysis are subject
to the same basic limitations as in primary studies.
Weighting of effect sizes as just described assumes a fixed effects model,
or a conditional model. It assumes that (a) there is one population of studies
with a single true effect size and (b) study effect size departs from true effect size
due to within-studies variance only. Thus, effect sizes in conditional models
are weighted solely by functions of their conditional variances (Equation 9.2).
Other variation in observed effect sizes is viewed as systematic and as a result
of identifiable differences due to meta-analytic predictors (study factors).
Generalizations in a fixed effects model are limited to studies such as those
actually found.
An alternative model for a meta-analysis is a random effects model, also
called an unconditional model. There is no single population of studies or a
constant population effect size presumed to underlie all studies in a random
effects model. It assumes instead that (a) there is a distribution of population
effect sizes (i.e., there is a different true effect size for each study) and (b) there
are two sources of error variance. One is within-studies variation, which in an
unconditional model is conceptualized as the difference between an observed
effect size and the population effect size estimated by that particular study, just as
in a fixed effects model. The second source is between-studies variance, which
concerns the distribution of all population effect sizes around the population
where the mean effect size MES and the weights wi for each of the k effect
sizes are computed assuming a fixed effects model (Equations 9.1–9.2). The
expression in the right side of Equation 9.5 is a computational version more
amenable to hand calculation.
Under the null hypothesis of a single population effect size, the Q sta-
tistic is distributed as a central chi-square with k – 1 degrees of freedom.
The latter is the expected value in a central chi-square distribution. If the
null hypothesis about a fixed effects model is false, more and more sample c2
Q − df
T2 = (9.6)
C
where C is a scaling factor that controls for the fact that Q is a sum of squares,
not a variance. In Equation 9.6, it is the division of Q - df by C that estimates
the between-studies variance in the same metric as the within-studies variance
(Equation 9.2). A computational formula for C is
k
k ∑ wi2
C = ∑ wi − i =1
k
(9.7)
i =1
∑ wi
i =1
1
s2M =
*
ES k
(9.10)
∑ w*i
i =1
The general form of a 100 (1 - a)% confidence interval for the population
grand mean effect size µ*ES has the following general form:
M*ES ± sM ( z 2-tail, α )
*
ES
(9.11)
Confidence intervals for µ*ES assuming a random effects model are generally
wider than those for µES assuming a fixed effects model for the same data and
level of a. Thus, the choice between the two models in meta-analysis affects
the relative contribution of individual effect sizes and the estimation of both
the weighted average effect size and its precision.
Listed in the left side of Table 9.1 are the results of eight hypothet-
ical studies each based on a two-sample design. The observed effect sizes
are dpool (Equations 5.3–5.4). I used ESCI (Cumming, 2012; see footnote 4,
Chapter 2) to calculate the within-studies variances and weights for a fixed
effects model that are reported in the table.1 Also reported in Table 9.1 are per-
centages that indicate the relative contribution of each result to the weighted
average. These percentages are derived as the ratio of the weight for each study
over the total of all the weights, which is 154.130 for a fixed effects model. For
example, the weight for study 1 in Table 9.1 is 10.667, so the relative contribu-
tion of this effect size is 10.667/154.131 = .069, or about 6.9%.
Given these results from Table 9.1
the average weighted effect size and its estimated standard error are com-
puted as
165.999 1
Md = = 1.077 and sM = = 0.0805
154.131 154.131
d
1
The ESCI program calculates the error variance of dpool using the square of Equation 5.20 except that
the overall sample size N replaces the expression df = N - 2 in this equation.
165.9992
Q = 203.872 − = 25.091
154.131
With a total of k = 8 studies, the degree of freedom are 7, and the p value for
c2 (7) = 25.091 is .001. If using a conventional significance test, we would
reject the hypothesis of homogeneity that there is a common population
effect size at the .05 level.
Now assuming a random effects model for the data in Table 9.1, we
estimate the between-studies variance as
∑ wi2 = 3,787.470
Q − df = 25.091 − 7 = 18.091
3,787.470
C = 154.131 − = 129.558
154.131
18.091
T2 = = .140
129.558
As expected, the width of the 95% confidence interval for the random effects
model, or [.77, 1.39], is wider than that for the fixed effects model, or [.92, 1.23].
The results just described for the data in Table 9.1 are summarized with
the forest plots in Figure 9.1. The noncentral 95% confidence intervals for d
d
.50 0 .50 1.00 1.50 2.00 2.50 3.00
Figure 9.1. Forest plot for a fixed effects model and a random effects model for the
data in Table 9.1.
Statistical Techniques
Conclusion
Learn More
DOI: 10.1037/14136-010
Beyond Significance Testing: Statistics Reform in the Behavioral Sciences, Second Edition, by R. B. Kline
Copyright © 2013 by the American Psychological Association. All rights reserved.
289
Contexts for Bayesian Estimation
Bayes’s Theorem
p ( H ) p ( data H )
p ( H data ) = (10.2)
p ( data )
.10 (.80 )
p ( H data ) = = .40
.20
which says that there is a 40% chance that it will snow on days when it is
cold in the morning.
k
∑ p ( Hi ) = 1.0 (10.3)
i =1
The prior probability of the data in Equation 10.2 for Bayes’s theorem can
now be expressed as
k
p ( data ) = ∑ p ( Hi ) p ( data Hi ) (10.4)
i =1
which is the sum of the products of the prior probabilities for each of the k
discrete hypotheses and the likelihood of the data under it.
Suppose that the distribution on a continuous variable in a population
is normal, the variance is known—that is, it is a constant, not a variable, and
we assume s2 = 144.00—but the mean is not known (i.e., it is a random vari-
able, m). There are two competing hypotheses, or
p ( H1 ) = p ( H2 ) = .50
e−z 2
ndf ( z ) = (10.5)
2 pi
The results of the ndf function are divided by two because there are two
hypotheses. These results say that the probabilities of the data under H1
and H2 are, respectively, .0270 and .0820. The prior probability of the data
(M1 = 106.00) is
and applying Bayes’s theorem tells us that the posterior probabilities for each
hypothesis are
.50 (.0270 )
p ( H1 data1 ) = = .2477
.0545
.50 (.0820 )
p ( H2 data1 ) = = .7523
.0545
1In Microsoft Excel, the function NORMDIST(z, 0, 1, True) returns the likelihood of z.
The posterior odds are the ratio of the conditional probabilities of two
competing hypotheses for the same data. For the example
p ( H2 data1 ) .7523
Posterior odds1 = = = 3.04
p ( H1 data1 ) .22477
which says that the odds are about 3:1 in favor of H2 that the population
mean is 110.00 over H1 that this mean is 100.00 after observing M1 = 106.00.
Which of the two hypotheses is represented in the numerator is arbitrary. For
this example, the ratio .2477/.7523, or .329, is the posterior odds for H1 rela-
tive to H2 (i.e., about 1:3 against H1).
With Equation 10.2 used for Bayes’s theorem, it can be demonstrated
that posterior odds can be expressed as the product of the prior odds and the
likelihood ratio, or the Bayes factor (BF). That is,
where the prior odds are p (H2) / p (H1) and the Bayes factor is
p ( data H2 )
BF = (10.6)
p ( data H1 )
which summarizes the relative likelihood of the same data under the two
hypotheses. (Compare Equations 6.8 and 10.5.) The Bayes factor also sum-
marizes the results of the study that allow the update of the odds of the two
hypotheses from what they were before collection of the data (prior odds)
to what they should be given the data. If the prior odds do not favor one
hypothesis over the other (i.e., it is 1.0), the value of BF directly equals that
of the posterior odds. For the example where the prior odds are 1.0, the value
of the Bayes factor is
p ( data1 H2 ) .0820
BF1 = = = 3.04
p ( data1 H1 ) .0270
which equals the posterior odds for this example calculated earlier (3.04) as
the ratio of the likelihood of the two hypotheses, given M1 = 106.00.
The value of the normal deviate for M2 = 107.50 is 2.500, given sM = 3.00
and assuming m = 100.00 under H1, so the likelihood of the second mean under
this hypothesis is
Under H2, which assumes m = 110.00, the normal deviate for the second
mean is -.833, so the likelihood under this hypothesis is
p ( data 2 H2 ) .1410
BF2 = = = 16.02
p ( data 2 H1 ) .0088
which now favor H2 that m = 110.00 over H1 that m = 100.00 even more
strongly than the original posterior odds when only the first result (3.04) was
available.
which predicts m 1 > m 2 but also limits the upper bound of the expected popula-
tion mean difference to 5.0. The distribution in Figure 10.1(c) also represents
every result within the range 0–5.0 as equally plausible. Specification of the
lower and upper bounds of a rectangular distribution is sometimes justified by
the scale on which means are calculated. If scores on that scale range from 0
to 5, the difference between two means cannot exceed 5.0.
1.0 1.0
Probability
Probability
0 0
5.0 2.5 0 2.5 5.0 5.0 2.5 0 2.5 5.0
µ1 − µ2 µ1 − µ2
1.0 1.0
Probability
Probability
0 0
5.0 2.5 0 2.5 5.0 5.0 2.5 0 2.5 5.0
µ1 − µ2 µ1 − µ2
1.0 1.0
Probability
Probability
0 0
5.0 2.5 0 2.5 5.0 5.0 2.5 0 2.5 5.0
2
µ1 − µ2
2
http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_factor.swf
3
http://pcl.missouri.edu/bayesfactor
prc 0 prc1
m1 = m0 + M1 (10.8)
prc 0 + prc1 prc 0 + prc1
1
s 12 = (10.9)
prc 0 + prc1
Note in Equation 10.8 that the relative contribution of new knowledge, the
observed mean M1, depends on its precision, prc1, and the precision of all
prior knowledge taken together, prc0.
An example demonstrates the iterative estimation of the posterior
distribution for a random population mean as new data are collected. The
distributional characteristics stated earlier are assumed. Suppose that the
researcher has no basis to make a prior prediction about the value of m, so a
flat prior distribution with infinite variance is specified as the prior distribu-
tion. A sample of 100 cases is selected, and the results are
The traditional 95% confidence interval for the population mean computed
with z2-tail, .05 = 1.96 instead of t2-tail, .05 (99) = 1.98 is
equal the observed mean and standard error, respectively. The Bayesian 95%
credible interval for the random population mean m calculated in the poste-
rior distribution is
which defines exactly the same interval as the traditional 95% confidence
interval calculated earlier. We can say, based on the data, that the probability
is .95 that the interval [101.10, 110.90] includes the true value of m. But after
something is known about the parameter (i.e., there are data), traditional
confidence intervals are no longer interpreted this way.
All of the information just described is summarized in the first row of
Table 10.1. The remaining rows in the table give the characteristics of the
prior and posterior distributions and results in three subsequent samples,
each based on 100 cases. For each new result, the posterior distribution from
the previous study is taken as the prior distribution for that result. For exam-
ple, the posterior distribution, given just the results of the first sample, with
the characteristics
becomes the prior distribution for the results in the second sample, which are
Table 10.1
Means and Standard Deviations of Prior Distributions and Posterior
Distributions Given Data From Four Different Studies
Prior distribution Data Posterior distribution
Study M s M sM m s 95% CI
1 — ∞ 106.00 2.50 106.00 2.50 [101.10, 110.90]
2 106.00 2.50 107.50 3.00 106.61 1.92 [102.85, 110.37]
3 106.61 1.92 112.00 2.80 108.33 1.58 [105.32, 111.43]
4 108.33 1.58 109.00 2.50 108.52 1.34 [105.89, 109.86]
Note. The sample size for all studies is N = 100. The prior distribution for Study 1 is a flat prior distribution with
infinite variance and where no prediction is made about the population mean. CI = Bayesian credible interval.
Prior 1
.20
Probability
Study 2
Posterior 2
.10
0
80.0 90.0 100.0 110.0 120.0
µ
Figure 10.2. Plots of the prior distribution before collecting the second sample,
the distribution in the second study, and the posterior distribution for the data in
Table 10.1. Prior 1 (m 1 = 106.00, s 1 = 2.50), Study 2 (M1 = 107.50, sM = 3.00), 1
Posterior 2 (m 2 = 106.61, s 2 = 1.92).
The mean and standard deviation in the posterior distribution, given the
results in the first and second samples, are
.16 .11
m 2 = 106.00 + 107.50 = 106.61
.16 + .11 .16 + .11
1
s2 = = 1.92
.16 + .11
That is, our best single guess for the true population mean has shifted slightly
from 106.00 to 106.61 after the second result, and the standard deviation in
the posterior distribution is reduced from 2.50 before collecting the second
sample to 1.92 after observing the second sample. Our new Bayesian 95%
credible interval is [102.85, 110.37], which is slightly narrower than the pre-
vious 95% credible interval, [101.10, 110.90].
I used an online plotter by Dienes (2008) to display the prior, sample,
and posterior distributions shown in Figure 10.2 for the results just described.4
This graphic shows the change from the prior to posterior distributions after
observing the results in the second sample. The last two rows in Table 10.1
show changes in the prior and posterior distributions as results from two
additional samples are synthesized. Note in the table that the widths of the
4http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/bayes_normalposterior.swf
Evaluation
Bayesian methods are flexible and can evaluate the kinds of questions
that researchers would really like answered. An obstacle to their wider use
in the behavioral sciences was that many older reference works for Bayesian
statistics were quite technical. They often required familiarity with inte-
gral notation for probability distributions and estimation techniques for the
parameters of different kinds of probability distributions. Such presentations
are not accessible for applied researchers without strong quantitative back-
grounds. But this situation is changing, and there are now some books that
introduce Bayesian methods to a wider audience in the behavioral sciences
(e.g., Dienes, 2008).
A second obstacle was the relative paucity of Bayesian software tools
for behavioral scientists, but things have improved in this area, too. A freely
available software tool for Bayesian analysis is WinBUGS (Bayesian Inference
Using Gibbs Sampling; Lunn, Thomas, Best, & Spiegelhalter, 2000) for
personal computers.5 There is an open-source version of WinBUGS that
5http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
6http://www.openbugs.info/w/
Conclusion
Learn More
Abelson, R. P. (1997a). A retrospective on the significance test ban of 1999 (If there
were no significance tests, they would be invented). In L. L. Harlow, S. A. Mulaik,
& J. H. Steiger (Eds.), What if there were no significance tests? (pp. 117–141).
Mahwah, NJ: Erlbaum.
Abelson, R. P. (1997b). On the surprising longevity of flogged horses: Why there is a case for
the significance test. Psychological Science, 8, 12–15. doi:10.1111/j.1467-9280.1997.
tb00536.x
Agresti, A. (2007). An introduction to categorical data analysis (2nd ed.). Hoboken, NJ:
Wiley. doi:10.1002/0470114754
Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & Kohlhausen, D.
(2010). Customer-centric science: Reporting significant research results with
rigor, relevance, and practical impact in mind. Organizational Research Methods,
13, 515–539. doi:10.1177/1094428109333339
Aiken, L. S., West, S. G., Sechrest, L., Reno, R. R., Roediger, H. L., III, Scarr,
S., . . . Sherman, S. J. (1990). Measurement in psychology: A survey of PhD pro-
grams in North America. American Psychologist, 45, 721–734. doi:10.1037/0003-
066X.45.6.721
Algina, J., & Keselman, H. J. (2003). Approximate confidence intervals for effect
sizes. Educational and Psychological Measurement, 63, 537–553. doi:10.1177/
0013164403256358
Algina, J., Keselman, H. J., & Penfield, R. D. (2005a). An alternative to Cohen’s stan-
dardized mean difference effect size: A robust parameter and confidence inter-
vals in the two independent groups case. Psychological Methods, 10, 317–328.
doi:10.1037/1082-989X.10.3.317
Algina, J., Keselman, H. J., & Penfield, R. (2005b). Effect sizes and their intervals:
The two-level repeated measures case. Educational and Psychological Measure-
ment, 65, 241–258. doi:10.1177/0013164404268675
Algina, J., Keselman, H. J., & Penfield, R. D. (2006). Confidence intervals for an
effect size when variances are not equal. Journal of Modern Applied Statistical
Methods, 5, 2–13. doi:10.1177/0013164406288161
American Educational Research Association, American Psychological Associa-
tion, & National Council on Measurement in Education. (1999). Standards for
educational and psychological testing. Washington, DC: American Psychological
Association.
American Psychological Association. (2001). Publication manual of the American
Psychological Association (5th ed.). Washington, DC: Author.
American Psychological Association. (2010). Publication manual of the American
Psychological Association (6th ed.). Washington, DC: Author.
313
Andersen, M. B. (2007). But what do the numbers really tell us? Arbitrary metrics
and effect size reporting in sport psychology research. Journal of Sport & Exercise
Psychology, 29, 664–672. Retrieved from http://journals.humankinetics.com/jsep
Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis test-
ing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64,
912–923. doi:10.2307/3803199
Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International
Journal of Forecasting, 23, 321–327. doi:10.1016/j.ijforecast.2007.03.004
Aron, A., & Aron, E. N. (2002). Statistics for the behavioral and social sciences (2nd ed.).
Upper Saddle River, NJ: Prentice Hall.
Austin, P. C., Mamdani, M. M., Juurlink, D. N., & Hux, J. E. (2006). Testing multiple
statistical hypotheses resulted in spurious associations: A study of astrological
signs and health. Journal of Clinical Epidemiology, 59, 964–969. doi:10.1016/j.
jclinepi.2006.01.012
Baguley, T. (2004). An introduction to sphericity. Retrieved from http://homepages.
gold.ac.uk/aphome/spheric.html
Baguley, T. (2009). Standardized or simple effect size: What should be reported?
British Journal of Psychology, 100, 601–617. doi:10.1348/000712608X377117
Bakan, D. (1966). The test of significance in psychological research. Psychological
Bulletin, 66, 423–437. doi:10.1037/h0020412
Bayes, T. (1763). A letter to John Canton. Philosophical Transactions of the Royal Society
of London, 53, 293–295.
Beck, A. T., Rush, A. J., Shaw, B. F., & Emory, G. (1979). Cognitive therapy of depression.
New York, NY: Guilford Press.
Belasco, J., & Stayer, R. (1993). Flight of the buffalo: Soaring to excellence, learning to let
employees lead. New York, NY: Warner.
Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand
confidence intervals and standard error bars. Psychological Methods, 10, 389–396.
doi:10.1037/1082-989X.10.4.389
Bellinger, D. C. (2007). Interpretation of small effect sizes in occupational and envi-
ronmental neurotoxicology: Individual versus population risk. Neurotoxicology,
28, 245–251. doi:10.1016/j.neuro.2006.05.009
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive
influences on cognition and affect. Journal of Personality and Social Psychology, 100,
407–425. doi:10.1037/a0021524
Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American
Statistical Association, 37, 325–335. doi:10.1080/01621459.1942.10501760
Bester, A. (1979). 5,271,009. In M. H. Greenberg & J. Olander (Eds.), Science fiction of
the fifties (pp. 187–221). New York, NY: Avon Books. (Original work published
1954)
Bird, K. D. (2002). Confidence intervals for effect sizes in analysis of variance. Educational
and Psychological Measurement, 62, 197–226. doi:10.1177/0013164402062002001
references 315
Card, N. A. (2012). Applied meta-analysis for social science research. New York, NY:
Guilford Press.
Cartwright, D. (1973). Determinants of scientific progress: The case of research on
the risky shift. American Psychologist, 28, 222–231. doi:10.1037/h0034445
Carver, R. P. (1978). The case against significance testing. Harvard Educational Review,
48, 378–399. Retrieved from http://www.hepg.org/main/her/Index.html
Casscells, W., Schoenberger, A., & Graboys, T. (1978). Interpretation by physicians
of clinical laboratory results. New England Journal of Medicine, 299, 999–1001.
doi:10.1056/NEJM197811022991808
Chater, N., Tenenbaum, J. B., & Yuille, A. (2006). Probabilistic models of cog-
nition: Conceptual foundations. Trends in Cognitive Sciences, 10, 287–291.
doi:10.1016/j.tics.2006.05.007
Chernick, M. R. (2008). Bootstrap methods: A guide for practitioners and researchers
(2nd ed.). Hoboken, NJ: Wiley.
Chinn, S. (2000). A simple method for converting an odds ratio to effect size for
use in meta-analysis. Statistics in Medicine, 19, 3127–3131. doi:10.1002/1097-
0258(20001130)19:22<3127::AID-SIM784>3.0.CO;2-M
Christina, R. (2010). Extreme risk management: Revolutionary approaches to evaluating
and measuring risk. New York, NY: McGraw Hill.
Cohen, J. (1962). The statistical power of abnormal–social psychological research:
A review. Journal of Abnormal and Social Psychology, 65, 145–153. doi:10.1037/
h0045186
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological
Bulletin, 70, 426–443. doi:10.1037/h0026714
Cohen, J. (1969). Statistical power analyses for the behavioral sciences. New York, NY:
Academic Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York,
NY: Academic Press.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
doi:10.1037/0003-066X.49.12.997
Colliver, J. A., & Markwell, S. J. (2006). ANCOVA, selection bias, statistical equat-
ing, and effect size: Recommendations for publication. Teaching and Learning in
Medicine, 18, 284–286. doi:10.1207/s15328015tlm1804_1
Conn, V. S., & Rantz, M. J. (2003). Research methods: Managing primary study qual-
ity in meta-analyses. Research in Nursing & Health, 26, 322–333. doi:10.1002/
nur.10092
Cook, S., & Wilding, J. (2001). Earwitness testimony: Effects of exposure and atten-
tion on the face overshadowing effect. British Journal of Psychology, 92, 617–629.
doi:10.1348/000712601162374
Cortina, J. M., & Nouri, H. (2000). Effect size for ANOVA designs. Thousand Oaks,
CA: Sage.
references 317
Dunleavy, E. M., Barr, C. D., Glenn, D. M., & Miller, K. R. (2006). Effect size report-
ing in applied psychology: How are we doing? The Industrial-Organizational Psy-
chologist, 43(4), 29–37. Retrieved from http://www.siop.org/tip/tip.aspx
Easley, R. W., Madden, C. S., & Dunn, M. G. (2000). Conducting marketing science:
The role of replication in the research process. Journal of Business Research, 48,
83–92. doi:10.1016/S0148-2963(98)00079-4
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for
psychological research. Psychological Review, 70, 193–242. doi:10.1037/h0044139
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statis-
tics, 7, 1–26. doi:10.1214/aos/1176344552
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and
the interpretation of research results. New York, NY: Cambridge University Press.
doi:10.1017/CBO9780511761676
Erceg-Hurn, D. M., & Mirosevich, V. M. (2008). Modern robust statistical methods:
An easy way to maximize the accuracy and power of your research. American
Psychologist, 63, 591–601. doi:10.1037/0003-066X.63.7.591
Eysenck, H. J. (1995). Meta-analysis squared—Does it make sense? American Psy-
chologist, 50, 110–111. doi:10.1037/0003-066X.50.2.110
Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing
persistence of a probabilistic misconception. Theory & Psychology, 5, 75–98.
doi:10.1177/0959354395051004
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible
statistical power analysis program for the social, behavioral, and biomedical
sciences. Behavior Research Methods, 39, 175–191. doi:10.3758/BF03193146
Ferguson, C. J. (2009). Is psychology research really as good as medical research?
Effect size comparisons between psychology and medicine. Review of General
Psychology, 13, 130–136. doi:10.1037/a0015103
Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems. Journal
of Consumer Research, 23, 89–105. doi:10.1086/209469
Fidler, F., Burgman, M. A., Cumming, G., Buttrose, R., & Thomason, N. (2006).
Impact of criticism of null-hypothesis significance testing on statistical report-
ing practices in conservation biology. Conservation Biology, 20, 1539–1544.
doi:10.1111/j.1523-1739.2006.00525.x
Fidler, F., Cumming, G., Thomason, N., Pannuzzo, D., Smith, J., Fyffe, P., . . . Schmitt,
R. (2005). Toward improved statistical reporting in the Journal of Consulting and
Clinical Psychology. Journal of Consulting and Clinical Psychology, 73, 136–143.
doi:10.1037/0022-006X.73.1.136
Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can
lead researchers to confidence intervals, but can’t make them think: Statistical
reform lessons from medicine. Psychological Science, 15, 119–126. doi:10.1111/
j.0963-7214.2004.01502008.x
references 319
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587–606.
doi:10.1016/j.socec.2004.09.033
Gigerenzer, G., & Murray, D. (1987). Cognition as intuitive statistics. Hillsdale, NJ:
Erlbaum.
Gilbody, S. M., Song, F., Eastwood, A. J., & Sutton, A. (2000). The causes, conse-
quences and detection of publication bias in psychiatry. Acta Psychiatrica Scan-
dinavica, 102, 241–249. doi:10.1034/j.1600-0447.2000.102004241.x
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research.
Newbury Park, CA: Sage.
Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to
meet assumptions underlying the fixed analysis of variance and covariance.
Review of Educational Research, 42, 237–288. doi:10.2307/1169991
Gleick, J. (1987). Chaos: Making a new science. New York, NY: Viking Penguin.
Gliner, J. A., Leech, N. L., & Morgan, G. A. (2002). Problems with null hypothesis
significance testing (NHST): What do the textbooks say? Journal of Experimental
Education, 71, 83–92. doi:10.1080/00220970209602058
Gorard, S. (2006). Towards a judgment-based statistical analysis. British Journal of
Sociology of Education, 27, 67–80. doi:10.1080/01425690500376663
Gouzoulis-Mayfrank, E., Daumann, J., Tuchtenhagen, F., Pelz, S., Becker, S., Kunert,
H.-J., . . . Sass, H. (2000). Impaired cognitive performance in drug free users of
recreational ecstasy (MDMA). Journal of Neurology, Neurosurgery & Psychiatry,
68, 719–725. doi:10.1136/jnnp.68.6.719
Gray, P. O. (2002). Psychology (4th ed.). New York, NY: Worth.
Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes
and p values: What should be reported and what should be replicated? Psycho-
physiology, 33, 175–183. doi:10.1111/j.1469-8986.1996.tb02121.x
Grimes, D. A., & Schulz, K. F. (2002). Uses and abuses of screening tests. Lancet,
359, 881–884. doi:10.1016/S0140-6736(02)07948-5
Grissom, R. J., & Kim, J. J. (2011). Effect sizes for research: Univariate and multivariate
applications (2nd ed.). New York, NY: Routledge.
Guthery, F. S., Lusk, J. J., & Peterson, M. J. (2001). The fall of the null hypoth-
esis: Liabilities and opportunities. Journal of Wildlife Management, 65, 379–384.
doi:10.2307/3803089
Halkin, A., Reichman, J., Schwaber, M., Paltiel, O., & Brezis, M. (1998). Likelihood
ratios: Getting diagnostic testing into perspective. Quarterly Journal of Medicine,
91, 247–258. doi:10.1093/qjmed/91.4.247
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem stu-
dents share with their teachers? Methods of Psychological Research Online, 7(1),
1–17. Retrieved from http://www.dgps.de/fachgruppen/methoden/mpr-online/
Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no
significance tests? Mahwah, NJ: Erlbaum.
references 321
Hunt, M. (1997). How science takes stock. New York, NY: Russell Sage Foundation.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and
bias in research findings (2nd ed.). Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating
research findings across studies. Beverly Hills, CA: Sage.
Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman–Pearson
decision theory framework and rise of the neoFisherian. Annales Zoologici Fen-
nici, 46, 311–349. Retrieved from http://www.sekj.org/AnnZool.html
Hyde, J. S. (2001). Reporting effect sizes: The role of editors, textbook authors, and
publication manuals. Educational and Psychological Measurement, 61, 225–228.
doi:10.1177/0013164401612005
Hyde, J. S. (2005). The gender similarities hypothesis. American Psychologist, 60,
581–592. doi:10.1037/0003-066X.60.6.581
International Committee of Medical Journal Editors. (2010). Uniform requirements
for manuscripts submitted to biomedical journals: Writing and editing for bio-
medical publication. Retrieved from http://www.icmje.org/urm_full.pdf
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS
Medicine, 2(8), e124. doi:10.1371/journal.pmed.0020124
Iverson, G. J., & Lee, M. D. (2009). prep misestimates the probability of replication.
Psychonomic Bulletin & Review, 16, 424–429. doi:10.3758/PBR.16.2.424
James, G. S. (1951). The comparison of several groups of observations when the ratios
of the population variances are unknown. Biometrika, 38, 324–329. doi:10.1093/
biomet/38.3-4.324
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, England: Oxford Univer-
sity Press.
Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal
of Wildlife Management, 63, 763–772. doi:10.2307/3802789
Johnson, M. K., & Liebert, R. M. (1977). Statistics: Tool of the behavioral sciences.
Englewood Cliffs, NJ: Prentice Hall.
Kanfer, R., & Ackerman, P. L. (1989). Motivation and cognitive abilities: An
integrative/aptitude–treatment interaction approach to skill acquisition. Jour-
nal of Applied Psychology, 74, 657–690. doi:10.1037/0021-9010.74.4.657
Kazdin, A. (2006). Arbitrary metrics: Implications for identifying evidence-
based treatments. American Psychologist, 61, 42–49. doi: 10.1037/0003-066X.
61.1.42
Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, applica-
tion, and implementation. Journal of Statistical Software, 20(8). Retrieved from
http://www.jstatsoft.org/v20/i08
Kelley, K., & Preacher, K. J. (2012). On effect size. Psychological Methods, 17, 137–152.
doi:10.1037/a0028086
references 323
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estima-
tion and model comparison. Perspectives on Psychological Science, 6, 299–312.
doi:10.1177/1745691611406925
Kuhn, T. S. (1996). The structure of scientific revolutions (3rd ed.). Chicago, IL: Uni-
versity of Chicago Press.
Kupfersmid, J., & Fiala, M. (1991). A survey of attitudes and behaviors of authors
who publish in psychology and education journals. American Psychologist, 46,
249–250. doi:10.1037/0003-066X.46.3.249
Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance
tests are not. Theory & Psychology, 22, 67–90. doi:10.1177/0959354311429854
Lenth, R. V. (2006–2009). Java applets for power and sample size. Retrieved from
http://www.stat.uiowa.edu/~rlenth/Power
Lindberg, S. M., Hyde, J. S., Petersen, J. L., & Linn, M. C. (2010). New trends in
gender and mathematics performance: A meta-analysis. Psychological Bulletin,
136, 1123–1135. doi: 10.1037/a0021276
Little, T. D., Lindenberger, U., & Nesselroade, J. R. (1999). On selecting indica-
tors for multivariate measurement and modeling with latent variables: When
“good” indicators are bad and “bad” indicators are good. Psychological Methods,
4, 192–211. doi:10.1037/1082-989X.4.2.192
Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assump-
tions violations revisited: A quantitative review of alternatives to the one-
way analysis of variance F test. Review of Educational Research, 66, 579–619.
doi:10.3102/00346543066004579
Loftus, G. R. (1993). Editorial comment. Memory & Cognition, 21, 1–3. doi:10.3758/
BF03211158
Longford, N. T. (2005). Editorial: Model selection and efficiency: Is “which
model . . . ?” the right question? Journal of the Royal Statistical Society: Series A,
168, 469–472. doi:10.1111/j.1467-985X.2005.00366.x
Lunn, D. J., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project:
Evolution, critique and future directions. Statistics in Medicine, 28, 3049–3082.
doi:10.1002/sim.3680
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—A Bayes-
ian modelling framework: Concepts, structure, and extensibility. Statistics and
Computing, 10, 325–337. doi:10.1023/A:1008929526011
Lunneborg, C. (2000). Modeling experimental and observational data. Belmont, CA:
Duxbury Press.
Lunneborg, C. E. (2001). Random assignment of available cases: Bootstrap standard
errors and confidence intervals. Psychological Methods, 6, 402–412. doi:10.1037/
1082-989X.6.4.402
Lykken, D. T. (1991). What’s wrong with psychology, anyway? In D. Cicchetti & W.
Grove (Eds.), Thinking clearly about psychology (Vol. 1, pp. 3–39). Minneapolis:
University of Minnesota Press.
references 325
Meilaender, G. (2011). Playing the long season. First Things, 214, 19–20. Retrieved
from http://www.firstthings.com/
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures.
Psychological Bulletin, 105, 156–166. doi:10.1037/0033-2909.105.1.156
Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covari-
ance. Journal of Abnormal Psychology, 110, 40–48. doi:10.1037/0021-843X.
110.1.40
Miller, J. (2009). What is the probability of replicating a statistically significant
effect? Psychonomic Bulletin & Review, 16, 617–640. doi:10.3758/PBR.16.4.617
Montgomery, A. A., Peters, T. J., & Little, P. (2003). Design, analysis and presentation
of factorial randomised controlled trials. BMC Medical Research Methodology, 3,
Article 26. doi:10.1186/1471-2288-3-26
Moons, K. G. M., van Es, G.-A., Deckers, J. W., Habbema, J. D. F., & Grobbee, D. E.
(1997). Limitations of sensitivity, specificity, likelihood ratio, and Bayes’ theo-
rem in assessing diagnostic probabilities: A clinical example. Epidemiology, 8,
12–17. doi:10.1097/00001648-199701000-00002
Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval
null hypotheses. Psychological Methods, 16, 406–419. doi:10.1037/a0024377
Morris, S. B., & DeShon, R. P. (1997). Correcting effect sizes computed with fac-
torial analyses of variance for use in meta-analysis. Psychological Methods, 2,
192–199. doi:10.1037/1082-989X.2.2.192
Mossman, D., & Berger, J. O. (2001). Intervals for posttest probabilities: A com-
parison of 5 methods. Medical Decision Making, 21, 498–507. doi:10.1177/
0272989X0102100608
Myers, J. L., Well, A. D., & Lorch, R. F., Jr. (2010). Research design and statistical
analysis (3rd ed.). New York, NY: Routledge Academic.
Neal, D. E., Donovan, J. L., Martin, R. M., & Hamdy, F. C. (2009). Screening for
prostate cancer remains controversial. Lancet, 374, 1482–1483. doi:10.1016/
S0140-6736(09)61085-0
Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Interpretation of significance
levels and effect sizes by psychological researchers. American Psychologist, 41,
1299–1301. doi:10.1037/0003-066X.41.11.1299
Nestoriuc, Y., Kriston, L., & Rief, W. (2010). Meta-analysis as the core of evidence-
based behavioral medicine: Tools and pitfalls of a statistical approach. Current
Opinion in Psychiatry, 23, 145–150. doi:10.1097/YCO.0b013e328336666b
Neuliep, J. W., & Crandall, R. (1990). Editorial bias against replication research.
Journal of Social Behavior and Personality, 5, 85–90. Retrieved from http://www.
rickcrandall.com/services/jsbp/#posts
Neuliep, J. W., & Crandall, R. (1993). Reviewer bias against replication research.
Journal of Social Behavior and Personality, 8, 21–29. Retrieved from http://www.
rickcrandall.com/services/jsbp/#posts
references 327
Penfield, R. D., Algina, J., & Keselman, H. J. (2006). ES Bootstrap 2 [Computer
software]. Available from http://plaza.ufl.edu/algina/index.programs.html
Penner, A. M. (2008). Gender differences in extreme mathematical achievement:
An international perspective on biological and social factors. American Journal
of Sociology, 114(Suppl. 1), S138–S170. doi:10.1086/589252
Perkins, N. J., & Schisterman, E. F. (2006). The inconsistency of “optimal” cutpoints
using two criteria based on the receiver operating characteristic curve. (2006).
American Journal of Epidemiology, 163, 670–675. doi:10.1093/aje/kwj063
Perlis, A. J. (1982). Epigrams on programming. ACM SIGPLAN Notices, 17(9),
7–13. doi:10.1145/947955.1083808
Pierce, C. A., Block, R. A., & Aguinis, H. (2004). Cautionary note on reporting eta-
squared values from multifactor ANOVA designs. Educational and Psychological
Measurement, 64, 916–924. doi:10.1177/0013164404264848
Platt, J. R. (1964, October 16). Strong inference: Certain systematic methods of
scientific thinking may produce much more rapid progress than others. Science,
146, 347–353. doi:10.1126/science.146.3642.347
Poitevineau, J., & Lecoutre, B. (2001). The .05 cliff effect may be overstated. Psycho-
nomic Bulletin & Review, 8, 847–850. doi:10.3758/BF03196227
Pollard, P. (1993). How significant is “significance”? In G. Keren & C. Lewis (Eds.),
A handbook for data analysis in the behavioral sciences: Vol. 1. Methodological issues
(pp. 449–460). Hillsdale, NJ: Erlbaum.
Pourret, O., Naïm, P., & Marcot, B. (Eds.). (2008). Bayesian networks: A practical
guide to applications. New York, NY: Wiley.
Pratt, T. C. (2010). Meta-analysis in criminal justice and criminology: What it is,
when it’s useful, and what to watch out for. Journal of Criminal Justice Education,
21, 152–168. doi:10.1080/10511251003693678
Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychologi-
cal Bulletin, 112, 160–164. doi:10.1037/0033-2909.112.1.160
Provalis Research. (1994–2004). SimStat (Version 2.5.8) [Computer software].
Montréal, Québec, Canada: Author.
Reichardt, C. S., & Gollob, H. F. (1997). When confidence intervals should be used
instead of statistical tests, and vice versa. In L. L. Harlow, S. A. Mulaik, & J. H.
Steiger (Eds.), What if there were no significance tests? (pp. 259–284). Mahwah,
NJ: Erlbaum.
Robinson, D. H., & Levin, J. R. (1997). Reflections on statistical and substantive
significance, with a slice of replication. Educational Researcher, 26(5), 21–26.
doi:10.3102/0013189X026005021
Robinson, D. H., & Wainer, H. (2002). On the past and future of null hypothesis signifi-
cance testing. Journal of Wildlife Management, 66, 263–271. doi:10.2307/3803158
Rodgers, J. L. (2009). The bootstrap, the jackknife, and the randomization test: A
sampling taxonomy. Multivariate Behavioral Research, 34, 441–456. doi:10.1207/
S15327906MBR3404_2
references 329
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York,
NY: Wiley.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect
on the power of studies? Psychological Bulletin, 105, 309–316. doi:10.1037/0033-
2909.105.2.309
Seggar, L. B., Lambert, M. J., & Hansen, N. B. (2002). Assessing clinical significance:
Application to the Beck Depression Inventory. Behavior Therapy, 33, 253–269.
doi:10.1016/S0005-7894(02)80028-4
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-
experimental designs for generalized causal inference. New York, NY: Houghton
Mifflin.
Shrout, P. E. (1997). Should significance tests be banned? Introduction to a special
section exploring the pros and cons. Psychological Science, 8, 1–2. doi:10.1111/
j.1467-9280.1997.tb00533.x
Simel, D. L., Samsa, G. P., & Matchar, D. B. (1991). Likelihood ratios with con-
fidence: Sample size estimation for diagnostic test studies. Journal of Clinical
Epidemiology, 44, 763–770. doi:10.1016/0895-4356(91)90128-V
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychol-
ogy: Undisclosed flexibility in data collection and analysis allows presenting
anything as significant. Psychological Science, 22, 1359–1366. doi: 10.1177/
0956797611417632
Smith, M. L., & Glass, G. V. (1977). Meta-analysis of psychotherapy outcome stud-
ies. American Psychologist, 32, 752–760. doi:10.1037/0003-066X.32.9.752
Smithson, M. (2003). Confidence intervals. Thousand Oaks, CA: Sage.
Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected
effect size estimates. Journal of Experimental Education, 61, 334–349. Retrieved
from http://www.tandfonline.com/toc/vjxe20/current
Sohn, D. (2000). Significance testing and the science. American Psychologist, 55,
964–965. doi:10.1037/0003-066X.55.8.964
Spence, G. (1995). How to argue and win every time: At home, at work, in court, every-
where, everyday. New York, NY: St. Martin’s Press.
Statistics.com. (2009). Resampling Stats (Version 4) [Computer software]. Arlington,
VA: Author.
Steering Committee of the Physicians’ Health Study Research Group. (1988). Pre-
liminary report: Findings from the aspirin component of the ongoing Physicians’
Health Study. New England Journal of Medicine, 318, 262–264. doi:10.1056/
NEJM198801283180431
Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of
close fit in the analysis of variance and contrast analysis. Psychological Methods,
9, 164–182. doi:10.1037/1082-989X.9.2.164
Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the
evaluation of statistical models. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger
references 331
Thompson, B. (1999). Journal editorial policies regarding statistical significance tests:
Heat is to fire as p is to importance. Educational Psychology Review, 11, 157–169.
doi:10.1023/A:1022028509820
Thompson, B. (2001). Significance, effect sizes, stepwise methods, and other issues:
Strong arguments move the field. Journal of Experimental Education, 70, 80–93.
doi:10.1080/00220970109599499
Thompson, B. (Ed.). (2003). Score reliability: Contemporary thinking on reliability issues.
Thousand Oaks, CA: Sage.
Thompson, B. (2006a). Foundations of behavioral statistics: An insight-based approach.
New York, NY: Guilford Press.
Thompson, B. (2006b). Research synthesis: Effect sizes. In J. Green, G. Camilli, &
P. B. Elmore (Eds.), Handbook of complementary methods in education research
(pp. 583–603). Washington, DC: American Educational Research Association.
Thompson, S. K. (2012). Sampling (3rd ed.). Hoboken, NJ: Wiley.
Thompson, W. L. (2001). 402 citations questioning the indiscriminate use of null
hypothesis significance tests in observational studies. Retrieved from htttp://
warnercnr.colostate.edu/~anderson/thompson1.html
Toffler, A. (1970). Future shock. New York, NY: Random House.
Tryon, W. W. (2001). Evaluating statistical difference, equivalence, and indetermi-
nacy using inferential confidence intervals: An integrated alternative method of
conducting null hypothesis statistical tests. Psychological Methods, 6, 371–386.
doi:10.1037/1082-989X.6.4.371
Tryon, W. W., & Lewis, C. (2008). An inferential confidence interval method of
establishing statistical equivalence that corrects Tryon’s (2001) reduction fac-
tor. Psychological Methods, 13, 272–277. doi:10.1037/a0013158
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Tukey, J. W., & McLaughlin, D. H. (1963). Less vulnerable confidence and significance
procedures for location based on a single sample: Trimming/Winsorization 1.
Sankhya, Series A, 25, 331–352. Retrieved from http://sankhya.isical.ac.in/
index.html
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychologi-
cal Bulletin, 76, 105–110. doi:10.1037/h0031322
U.S. Census Bureau. (2010). Door-to-door visits begin for 2010 census [Press release].
Retrieved from http://2010.census.gov/news/releases/operations/door-to-door-
visits-begin.html
Vacha-Haase, T., & Ness, C. N. (1999). Statistical significance testing as it relates to
practice: Use within Professional Psychology: Research and Practice. Professional
Psychology: Research and Practice, 30, 104–105. doi:10.1037/0735-7028.30.1.104
Vacha-Haase, T., & Thompson, B. (2011). Score reliability: A retrospective look
back at 12 years of reliability generalization. Measurement and Evaluation in
Counseling and Development, 44, 159–168. doi:10.1177/0748175611409845
references 333
Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods
in psychology journals: Guidelines and explanations. American Psychologist, 54,
594–604. doi:10.1037/0003-066X.54.8.594
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimen-
tal design (3rd ed.). Boston, MA: McGraw-Hill.
Wood, M. (2005). Bootstrapped confidence intervals as an approach to statistical infer-
ence. Organizational Research Methods, 8, 454–470. doi:10.1177/1094428105280059
Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances.
Biometrika, 61, 165–170. doi:10.1093/biomet/61.1.165
Zientek, L. R., & Thompson, B. (2009). Matrix summaries improve research reports:
Secondary analyses using published literature. Educational Researcher, 38, 343–352.
doi:10.3102/0013189X09339056
Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the
standard error costs us jobs, justice, and lives. Ann Arbor: University of Michigan
Press.
335
Bem D. J., 290 and effect sizes for 3 × 4 tables, 172
Berger, J. O., 181 research example, 182–185
Berkson, J., 20 screening tests of, 172–182
Bester, Alfred, 312 types of, 164–165
Betkowska, K., 181 Causal efficacy, 124
Between-studies variance, in meta Causality fallacy, 100
analysis, 277–278 Cause size, 124
BF (Bayes factor), 296–297 Central chi-square distribution, 36
Bias, for statistical significance, Central limit theorem, 33
in meta-analyses, 274–275 Central t distribution, 36, 52, 53
“Big Five” misinterpretations, 95–103 Chalmers, T., 12
Binary logistic regression, 164 Change, mean, 48
Bird, K. D., 190, 201 Chapman, J. P., 212
Blanton, H., 157 Chi-square test, 88–89
Block, R. A., 251 Circularity, 86, 87
Board of Scientific Affairs (APA), 21 Cliff effect, 102
Bonferroni correction, 73 Clinical significance, 10, 157–158
Bonferroni–Dunn method, 196 Cluster sampling
Bootstrapped confidence intervals, and single-stage, 30
standardized contrasts, 202–203 two-stage, 30
Bootstrapping, 54, 63 Cognitive distortions, 95–119
nonparametric, 54–56 Cognitive errors, 10–11
parametric, 56–57 Cohen, J., 103, 130, 154, 271
Borenstein, M., 284, 285 Cohen’s d, 130
Boring, E. G., 20 Collins, L. M., 218
Bowers, J. S., 308 Colliver, J. A., 215
Box correction, 87 Common language effect size (CL), 152
Box plots (box-and-whisker plots), 149 Comparative risk, in categorical
Box-score (vote counting) method, 271 outcomes, 166–168
BP (finite-sample breakdown point), 58 Completely between-subjects designs,
BR. See Base rate 222
Bradbury, R. B., 108 interaction contrasts in, 248–249
Brown, J. S., 184, 185 single-factor contrasts in, 244–248
Brown, T. G., 255 Complex comparison, 191
Browne, M. W., 218, 219 Complex interaction contrast, 232–233
Bruce, C. R., 125 Conditional model, in meta-analysis,
Brunner–Dette–Munk test, 91 277
Burgman, M. A., 19 Confidence Interval Calculator, 181
Burnham, K. P., 20 Confidence intervals, 40, 41
Buttrose, R., 19 Bayesian, 303–307
for dy, 199–201
Campbell, D. T., 38 and effect sizes, 142–147
Canadian Task Force on Preventative for m, 39–41
Health Care (CTFPHC), 180 for m1 – m2, 42–48
Capture percentage, 41 for mD, 48–50
Casella, G., 209 noncentral, for d, 144–145
Casscells, W., 176–177 noncentral, for h2, 146
Categorical outcomes, 163–186 in Publication Manual 6 ed., 21
and effect sizes for 2 × 2 tables, reporting of, 117
165–172 Wald method and, 170
336 index
Confidence interval transformation, Criterion contrasts, 157
50–51 Critical ratio, 17–18
Confidence-level misconception, 41 Cross-validation sample, 268
Conjugate distributions, 301 Crud factor, 70
Conjugate prior, 301 Cumming, G., 14, 19, 22, 39–41, 75,
Conjunction fallacy, 292 100, 250, 280, 286
Construct replication (conceptual), Customer-centric science, 110–111
268–269
Continuous outcomes, 128–161 Daumann, J., 216
case-level analysis of, 148–154 Davis, C. J., 308
correlation of effect sizes for, Decision theory, 109
138–140 Deckers, J. W., 179
families of effect sizes for, 128–129 Deering, K. N., 63
interval estimation for, 142–147 Degeneracy, 170
measurement error in, 140–142 Degrees of freedom (df), 48, 52, 53
misinterpretations with, 158–159 Delaney, H. D., 152
research example, 159–161 d, noncentral confidence intervals for,
and standardized mean differences, 144–145
129–138 Dependent samples
substantive significance of, 154–158 F test for, 84–88
Contrast specification, in single-factor p values, 85
designs, 190–196 and standardized contrasts, 199
Contrast weights (coefficients), Derivation sample, 268
190, 191 DeShon R. P., 243, 248
Control factor, use of, 83–84 Desired relative seriousness (DRS),
Convenience samples, 32 71–72
Conversation analysis, 156 df (degrees of freedom), 48, 52, 53
Cook, S., 258, 260 d family (group difference indexes), 128
Cook, T. D., 38 Dichotomization, of p values, 109, 110
Corballis, M. C., 205, 254 Dienes, Z., 299, 301, 302, 306
Correction for attenuation, 141 Diffusion of idiocy, law of, 16
Correlated effect sizes, 275–276 Dismantling research, in treatment
Correlation(s) efficacy, 268
autocorrelation of the errors, 86 Disordinal (crossover) interaction,
of effect sizes, 138–140 229–231
illusory, 104 Distributional assumptions, 57
and measures of association, in Dixon, P., 17, 293
single-factor designs, 203–211 Dodd, D. H., 205
Pearson, 168 DRS (desired relative seriousness),
in single-factor designs, 203–211 71–72
Cortina, J. M., 136, 215, 243, 247, 248, Dunleavy, E. M., 23–24
250 Du Toit, S. H. C., 218, 219
Covariate, 211
Covariate analyses, effect sizes in, Earwitness testimony, 258–260
211–215 Ecstasy (MDMA) use, 215–217
Cramer’s V, 172 Edgeworth, F. Y., 18
Crandall, R., 270 Editorial policies, 22
Crawford, J. R., 181, 182 Educational and Psychological
Credible intervals (Bayesian analysis), Measurement (Hubbard), 18
303–307 Edwards, W., 290, 297
index 337
Effect size(s), 111, 123–129, 142–159 Epidemiology, 22
for 2 × 2 tables, 165–172 EpiTools, 172
for 3 × 4 tables, 172 Equivalence fallacy, 101
case-level analysis of, 148–154 Equivalence testing, 111–112
cause size vs., 124 Erceg-Hurn, D. M., 64, 91, 107
common language, 152 Error(s)
correlated, 275–276 Bayesian Id’s wishful thinking error,
correlation, for contrasts, 203–205 98
correlation of, 138–140 construct definition, 37
in covariate analyses, 211–215 inverse probability error, 19, 75
definitions of, 124–125 margin of, 39
editorial policies about, 23–24 measurement, 37
estimates of, 77, 110, 116–117, real, 38
125–127 sampling, 38
families of, 128–129 specification, 37
group- or variable-level, and case- standard, 34–37, 57
level proportions, 153–154 standard, of Fisher’s transformation,
interpretive guidelines for, 154–158 51–52
interval estimation with, 142–147 standard of M, 35
levels of analysis, 127–128 standard, of MD, 49–50
margin-bound, 169 standard metric, of t, 79, 80
in meta-analyses, 271 treatment implementation, 37
metric-free, 128 Type I, 11, 68, 101, 112, 308
misinterpretations of, 158–159 Type II, 11, 68, 71, 76, 101, 308
population, 38 Error bars, 39
for power analysis, 209–210 ES Bootstrap: Correlated Groups, 147
proportion of variance explained, ES Bootstrap: Independent Groups,
128 147
in Publication Manual, 14, 21 ES Bootstrap 2 (software), 147
and relative risk for undesirable ESCI (Exploratory Software for
outcomes, 164–166 Confidence Intervals), 53, 77,
reporting of, 10 145, 281
sensitivity analysis and, 284 Estimated epsilon-squared, 129
signed, 128 Estimated eta-squared, 129
standardized, 126 Estimated omega-squared, 129
standardized criterion contrast, 157 Estimation, Bayesian analysis and,
unsigned, 128 290–292
unstandardized, 125 Estimation thinking, 15
weighed, 276–277, 280, 284 Estimators
Effect size measures, 124–127 least square, 33
Effect size synthesis (meta-analysis), negatively biased, 34
276–284 positively biased, 34
Effect size value, 124 resistant, 57–64
Efficacy, causal, 124 h2, noncentral confidence intervals for,
Efron, B., 54 146
Ellis, P. D., 11, 129, 189 Ethnographic techniques, 156–157
Empirical cumulativeness, 266–267 Exact level of significance, 75
Empirical sampling distribution, 54, 55 Exact replication (direct, literal, or
Empirical studies, best practices for precise), 268
reporting results from, 308–312 Experimentwise error rate, 72, 73
338 index
Exploratory Software for Confidence Freckleton, R. P., 108
Intervals (ESCI), 53, 77 Freiman, J. A., 12
External replication, 268 French, B. F., 210, 254
Extrinsic factors, intrinsic factors vs., 244 Frequentist perspective, 40–41
Friedman, G., 9
f 2 parameter, 209–210 F test(s)
Face overshadowing effect (FOE), for dependent samples, 84–88
258–260 for independent samples, 81–84
Factorial analysis of variance, 223–226 for significance testing, 81–88
Factorial designs, standardized contrasts
in, 244 Gain, mean, 48
Factor of interest (targeted factor), 244 Garbage in, garbage out problem, in
Fad topics, 12 meta-analyses, 275
Fail-safe N, in meta-analysis, 273 Garthwaite, P. H., 181
Failure fallacy, 101 Geisser–Greenhouse conservative test,
Fallacy(-ies) 87
in significance testing, 103–106 Geisser–Greenhouse epsilon, 87
of the transposed conditional, 98 Generalized estimated eta-squared,
“False-positive psychology,” 11 251–252
Familywise error rate, 72, 73 Gigerenzer, G., 17–19, 75, 95, 101
Ferguson, C. J., 129 Glass, G. V., 123, 243, 244, 271
Fern, E. F., 155 Glass’s delta, 133
Feynman, R., 73 Glenn, D. M., 23–24
Feynman’s conjecture, 73 Gollob, H. F., 41
Fidell, L. S., 236 Gonzalez, R., 98
Fidler, F., 19, 22, 39, 41, 146, 210 Gossett, W., 17, 38
Figuerdo, A. J., 224 Gouzoulis-Mayfrank, E., 216, 217
File-drawer problem, in meta-analysis, Graboys, T., 176–177
273 Great p value blank-out, 107
Filter myth, 97 Greenwald, A. G., 98
Fimm, B., 216 Grissom, R. J., 48, 129, 169, 254
Finch, S., 19, 22, 23 Grobbee, D. E., 179
Finch, W. H., 210, 254 Group overlap
Finite-sample breakdown point (BP), 58 indexes, 127–128
Fisher, R., 17, 51 measures of, 148–150
Fisher approach, 102 Guthery, F. S., 114
Fisher model, 17 Guthrie, D., 98
Fisher’s transformation, 51–52
Fixed effects factors, 83 Habbema, J. D. F., 179
Fixed effects model, 44, 279–284 Haller, H., 96, 99
Focused comparisons, between two Hansen, N. B., 158
means, 81 HARKing, 73, 310
FOE (face overshadowing effect), 258–260 Harlow, L. L., 20
Follow-up studies, replication and, Harris, R. J., 98
270–271 Health Psychology, 23
Forest plot, 44 Hedges, L. V., 267
Fouladi, R. T., 38, 144 Hedges’s g, 134
Fourfold table, 164 Herbert, R., 172, 181, 182
Fractional (partial, incomplete) factorial Heteroscedasticity, 90, 137
designs, 222 Hierarchical design, 222
index 339
High-inference characteristics, in meta- Interaction contrasts
analysis, 272 in completely between-subjects
Hoekstra, R., 19 design, 248–249
Hoffer, E., 29 in factorial analysis of variance,
Homogeneity of regression, 212 231–233
Homoscedasticity, 42, 47, 48, 57 Interaction effect, 228, 238
Horn, J. L., 218 Interaction trends, 231–233
Hubbard, R., 18 Internal replication, 268
Huberty, C. J., 127, 152 Interquartile range, 58–59
Hunt, K., 270 Interval estimation, 15, 38–64, 110,
Hunter, J. E., 15, 141, 144, 168, 275 181–182
Hurlbert, S. H., 105, 109, 110 approximate methods for, 50–52
Hux, J. E., 73 with bootstrapping, 54–57
Huynh–Feldt epsilon, 87 in categorical outcomes, 170–172
Hypothesis(--es) in correlations and measures of
alternatives to, in significance association, 210–211
testing, 70 with effect sizes, 142–147
Bayesian methods, 291 misinterpretations in, 43
nil, 69, 70 for m, 39–41
nondirectional, 71 for m1 - m2, 42–48
non-nil, 69 for mD, 48–50
null, 69–70, 91 non-centrality, 64
one-tailed, 71 noncentrality interval estimation,
point, 69 52–54
range, 71 robust estimators for, 57–64
testing of, 73 Intrinsic factors, extrinsic factors vs., 244
two-tailed, 71 Intro Stats Method, 17–19, 69, 102,
109–113
Illegitimate uses, of significance testing, Inverse chi-square distribution, 301
107–108 Inverse gamma distribution, 301
Illusory correlation, 104 Inverse probability error, 19, 75
Improvement over chance Inverse probability fallacy, 98
classification (I), 152 IUMSP (Institut universitaire de
Independent samples médecine sociale et préventive),
F test for, 81–84 64
and standardized contrasts, 197–198 Iverson, G. J., 99
Indexes
group difference (d family), 128 Jaccard, J., 157
group overlap, 127–128 Jackknife technique, 268
relationship (r family), 128 Jacklin, C. N., 151
Inertia, 24 Jackson, G. B., 15
Inference revolution, 18 Jakab, E., 308
Inferential confidence intervals, Johnson, A., 19
112–113 Journal of Applied Psychology, 23
Inferential measures of association, Journal of Educational Psychology, 23
252–254 Journal of Experimental Education, 20
Informative priors, 294 Journal of Experimental Psychology:
Institut universitaire de médecine Applied, 23
sociale et préventive Journal of Management, 290–291
(IUMSP), 64 Juurlink, D. N., 73
340 index
Kahneman, D., 41 Loss function, 68
Kanfer, R., 217 Lower confidence limit, 38
Kelley, K., 124, 145, 146 Low-inference characteristics, in meta-
Keppel, G., 87, 227, 237 analysis, 272
Keselman, H. J., 59, 63, 84, 87, 91, 136, Lowman, L. L., 152
145, 147, 197, 201–203, 250 LTR (left-tail ratio), 150, 152
Khurshid, A., 254 Lunneborg, C., 13, 239
Kiers, H., 19 Lykken, D. T., 12, 106
Killeen, P. R., 99 Lytton, H., 285, 286
Kim, J. J., 48, 129, 169, 254
King, G., 169 Maccoby, E. E., 151
Kirk, R. E., 22, 129, 205, 222, 223, 236 MacDonald, George, 103
Kline, R. B., 96, 228 MAD (median absolute deviation), 59
Kowalchuk, R. K., 87 Magnitude fallacy, 100
Krauss, S., 96, 99 Maillardert, R., 41
Kruschke, John K., 289, 299, 303 Main comparisons, 227
Kuebler, R. R., 12 Main effect, 227, 233, 238
Kuhn, Thomas S., 266 Main effects model, 239–240
Kunert, H.-J., 216 Mamdani, M. M., 73
MANOVA (multivariate analysis of
Lambdin, C., 106 variance), 87
Lambert, M. J., 158 Marginal probability, 293
Large numbers, law of, 33 Margin-bound effect, 169
Latin square design, 222 Margin of error, 39
Law of diffusion of idiocy, 16 Markwell, S. J., 215
Law of large numbers, 33 Matchar, D. B., 181
Law of small numbers, 41 Mathematical models, evaluation of, 26
Learning curve data, analysis of, Mauchly’s test, 87
217–219 Maximum likelihood estimation, 209
Least squares estimators, 33 Maximum probable difference, 113
Lecoutre, B., 102 MBESS. See Methods for the
Lee, M. D., 99 Behavioral, Educational, and
Leeman, J., 22 Social Sciences
Left-tail ratio (LTR), 150, 152 McBride, G. B., 112
Lewis, C., 113 McCloskey, D. N., 10, 20–22, 25, 67,
Likelihood, 293 114, 128
Likelihood ratio, in screening tests, McCulloch, C. E., 209
178–179 McGraw, B., 243
Likert scale, 164 McGraw, K. O., 152
Lindman, H., 290 McKnight, K. M., 224
Lix, L. M., 63 McKnight, P. E., 224
Locally available samples, 32 McWhaw, K., 212
Local Type I error fallacy, 97 Mean
Loftus, G. R., 23 trimmed, 58, 60, 61
Logit d, 167 Winsorized, 59–60
Loh, C., 128 Mean change, 48
Lombardi, C. M., 105, 109, 110 Mean difference, 48, 190, 244. See also
Longford, N. T., 11 Standardized mean difference(s)
Long-run relative frequency, 40–41 Mean gain, 48
Lorch, R. F. Jr., 223 Meaningfulness fallacy, 100
index 341
Means, 33, 35 Moderator effects, 228
F tests for, 81–88 Moderator variables, 228, 272
t tests for, 78–81 Modulus, 18
Means analysis Monotonic transformation, 57
unweighted, 82–83 Monroe, K. B., 155
weighted, 82 Moons, K. G. M., 179
Measurement crisis, 12 Morey, R. D., 78
Measurement error, correcting for, in Morris, S. B., 243, 248
continuous outcomes, 140–142 Mossman, D., 181
Measures of association, 128 Muliak, S. A., 20
descriptive, 250–252 Multifactor designs, 221–260
inferential, 205–209, 252–254 analysis strategy in, 237–240
in multifactor design, 250–254 effects in balanced two-way designs,
in single-factor designs, 203–211 226–233
Median absolute deviation (MAD), 59 extensions to multivariate analyses,
Mediational meta-analysis, 273 254
Mediator effect, 228 factorial analysis of variance in,
Meehl, P. E., 70, 180 223–226
Memory & Cognition, 23 measures of association, 250–254
Meta-analysis, 271–286 nonorthogonal designs, 240–243
effect size synthesis in, 276–284 research examples, 255–260
estimation thinking and, 15–16 standardized contrasts in, 243–250
limitations to, 285 tests in balanced two-way designs,
predictors in, 272–273 233–237
statistical techniques in, 284 types of, 221–222
and statistics reform, 285–286 Multiple regression, ANOVA and,
steps in, 273–276 88–89
validity of, 284–285 Multivariate analyses, extensions to, in
Meta-regression, and variability of multifactor design, 254
results, 272–273 Multivariate analysis of variance
Method 1 regression-based technique, (MANOVA), 87
242, 243 Murray, D., 18
Method 2 regression-based technique, Myers, J. L., 223
242, 243
Methods for the Behavioral, Educational, Narrative analysis, 156
and Social Sciences (MBESS), National Council on Measurement in
145, 202, 211 Education, 269
Metric-free effect sizes, 128 NDC (Noncentral Distribution
Metrics, arbitrary, 16 Calculator), 145
Microsoft Excel, 48 Negative consequences, of significance
Miller, D. T., 155 testing, 106–107
Miller, G. A., 212 Negative likelihood ratio (NLR), 178,
Miller, J., 99 182
Miller, K. R., 23–24 Negatively biased estimator, 34
Mirosevich, V. M., 64, 91, 107 Negative predictive value (NPV),
Mixed within-subjects factorial design 175–177
(split-plot design), 222 Nelson, L. D., 11, 102
Model-driven meta-analysis, 273 Neo-Fisherian significance assessments,
Model testing, in multifactor design, 110
239–240 Neuliep, J. W., 270
342 index
New statistics, 14–16 Ordinal categories, 164
Neyman, J., 17 Ordinal interaction, 229–231
Neyman–Pearson model, 17, 68–69, O’Reilly, T., 17, 293
102, 110 Orthogonal contrasts, 191, 192
Nil hypothesis, 69, 78, 79, 100–101, 109 Orthogonal designs, 223
NLR (negative likelihood ratio), 178, Orthogonal polynomials, 193
182 Orthogonal sums of squares method,
Nonadditive model, 84–85 245–246
Noncentral confidence intervals for dy, Outliers, 59, 62, 84
201–202 Overall, J. E., 242
Noncentral Distribution Calculator Overlap rule for two independent
(NDC), 145 means, 45–46
Noncentrality parameter, 52
Noncentral t, 52, 53 Pairwise comparison, 190
Noncentral test distributions, 52 Pairwise interaction contrast, 232
Nondirectional hypothesis, 71 Paleo-Fisherian approach, 110
Non-nil hypothesis, 69, 78, 79 Pan, W., 24
Nonorthogonal contrasts, 191, 192 Paradigm, 266
Nonorthogonal designs, 224, 240–243 Parametric bootstrapping, 56–57
Nonparametric bootstrapping, 87 Park, R. L., 101
Nonparametric percentile bootstrapped Partial replication (improvisational),
confidence levels, 54, 63 268
Nonparametric testing, 90 Pearson, E. S., 17
Normal science, 266 Pearson, K., 17
Nouri, H., 136, 215, 243, 247, 248, 250 Pearson correlation, 168
NPV (negative predictive value), Pelz, S., 216
175–177 Penfield, R. D., 136, 147
Null hypothesis, 69–70, 91 Perlis, Alan J., 221
Person × treatment interaction, 84, 85,
Oakes, M., 96, 99 207
Objectivity fallacy, 102 Philosophical Transactions of the Royal
Odds Society, 292
posterior, 296–298 Pierce, C. A., 251
prior, 294 Planned comparisons, 195
in screening tests, 178–179 PLR (positive likelihood ratio), 178,
Odds-against-chance fallacy, 96–97 182
Odds ratio, 167, 169 Point estimates, 15
Off-factors (peripheral factors), Point hypothesis, 69, 294–298
244–245 Poitevineau, J., 102
Ojeda, M. M., 254 Pollard, P., 97
Olejnik, S., 129, 133, 244, 249, 251, Polynomials, 193
254 Population inference model, 31
Omnibus comparisons, 81 Positive bias, 87, 134
Omnibus effects, in correlation and Positive likelihood ratio (PLR), 178, 182
measures of association, 205 Positively biased estimators, 34
One-tailed hypothesis, 71 Positive predictive value (PPV),
OpenBugs, 308 175–177
Operational replication, 268 Posterior odds, 296–298
Ordered categories (multilevel ordinal Posterior probability, 293
categories), 164 Post hoc, observed (power analysis), 77
index 343
Power analysis, 53–54, 68, 145 Fisher model and, 17
and effect size, 127 incorrect, 13–14
effect sizes for, 209–210 misinterpretation of, 19
retrospective, 77 in significance testing, 74–76
in significance testing, 76–77
Power curves, 76 Quality fallacy, 101
PPV (positive predictive value), 175–177 Quasi-F ratios, 237
Preacher, K. J., 124
Prediction intervals for p, 75 Raaijmakers, J. G. W., 308
Predictive value, in screening tests, Random effects factors, 83
175–177 Random effects model, 99, 279–284
Predictors, in meta-analysis, 272–273 Randomization model, 31–32
Prentice, D. A., 155 Randomized blocks design, 222
Presumed interactions, 238 Randomized groups factorial design,
Principle of indifference, 41 222
Prior odds, 294 Random sampling, 30–31
Prior probability, 72, 293 Range hypotheses, 71, 298–303
Probabilistic revolution, 18 Range of practical equivalence, 112
Probability RD (risk difference), 166, 168
Bayesian methods and, 291 Real error, 38
long-run relative frequency and, Realization variance, 99
40–41 Receiver operating characteristic
posterior, 293 (ROC) model, 164–165, 173
prior, 293 Reduced cross-classification method,
subjective degree-of-belief and, 40–41 246
Probability of (stochastic) superiority, Reichardt, C. S., 41
152 Reification fallacy, 102
Probability samples, 30 Reject-support testing, 70
Professional Psychology: Research and Reliability induction, 13
Practice, 23 Replicability fallacy, 98–99
Propensity score analysis (PSA), 212 Replication, 265–271
Proportion of variance explained effect in behavioral sciences, 269–271
size, 128 cultural bias vs., 270
Prospective power analysis, 76 defined, 165
Prostate-specific antigen (PSA) and follow-up studies, 270–271
screening, 180 requirement of, 117
Pseudo-orthogonal design, 224 as standard procedure, 26
PSY, 201, 250 and theoretical/empirical
Psychological Bulletin, 141 cumulativeness, 266–267
Psychological Science, 20 types of, 267–269
Psychonomic Bulletin & Review, 19 Reporting crisis, 13
Publication bias, 11 Reporting results, from empirical
Publication Manual, 5 ed. (APA), 21, 22 studies, 308–312
Publication Manual, 6 ed. (APA), 11, 14, Resampling, 54
15, 21, 38 Resampling Stats, 56
Purposive sample, 32 Research
Puzzle solving, 266 communication in, 111
p value(s), 11, 97–103, 105–108, 110 enthusiasm for, 107
for dependent sample analysis, 85 Researcher degrees of freedom, 106
dichotomization of, 109, 110 Research in the Schools, 20
344 index
Resistant estimators, 57–64 Sampling distribution, 34, 35
Retrospective power analysis, 77 Sampling error, 32–34
r family (relationship indexes), 128 Samsa, G. P., 181
Rief, W., 271 Sanctification fallacy, 102–103
Right-tail ratio (RTR), 150–152 SAS/IML, 147, 202, 250
Riordan, C. M., 244 Sass, H., 216
Risk difference (RD), 166, 168 SAS/STAT, 254
Risk rates, in categorical outcomes, Savage, L. J., 290
165–166 Scaling, mean difference, 190
Risk ratio, 166, 168–169 Schmidt, F. L., 15, 141, 144, 168, 275
Robinson, D. H., 109 Schmidt, S., 270
Robust estimation, 57–64 Schoenberger, A., 176–177
Robust interval estimation, 60–64 Schultz, R. F., 205
Robust method for outlier detection, 59 Schuster, C., 205–206
Robustness fallacy, 103 Screening tests, 172–185
Robust statistical tests, significance base rate in, 175–177
testing and, 90–92 defined, 173
ROC model, 164–165, 173 estimating base rates in, 180
Rodgers, J. L., 107 interval estimation in, 181–182
Romney, D. M., 285, 286 likelihood ratio in, 178–179
Rosen, A., 180 negative predictive value in,
Rosenthal, R., 79, 129, 205 175–177
Rosnow, R. L., 79 and odds, 178–179
Rothman, K. J., 22, 23 positive predictive value in, 175–177
Rouder, J. N., 78, 299, 303 predictive value in, 175–177
Rozeboom, W. W., 20 sensitivity in, 174
R script, 254 specificity in, 174–175
RTR (right-tail ratio), 150–152 for urinary incontinence, 184–185
Rubin, D. B., 79 Searle, S. R., 209
Rutherford, A., 215, 223 Seggar, L. B., 158
Rutledge, T., 128 Sensitivity, in screening tests, 174
Ryan, P. A., 18 Sensitivity, specificity, and predictive
value model, 164
Sagan, C., 64 Sensitivity analysis, effect size and, 284
Sahai, A., 254 Seraganian, P., 255
Samples Shadish, W. R., 38, 212
accidental, 32 Sidani, S., 224
ad hoc, 32 Signal detection theory, 164
convenience, 32 Signed effect sizes, 128
cross-validation, 268 Significance game, 24
derivation, 268 Significance testing, 67–92, 95–119
locally available, 32 alternative hypotheses in, 70
probability, 30 “Big Five” misinterpretations in,
purposive, 32 95–103
systematic, 32 chi test, 88–89
Sampling, 29–38 cognitive distortions in, 95–119
errors in, 32–38 cognitive errors in, 10–11
random, 30–31 costs of, 11–14
stratified, 30 defenses of, 108–109
types of, 30–32 and effect size, 127
index 345
Significance testing, continued Specific probability inference, 41
F tests for, 81–88 Speckman, P. L., 78
illegitimate uses of, 107–108 Spence, G., 24
limitations of, 290 Sphericity, 86, 87, 107
negative consequences of, 106–107 Spiegel, D. K., 242
Neyman–Pearson approaches vs., SPSS, 145, 194, 212, 215, 254, 255
68–69 SRP (structured relapse prevention),
null hypotheses in, 69–70 255–258
overreliance on, 10 SS (Sum of squares), 33, 34
power analysis in, 76–77 Standard deviation bars, 39
p values, 74–76 Standard error, 34
reasons for fallacies in, 103–106 estimation of, 57
recommendations for use of, of Fisher’s transformation, 51–52
113–118 in risk effect sizes, 170–171
and robust statistical tests, 90–92 Standard error bars, 39
role of, 25–26 Standardized contrasts
t tests for, 78–81 and bootstrapped confidence
Type I error, 71–74 intervals, 202–203
variations on, 109–113 and confidence intervals for dy,
Simel, D. L., 181 199–201
Simmons, J. P., 11, 106 defined, 196
Simonsohn, U., 11 dependent samples and, 199
Simple comparisons, 228 independent samples and, 197–198
Simple effects (simple main effects), in multifactor design, 243–250
228 and noncentral confidence intervals
Simple interactions, 236 for dy, 201–202
SimStat, 54, 55 in single-factor designs, 196–203
Simultaneous (joint) confidence Standardized criterion contrast effect
intervals, 196 sizes, 157
Single-factor contrasts, in completely Standardized effect sizes, 126, 275
between-subjects designs, Standardized mean changes
244–248 (standardized mean gains), 134,
Single-factor designs, 90, 189–220 250
contrast specification in, 190–196 Standardized mean difference(s), 128,
correlations and measures of 129–138
association in, 203–211 and correction for positive bias, 134
effect sizes in covariate analyses, ddiff for dependent samples, 134–136
211–215 dpool, 131–133
research examples, 215–219 dtotal, 133
standardized contrasts in, 196–203 dwith
Single-stage cluster sampling, 30 general form for, 130
Sizeless science, 20 limitations of, 137–138
Slippery slope of nonsignificance, robust, 136–137
100–101 Standardizers, 130
Slippery slope of significance, 100 Standard set, 190
Small numbers, law of, 41 Standards for Educational and
Smith, H., 12, 271 Psychological Testing, 269
Smith, M. L., 243 STATISTICA 11 Advanced, 53–54
Smithson, M., 145, 146, 210, 254 STATISTICA Advanced, 145
Specificity, in screening tests, 174–175 Statistical analysis, 11–12
346 index
Statistical equivalence, testing for, Sum of squares (SS), 33, 34
over two or more populations, Sun, D., 78
113 Sun, S., 24
Statistical hypotheses, substantive vs., Systematic samples, 32
100
Statistical inference, Fisher vs. Tabachnick, B. G., 236
Neyman–Pearson approaches to, Tail ratios, 150–152
68–69 Task Force on Statistical Inference
Statistical models, 26 (TFSI), 21, 22, 103
Statistical significance, 117 Testimation, 19–20
Statistical software, 118 Testing to a forgone conclusion, 73
Statistical tests (statistical testing) Test statistic (TS), 76
history of, 16–24 Test statistics (contrast specification),
justified use of, 116 194–195
Statistician’s two-step, 31 TFSI (Task Force on Statistical
Statistics education, 118 Inference), 21, 22, 103
Statistics reform, 9–26 Theoretical cumulativeness, 266–267
and cognitive errors in significance Thinkmap Visual Thesaurus, 105
testing, 10–11 Thomas, K. M., 244
and costs of significance testing, Thomason, N., 19, 22
11–14 Thompson, B., 13, 21, 126, 146, 155,
future directions, 25–26 210, 254, 268, 311
and history of statistical testing, Thompson, W. L., 20
16–24 3 Incontinence Questions (3IQ) scale,
meta-analysis and, 285–286 184–185
“new” statistics in, 14–16 Three-valued logic, 110
obstacles to, 24–25 Toffler, Alvin, 163
Stayer, R., 24 Tolstoy, Leo, 104
Steering Committee of the Physicians’ Total variance, estimation of, 207
Health Study Research Group, Trained incapacity, 10–11
128 Tremblay, J., 255
Steiger, J. H., 20, 38, 53, 144, 145, 202, Trends, 193
210, 211 Trends in Cognitive Science, 290
Stephens, P. A., 108 Trimmed mean, 58, 60, 61, 91
Stepwise method, 108 Tryon, W. W., 112, 113
Stevens, J. J., 231 TS (test statistic), 76
Stopping rule, 74 TSF (12-step facilitation), 255–258
Stratified sampling, 30 T-shirt effect, 126–127, 311
Strong inference, 100 t test(s), 78, 79
Structured relapse prevention (SRP), Bayesian version of, 301–302
255–258 for significance testing, 78–81
Student’s t distribution, 36 Welch (Welch–James), 80–81, 90–92
Subjective degree-of-belief, 40–41 Yuen–Welch, 90–92
Subjectivist perspective, 40–41 Tuchtenhagen, F., 216
Subjects effect, 49 Tukey, J. W., 149
Substantive effects, 154–157 Tukey–McLaughlin method, 60–61
Substantive hypotheses, statistical vs., Tversky, A., 41
100 12-step facilitation (TSF), 255–258
Substantive significance, 16 Two-stage cluster sampling, 30
Success fallacy, 101 Two-tailed hypothesis, 71
index 347
Type I error, 68, 101, 112 Wald method, 170, 182
in Bayesian estimation, 308 Wang, L. L., 24
controlling, 195–196 Wayne, J. H., 244
defined, 11 Weighted means analysis, 82
in significance testing, 71–74 Welch procedure, 47–48
Type II error, 11, 68, 71, 101, 308 Welch t test (Welch–James t test),
80–81, 90–92
Unbiased estimator, 33 Well, A. D., 223
Uncalibrated metrics, 16 Wellek, S., 112
Unconditional probability, 293 Wetzels, R., 297, 308
Uniform Requirements for Manuscripts Whittingham, M. J., 108
Submitted to Biomedical Wickens, T. D., 87, 227, 237
Journals, 22 Wilcox, R. R., 59, 63, 84, 91, 147,
Uninformative priors, 294 203, 250
Unit-free effect sizes, 128 Wilding, J., 258, 260
Unordered categories, 164 Wilkinson, L., 103
Unplanned comparisons, 195 Williams, J., 41
Unsigned effect sizes, 128 WinBUGS, 307–308
Unstandardized effect sizes, 125
Winer, B. J., 205, 210, 223, 236, 237
Unweighted means analysis, 82–83
Winsorized mean, 59–60
Upper confidence limit, 38
Winsorized variance, 59, 61–62
Urinary incontinence, 184–185
Within-studies variance, 276–277
U.S. Air Force, 217
Within-subjects factors, 249–250
Women, social conditions of, 267
Vacha-Haase, T., 13, 23
Women, urinary incontinence in,
Validity, of meta-analysis, 284–285
184–185
Validity fallacy, 98
van Es, G.-A., 179 Wong, S. P., 152
Vargha, A., 152 Wood, M., 56
Variance, realization, 99 WRS, 147
Variance components, 205 WRS package for R, 250
Vaughn, G. M., 205
Vaux, D. L., 39 Yuen–Welch procedure, 61–63
Velasco, F., 254 Yuen–Welch t test, 90–92
Vested interest, 24–25
Viechtbauer, W., 142 Zeng, L., 169
von Eye, A., 205–206 Zero fallacy, 100–101
Zientek, L. R., 311
Wagenmakers, E.-J., 308 Ziliak, S. T., 10, 20–22, 25, 67, 114,
Wainer, H., 109 128
348 index
About the Author
349