Tenko Raykov, George A. Marcoulides-Basic Statistics - An Introduction With R-Rowman & Littlefield Publishers (2012) PDF
Tenko Raykov, George A. Marcoulides-Basic Statistics - An Introduction With R-Rowman & Littlefield Publishers (2012) PDF
Tenko Raykov, George A. Marcoulides-Basic Statistics - An Introduction With R-Rowman & Littlefield Publishers (2012) PDF
TENKO RAYKOV
Michigan State University
and
GEORGE A. MARCOULIDES
University of California at Riverside
All rights reserved. No part of this book may be reproduced in any form or by any electronic or
mechanical means, including information storage and retrieval systems, without written permission
from the publisher, except by a reviewer who may quote passages in a review.
⬁ The paper used in this publication meets the minimum requirements of American National
Standard for Information Sciences—Permanence of Paper for Printed Library Materials, ANSI/
NISO Z39.48-1992.
Preface xi
4 Probability 43
4.1. The importance of probability 43
4.2. Definition of probability 43
4.2.1. Classical definition 44
4.2.2. Relative frequency definition 44
4.2.3. Subjective definition 45
4.3. Evaluation of event probability 45
4.4. Basic relations between events and their probabilities 48
4.5. Conditional probability and independence 50
11 Correlation 189
11.1. Relationship between a pair of random variables 189
11.2. Graphical trend of variable association 190
11.3. The covariance coefficient 193
11.4. The correlation coefficient 196
11.5. Linear transformation invariance of the correlation coefficient 201
11.6. Is there a discernible linear relationship pattern between two variables
in a studied population? 202
11.7. Cautions when interpreting a correlation coefficient 205
Epilogue 321
References 323
Index 325
About the Authors 331
xi
The past few decades have witnessed a tremendous increase in the amount of
data being collected in our modern society. For example, data about individ-
ual spending habits and patterns, school and college achievement, aptitude or
intelligence are collected frequently by various persons and organizations—
e.g., banks, teachers, schools, instructors, colleges, clinicians, administrators.
Accompanying this data buildup is also a great deal of interest in analyzing
these data to address specific questions of interest using statistics.
Statistics can be defined as a science that helps design empirical studies to
collect data, as well as to organize, classify, analyze, and interpret these data,
in order to make decisions in situations that are often characterized by uncer-
tainty. This uncertainty generally results from the at times very limited infor-
mation contained in the data obtained through these empirical studies.
Additionally, this information can be potentially highly variable across the
examined individuals, cases, subjects, or units of analysis considered. Based
upon these features, statistics can be regarded as a scientific discipline used in
almost all aspects of modern society to greatly facilitate the process of learning
from data.
In this book, we will be mainly interested in the application of basic statis-
tics to a variety of empirical studies in the behavioral, biological, educational,
medical, management, and social sciences. One can distinguish thereby
between the following four stages of using statistics to conduct research in
these sciences (e.g., Ott & Longnecker, 2010):
Each of these stages is essential for the process of applying statistics to address
questions within the above-mentioned sciences. If any of these stages is
bypassed or carried out in an inappropriate way, the outcome of the subse-
quent stage(s) may well be significantly affected if not totally compromised.
Additionally, the ultimate results and interpretation of the study in question
may be seriously misleading. As it turns out, this four-stage process closely
resembles the scientific method that underlies scientific inquiry (e.g., Grazi-
ano & Raulin, 2009). The scientific method is essentially a set of principles
and procedures used to advance any empirical science. It consists of the fol-
lowing phases:
We note that the phases (a) through (d) listed above closely resemble the
statistical application stages (i) through (iv).
As an example, let us consider the following study, which we will return to
frequently in this book. In this investigation, our main interest lies with exam-
ining the presence of possible gender differences in depression in older adults
in the United States. To begin with, the detailed description of the study’s
goals accomplishes much of the above stage (i) of using statistics in the proc-
ess of empirical research. We need to note also that since we cannot realisti-
cally examine each and every elderly person in the entire country, due to
resource- and time-related limitations, we will have to rely on information
obtained from just a subset of the population. This is a generally relevant issue
that we will often make reference to as we advance through this book. Then
in accordance with stage (ii), we collect data by using a well-established
instrument (e.g., a scale) for measuring depression on a randomly selected
group of say n 500 elderly adults that represents well the population of
aged persons in the United States. Section 1.2 of this chapter deals with gen-
eral issues pertaining to this stage and with the selection of the group to be
studied, which is called a sample. In stage (iii), we obtain summary indexes
signifying for instance average depression score and degree of variability
within each of the two genders from the sample studied (as well as conduct
further activities, in general, which depend on the specific research questions
being asked). Chapter 2 discusses matters pertaining to this stage in a more
general context. Finally, in stage (iv), we may examine for instance the gender
differences on these summary indexes (and/or possibly carry further activities,
again depending on the research questions pursued). In addition, we may
draw conclusions about those differences and communicate our results in
To illustrate further this process, let us revisit again the earlier depression
example in which the objective is the examination of gender differences in
depression of older adults (defined as elderly persons aged 65 or older). This
observation represents step (a) mentioned above. In step (b), we identify the
score on a depression scale (measure) as the variable of interest. We then
choose a measuring device or instrument(s), such as the Beck Depression
Inventory (Beck et al., 1977), to collect subsequently these scores from the
subjects (the units of analysis) in the drawn sample. In step (c), we decide
how to select the subjects to be included into the actual study. Let us suppose
for now that we were able to obtain a random sample of 250 men and 250
women from the population of elderly in the United States. In step (d), we
administer the measure(s) (scale, test, instrument), and then score the results
we obtain on it from each person in the sample.
Frequently, step (b) of the discussed process requires a detailed consider-
ation of measures that can be of interest in the context of the study to be
conducted. Experts in the substantive (subject-matter) field of application
may be asked for their opinion on this issue during this step. Similarly, step
(c) may involve time- and effort-consuming activities to come up with sam-
ples that are representative of the studied population(s). Due to the complexi-
ties involved in this activity, a special branch of statistics referred to as
sampling theory has actually been developed to meet the demands of obtaining
a representative sample (as well as carrying out subsequently appropriate
related data analyses). We will touch upon some of these complexities and
activities in a later chapter of the book. Finally, in step (d), one needs to
ensure that specific and detailed instructions are precisely followed when
administering the measure(s) of concern, as well as that their objective and
accurate scoring is accomplished at the end of the data collection process. We
note that if the study is indeed experimental, then further activities are also
involved in steps (b) through (d). Although elaborating on these activities is
beyond the scope of this introductory book, they are discussed in greater
detail in more specialized experimental design sources and books (e.g., Kirk,
1989).
There are a number of reasons that the study of statistics is very beneficial to
advancing knowledge in the empirical sciences. For one, statistics can be used
to summarize and interpret correctly published numerical data (e.g., data
from surveys or various other forms of reports). Further, statistics can be used
to help develop a critical and informed way of interpreting data and evaluat-
ing possible conclusions drawn from examined data sets. For instance, the
media continually expose us to large bodies of information through news and
An Introduction to Descriptive
Statistics: Data Description and
Graphical Representation
ical data descriptions via summary indexes that can contain important
information about research questions of interest. In this chapter we discuss
methods that fall under category (a), and we will be concerned with methods
in category (b) in the next chapter. Because graphical displays differ depend-
ing on the number of variables involved, we cover first the case of just a single
variable. Situations involving several variables will be attended to in later sec-
tions of the book. As far as numerical indexes are concerned, we note that we
will familiarize ourselves with several of them in Chapter 3, where we will also
see in detail how they summarize important features of sets of scores in given
samples (or populations).
Example 2.1. Suppose we have obtained data on the number of faculty that
are employed in a college of education at a particular state university. Let us
also assume that the observations given in Table 2.1 represent the obtained
faculty data. As indicated earlier, we need to make sure that each single mea-
surement falls within only one possible category. In other words, we need to
ensure either that there are no faculty members with a joint appointment in
another department, or that any faculty member with such an appointment is
only counted in one of the departments he or she is affiliated with. If we ignore
such a situation (joint appointment), then the actual total number of faculty
counted will be incorrect.
* ASCII stands for ‘‘American Standard Code for Information Interchange’’ and is a
character-encoding scheme for representing text to computers and digital devices that use
text.
that is similar to the one presented in the main body of Table 2.2, using also
for simplicity the abbreviation ‘‘dept’’ for ‘‘department’’ (cf. Table 2.1). We
note that the first column of Table 2.2 contains the abbreviation of the name
of the department (initials), and the second column is the number of faculty
in each department. We also notice that the first row of this file contains the
names of the two variables—‘department’ and ‘faculty’. We save this data file
using an easy-to-remember file name, such as CH2_EX21.dat.
Now in order to read this small data set into R, we need to use a particular
command. Before we give it, however, we note that for ease of presentation
we adopt the convention of using the Courier font to represent both the com-
mand submitted to the R software and the output sections produced by the
statistical analysis of the program itself. This same convention for commands
and output will be followed throughout the book. We also begin the represen-
tation of each used command with the conventional R prompt, which is the
sign ‘‘⬎’’ (we emphasize that when one is actually using a computer to type a
command, there is no need to precede any R command with the prompt, as
it is generated automatically). With this in mind, the command to read the
data from Example 2.1 into R (as provided in Table 2.2) is ‘read.table’, which
we use here as follows:
We note that one needs to provide with this command the path leading to
the specific subdirectory where the ASCII/plain text file resides on the com-
puter used, placed in inverted commas. (In this case, the file is located in the
computer subdirectory ‘‘data’’ in drive C; see the Note at the end of this
chapter.) The subcommand ‘header T’ instructs R to read the first line not
as numbers but as variable names (i.e., as ‘‘character strings’’—sequences of
characters or letters; see Note to this chapter).
When the R command ‘read.table’ is executed by the software, it creates an
object with the name ‘d’. The object named ‘d’ represents the data that have
> attach(d)
Example 2.2 (MTA study): A mathematics test–taking anxiety study was carried
out with n 36 elementary school students, using an established measuring
instrument referred to as the MTA scale. The resulting data are provided in
Table 2.3 and saved under the name CH2_EX22.dat. In Table 2.3, ‘id’ denotes a
subject identifier, and ‘y’ the score for each student obtained using the mathe-
matics test–taking anxiety measuring instrument (i.e., the student MTA score).
For ease of presentation, we adopt here the convention that when representing
subject data in this book, we will typically use the first column as the person
(case) identifier.
Next we use the two previously discussed examples (see Tables 2.2 and
2.3) to pictorially represent the data on both the qualitative and quantitative
variables involved. This is done in order to obtain a first idea of the relation-
ships between the values (scores or categories) that the subjects (the units of
analysis) take on them. Two very popular yet distinct methods for graphically
representing data on studied variables are discussed next. First, some methods
that can be used to graphically represent qualitative variables are presented.
These are followed by an illustration of methods for graphically representing
quantitative variables.
> pie(faculty)
This command yields the pie chart displayed in Figure 2.1—this pie chart is
shown by R in a separate graphics device window opened by the software after
one completes the command (i.e., after one submits the command to R).
FIGURE 2.1.
Pie chart of the number of faculty per department.
While Figure 2.1 readily presents a rough idea about the relations between
the number of faculty members in each department, the labeling of the dis-
played slices is not immediately informative. In particular, the specific num-
bers attached to the slices here by R are simply those of the consecutive rows
of Table 2.1. Such a graphical display does not provide a clear enough picture
of the data and would require one to refer back to the raw data in order to
interpret the relative sizes of the slices presented. To deal with this limitation
of the displayed figure, it would be best to add the names of the departments
(in the abbreviation notation used in Table 2.2). This is possible in R by
using the subcommand ‘labels’, and we attach a title to the figure using the
subcommand ‘main’, leading to the following extended command:
As can be seen from this command line, we set the subcommand ‘labels’
equal to the name of the variable containing those of the departments, and
set the subcommand ‘main’ equal to the title we want to use, which we place
in quotation marks. To simplify matters throughout the remainder of the
book, we will always make a reference to any R commands and subcommands
within the main body of the text by using quotation marks. Using the above
pie-chart command produces the output displayed in Figure 2.2.
The pie chart as a graphical device is very useful when one is interested in
displaying qualitative data in the form of slices in a circle. Another very popu-
lar alternative and equivalent data presentation method is the so-called bar
chart (barplot). In its simplest form, in a barplot the levels of a variable under
consideration are presented successively (following the order from top to bot-
FIGURE 2.2.
Pie chart of faculty by department, using slice labeling and figure title.
tom in the data file), and the frequencies (relative frequencies) with which the
variable takes its values are represented by the height of bars—vertical rectan-
gles positioned above these levels. For our currently considered example, we
can obtain the barplot of the ‘faculty’ variable in R with the command ‘bar-
plot’ as follows:
> barplot(faculty)
This command yields the graph displayed in Figure 2.3. Along its horizontal
axis, the departments are represented from left to right as they follow from
top to bottom in the original data file. Above them are the bars, whose heights
reflect the number of faculty members per department.
While we can easily obtain from Figure 2.3 an idea about the relations
between the number of faculty across each department—which numbers are
represented along the vertical axis—it is unclear from the figure alone which
department is associated with which of the bars. To assign the names of the
departments to the horizontal axis, we use the subcommand ‘names.arg’, and
as above use the subcommand ‘main’ to attach a title to the figure. The result-
ing, somewhat extended command we need for these aims, is now as follows:
This extended command furnishes the bar chart displayed in Figure 2.4.
This bar chart is now far more informative than the one displayed earlier
FIGURE 2.3.
Barplot of faculty by department.
in Figure 2.3, and in addition may be seen as more informative than the pie
chart presented in Figure 2.2. The reason is in particular the fact that by
referring to the vertical axis one can see the values that the ‘faculty’ variable
takes across departments. These are the above-mentioned raw frequencies that
equal the number of faculty per department. If we wish to obtain a bar chart
FIGURE 2.4.
Barplot of faculty by department.
not of the raw frequencies as in Figure 2.4, but of the relative frequencies, we
can divide the raw frequencies by the total number of cases in the data set—
here the total number of faculty at the college in question. In general, as
mentioned before, the relative frequency for a given variable category
(‘department’ in this example) is defined as the ratio of observed frequency,
denoted for clarity as f, to the total number of cases studied—denoted
n—which is the sample size in a study under consideration:
(2.1) rf/n,
where r is used to denote relative frequency. That is, to obtain the relative
frequency of faculty per department in our example, for each department we
need to divide their number by the total sample size. In most studies one
would know the sample size already at the beginning of the analysis, but at
times it may be necessary to obtain it alternatively (e.g., by having the R
software do it for us). We can simply obtain sample size by summing the
values of the variable ‘faculty’ across all departments, which is accomplished
with the R command ‘sum’:
> sum(faculty)
[1] 53
In this output, the first number presented in brackets, [1], indicates that
immediately following is the first element of the output produced by R. Since
in this case all we need from R is a single number (i.e., the studied group size,
or sample size), it is presented right after the symbol ‘‘[1].’’ Thus, 53 is the
total number of faculty in the college under consideration. We will often
encounter this simple pattern of output presentation in the remainder of the
book.
Once we know the group size, we can request from R the barplot of relative
frequencies per department using the above command ‘barplot’, where we
now divide the variable in question by this size. We achieve this by using the
division sign ‘/’ as follows:
We stress that the first argument in this command (the first entry in its paren-
theses, from left to right) is formally the ratio of the variable in question to
studied group size (sample size). The last presented, extended ‘barplot’ com-
mand produces the relative frequencies bar chart displayed in Figure 2.5.
From this barplot, it can now be readily observed that the largest percentage
of faculty are employed in the Educational Administration Department, about
a quarter of all faculty are in the Teacher Education Department, less than
10% are in the Measurement, Evaluation, and Statistics Department, and
about 17% of all faculty are employed in the School Psychology Department.
FIGURE 2.5.
Relative frequencies bar chart for faculty by department.
the frequencies or relative frequencies with which these variables take their
values or levels.
Although qualitative variables are often encountered in empirical research,
data arising from quantitative variables are just as common (or at least from
variables that could be treated as quantitative). As mentioned earlier, the val-
ues of a quantitative variable differ from one another in quantity rather than
in quality as is the case for a qualitative variable. For quantitative variables, it
is equally necessary to have methods that can be used to graphically represent
their values in a way that provides an informative summary of them.
A very popular method that can be used to graphically represent data from
a quantitative variable is the so-called histogram. A histogram is basically a
series of vertical rectangles that represent the frequency with which scores fall
in the interval that is at the bottom of that rectangle. Fortunately, these inter-
vals, including their length and position, are automatically chosen for us by
the R software through some built-in, reasonable, and widely applicable
defaults. In fact, with R the intervals are defined by default as including the
number positioned at their right end, but not at their left end. With this
feature, the histogram gives an impression of what the distribution of a vari-
able under study actually looks like. We mention in passing that the distribu-
tion is the way in which scores on the variable relate to one another, and we
will have more to say about this concept in a later section of the book.
To illustrate the construction of a histogram, let us return to Example 2.2
where we were interested in the mathematics test–taking anxiety (MTA) vari-
able in a study with a sample of n 36 students. Suppose we wish to construct
the histogram of the variable ‘anxiety score’ (i.e., the MTA score obtained with
an established measuring instrument used in that study); we recall that this
variable was denoted as ‘y’ in Table 2.3. Now using this data file, R can readily
produce a histogram for the variable with the command ‘hist’:
We note that in this command the name of the variable for which a histogram
is to be obtained is given first (which here is denoted as ‘y’). In addition, the
subcommand ‘main’ is used to attach a title to the top of the generated histo-
gram. Figure 2.6 provides the histogram for the MTA scores generated using
this command statement.
As can be readily seen from this histogram, there are for example in total
seven scores that are either 23 or 24, since the bar corresponding to the inter-
val (22, 24] has a height of seven units—i.e., the frequency of scores in this
interval is seven. (We note that the symbol ‘‘(.,.]’’ is used to denote an interval
including the number at its right end but not the number at its left end.)
Similarly, there are three scores in the interval (16, 18], as can also be verified
by inspecting the data in Table 2.3. (Below we also illustrate an alternative
FIGURE 2.6.
Histogram of mathematics test–taking anxiety scores (Example 2.2).
way of achieving the same graphical representation aim using the so-called
stem-and-leaf plot.)
When informally presenting a histogram, often one may wish to connect
the middle points of the top sides of these rectangles. If one uses thereby
corresponding segments of straight lines, the resulting curve is commonly
referred to as a frequency polygon. If one connects these middle points by a
smooth curve rather than segments, the resulting curve is informally also
referred to as variable ‘‘distribution’’ curve (or just distribution). We will dis-
cuss this alternative notion for graphical presentation of data in more detail
in a later section of the book.
Histograms are very useful devices to present data, but depending on how
their class intervals are built, they can sometimes conceal some important infor-
mation (e.g., Verzani, 2005). For example, such a situation may arise when
grouping together fairly different scores (as may happen with variables that have
wide ranges of scores). For this reason, it can be particularly informative to
examine the actual frequency with which a particular score appears in a data
set. This is readily accomplished with the stem-and-leaf plot. In it, the stem is
composed of all numbers apart from their last digit, and the leaves are their last
digits.
Using the same anxiety score data for Example 2.2 considered above, we
can obtain their stem-and-leaf plot (graph) with R using the command ‘stem’.
Thereby, the subcommand ‘scale’ used requests that the last digit of the scores
be presented as leaves, and all the preceding digits as stem:
1 | 577899999
2 | 1111223333344
2 | 555557778999
3 | 12
FIGURE 2.7.
Stem-and-leaf plot of the anxiety scores (Example 2.2).
NOTE
23
score(s) with the highest frequency(-ies) is then the mode of the examined
variable in the studied group. To illustrate, let us take a look at the stem-and-
leaf plot (Figure 2.7) of the mathematics test–taking anxiety (MTA) scores in
Example 2.2 considered in the last chapter. For ease of presentation we repeat
the stem-and-leaf plot below as Figure 3.1. As can be readily seen from the
graph in Figure 3.1, there are three modes in the MTA example data set. These
are the scores 19, 23, and 25. The reason is that each of them occurs five times
among the 36 anxiety scores, and all remaining scores occur less frequently.
We note in passing that in a multiple-variable data set, each variable has its
own mode that need not be the score of the same person across the variables,
nor for that matter the same score(s). For example, suppose that in a study of
mathematics ability three tests are administered to high school seniors: (i) an
algebra test, (ii) a geometry test, and (iii) a trigonometry test. Then the mode
on the algebra test could be 25, the mode on the geometry test 27, and the
mode on the trigonometry test 22. The modes can also be scores obtained by
different subjects across these three tests.
We emphasize that the mode need not be uniquely defined. This is because
there may be more than just a single number that occurs with the highest
observed frequency. In fact, it all depends on the data of the variable under
consideration, and some data sets may have more than one mode. A way in
which this can happen is when we consider several groups of subjects but
disregard group membership (i.e., consider them as a single data set), e.g.,
males and females on a variable of interest. Data sets (variables) with two
modes are often referred to as bi-modal, and similarly defined in terms of the
number of modes are tri-modal or multi-modal data sets or variables in gen-
eral. For instance, the MTA data set presented in Figure 3.1 (Example 2.2
from Chapter 2) is tri-modal as we saw above. Also, in a data set where each
separate score appears the same number of times, either score can be viewed
as a mode.
An especially useful feature of the mode is that it is in general not affected
by extreme observations (e.g., even in the problematic cases when these are
the results of entry errors). This feature is often referred to as ‘‘resistance’’ or
‘‘robustness’’ of the mode to extreme values. To illustrate, consider the follow-
ing example. Suppose that when recording or entering the data in the first
1 | 577899999
2 | 1111223333344
2 | 555557778999
3 | 12
FIGURE 3.1.
Stem-and-leaf plot of the anxiety scores (Example 2.2).
considered example in this chapter with the IQ test scores, a researcher mis-
takenly entered 995 rather than 95. That is, in his/her version of the above
data set, the following were the actually entered (recorded) scores
995, 97, 100, 101, 103, 101, 102, 105, 101, 95, 97, 101.
However, despite this flagrant error, the mode is still 101 as it occurs the
most frequently. That is, in this case the mode is unaffected by the data entry/
data recording mistake.
At the same time, it is worthwhile noting that it is also possible that an
entry or recording error might be made when entering the score that most
frequently occurs in the data set, in which case the mode may well be affected.
For example, when the second most frequent score appears only one less times
than the actual mode in the original data set, then due to such a recording
error both these scores may appear equally often. In such a case, two modes
will be proclaimed for a data set that actually has only a single mode. Obvi-
ously a variety of other examples of similar errors can be easily constructed as
well, where the actual mode of a data set is misrepresented due to a data entry
or recording error.
A useful feature of the mode is that while it is easily defined for quantitative
variables, it is the only measure of ‘‘central tendency’’ that is meaningful for
a qualitative variable. (The concept of ‘‘central tendency’’ is not well defined
for such variables, and we use it here in a loose sense, viz., in the sense of
most ‘‘fashionable’ ’’ category—i.e., one with the highest observed frequency.)
Finally, we note that the mode has the unique feature of being a score that is
actually contained in the data set under consideration. That is, the mode is in
fact a score taken by/measured on a studied subject (unit of analysis). This
need not be the case with the two other measures of central tendency that we
turn to next.
> sort(y)
This command produces the following output, whereby the prefixes [1] and
[23] signal that the first and 23rd consecutive MTA scores follow immediately
after them (which prefixes we just ignore most of the time):
[1] 15 17 17 18 19 19 19 19 19 21 21 21 21 22 22 23 23 23 23 23 24 24
[23] 25 25 25 25 25 27 27 27 28 29 29 29 31 32
To find the median of this variable, we need to take the middle score, if sample
size is an uneven number, or the average of the middle two scores, if sample
size is even. In this particular example, we happen to know that the sample
size is 36, but at times the sample size may not be known beforehand. In such
cases, in order to work out the sample size for a given study or variable, we
can use the command ‘length’. This command determines the length of the
array (row, set) of scores that comprise the available observations on a vari-
able in question. For our MTA example, the command produces the following
result (given immediately beneath the command):
> length(y)
[1] 36
That is, the sample size is 36 here—something we knew beforehand for this
MTA example but may not know in another data set of interest.
Since the number of scores on the variable ‘y’ of interest in this data set is
even, in order to work out its median we take the average of the 18th and
19th scores from left (or from right) in their above-sorted sequence. As it
happens, both these middle scores are 23, and thus they are each equal to
their average, which is declared to be the median value of the MTA variable,
viz., 23.
We would like to note here that all of these same computational activities
> median(y)
This command yields for the mathematics test–taking anxiety (MTA) example
the median value of
[1] 23
> mean(y)
yields
[1] 23.25
Since the mean is so widely used, we will spend next some time on its more
formal discussion (e.g., Raykov & Marcoulides, 2011, ch. 2). Let us first
denote a given variable of interest by the letter y—a notation that will be used
quite often in the rest of this book. The mean of y in a population of N
subjects—N being typically large and finite, as is usually the case in current
behavioral and social research and assumed throughout the rest of this
book—is defined as:
1 1 N
(3.1) 兺 yi ,
y (y1y2...yN) i1
N N
where y1 through yN denote the values of the variable y for the members of
the population, beginning with the first and up to the Nth member, and y
designates the mean of y in the population of interest. (We may eventually
dispense with using the sub-index (y) attached to the symbol for population
mean values in later discussions, when no confusion may arise.) We note also
the presence of the summation symbol, Σ, in Equation (3.1). This symbol is
utilized to denote the process of adding together all y scores with sub-indexes
ranging from i 1 to i N, i.e., the sum y1 y2 ... yN. We will also
frequently use this short summation index, Σ (with appropriate ranges of
associated sub-indexes), in the rest of the book.
We rarely have access, however, to an entire population of interest. Rather,
we typically only have available a sample from the population of interest. How
can we then use this sample to extract information about the population
mean, i.e., obtain a good ‘‘guess’’ of the population mean? To this end, we
wish to combine in an appropriate way the studied variable values obtained
from the sample, so as to render such a good ‘‘guess’’ of the population mean.
This process of combining appropriately the sample values to furnish infor-
mation about an unknown quantity, like the mean, is called in statistical ter-
minology estimation. Unknown population quantities, like the mean (or
median), which characterize a population distribution on a variable of inter-
est, are called parameters. We wish to estimate these parameters using data
obtained from the sample.
For any given parameter, we accomplish this estimation process utilizing
an appropriate combination, or function, of the sample values or scores on
the studied variable. This function is typically referred to as statistic. That is,
we estimate unknown parameters using statistics. A major property of ‘‘good’’
statistics is that they appropriately combine sample values (observations) in
order to extract as much as possible information about the values of unknown
parameters in a population(s) under investigation. When using statistics to
estimate parameters, the statistics are often referred to as estimators of these
parameters. That is, in a sense an estimator is a statistic, or a formula, that is
generally applicable for the purpose of estimating a given parameter. In a
given sample, the value that the statistic takes represents the estimate of this
specific parameter.
Returning to the mean considered earlier in this section, as we indicated
before it is estimated by the arithmetic average of the scores on y obtained in
the studied sample, i.e., by
(3.2)
1
ˆ yȳ (y1y2...yn)
n
1 n
n i1
yi , 冘
Specifically for the data set in Example 2.2 (see the MTA example in Chap-
ter 2), the 5%-trimmed mean is obtained with this R command as
[1] 23
> u = y - mean(y)
To see the result of this action, we ask R to print to the screen the elements
of the newly obtained vector u, by simply stating its symbol at the R prompt:
> u
This prints to the computer screen the individual deviation scores as follows
(the prefixes ‘‘[13]’’ and ‘‘[25]’’ signal that the 13th and 25th individual devia-
tion score follows immediately after them, in this listing, and are ignored as
usual):
[1] -8.25 -6.25 -4.25 -2.25 -0.25 8.75 -5.25 -4.25 -1.25 -0.25 1.75 -2.25
[13] -1.25 1.75 1.75 0.75 -0.25 3.75 -4.25 -6.25 -2.25 5.75 3.75 7.75
[25] -4.25 -4.25 -0.25 1.75 1.75 3.75 5.75 5.75 -2.25 0.75 -0.25 4.75
While the individual deviation scores represent the degree to which indi-
viduals (individual units of analysis) deviate from the mean, they have the
property that they always sum up to zero, no matter how large any one of
them is. In other words, the sum of the deviations of individual scores around
the mean will always be equal to zero. Indeed, if we sum them up using R (for
which we can use the command ‘sum’ as indicated in the preceding chapter),
we readily observe that the result is zero:
> sum(u)
[1] 0
Thus, any data set—no matter how many scores it consists of or how different
the individual scores are from their mean—has the same overall sum of the
individual deviation scores, viz., zero. Hence, it would seem that this informa-
tion does not appear to help much in differentiating between different sets of
scores. In addition, the individual mean deviations are scores that are charac-
teristic for each person studied. Therefore, when one simply examines these
deviations, no data reduction (summarization) is actually being achieved. Yet
this type of reduction or summarization is what is frequently sought when
using statistics in most empirical research.
For these reasons, we need a summary measure of individual differences,
which measure does not share the limitations just mentioned. At the popula-
tion level, such a measure is the variance of the studied variable, which is
defined as follows (later in the book, for ease of presentation we may dispense
with the subindex ‘y’ to 2, when no confusion may arise):
1 N 2 1 N
(3.4) 2y 兺 ui i1
兺 (yiȳ)2.
N i1 N
Based on this equation, it should be evident that the variance is the average
squared mean deviation. (We assume throughout the rest of the book that the
variance, as well as the mean, of any variable considered is finite, which is a
fairly mild if at all restrictive assumption in empirical research.) We also
observe from this definition in Equation (3.4) that the variance has as units
of measurement the squared units underlying the individual scores (measure-
ments). Because of these squared units, the variance is somewhat difficult to
directly interpret in an empirical setting. To avoid this problem of interpreting
squared units, the standard deviation is also considered, which is defined as
the square root of the variance:
(3.5) y
冪 1 N 2
兺 ui
N i1 冪 1 N
兺 (yiȳ)2,
N i1
where a positive square root is taken.
Equations (3.4) and (3.5) allow us to determine the variance and standard
deviation of a studied random variable y, if an entire (finite) population of
concern were available. As already mentioned, this will rarely be the case in
empirical social and behavioral research that typically works, due to a number
of reasons, with samples from populations of interest. As a consequence, the
variance and standard deviation for a variable of concern are estimated in an
available sample correspondingly by using the following equations:
1 n
(3.6) ˆ 2ys2y 兺 (yiȳ)2
n1 i1
and
(3.7) ˆ ysy
冪 1 n
兺 (yiȳ)2,
n1 i1
with a positive square root taken in the last equation. We emphasize that we
divide in Equation (3.6) the sum of squared mean deviations by (n 1)
rather than by n. We do this in order to obtain an ‘‘unbiased’’ estimate/
estimator of variance. This means that the resulting estimate on average equals
the population variance, across possible samples taken from the population
(all of them being with the same size, n). This unbiasedness feature is a desir-
able property for any estimator of any parameter. Alternatively, if we divide
just by n in Equation (3.6), then on average—that is, across repeated sampling
from the same population—the variance estimate will underestimate the true
variance (i.e., will be associated with a negative bias).
The variance and standard deviation sample estimates are also readily
obtained with R using the commands
> var(y)
and
> sd(y)
is defined as the difference between the largest and smallest scores in a given
data set (on a studied variable). Denoting by y this variable, the range r is
defined as
(3.8) rmax(y)min(y),
where max(.) and min(.) are the largest and smallest score on y, respectively.
With R, we obtain the range by simply using its definitional equation:
The result of this activity is that the range of the variable y is stored or created
as the object r. As before, in order to see its contents—e.g., to see what the
range of the MTA score is—we need to state next the symbol of the range at
the R prompt:
> r
which yields
[1] 17
That is, the distance between the highest and lowest mathematics test–taking
anxiety score in the example studied sample of 36 students is equal to 17.
Alternatively, we can also use the command ‘range’, to obtain the smallest
and largest number in the data set:
> range(y)
In the above MTA example (Example 2.2 in Chapter 2), this command
returns
[1] 15 32
> IQR(y)
[1] 5
That is, the middle half (in the rank ordering of) the scores of the n 36
students examined in this anxiety study differ from each other by up to five
points.
Unfortunately, any of the summary indexes discussed in this section has
the limitation that sets of scores with very different mean values could still
have the same variability measures. In order to relate variability to the posi-
tion of the actual range within which scores vary from one another, the coeffi-
cient of variation (CV) can be used. The CV index provides by definition the
variability per unit mean and is determined as
(3.10) cyy / y
for a given variable y in a population of interest. In an available considered
sample, the CV can obviously be estimated as
(3.11) ĉyˆ y / ˆ ysy / ȳ .
For our earlier MTA example, this coefficient is obtained with R as follows:
> c = sd(y)/mean(y)
which yields
[1] 1.789
The CV becomes quite relevant when one considers different populations that
may have similar variability but different mean values. Under such circum-
stances, their CV indexes will also differ, as a reflection of the population
differences.
3.3.1. Quartiles
The term quartile is commonly used to describe the division of observations
into defined intervals based on the values of the data. (We note in passing
that a quartile can also be thought of as a particular quantile; see below.)
There are two main quartiles for a given variable, a lower and an upper quart-
ile. The lower quartile cuts out at the left (i.e., the lowest) 25% of the scores
on the distribution, i.e., the lowest quarter of the distribution. Conversely, the
upper quartile cuts out at the right (i.e., the highest) 25% of the scores, i.e., the
upper quarter of the distribution. That is, the upper quartile can be thought of
as being the median of all scores that are positioned above the median of the
original data set; similarly, the lower quartile would be the median of the
scores below the median of the original data. In other words, the earlier dis-
cussed inter-quartile range (IQR) is just the difference between these two
quartiles, which thus enclose the middle 50% of the scores on a given variable.
The lower and upper quartiles are readily obtained with R using the com-
mand ‘quantile’. To obtain the lower quartile, we use
We stress the use of the numbers 1/4 and 3/4 at the end of these commands
(and after a comma separating them from the name of the variable in ques-
tion), since 1/4 and 3/4 of all scores correspond to the left of the lower and to
the left of the upper quartile, respectively.
To illustrate, for our earlier MTA example (Example 2.2 of Chapter 2), we
obtain the following values of these quartiles (with the results again given
beneath the command used):
That is, the smallest 25% of the anxiety scores in this example are no higher
than 20 (since they are all whole numbers, as seen from Table 2.2 in Chapter
2), while the highest 25% of the anxiety scores are at least 26.
These two quartiles can be readily obtained for a quantitative variable in a
given data set, along with its mean, median, maximal, and minimal value,
using alternatively the command ‘summary’. This R command produces the
six-point summary of the variable. For instance, considering again the MTA
example, we obtain the following results (given beneath the used R com-
mand):
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.00 20.50 23.00 23.25 25.50 32.00
As before, we observe from this output the minimum and maximum values
being 15 and 32 respectively (see earlier discussion of the ‘range’ command
in this chapter). We also notice the values of the mean and median being
correspondingly 23.25 and 23, and finally the lower and upper quartiles as
20.5 and 25.5, respectively. With these features, the command ‘summary’ pro-
vides a quick and convenient summarization of the data on a quantitative
variable.
where we now add in the subcommand ‘ylab’ a title for the vertical axis—the
one of anxiety scores. This command yields Figure 3.2 for the earlier MTA
example.
In this displayed boxplot, the thick horizontal line in its middle represents
the median of the anxiety scores of interest, which is equal to 23. As men-
tioned before, the two thinner lines (hinges) enclose the box at the upper and
lower quartiles of 20.5 and 25.5, respectively. That is, the distance of the latter
two statistics is the IQR, viz., 5 in this example. In other words, this box
encloses the ‘‘middle’’ half of the scores, as one moves along the vertical axis
FIGURE 3.2.
Boxplot of anxiety scores.
representing the range of the MTA scores under consideration. In the boxplot,
the whiskers are the dashed lines that stretch away from the hinges and until
the smallest and largest value (15 and 32, respectively) in the sample, which
fall within 1.5 IQR units from the closest box end. As mentioned before, when
a data set contains possible ‘‘outliers’’ (i.e., extremely small or large values),
they are visually located outside of the whiskers. We do not have such ‘‘out-
liers’’ in the present data set, since none of its scores extends further than 1.5
IQR 7.5 points from the lower and upper quartiles (20.5 and 25.5,
respectively).
There are a number of important pieces of information that can be
extracted by examining a boxplot for a variable under consideration. First, the
median can be easily located by examining where the thick horizontal line is
located within the box—the height of this line is the median. (See the vertical
axis in Figure 3.2 and the score on it pertaining to the line; that score is the
median, 23, in the presently considered example.) Second, the length of the
box (i.e., the IQR) is a measure of variability, as discussed earlier in this chap-
ter. This measure is visually represented by the distance between the lower
and upper hinges of the box. In case the median is positioned in the center of
the box and the two whiskers are of the same length, with no points further
away from the lower and upper ends of the whiskers, the distribution of the
variable is fairly symmetric. If the median is not positioned in the center of
the box, however, there is evidence of some asymmetry in this distribution.
In that case, the longer tail of the distribution is to be found in the direction of
the longer whisker (and further extreme observations if any). This asymmetry
usually is opposite to the direction in which one finds the hinge that is closer
to the median. Points outside of the ends of the whiskers are also indicative
of possible outliers as mentioned before.
When a distribution is not symmetric, it is called skewed. In a skewed distri-
bution, the scores above (below) the median are spread more and further
away from it, than the scores below (above) the median. More specifically,
there are two types of skewed distributions—positively and negatively skewed
distributions. A positively skewed distribution has the median usually posi-
tioned lower than the center of the box and closer to the lower whisker. In
such a distribution, the scores above the median are spread more and further
away from it, than the scores below the median. That is, the right tail of the
distribution is longer. Conversely, a negatively skewed distribution has the
median usually positioned closer to the upper hinge, and the lower whisker
being longer than the upper whisker. In such a distribution, the scores below
the median are spread more and further away from it, than the scores above
the median; that is, the left tail of the distribution is longer. We will discuss
further the notion of asymmetry and quantify it in a later chapter of the book.
We note the use of the symbol ‘⬃’, which effectively requests comparison of
the variable ‘y’ within the values of the variable ‘g’, i.e., for each of the genders
here. Furthermore, we have also added the ‘xlab’ subcommand to provide a
title for the horizontal axis. The last stated R command yields the graphical
representation provided in Figure 3.3.
FIGURE 3.3.
Simultaneous boxplots for boys and girls on a college aspiration measure.
As can be seen by examining Figure 3.3, boys have a slightly higher median
than girls on the college aspiration scale; they also have less pronounced inter-
individual differences. Additional group comparisons can also be readily
made using this data set. For example, in this group comparison context, a
question of actual interest could be whether these observed group differences
are ‘‘real,’’ i.e., if they exist also in the two examined populations (of boys and
of girls) from which these two samples came. After all, the research question
was concerned with population differences to begin with—whether such
existed—rather than with differences in randomly drawn samples from them.
Answering this important question will be the concern of later chapters of the
book. Before we turn our attention to them, however, we need to discuss
in the next chapter a concept of fundamental relevance for statistics and its
applications, that of probability and its related notions.
Probability
The concept of probability has been the subject of intense study for more
than 400 years, dating back to some of the early writings by the mathemati-
cians Pierre de Fermat and Blaise Pascal. This process has led to a variety of
different probability definitions. We will mention here three such definitions.
43
As an example, suppose we are given a coin (with two sides, heads and
tails), for which we do not know whether it is fair or not. The probability of
getting a tail then would be approximated, in a long series of independent
tosses of the coin, by the ratio of the number of tails occurring to the total
number of tosses—according to Equation (4.2). Thereby, the larger the num-
ber of trials (assumed as mentioned above to be independent of one another),
the better the resulting estimate of probability.
When the outcomes of an observation (experiment) are equally likely, the
relative frequency definition of probability may be viewed informally as an
empirical implementation of the classical interpretation (definition) of proba-
bility. This can be seen by considering the set of all conducted trial results as
that of all possible outcomes, while those when the event in question has
occurred are viewed as favorable for it.
In this section, as in the rest of the chapter, we will use the relative frequency
definition of probability to illustrate the evaluation or estimation of the prob-
ability of an outcome or event. To demonstrate the concepts involved, we use
and is also called array. We emphasize that we sample with replacement, since
we wish to make sure that the two possible outcomes—0 and 1—are ‘‘avail-
able’’ for each experiment repetition. (The default arrangement in the pro-
gram R for the ‘sample’ command is sampling without replacement, which
the presently used subcommand ‘replace T’ overrides.)
In order to obtain now an estimate of the probability of a ‘‘tail,’’ we first
obtain the number of 1’s sampled. This number is actually the sum of all
elements in the array ‘s’ defined above. According to the relative frequency
definition of probability, the probability of ‘‘tails’’ is then furnished by divid-
ing the number of 1’s by the total number of experiments carried out, i.e., by
n 100:
which returns
[1] 0.47
Hence, our estimate of the probability for ‘‘tails’’ with the coin used, based
on the sample of 100 observations we utilized, is .47. (Note the use of the
command ‘sum’ to add together all elements of the array s of numbers.) We
observe that the obtained estimate of .47 is fairly close to the well-known
probability of .5 for getting a ‘‘tail’’—a probability value that would have to
be valid based on the assumption that the coin is fair.
We reiterate that this estimate of the probability in question is obtained
using the relative frequency definition of probability. As indicated earlier, with
the relative frequency definition of probability an assumption was made that
the experiment was repeated a large number of times. One can perhaps argue
that 100 is hardly a large number, so let us repeat this experiment now with
n 1,000, i.e., ‘‘run’’ our experiment 1,000 times. We achieve this in R as
follows:
This leads—in the same way as explained above—to the estimate .496 of the
probability of a ‘‘tails’’ outcome. As can be seen this value is now much closer
to the probability of .5. If we were to conduct 10,000 repetitions of the same
experiment, the estimate would be even closer to .5 (in this case the obtained
value would be .503). Indeed, one could potentially continue increasing the
number of repetitions of the same experiment, whereupon the estimate would
tend to get closer and closer to .5.
This simple simulation of an experiment, carried out with R, demonstrates
the relative frequency definition of probability and its utility in practical set-
tings. Of course, if no statistical software were to be available, one could
potentially conceive of tossing the coin n 100 times, then n 1,000 times,
and finally n 10,000 times, and proceeding as above—dividing the number
of times ‘‘tails’’ has occurred by the total number n of tosses.
Having defined probability and illustrated it empirically, we are now ready
for a discussion of basic event and probability relationships.
(4.3) 0ⱕPr(E)ⱕ1,
for any event E. Thereby, the impossible event—at times also called the
‘‘empty event’’ and denoted as Ø—is associated with a probability of 0, and
the certain event is associated with a probability of 1. (As we will find out
later in the book, these are however not the only events with a probability of
0 and 1, respectively.)
Let us consider now two events, denoted by A and B. For instance, when
rolling a dice, let A be the event that ‘‘The number one turns up’’ and B the
event that ‘‘An even number turns up.’’ Obviously, A and B cannot occur simul-
taneously, since when rolling a dice only one number turns up, which cannot
be both one and at the same time an even number. Such events are called
mutually exclusive (denoted simply as ‘‘m.e.’’). Consider next the union of these
two events, denoted as C A 傼 B, or simply ‘‘A or B.’’ The event C will occur
if A or B occurs, i.e., if one or an even number turns up when rolling the dice.
An important relationship that can be shown for mutually exclusive events
is the following:
For a given event A, its complement is defined as the event that A does not
occur, sometimes denoted ‘‘not A.’’ For instance, if when rolling a dice the
event of interest is A ‘‘An even number turns up,’’ then the complement
of A is the event D ‘‘not A’’ ‘‘An uneven number turns up.’’ Obviously,
if D ‘‘not A’’ is the complement of A, then
(4.5) Pr(D)1Pr(A).
This is because A and D are evidently m.e., and in addition their union is the
certain event, i.e., Pr(A or ‘‘not A’’) 1 (see also Equation (4.4)).
In many empirical situations, one may be interested in two events A and B
that are not necessarily m.e., and in particular in their occurrence simultane-
ously. The event consisting of this simultaneous occurrence is called the inter-
section of the two events and is denoted by F AB, or alternatively as A 傽 B.
We stress that the occurrence of F is tantamount to that of both A and B
occurring in a given experiment. We note also that if A and B are m.e.,
then F is the empty event that never occurs, and then Pr(AB) Pr(F)
Pr(Ø) 0.
The probability of the intersection of two events, A 傽 B, is related to that
of the union of two general events A and B (i.e., events that need not be m.e.)
in the following way:
FIGURE 4.1.
Venn diagram for the probability of the intersection F of the events A and B (F AB).
In the remainder of this chapter, we will use on several occasions the proba-
bility relationships discussed in this section.
(i) its sensitivity, that is, its true positive rate, i.e., Pr(test is positive 兩 child
has a learning disability); as well as
(ii) its specificity, that is, its true negative rate, i.e., Pr(test is negative 兩
child has no learning disability).
These probabilities can be worked out with Bayes’ formula, which can be
obtained using earlier developments in this chapter. To this end, first let us
consider two events A and B with positive probability each. Then, from Equa-
tions (4.9) and (4.10) we obtain,
(4.12) Pr(A 兩 B)Pr(AB) / Pr(B)Pr(A).Pr(B 兩 A) / Pr(B).
However, since the event BA and the intersection of B with the event ‘‘not A’’
are m.e. and in their union render B,
(4.13) Pr(B)Pr(BA)Pr(B and ‘‘not A’’).
difficulties, then in case a child has such indeed, Pr(B 兩 A) .72. Similarly,
from the specificity of this test being .65, it follows that Pr(not B 兩 not A)
.65 1 Pr(B 兩 not A). Therefore, Pr(B 兩 not A) 1 Pr(not B 兩 not
A) 1 .65 .35.
If we have now the result obtained from testing a certain child with this
instrument, and the result suggests the child has learning difficulties, then the
probability the child indeed is having such difficulties is according to Bayes’
formula (‘ ’ denotes multiplication next):
Pr(A 兩 B).72 .05 / (.72 .05.35 .95).098.
This probability result can be readily obtained with R as follows (note the
use of ‘*’ for multiplication):
which renders a value of .098 as the probability of the child having learning
difficulties.
We note in passing that we used R here in the form of a ‘‘calculator,’’ in
which role also any other statistical software could be used; needless to say,
the calculations performed here by R can be alternatively obtained with any
handheld calculator.
The formula provided in Equation (4.15) is actually a special case of the
general Bayes’ formula, which is also often called Bayes’ theorem. This special
case is obtained when only two events are considered, viz., A and ‘‘not A.’’
The general Bayes’ formula for a set of m.e. events A1, . . . , Ak is as follows
(note the analogy of the following expression to the right-hand side of Equa-
tion (4.15)):
Pr(B 兩 Aj).Pr(Aj)
(4.16) Pr(Aj 兩 B) k .
j1
兺 Pr(B 兩 Aj).Pr(Aj)
That is, upon obtaining knowledge that event B has occurred, the prior proba-
bility for the jth event Aj is modified to the expression in the right-hand side
of Equation (4.16) (j 1,..., k). As indicated earlier, the Bayes’ theorem is of
fundamental relevance for a whole branch of statistics, called Bayesian statis-
tics. Given the introductory nature of the present book, this topic is not con-
sidered further—for more specialized treatments on this branch of statistics
the reader is referred to Smith (2010) or Swinburne (2005).
Based on the concepts presented in this chapter, we move next to a discus-
sion of various distributions of variables studied in the behavioral and social
sciences.
Probability Distributions of
Random Variables
55
on the variables and studied subjects. Variables of this type are called random
variables (RVs). In other words, a RV is such a variable whose individual
scores do not exist prior to the actual conducting of the study for which it is
of relevance, but become available only after that study has been carried out.
There are two main types of RVs in social and behavioral research—discrete
and continuous. A discrete random variable can take on only a countable
number of particular distinct values, whereas a continuous variable takes an
infinite number of possible values between any two lying in its range. In other
words, a continuous random variable is not defined at specific values but is
instead defined over an interval of values. In empirical research, it is custom-
ary to consider as approximately continuous a RV that can take on a relatively
large number of values (e.g., 15 or more), while one considers as discrete a
RV that can take only a limited number of values—e.g., two or just a few
more. For instance, the answer on an algebra problem—true or false—is an
example of a discrete RV. On the other hand, an IQ score obtained from an
intelligence test is an example of a variable that for most practical purposes
could be treated as a continuous (approximately continuous) RV. The distinc-
tion between discrete and continuous RVs becomes especially important in
applications of statistics for the purposes of modeling the relationships
between studied variables, as we will be doing later in this book. Under such
circumstances, different methods are best applicable according to whether we
are dealing with discrete response variables or with continuous (response,
outcome, dependent) variables.
> y = c(0, 1, 2)
> Pr.y = c(.25, .50, .25)
With these two commands, we create two arrays—the first containing the
values that the RV y of interest can take, and the second containing the proba-
bilities with which y takes these values. We can now provide a graph with
these probabilities by using the R command ‘plot’:
FIGURE 5.1.
Graph of the probabilities of the random variable y, the number of tails when tossing
two fair coins (independently of one another).
work independently and all have the same probability p of solving correctly
the particular algebra task.
As seen from this informal definition of a binomial RV, in order to speak
of such a variable we need to be given two numbers—often called the parame-
ters of this distribution. These are the number of trials n and the probability
p of success. Once we know these two parameters n and p, we can speak of an
RV having the binomial distribution if the RV can take on the values 0, 1,..., n
(and no other) with specific probabilities; the latter are the probabilities for 0,
1,..., n successes in n trials that are independent of one another and with the
same probability of success. More formally, if y is a RV following the binomial
distribution with parameters n and p, which can be symbolically represented
as
y ⬃ Bi(n, p) ,
then the probability associated with each of them can be shown mathemati-
cally to be (e.g., Roussas, 1997):
n!
(5.1) Pr(y) py(1p)ny (y0, 1,..., n).
y!(ny)!
In Equation (5.1), for any integer k we denote by k! 1.2 ... (k 1).k the
product of all integer numbers up to k, including k (whereby k! is referred to
as ‘‘k factorial,’’ and by definition we set 0! 1). For example, 3! would
simply be 3 2 1 6.
Although one could hand-calculate these probabilities, we can instead eas-
ily obtain them and graph them using R. This software implements the for-
mula provided in (5.1) by using the command ‘dbinom’. There are three
‘‘arguments’’ to this command, i.e., three entities that we need to provide R
with, in order for R to be in a position to execute this command. Specifically,
these are (i) the values of y that are of interest, (ii) the total number of values
y could obtain (less 1, for 0), and (iii) the probability p of success at each of
the independent trials involved.
To illustrate the definition of the binomial distribution and use of the cor-
responding R command ‘dbinom’, let us return to the context of the algebra
task example given above. Suppose that we have a class of 10 students who
work independently and that the probability of each one of them solving the
task is .7. Then we can obtain with R a graph of the probabilities of the RV
defined as the number of students with correct solutions of the task, denoted
y, as follows. First, we need to generate the set of possible numbers of students
with correct answers—viz., 0, 1,..., 10. This is easily done with the command
> y = 0:10
where we need to enter the number n (in our example, 10) and the probability
p (in our example, .7). (As an aside at this point, the name of the command,
‘dbinom’, could be in a sense decoded as ‘‘distribution of a binomial vari-
able.’’ Technically, ‘dbinom’ could be seen also as standing for ‘‘density of a
binomial variable,’’ keeping in mind that the set of its probabilities could be
considered the counterpart of the density of a continuous random variable, a
concept which is discussed later in this chapter.)
For the currently considered algebra task example, the following command
is thus needed:
We can view the probabilities computed in this way by just entering at the R
prompt ‘Pr.y’:
> Pr.y
The last command produces the following series of probabilities (note the
prefix in brackets at the beginning of each output row, indicating the serial
number of the probability following next in this set of altogether 11 probabili-
ties—for each of the numbers 0 through 10):
From the provided output, we easily see that the probability of no students
correctly solving the task is .000 (rounded off to the third decimal place), for
six students correctly solving the task is .200 (rounded off), while that for just
three students solving it correctly is .009 (rounded off)—these are the first,
seventh, and fourth numbers, respectively, from left to right in this output.
To plot next these probabilities against the possible values of the RV y, we
use the ‘plot’ command as before:
where we employ the last subcommand ‘type’ to request vertical lines con-
necting the probability point on the graph with the horizontal line (abscissa).
This yields the graph displayed in Figure 5.2.
An interesting property of the binomial distribution is that it becomes
more and more symmetric, for a given probability p, with increasing number
of trials, n. For instance, the graph of the binomial probabilities for n 100
with p .7 is much more symmetric, and displayed in Figure 5.3. In order
to obtain this graph, we first create an array with the 101 numbers from 0 to
100, for instance using the command
> y = 0:100
Finally we plot these probabilities against their corresponding y’s (as done
above), which leads to the graph in Figure 5.3 that as mentioned looks much
more symmetric than that in Figure 5.2.
Like any distribution used in this book, a RV y that is binomially distrib-
FIGURE 5.2.
Graph of the binomial probabilities (5.1) for n 10, and p .7.
FIGURE 5.3.
Graph of the binomial probabilities for n 100 and p .7.
uted has a mean and variance or standard deviation, which are its respective
measures of central tendency and variability (see Chapter 3). They can be
shown to equal correspondingly (e.g., Agresti & Finlay, 2009)
(5.2) ynp, and
2ynp(1p).
We note that we can determine the mean and variance (standard deviation)
in Equations (5.2) if we know only the probability of success in each of the
trials, p, and their number, n. If we do not know this probability, but only the
number n of trials in a given sample (viz., its size), we may estimate p from
the sample as the relative frequency
(5.3) p̂
successes/
trials,
and then substitute it into Equation (5.2). In this way, we will render estimates
of the mean and variance:
(5.4) ˆ np̂, ˆ 2np̂(1p̂).
(e.g., Agresti & Finlay, 2009). These estimates provide empirical information
about the population mean and variance, and may be of relevance in a partic-
ular study.
> y = 0:10
> Pr.y = dpois(y, 3)
> plot(y, Pr.y, type = “h”)
FIGURE 5.4.
Graph of the Poisson probabilities (5.5) (for 0, 1, . . . , 10, and 3).
year. We wish to find out the probability that in a given year there will be two
cases of this disease in the district. To accomplish this, we can use the formula
(5.1) via the simple command:
which returns
[1] .07581633
That is, the sought probability is .076 (rounded off to third decimal).
The Poisson distribution becomes even more useful as a means of approxi-
mation of the binomial distribution with large n and small probability p of
success (specifically with n ⱖ 100 and p ⱕ .01). This approximation is quite
good if then np ⱕ 20. In that case, rather than compute the binomial distribu-
tion probabilities, one can simply work out the Poisson probabilities—for 0,
1,..., n—with a mean np (Ott & Longnecker, 2010).
To illustrate, consider the following example. A new medicine aimed at
reducing blood pressure in elderly adults has been found to be associated with
a probability of .005 for a particular side effect, such as dizziness, say. In a
clinical trial, 200 aged persons receive the medicine. What is the probability
that none of them will experience dizziness? This probability is readily worked
out with R by observing that since 200 ⬎ 100, .005 ⬍ .01 and 200 .005 1
⬍ 20, the Poisson distribution is a good approximation of the binomial prob-
ability of relevance here (with ‘ ’ denoting multiplication). Then the sought
probability is found out as follows (see preceding example):
which yields .368 (rounded off to third decimal place). Thus, one can say that
the probability that none of the persons studied will experience dizziness is
just over a third. Conversely, it could be said that with a probability of nearly
two-thirds at least one elderly person will experience dizziness during the trial
of this medicine.
A continuous RV has the property that its possible values form whole inter-
vals, ranges, or continua, as opposed to discrete RVs that can take only distinct
values. In empirical behavioral and social research, one may argue that most
variables could be seen as discrete. However, when they take a relatively large
number of values (say, 15 or more), they can be considered for many empiri-
cal purposes as (approximately) continuous.
For continuous RVs, it is not possible to assign probabilities to single values
they can take. This is because, at least theoretically, there are infinitely many
such possible values (in fact, infinitely many more values than are countable).
Therefore, instead of the tables or graphs representing probabilities for dis-
crete RV values as earlier in the chapter for the discussed binomial and Pois-
son distributions, we now use a different concept to refer to the distribution
of a continuous RV. This is the notion of a ‘‘probability density function’’
(pdf, for short, also called at times ‘‘density curve’’ or ‘‘pdf curve’’). The pdf
is typically used to represent the probabilities with which a continuous RV
can take on values within certain intervals. Such a function for a continuous
RV is presented in Figure 5.5, and denoted f in the rest of this book, when no
confusion can arise.
Continuous RVs do not need to have symmetric pdf’s, similar to the one
displayed in Figure 5.5. Rather, they can also have nonsymmetric density
curves. Nonsymmetric pdf ’s are commonly called ‘‘skewed’’—displaying
asymmetry either to the right or to the left (see also Chapter 3). Specifically,
when their right tail is longer than the left, they are called ‘‘positively skewed’’;
alternatively, if their left tail is longer than the right, they are called ‘‘negatively
skewed.’’ Many random variables in empirical research have nearly symmetric
pdf’s, however. As we will see in the next chapter, this phenomenon can be
FIGURE 5.5.
Probability density function (pdf, density curve) for a continuous random variable.
explained with an important result in statistics, called the central limit the-
orem.
Each pdf can be seen as enclosing a certain area between its curve and the
horizontal axis of its graph. How large is this area? An important and widely
followed convention in statistics and its applications is that the area under the
pdf (density curve) of a continuous RV is assumed to equal one. This convention
is quite reasonable, as the counterpart of a related property for a discrete RV,
specifically that the sum of all probabilities for the values it can take is equal
to one (see Section 5.2).
Probability density functions resemble in form histograms (see Chapter 2).
In particular, the probability of a considered continuous RV obtaining a value
within a given interval equals the area under its pdf curve and between the ends
of the interval. Specifically, if y is a continuous RV and (a, b) a given interval
(in the range of values that y can take, with a ⬍ b), then Pr(a ⬍ y ⬍ b) is the
area under the pdf of y that is enclosed between the numbers a and b.
Hence, if one were interested in finding out what the probability is for a
continuous RV y to obtain a value in a fairly small interval surrounding
(closely) a given number, x say, this probability is approximated by f(x).⌬,
where ⌬ is the length of the considered interval. By letting ⌬ go to zero, we
readily observe that the probability of a RV y to obtain a prespecified value in
its range is zero. Nonetheless, it is possible that y indeed takes on that particu-
lar value x0, say. This is an instructive example of an event that is not the
empty one, which has, however, also a probability of zero (see Chapter 4).
Conversely, the probability of this RV taking on a value other than x0 is one.
Similarly, this is an example of an event other than the certain event, which
has a probability of one (since one cannot be really certain that the RV y will
not take the value x0).
We mention in passing that while it might be tempting to view f(y) as the
probability of a continuous RV y, to take the value y, it would be an incorrect
interpretation! The value f(y), for a given number y, cannot be interpreted
meaningfully in any different manner than saying that it is the height of the
pdf of the RV y at the value y—since the probability of the event E: ‘‘The RV
y takes the value of y’’ is zero.
With the preceding discussion in mind, we deal in the next section with a
fundamental distribution for continuous RVs, the so-called normal distribu-
tion.
this graph of the normal distribution with a 100 and 15, which is
approximately the distribution of the scores resulting from many intelligence
tests, we proceed as follows: (i) create a ‘‘grid’’ (‘‘net,’’ or array) of scores for
y, at which we (ii) compute the right-hand side of Equation (5.8), and then
(iii) plot the grid and this set of corresponding scores. These three activities
are accomplished with the following three consecutive R commands:
The first of these commands, as mentioned, makes available for the next two
commands or activities the sequence of all numbers between 55 and 145 that
are .05 units apart—starting with the smallest, 55, and ending with the largest,
145. (Later in this chapter, we elaborate on why it is meaningful to choose 55
and 145 here.) The second command calculates, using the right-hand side of
Equation (5.8) with 100 and 15, the values of the pdf of the normal
distribution N(100, 225) at all these numbers in the array (sequence of num-
bers) produced by the first command and denoted y. The last command plots
each of the values in the array y and its corresponding value of the pdf, f(y),
for the normal distribution under consideration. The triple of these com-
mands produce the graph presented in Figure 5.6.
We note in passing that if we wanted to display the pdf of this distribution
also for scores lower than 55 and larger than 145, we could reflect that in the
FIGURE 5.6.
Graph of the density curve (pdf) for the normal distribution N(100, 225).
first command above that created the grid for the graph. However, as we will
soon see, the probability at lower scores than 55 is practically zero, as it is for
scores larger than 145 (for this particular distribution, N(100, 225)).
Using the above three command lines, one can similarly create the graph
of the pdf of any normal distribution N(, 2) of interest—all we need to do
is just substitute in these three lines its parameters, and 2 (and modify
correspondingly the two end points of the underlying graph grid; see below).
We thus create with the first command a thinner grid than the one used
earlier, due to the much smaller variance of the distribution of interest here,
N(0,1), and our desire to produce a smooth curve (i.e., a curve appearing
smooth to the naked eye). The remaining two commands are the same as
those used above in Section 5.3, with the only difference that the one produc-
ing the pdf values f.y needs to use arguments that are valid for the present
case, viz., mean of zero and standard deviation of one (as the second and
third of its arguments). These three commands produce the graph displayed
in Figure 5.7, for completeness of the current discussion. As we see from it,
most scores assumed by an RV following a standard normal distribution can
be expected to lie within the interval 3 and 3. The reason is an important
fact about the probability of a score from such a distribution to be at least a
certain distance away from the mean, which we will discuss in detail later in
the chapter.
The standard normal distribution leads us naturally to a very useful concept
in empirical behavioral and social research, which we next turn to.
FIGURE 5.7.
Probability density function of the standard normal distribution, N(0,1).
5.3.2. z-scores
The reason the standard normal distribution plays such a central role
among the normal distributions is the fact that one can ‘‘move’’ to it and back
from any other normal distribution, using a simple standardization formula.
In particular, it can be shown that if the RV y is normally distributed with
any mean and variance and , i.e., y ⬃ N(, 2), then the so-called z-score,
(5.9) z(y)/,
follows the standard normal distribution, N(0,1). Conversely, if x ⬃ N(0,1) is
a standard normal RV, then the RV y x will have the distribution
N(, 2) (e.g., King & Minium, 2003).
More generally, the z-score is defined for any random variable, regardless
of its distribution (as long as its variance is positive, as assumed throughout
this book). Further, we stress that Equation (5.9) produces a z-score for any
given variable and any subject (unit of analysis). That is, any studied subject
(or aggregate of such, if being the unit of analysis) has a z-score on any vari-
able under consideration. This score is computed as in Equation (5.9). We
also notice that the z-score equals the number of standard deviations () that
the original y score is away from (i.e., below or above) the mean, . Also, we
will keep in mind that unlike raw scores (y) often obtained on measures in the
behavioral and social sciences, the z-scores can be negative (viz., when y ⬍ )
and will usually not be whole numbers.
Equation (5.9) would be applicable if we knew the population mean and
y ˆ y yiȳ
(5.10) ẑy,i i ,
ˆ y sy
> z (y - mean(y))/sd(y)
To illustrate, consider the earlier MTA example (see also Example 2.2 in
Chapter 2), with n 36 students being examined for their mathematics test–
taking anxiety. (As mentioned earlier, the complete data set is available in the
file CH2_EX22.dat.) For the anxiety score in that data set—the variable
denoted ‘y’ in it—the last indicated R command yields the following z-scores
(after entering at the R prompt just ‘z’; note that thereby 36 scores are printed
on the screen):
> z
[1] -2.878858e-17
This is a very small number, which for all practical purposes can be treated as
zero. Specifically, it is formally equal to 2.878858 1017 (where ‘ ’
denotes multiplication). We note in passing that the software R—like many
other available statistical software—uses the so-called scientific notation
to represent large or small numbers. In this notation, powers of 10 are
denoted formally by ‘e’ followed by the actual power. That is, the number
-2.878858e-17 is the same as the number -2.878858 1017. (The sym-
bol e displayed here should not be confused with the constant e 2.718...
mentioned earlier in the book; the latter constant plays a fundamental role
for the pdf of the normal distribution, as can be seen from the above Equation
(5.8).)
To render the variance of the z-scores, after just entering ‘var(z)’ at the R
prompt, one obtains
> var(z)
[1] 1
which is what it should be, according to our earlier discussion in this chapter.
That is, just over 84% of the entire area under the normal curve for this
distribution, N(1,5), is to the left of the point one plus one standard deviation
(i.e., to the left of in this example). We note in passing that in R, like in
other software, square root is accomplished with the command ‘sqrt(.)’, with
the number to be square-rooted given within parentheses.
The last calculated probability implies that the area above and under
the curve of the distribution N(1, 5) is the complement of this probability to
the entire area under that normal curve, i.e., 1 .8413447 .1586553. Due
to the symmetry of the normal distribution (see, e.g., Figure 5.6 or the for-
mula of the normal pdf in Equation (5.8), and our earlier discussion in this
chapter), it follows that the area under the normal curve and between
and is .8413447 .1586553 0.6826894, or just over 68% as men-
tioned above.
This percentage is actually correct in general for any normal distribution,
and we just illustrated it here in a particular example using the distribution
N(1,5). A simple illustration of this general result, which is quite useful in
empirical work, can be given as follows. Suppose one has administered a mea-
sure, such as an intelligence test, which is known to have (approximately) a
normal distribution in a large group of students. What is the probability of
picking at random a student with an IQ score being positioned no more than
one standard deviation away from the mean? In other words, what is the
probability of picking at random a student with a z-score smaller than one in
absolute value?
This is the same as asking the earlier question in this section, viz., what is
the area under the normal curve that is enclosed within one standard devia-
tion from the mean. And the answer we first gave above is .68 (as shown for
the general case in more advanced treatments, and demonstrated above for
the distribution N(1,5)). That is, if one were to keep picking at random stu-
dents from that large group, in the long run about two out of every three
students will not have an IQ score that is more than one standard deviation
higher or lower than the mean on this measure. (Usually IQ scores have a
mean of 100 and a standard deviation of 15, as we indicated above.)
The discussion so far in this section dealt only with the question of what
the area was under the pdf curve of a normal distribution and within one
standard deviation from the mean. It can be similarly shown that the area
under the normal curve and within two standard deviations from the mean is
approximately 95%. We can illustrate this general result on our earlier exam-
ple with the N(1,5) distribution by using R as follows (output provided
beneath command):
Once we have this area under the curve (up to two standard deviations below
the mean), the area under the normal curve and within two standard devia-
tions from the mean is calculated in this example with R simply as
> 0.9772499-(1-0.9772499)
[1] 0.9544998
Note that here we used R to conduct simple computations, such as the sub-
traction of the two obtained numbers. (We can of course use any of a number
of other statistical software for the same goal.) That is, the area under the pdf
curve of the normal distribution of interest and within two standard devia-
tions from the mean is just above 95%. This percentage is valid not only in
the presently considered example, but also in the general case of any normal
distribution, as mentioned above.
Further, it can be generally shown that the area under the normal curve
and within three standard deviations is just above 99%. In our current exam-
ple with the distribution N(1,5), we again obtain this area with R using the
following two computational steps:
For our earlier IQ example with repeated sampling from a large group of
student IQ scores, these findings imply that in the long run about one out of
20 students would have an IQ score further away from the mean than two
standard deviations, i.e., would have a z-score larger than 2 or smaller than
2. Further, about five out of 2,000 students would have a score further away
than three standard deviations from the mean, i.e., a z-score larger than 3 or
smaller than 3.
We stress that the general interpretations given in this section do not
depend on a particular normal distribution but are valid for any considered
normal distribution. This is because we did not refer in these interpretations
to the mean and standard deviation of a particular normal distribution being
assumed. Also, the last general finding shows that in a normal distribution
nearly all scores lie within three standard deviations from the mean, i.e., have
a z-score that is not larger than three in absolute value. Hence, unless the
group studied is very large, one would be really surprised to find a score that
far or even further away from the mean, i.e., with such a large or an even
larger z-score in absolute value terms.
Example 5.1. Suppose an IQ test score has a distribution N(100, 225) in a large
group of third graders (population of interest). What is the score that cuts out
to the right 5% of the area under the pdf of this distribution?
This question asks about the 95th percentile of the given normal distribution,
since to the left of the requested number are 100% 5% 95% of the scores
in the population of concern. We obtain this percentile with R as follows (the
result is again given underneath the used command ‘qnorm’):
That is, if a student has a score of at least 125, he or she is among the top 5%
of his or her peers in terms of IQ score on the test in question.
Here we are interested in finding the 10th percentile of the distribution N(40,
25). Using R with the command ‘qnorm’ yields (result given below com-
mand):
That is, the sought score is 33. In other words, any anxiety score lower than
34 would have the property that at least 90% of the elderly population scores
are higher than it.
Empirical studies conducted in the social and behavioral sciences are generally
interested in examining phenomena in very large populations that cannot be
completely accessed and exhaustively studied. For example, if a new teaching
method showing fourth-grade children how to perform number division were
developed and we wanted to compare it to an already established method, we
cannot easily study the entire population of fourth graders taught by the new
method. While somewhat conceivable, any attempt to study the entire popu-
lation would be impractical if not impossible, excessively resource- and time-
demanding, and for a number of reasons (to be expanded on later) just not
worthwhile attempting—at least not currently. To resolve this problem, we
usually resort to (i) taking a sample from a population of interest and
(ii) studying each member of the sample. Thereby, it is necessary that the
sample resemble (i.e., be representative of) the population of concern as well
as possible, since otherwise our conclusions may well be incorrect if not mis-
leading.
81
poses of this introductory book, we will only be dealing with simple random
samples. In the remainder of the book, therefore, whenever mention is made
of a sample, we will presume that it is a simple random sample (see below for
a more precise definition).
In certain circumstances in social and behavioral research, all subjects
(units of analysis) from a population may indeed be exhaustively studied, as
for instance when a census is conducted. Nevertheless, these are usually very
expensive investigations that are infrequently or relatively rarely carried out.
In this book, we will not be discussing them, but as mentioned will only be
dealing with the case of studies that are based on simple random samples
from large populations. As was done in some previous discussions, we will
denote the number of subjects (units of analysis) in the population as N, and
those in the sample as n. Usually, n is much smaller than N, which is assumed
to be a finite number, however large it may be. In the special case that n N,
we are dealing with a census. Most of the time in contemporary behavioral
and social science research, n ⬍ N, or in fact more likely n ⬍⬍ N (a symbol
used to denote that n is much smaller than N).
Before we move on, we give a more precise definition of the concept of
random sample, which is essential for the remainder of the book. A sample of
n measurements (subjects, units of analysis) selected from a population under
study is said to be a random sample if it has the same probability of being
selected from the population as any other sample of the same size. That is, a
random sample is a ‘‘draw’’ (selection) in a random fashion from a popula-
tion under consideration. This implies that every subject in a population of
interest is given an equal chance of being selected into the sample. Hence, a
random sample is not characterized by any systematic feature that would
make it different from any other possible draw from the studied population,
which draw would consist of the same number of subjects (measurements,
units of analysis). We mention in passing that in sampling theory, a random
sample defined in this way is usually referred to as ‘‘simple random sample.’’
(We will use the reference ‘‘random sample’’ throughout the remainder of the
book.)
A random sample may be obtained using tables with random numbers that
are commonly provided in many statistics texts (e.g., King & Minium, 2003).
Alternatively, one may wish to use tabled values obtained by using a compu-
terized random number generator. Either approach would be feasible if both
N and n were not very large (especially as far as N is concerned), and there is
a listing available of all population members from which the sample is selected
using such a table. A table with random numbers has the property that any
number in it is random—i.e., any one of its numbers can be chosen as a
starting point in the search for elements from the population to be drawn
into the sample. Thereby, one can move in any direction any number of steps
without violating the property of the next chosen number in this way being
random as well. For much larger studies this process is usually automated,
and its details are beyond the confines of this introductory book (e.g., Kish,
1997). Thus, as mentioned above, for the remainder of the book we assume
that a random sample from a studied population is available to us, which we
can proceed with in our analytic and modeling activities.
est. This set of values depends obviously on the size n of the samples that we
take thereby. Therefore, any statistic has a sampling distribution for any given
sample size that may be of interest.
In order to look at the concept of a sampling distribution from a slightly
different point of view, let us suppose that we are given a population we are
interested in—e.g., all second graders in a particular state. Let us also assume
that we are concerned with a test of writing ability for all second graders. We
could think of each student in the population as having a score on this test,
which score however we unfortunately do not yet know. In a sense, this score
is hidden from us, unless of course we actually studied with the test the entire
population—which as indicated before, is something that would most of the
time not be done in real-life empirical research. Next suppose we decide to
draw a sample of n 50 students from this population—i.e., we fix the size
of the sample. We observe that there are very many possible samples of this
size that we could obtain. For example, if the population contained 100,000
children, it can be shown mathematically that there are in total (‘ ’ is used
next to denote multiplication)
(6.1) 100000!/(50! 99950!)
possible samples. Although this is clearly an extremely large number, we could
of course readily compute it, out of curiosity, using R with the following
command (result is given beneath it):
another. This set of data, ⌼, like any set of numbers, has a distribution, which
as previously illustrated we know how to obtain—e.g., using the software R as
discussed in Chapter 2. This distribution is referred to as the random sampling
distribution (RSD) of the RV under consideration here, viz., the RV defined
as the mean ȳ at the sample size of 50. More generally, the RSD is defined as
the distribution of a RV under consideration—e.g., a statistic—across all pos-
sible random samples of a given size from the studied population. If we chose
a different number as a sample size—for example, if we chose n 100—then
along exactly the same lines we see that there is also a random sampling distri-
bution of the resulting set of sample averages across all possible samples of
size 100. This would be the RSD of the mean ȳ at a sample size 100. And so
on—for any given sample size n (n ⬍ N), there is a RSD of the mean ȳ at that
sample size.
We chose in the preceding discussion the mean as the object of our consid-
erations mostly because of its popularity in empirical behavioral and social
research. In fact, we could just as easily have chosen any other statistic (i.e.,
function of the sample data) to consider—e.g., the variance, standard devia-
tion, range, and so on. Hence, there is a random sampling distribution, at a
given sample size, for any statistic that we may be willing to consider in an
empirical setting.
A major gain in adopting the concept of RSD, as we have done in this
chapter, is the following realization: since the values of a statistic of interest,
such as the mean, across all possible samples with a fixed size need not be the
same, there is inherent variability in essentially any statistic that we could
consider in a given sample. (Of course, there would be no such variability if a
particular constant were chosen as a special case of a statistic, but we exclude
this trivial case from consideration.) This emphasizes the fact that the value
of any statistic that we could obtain in a given sample from a population
under consideration, need not repeat itself in any other sample from that popu-
lation, which we might have as well taken (drawn or selected). That is, the
specific score we obtain on a statistic of concern in a given sample from a
studied population is only one of potentially very many possible scores that
one could obtain with that statistic had another sample of the same size from
that population been drawn.
this theorem in Section 6.3.) For now, let us focus again on the previous
observation that any statistic we might be interested in a given sample from a
studied population can be seen as a random variable (RV). As a RV, that
statistic has a probability distribution and it depends on the size of the sample
in an interesting way. Yet what does this dependency actually look like?
Before we address this question in more detail, let us consider some of its
aspects, specifically as far as the RSD of the mean is concerned. In particular,
let us determine what is the actual central tendency and variability for this
RSD. That is, what are the mean and variance for the RSD of the sample
average?
Using this result, it can be shown that if for a given random variable of interest
y we denote its population mean as y, then the mean of the statistic defined
as the sample average (at a given selected sample size), ȳ, is in fact equal to
y, that is,
(6.3) ȳy.
In other words, the average of the sample averages is the same as the popula-
tion mean (e.g., Roussas, 1997). Similarly, it can be shown that the variance
of this statistic is
(6.4) 2ȳ 2y / n.
That is, while the mean of the RV defined as the sample average (at a given
sample size) is the same as the population mean, the variance of this RV is n
times smaller than the population variance of the original variable y of inter-
est. In other words, taking the average of a sample reduces the original vari-
ance n times, where n is the size of the sample in question. Therefore, the
sample average as a random variable is much less varied—in particular with
large samples—than any individual observation or score on the variable of
interest. Specifically, the larger the sample, the less varied is the sample aver-
age relative to any individual observation (score).
Since the mean of the random variable defined as the sample average is the
same as that of the original variable of interest y, from the preceding discus-
sion it follows that the higher the sample size is, the closer (probabilistically)
the sample average to the population mean of y. Hence, whenever one is
interested in the population mean of a RV—e.g., the score on a test of arith-
metic ability—taking a large sample from the population of concern will ren-
der in its average a score that has a distribution with an important property.
Specifically, this distribution will be quite compactly positioned around the
population mean of that variable.
This result, which is based on Equations (6.3) and (6.4), is the reason why
there is so much interest in statistics and its applications in the sample mean.
As this result suggests, one can have quite some confidence in the mean of a
large sample from a studied population, as an index informing about the
population mean on a variable of interest.
We refer to the sample average in a given sample as an estimate of the
population mean, and to the statistic sample average (outside of the context of
a particular sample) as an estimator of the population mean. We reiterate the
earlier noted difference—the former is a number, the latter is effectively a
formula or a statistic in general, as we indicated earlier in the book. We obtain
an estimate when we substitute empirical data into that estimator formula.
Section 6.2 emphasized the importance of the mean and variance of the ran-
dom sampling distribution of the mean. While these are two key aspects of
this distribution, they do not inform us about what its shape actually is. In
particular, they do not tell us whether that distribution is symmetric or not,
or whether it is particularly peaked and/or flat. In other words, these mean
and variance quantities provide us with no practically useful information
about the skewness and kurtosis of the distribution (i.e., the third and fourth
moments of the distribution). The latter queries are responded to by an essen-
tial result in statistics known as the central limit theorem (CLT).
To discuss the CLT, we will use again the concept of random sampling
distribution (RSD). To this end, let us consider the RSD of the mean (average)
of the scores on a variable of concern in a given sample of size n from a
studied population. Next, suppose we begin to change n—initially consider a
fairly small n, then a larger n, then an even larger n, and so on, each time
increasing n. Would there be any particular effect of this successive increase
in sample size upon the RSD of the mean? In particular, as n becomes larger
and larger, how does the shape of this distribution change, if it does at all? We
consider these issues in greater detail below.
Example 6.1: Suppose that a patient’s diastolic blood pressure is normally dis-
tributed with a mean 85 mm and variance 2 16 mm2. Assuming we
took four measurements of the patient’s blood pressure within a several-hour
time period on a given day, what is the probability that the average of these
measurements will be under 90 mm, thus indicating the patient does not have
high blood pressure? (This example assumes high blood pressure is indicated
when diastolic blood pressure is above 90 mm.)
Here we are interested in the RSD of the sample average (mean). Let us denote
by y the diastolic blood pressure at a given point in time—within the few
hours that this patient was measured—which is obviously a RV. Then ȳ would
be the mean of the four blood pressure measurements taken on the patient.
According to the earlier discussion in this section, since the initial variable of
concern, y, is normally distributed, so is also its mean, ȳ. That is, the RSD of
the mean is normal even though we have in this case only a sample of n 4
observations. Equations (6.3) and (6.5) give the mean of this RSD as ȳ y
85 and standard deviation of
ȳy / 兹44 / 22.
We now need to work out the probability of the event E { ȳ ⬍ 90} (using
curly brackets to denote event defined within them). To this end, we can use
R as in the preceding chapter, since we actually know the mean and variance
of the normal distribution of relevance as given here, viz., correspondingly 85
and 2:
That is, the probability of the mean being under 90 is .994 (rounded off to
third decimal place). Hence, even four repeated measurements under the cur-
rently considered circumstances give a fairly high probability of concluding
someone does not suffer from high blood pressure, when they in fact do not
have this condition. (This is of course based on the assumption that the pres-
sure measurements are normally distributed to begin with, an assumption
that we made at the beginning of our example discussion.) We thus conclude
that the specificity of these diastolic blood pressure measurements is fairly
high for the current example settings.
Denote the sample scores as usual y1,y2,... ,yn, and their sum y1 y2 ... yn
as ⌺y. We stress that y1, y2,... ,yn as well as their sum denoted for simplicity ⌺y
are all RVs.
Under these circumstances, the CLT for sums of random variables makes
the following statement (e.g., Ott & Longnecker, 2010). If y1, y2,... ,yn represent
observations on a studied variable from a population with a mean and
finite variance , then the mean and standard deviation of their sum are
correspondingly:
In addition, when the sample size n is large, the RSD of ⌺y will be approxi-
mately normal (with the approximation improving as n increases). If the pop-
ulation distribution is normal, then the RSD of ⌺y is exactly normal for any
finite sample size n.
We emphasize that here we are concerned with a different RSD than the
one introduced earlier in this chapter. Specifically, in this subsection we are
dealing with the RSD of the sum of random variables, rather than with the
RSD of the sample average.
for sums of random variables. In particular, we stated before that with large
samples ⌺y is approximately normal with a mean ny and standard deviation
兹ny. As indicated in Chapter 5, however, y and y (1 ) for
the binomial variable in question. Hence, for a large sample, the number of
successes (in a series of independent trials with the same probability of success
in each) is approximately normal with a mean equal to n, and standard
deviation n(1 ). In Chapter 5, we also mentioned that this normal
approximation is quite good when n and n(1 ) exceed 20 (see also
Agresti & Finlay, 2009). This will be the case with moderately large samples if
the probability of success is neither very small nor very large. This approxima-
tion can be used to work out probabilities related to the binomial distribution
(see Chapter 5).
The normal distribution is very convenient for many purposes in the behav-
ioral and social sciences. In addition, this distribution has been also exten-
sively studied. Its properties are well understood and examined, and are thus
available to researchers in cases where they are dealing with distributions that
are normal, or approximately so, in studied populations. For this reason, it is
important to have easily applicable ways to ascertain whether a distribution
of a random variable under consideration is normal.
How could one find out, however, whether the distribution of a RV (e.g., a
statistic) in question is normal in a population of concern? There is a relatively
straightforward graphical procedure that allows one to assess whether a distri-
bution can be relatively well approximated by a normal distribution, which
we describe in this section. To this end, suppose the random sample consists
of the n observations y1, y2,... , yn. In order to apply this graphical procedure,
we need to first rank order them (this can be done easily with R or using any
other available statistical software). Denote by y(1) ⱕ y(2) ⱕ ... ⱕ y(n) the rank-
ordered sequence of original scores in ascending order. For instance, if we
have the following sample of obtained anxiety scores,
(6.6) 22, 23, 28, 29, 25, 17,
First we need to read into R this sequence of scores and create thereby an
array (set) of scores, denoted x below, which is to be used subsequently with
the command ‘qqnorm’ in order to produce the NPP. As mentioned earlier in
the book, the creation of this array is easily accomplished with the command
> x = c(89 92 121 100 91 94 128 84 84 101 93 105 110 67 111 104 102
108 116 87)
Once the data are read in this way, we obtain the NPP as follows:
> qqnorm(x)
FIGURE 6.1.
Normal probability plot for 20 IQ scores.
FIGURE 6.2.
Normal probability plot with superimposed line for the above 20 IQ scores.
From Figure 6.1, it would appear that the scores approximately fall along a
diagonal line in this NPP. Such a judgment can be facilitated using the com-
mand ‘qqline’, which will superimpose a line upon this set of points:
> qqline(x)
which produces for the above set of 20 IQ scores the plot in Figure 6.2. From
Figure 6.2, it seems clear that the deviations from the line are not large enough
to warrant a claim of nonnormality, given the relatively limited sample size of
n 20. (It would be reasonable to expect some nonnormal appearance with
such a relatively small sample—e.g., Raykov & Marcoulides, 2008.) Therefore,
it may be suggested that the normality assumption is plausible for the popula-
tion of test scores from which the sample of the 20 IQ scores in this example
came.
While we realize that there is a certain degree of subjectivity involved in
assessing normality using the outlined simple graphical procedure in this sec-
tion, a number of more advanced methods have also been proposed in the
literature (e.g., see Marcoulides & Heshberger, 1997; Roussas, 1997). These
methods provide ways to examine more objectively a claim that a given sam-
ple of scores came from a population following a normal distribution.
The previous chapters in the book focused on various concepts dealing mainly
with descriptive statistics. As we mentioned earlier, however, there is another
branch of statistics, called inferential statistics (IS), that is of particular rele-
vance when one is interested in drawing conclusions about populations but
typically cannot study each member of them. In such cases, all one has access
to are samples obtained from the populations of interest. It is under such
circumstances that IS becomes especially helpful for making particular gener-
alizations. This is because the main objective of inferential statistics is to make
inferences about studied populations using information from the available
samples. Specifically, populations of concern in behavioral and social research
are usually characterized by numerical indexes that are generally unknown,
called parameters. For example, as noted before, the mean, mode, proportion,
variance, and standard deviation are all parameters. Most problems that IS
deals with are at least related to if not focused on inferences about one or
more parameters of specifically studied population(s). As an illustration of
this introductory discussion, let us consider the following two examples.
Example 7.1: Consider a test of reading ability for second graders that is admin-
istered at the beginning of the academic year in a given US state. Its population
mean, denoted , is unknown, and we are interested in making inferences about
this parameter, i.e., about the average reading ability score in this state’s popula-
tion of second graders.
Example 7.2: Let denote the proportion of scores on a reading ability test for
elementary school students in a particular state, which are below a standard that
is used to designate need of remedial assistance. This proportion, , is also a
parameter, and one may well be interested in drawing conclusions about it.
99
There are two general types of methods for making inferences about popula-
tion parameters, namely, estimation and hypothesis testing. Estimation is con-
cerned with answering questions about the value of a population parameter.
In contrast to estimation, hypothesis testing is mainly concerned with deter-
mining whether a population parameter fulfills a certain condition. For exam-
ple, if we are primarily interested in evaluating the population mean of
reading ability in the population of second graders in the state of California
in a given year, then we are concerned with estimation of the parameter .
Alternatively, if we are attempting to determine whether the proportion of
students in the state of Michigan who are in need of remedial assistance is
below 10%, then we can be interested in hypothesis testing. Thus, it should
be evident from these simple examples that estimation and hypothesis testing
have different concerns. Additionally, they also use largely different proce-
dures to accomplish their aims. Their unifying theme, however, is the use of
information in available samples for making conclusions (inferences) about
population parameters that are unknown.
(7.1) ˆ (y1y2...yn)/nȳ.
Equation (7.1) in fact defines a point estimator of the population mean , as
we mentioned previously. In other words, Equation (7.1) provides a generally
valid algorithm, or procedure, of how to obtain a point estimate of the mean
based on a given sample. We furnish this estimate by substituting into its
right-hand side the individual observations obtained in the available sample.
The sample average in Equation (7.1) seems to be a very reasonable estimate
in many situations, as long as there are no ‘‘outliers’’ like data entry errors or
extremely large/small observations that do not belong to the population of
interest. However, the right-hand side of Equation (7.1) does not indicate
how good the resulting mean estimate really is. In particular, we do not know
how close the value of ˆ in a given sample is to the population parameter
that is the value of actual interest.
More generally, we can certainly say that since any selected sample is not
going to be identical to the population, a sample average is in general not
going to be exactly equal (other than by sheer coincidence) to the population
mean. Since we can in effect assume that an estimate is not going to be equal
to the population mean (no matter how large the sample—but presumed
smaller than the population, of course), the critical question is how far away
ˆ is from . The quantity is the one of real concern in this whole develop-
ment, and we unmistakably observe that the last question is a fundamental
query about the quality of estimation. In fact, this question applies to any
estimate (estimator) of any parameter, and not just that of the mean as con-
sidered here. There is no doubt that a sample estimate does provide informa-
tion about the population parameter. However, since we are also pretty
certain that they will not equal each other, a critical question about estimation
quality in empirical research pertains to how far the estimate is from that
parameter.
specifically its function ‘qnorm’ (see Chapter 4) to work out its specific
value as follows (with result furnished by R given immediately under this
command):
> qnorm(.025, 0, 1)
[1] -1.959964
That is, c 1.96 (rounded off to second decimal place). Hence, from
Equation (7.2) it follows that
(7.5) Pr(1.96/兹n ⬍ ȳ ⬍ 1.96/兹n).95.
We can now alternatively modify the double inequality contained in the left-
hand side of Equation (7.5), in order to obtain an expression in terms of the
unknown mean, , which is of actual concern to us. To this end, we subtract
from all three sides of this double inequality first , then ȳ, and finally multi-
ply the result by (1) (thereby changing the direction of the inequalities
involved). These three steps lead to the following result
(7.6) Pr(ȳ1.96/兹n ⬎ ⬎ ȳ1.96/兹n).95,
to a different CI, unless for two samples under consideration their sample
averages happen to be identical (which in general would be quite unusual).
Thus, if we draw r samples (with r ⬎ 0), we will have in general r different
resulting confidence intervals of the same population parameter—in this case
the population mean of the studied variable.
Second, we cannot be certain that the CI will cover the population mean
for a given sample from a population under consideration—and in empirical
research we are typically dealing with only a single sample from a studied
population. In fact, Equation (7.7), on which the interval estimate (7.8) is
based, only says that in a series of many samples drawn from that population,
95% of the resulting intervals (7.8) will cover the population mean (for
further details see Chapter 4 and the pertinent probability definition in terms
of relative frequency). Thereby, we do not know for which 95% of all these
samples their associated CI will in fact cover .
Third, it is essential to also keep in mind the right-hand side of Equation
(7.7), viz., the number .95. This number, based on the relative frequency
interpretation of probability (see Chapter 4), reflects the degree of confidence
we have in covering the population mean across a large number of repeated
samples of the same size from the studied population. For this reason, this
probability number is commonly called a ‘‘confidence level.’’ As a result, the
interval presented in (7.8) is typically referred to as a confidence interval (CI)
at the 95%-confidence level (abbreviated as 95%-CI), or alternatively as an
interval estimate of the mean at the confidence level .95.
In this connection, we readily observe that had we instead been interested
in finding a confidence interval for the mean at the .99 confidence level, we
would need to seek another number c* using the corresponding equation
(7.9) Pr(c*/兹n ⬍ ȳ ⬍ c*/兹n).99.
Following exactly the same developments as those given above in connection
with Equation (7.2), we could then easily use R as follows in order to find this
number c*:
> qnorm(.005, 0, 1)
[1] -2.575829
That is, c* 2.58 (rounded off to second decimal place) would have resulted.
Hence, the 99%-confidence interval (99%-CI) for the population mean
would be
(7.10) (ȳ2.58/兹n, ȳ2.58/兹n).
Similarly, if we wanted to obtain a 90%-CI of the population mean in the
first instance, the needed number c** would be obtained with R as follows:
> qnorm(.05, 0, 1)
[1] -1.644854
This number, z␣/2, is often referred to as the (1 ␣/2)th quantile of the stan-
dard normal distribution. (Obviously, this is the 100(1 ␣/2)th percentile of
this distribution, in our earlier terminology, but it seems to be more easily
referred to as its (1 ␣/2)th quantile—this is also the reason for the frequent
use in the literature of the symbol z␣/2.)
With this notation, we now obtain the following general form of the
(1 ␣)100%-confidence interval for the population mean, for any ␣
(0 ⬍␣ ⬍ 1):
(7.12) (ȳz␣/2/兹n, ȳz␣/2/兹n).
We can construct this CI in (7.12) for the population mean on a given variable
y of interest, any time we know sample size, confidence level, and the standard
deviation of the RV y under consideration. Although we will make frequent
reference to the CI in (7.12) throughout the remainder of the book, for now
we simply emphasize that it provides the CI for the population mean at any
confidence level (as determined by this number ␣). We also note in passing
that the critical number ␣ has yet another, related and very important inter-
pretation that we will deal with in the next section when discussing the topic
of hypothesis testing.
To illustrate the above developments up to this point of the chapter, let us
consider the following example involving a writing ability test.
Example 7.3 (writing ability testing): A writing ability test is known to have a
population variance of 225 and a mean value that is unknown in a studied
population of third graders from the state of Hawaii. In a sample of n 120
students representative of that state’s third graders, the sample average score on
the test is found to be 26. What range of scores would provide 95% confidence
in covering the mean of this writing test score in repeated sampling (at this size)
from this population?
This question asks about a 95%-CI for the mean of the writing ability test
score. Denoting this test’s score by y here, we observe that at the used sample
size of n 120 the CLT is probably going to hold (or at least provide a good
approximation to the associated normal distribution). Hence, it would be rea-
sonable to use the normal approximation for the sampling distribution of the
average ȳ. With this in mind, the earlier Equation (7.8) provides us with the
sought 95%-CI for the mean of the writing ability test score as
(ȳ1.96/兹n, ȳ1.96/兹n).
Based on the above given values, we can readily use the calculator capacity of
R in order to determine numerically as follows the lower and upper endpoints
(limits) of this CI:
> 26-1.96*15/sqrt(120)
[1] 23.31616
> 26+1.96*15/sqrt(120)
[1] 28.68384
That is, the 95%-CI for the unknown population mean is found to be (23.32,
28.68). In other words, we are 95% confident that the population mean could
be a number somewhere between 23.32 and 28.68 (see also discussion earlier
in this section for a stricter interpretation of this interval in the context of
repeated sampling). Of course, since the left and right endpoints of this CI
are not whole numbers yet the writing test is likely to produce scores that are
whole numbers only, one can round them off to the lower and upper number,
respectively. In this way, we obtain the interval
(23, 29),
which is associated with a confidence level of at least 95%. We note that any-
time we make a CI wider, we actually enhance the associated confidence level.
This is because we can then be even more confident that we will enclose the
mean within the wider interval (i.e., we will enclose it more often across
repeated sampling from the studied population).
In conclusion of this section, we stress that unlike ordinary estimation pro-
cedures (viz., those often used in everyday life or in informal discussions),
inferential statistics provides us with ways to evaluate the quality of the indi-
vidual (point) estimates obtained with its methods. In particular, the merit of
a point estimation procedure is reflected in the width of the CI that can be
obtained using appropriate methods of inferential statistics. We elaborate fur-
ther on this issue next.
ter. That is, any number u ⬍ a, or any v ⬎ b, is not a good candidate then
for a population value of the parameter.
We also mentioned earlier that the quality of a parameter estimate is
reflected in the width of the resulting confidence interval. Specifically, in gen-
eral terms, it is obvious that the narrower the CI is, the higher the precision
is with which the unknown parameter has been estimated. Hence, if there
may be more than one interval estimation procedure for a parameter of con-
cern, the method producing the smallest CI at a given confidence level (and
sample size) has the highest quality.
Let us now take a second look at Equation (7.8) and notice again that the
width of the CI depends on three quantities. The following discussion in this
subsection assumes that when we consider the effect of any one of these three
quantities, the other two remain fixed.
We mentioned previously the effect of one of these quantities, the confi-
dence level. Indeed, as indicated, higher (smaller) confidence level leads to
wider (narrower) CIs. In this connection, it is worthwhile emphasizing that a
general notation for the confidence level is as a (1 ␣)100%-CI, where 0 ⬍ ␣
⬍ 1. Hence, the smaller ␣ is, the wider the CI. The reason is that smaller ␣’s
go together with higher percentage (1 ␣)100%, and thus with a higher
confidence level. This is because a higher confidence level leads to a larger z␣/2
value, in the role of the critical value c in our earlier developments of the CI
construction process. Indeed, as will be recalled, we subtracted and added c
times the standard error of the mean, to the sample average ȳ in order to
obtain the pertinent CI. Hence, we see that the smaller ␣ is—and hence the
higher the confidence level—the further ‘‘out’’ (i.e., away from the mean of
the distribution N(0,1)) we need to go in order to find this quantile z␣/2 c.
Therefore, the smaller ␣ is, the larger the value c is that we need to multiply
with the standard error of the mean and subtract/add to the sample mean, in
order to render the pertinent CI; this leads obviously to a wider CI.
A second factor that affects the width of the CI is the sample size, n. Spe-
cifically, the larger the sample size, the smaller the standard error of the mean
and thus the shorter the CI (since then the smaller the multiplier of c is, viz.,
/兹n—see earlier Equation (7.8)). This statement is valid more generally,
i.e., also when we conduct interval estimation of other population parameters.
That is, the larger the sample, the more precise our inference will be, since we
obtain a shorter interval of plausible population values. This observation is the
motivation of the frequently found recommendation in statistical and applied
literature to work with samples that are as large as possible. Large samples are
tantamount statistically to large amounts of information about population
parameters. Intuitively, the reason is that then the sample is closer to the
studied population, and hence there is higher potential for better inferences
about the parameters of concern. We will revisit the effect of sample size from
a slightly different perspective in the next subsection.
The last quantity that affects the width of a CI is the magnitude of the
variance of the original studied variable, y. Looking at any of the CIs provided
in Equations (7.8), (7.10), or (7.11), the larger this variance is, the wider
will be the associated CI. As it turns out, in some experimental studies it is
possible to control the magnitude of , in fact even reduce it. In contrast, for
most observational or nonexperimental studies (also sometimes referred to as
correlational studies), there are limited, if any, possibilities to accomplish this
aim. A detailed discussion of such possibilities is beyond the scope of this
introductory book, and we direct the reader to more advanced treatments on
the matter (e.g., Shadish, Cook, & Campbell, 2002).
very wide CIs. On the other hand, choosing a very small confidence level,
such as 30%, leads to CIs that would cover the population parameter a very
limited proportion of times in repeated sampling. For instance, choosing 30%
as a confidence level effectively means that the population parameter (e.g., the
mean) is covered only about a third of the time in repeated sampling. This is
a very poor coverage for most practical goals. Primarily due to an often-
followed convention, confidence levels of 95% and 90%, and in some situa-
tions 99%, have been widely considered acceptable. (We stress, however,
that a CI at any level can be constructed using the procedure outlined in
Section 7.3, assuming of course that the corresponding ␣ is between zero
and one.)
Keeping all the above issues in mind, let us return to our earlier concern
with interval estimation of the population mean on a variable of interest. For
ease of presentation, let us denote the elected tolerable error as E. In other
words, E is the width of interest of the CI at a chosen confidence level
(1 ␣)100% (0 ⬍ ␣ ⬍ 1). As before, let us assume we know the standard
deviation of the measure y of concern (e.g., the score on an ancient history
test). Then looking at the general form (7.12) of the CI for the mean, shows
that its width is
(7.13) E2z␣/2/兹n.
Hence, if we also know the confidence level, we can determine the smallest
sample size n at which the tolerable error would be equal to E. This can be
accomplished by algebraically solving Equation (7.13) in terms of n, which
leads to
冋 册
2
2z␣/2
(7.14) n .
E
Looking at Equation (7.14), one can readily see that it critically depends on
knowledge of the standard deviation of the variable of interest, y. Of course,
in most empirical behavioral and social research studies one will not know
with a reasonable degree of certainty. If any knowledge about is available
from prior research, and that knowledge can be considered trustworthy, it
may be used in Equation (7.14). Alternatively, when no such prior knowledge
about is available, one could use an earlier-discussed relationship between
the range and the standard deviation to approximate the population value of
. Specifically, we mentioned in Chapter 3 that the range is generally about
four times the standard deviation for a given random variable (recall that the
range is the difference between the largest and the smallest observation in an
available sample). Thus, when there are no extreme observations (i.e., abnor-
mal or outlying observations), one can substitute the ratio r/4 for in Equa-
tion (7.14) and proceed with determining the sought (approximate) sample
size n. We illustrate this discussion with the following example.
Example 7.4: Suppose it is known that scores on a studied depression scale are
normally distributed in an adult population of interest, and that in a given
sample its recorded observations fall between 2 and 26. How many persons does
one need to measure with this scale, in order to achieve a tolerance error of two
units (points) at a confidence level of 95% when evaluating the population
depression mean?
> (2*1.96*6/2)^2
[1] 138.2976
That is, we would like to use a sample size of n 139 students. We note in
passing that we round off to the next higher integer rather than the next
lower, in order to accomplish better precision of estimation. As indicated
above, the reason is that the larger the sample size, the higher the quality of
estimation that we thereby achieve.
distance between null hypothesis and data has some important problems.
First, it fails to account for the instability of the sample average ȳ that would
be obtained across repeated sampling from the studied population. Conse-
quently, this simple distance measure is certain to lead to a different value in
another selected sample. Additionally, and just as importantly, we cannot
judge when its values are sufficiently large in order to warrant rejection of the
null hypothesis. Hence, more sophisticated versions of a distance measure are
needed to account for the sampling variability in the test statistic used, in
order to allow more accurate judgments to be made. This is the goal of our
discussion in the next subsections.
is standard normal, i.e., follows the distribution N(0,1). Given the standard
normal distribution of z, the question now is which possible scores on it could
be considered sufficiently different (larger or smaller) from its mean, so as to
warrant rejecting the null hypothesis about its value.
To answer this question we must capitalize on our knowledge of what the
value of is under the null hypothesis being tested. In our earlier empirical
example, the null hypothesis tested stated that H0: 18. Obviously, if this
hypothesis were true, the z-score in Equation (7.15) with 18 substituted
in it, would follow a normal distribution N(0,1). However, as mentioned in
Chapter 6, 95% of the scores on a standard normal variable can be expected
to fall within 1.96 standard deviation from the mean of zero (i.e., to fall within
the interval (1.96, 1.96)). That is, scores on z that fall outside this interval
are in a sense unusual since they occur only at a rate of one to 20. This is
because they collectively make up the remaining 5% of the area under the
standard normal curve.
This reasoning suggests that z scores falling outside the specified interval
are associated with a probability up to .05 (5%) to occur. This is a rather
small probability. Indeed, if the null hypothesis were to be correct, then we
would have high expectation that the z-score in Equation (7.15) would fall
between 1.96 and 1.96. If this z-score then turns out to be larger than 1.96
or smaller than 1.96, we would be dealing with an event that has very small
probability or likelihood under the null hypothesis. Obtaining such a finding
would constitute evidence against the tested null hypothesis. In that case, we
would be willing to reject this hypothesis, i.e., consider H0 disconfirmed.
Returning again to our empirical example with the college aspiration scale,
we first note that at the used sample size of n 36 the sample average may
be considered (approximately) normal due to the central limit theorem. Mak-
ing this normality assumption, and recalling from Chapter 6 that the sample
mean ȳ has variance that equals the ratio of the variance of the studied ran-
dom variable y to sample size n, it follows that the z-score of the sample
average 22.5 is equal to
(22.518) / (12/6)2.25
(see Equation (7.15) under H0). Since this score of 2.25 is well outside the
specified interval (1.96, 1.96), we consider the available data to contain
evidence that sufficiently strongly points against the null hypothesis of the
mean being 18. Hence, we can view this evidence as warranting rejection of
the null hypothesis H0: 18. That is, we can suggest—using statistical
hypothesis testing reasoning—that the mean of the college aspiration mea-
sures is different from 18 in the studied high school senior population.
If we take another look at Equation (7.15), we notice that it is determining
the distance between a particular score y on the studied random variable y
and its mean , which is free of the original units of measurement. The reason
is that this difference—which is the numerator in Equation (7.15)—is divided
there by the standard deviation . By conducting this division, we in a sense
account for the variability of the score y in the population of interest. With
this in mind, the right-hand side of Equation (7.15) with ȳ in place of y in
fact measures the distance between the data (ȳ) and the null hypothesis H0:
18. Therefore, the z-score in Equation (7.15) does not really share the
earlier discussed limitations of the simple version of a test statistic, i.e., the one
being the simple difference between sample average and hypothetical value 18.
With this in mind, the z-score provided by Equation (7.15) with ȳ in place of
y can be considered a desirable test statistic.
As it turns out, we also obtain in this z-score a ratio whose distribution is
known if the null hypothesis is true. Specifically, this ratio follows the stan-
dard normal distribution. Knowledge of this distribution allows us to identify
its areas that are associated with very small probability. This permits us to
consider a score falling in those areas as sufficiently inconsistent with the
null hypothesis so as to warrant its rejection. The set of all these scores—the
corresponding areas with limited (small) probability under the null hypothe-
sis—constitute the rejection region.
for testing the null hypothesis of the population mean on a studied variable y
being equal to a given number, 0. We stress that this z-test is applicable when
knowledge about the standard deviation of y is available.
Looking at Equation (7.16), we see the test statistic in its right-hand side as
resulting from specific calculations carried out on the sample observations
y1,y2,... ,yn. That is, this z-test statistic is a function of the sample data. More
generally, test statistics are such functions of the sample data, which evaluate
its distance from the null hypothesis, in the direction of the alternative
hypothesis considered. Thereby, they do this evaluation in a way that accounts
for the variability in the collected data.
An important general feature of a test statistic, in order for it to be useful
for the purpose of statistical hypothesis testing, is that it is associated with a
completely known distribution if the null hypothesis were true. This distribu-
tion is referred to as test statistic distribution under H0, at times also called
simply a ‘‘null distribution.’’ We use the latter to find out which scores of
the test statistic are associated with very limited probability should the null
hypothesis be true. As indicated earlier, the collection of these scores repre-
sents the rejection region of relevance when testing a null hypothesis and a
rival alternative hypothesis. In our above college aspiration scale example, the
rejection region consists of the areas under the standard normal curve that
contain z-scores larger than 1.96 or smaller than 1.96, if we are willing to
consider scores that are more than two standard deviations away from the
mean as sufficiently rare under H0 so as to warrant rejection of this null
hypothesis.
cal setting we obtain a test statistic value with the property that the probability
is less than ␣ to get a score on this statistic that is at least as inconsistent with
the null hypothesis and falls in the direction of the alternative, then we will
consider the null hypothesis as rejected.
In most applications, ␣ .05 is usually chosen. There are also empirical
cases, however, where ␣ .01 can be a reasonable choice, or alternatively
when ␣ .1 is sensible to select. Strictly speaking, there is no restriction on
the particular choice of ␣ as a number between zero and one, and its selection
typically depends on the research question as well as the circumstances under
which the inference is to be made. Another principle to follow when choosing
the significance level ␣ in empirical research is that its selection should not
depend on the data that is to be subsequently analyzed. That is, if one looks
first at the data, and in particular some test statistics of possible interest, and
then chooses the significance level, incorrect conclusions may well be drawn
from the statistical test results that can be seriously misleading when interpre-
ted substantively. For this reason, the choice of ␣ should be made prior to
examining the analyzed data.
specifically the case whenever the sample data happens to yield a test statistic
falling in the associated rejection region, whereas H0 is actually true in the
population. Since sample data are random, in principle there is no restriction
on the value of the test statistic in a sample—and so there is nothing that
precludes it from falling in the rejection region also when H0 is in fact true in
the studied population (but we are not aware of this).
Alternatively, a Type II error is committed when the alternative hypothesis
Ha is correct in the population, but based on the sample we fail to sense this
and decide not to reject H0. That is, a Type II error is committed whenever
the alternative hypothesis is true but the empirical data happens to yield a test
statistic value not falling in the rejection region, thus leading us to the decision
of retaining the null hypothesis.
Table 7.1 presents all four possible outcomes, indicating these two error
types. The four possible outcomes result by crossing the two decisions possi-
ble—reject or do not reject (i.e., retain) the null hypothesis—with the two
possibilities regarding the status of the null hypothesis in the population, viz.,
H0 being actually true in reality or alternatively being false in the population
under investigation.
We emphasize that we do not know the true state of affairs (otherwise we
are done and do not need statistical hypothesis testing) and that we have to
make our decision to reject the null hypothesis H0 or retain it using random
data. Therefore, we can attach probabilities to each one of the four possible
outcomes in Table 7.1. Let us denote by ␣ the probability, or rate, for commit-
ting a Type I error. (We will soon see why this is a correct choice of notation.)
Further, denote by  the probability/rate of committing a Type II error. One
would naturally want to minimize (i) the probability/rate ␣ of making a Type
I error and (ii) the probability/rate  of committing a Type II error. If we
could only find a test statistic that accomplishes requirements (i) and (ii), it
would be a perfect choice.
However, this simultaneous minimization of both probabilities for making
a Type I error and a Type II error is not possible. In fact, anything we do to
minimize the Type I error ␣ actually leads to an increase of the Type II error
Table 7.1 Four possible outcomes in statistical hypothesis testing and the
two errors that can be committed.
True State of Affairs (Population) with Regard to
Null Hypothesis H0
Decision True False
Reject H0 Type I error Correct decision
Do Not Reject H0 Correct decision Type II error
, and vice versa. This is because making the Type I error small in fact
decreases the probability of the null hypothesis being rejected, and this will
also have to hold for the cases when it is incorrect. That is, suppressing the
Type I error leads to an increase of the Type II error. One similarly sees the
reverse as well—suppressing the Type II error leads in effect to increasing the
Type I error.
To resolve this dilemma, the methodology of statistical hypothesis testing
provides us with the following procedure. First, one controls the Type I error
by not allowing it to exceed a certain relatively small number, such as .05.
Once having ensured this, one chooses a test statistic (statistical test), which
has the lowest possible probability for a Type II error. A special branch of
statistics called test theory has been developed, with one of its goals being the
development of formal means allowing finding of such optimal test statistics.
Although a detailed discussion of test theory is beyond the confines of this
book, we mention that all statistical tests used in this book possess the prop-
erty of being optimal in this sense.
We reiterate that while there is no objective rule to determine ␣, a widely
followed and commonly accepted convention is to use ␣ .05. However, in
certain circumstances, different values for ␣ can be used, as indicated above.
Most often in those cases one seems to use either ␣ .01 or ␣ .1 (or some
other small value that is in principle equally plausible to use instead).
already laid the groundwork needed for developing such a test earlier in this
chapter, and hence we can capitalize on it when responding to this question
next.
As illustrated previously in the chapter, in order to develop a statistical test
we need a test statistic to measure the distance, in the direction of the alterna-
tive hypothesis, between the data and the null hypothesis. In analogy to the
preceding developments in the two-tailed alternative hypothesis case, as such
a test statistic we can again choose the z-ratio in Equation (7.16). We restate
that equation next (associated with the same number) for completeness of
this discussion:
ȳ0 .
(7.16) z
/ 兹n
This test statistic plays an instrumental role for the developments in the
remainder of this section.
the normal curve that is cut out (to the right) by the 95th percentile. Hence,
the area under the normal curve and above this percentile, which we denoted
z.05 earlier, is the rejection region we are searching here.
We can easily determine this percentile with R using again the command
‘qnorm’:
> qnorm(.95, 0, 1)
[1] 1.644854
That is, if in a given empirical setting the z-ratio (7.16) turns out to be larger
than this cutoff value 1.645, we reject the null hypothesis H0: 0 and
consider the one-sided alternative Ha: ⬎ 0 as retainable. In other words,
we do not reject H0 unless z ⬎ 1.645. We illustrate this discussion next with
an empirical example.
Since this value is smaller than the critical value of 1.645, we conclude that
we do not have sufficient evidence in the sample to warrant rejection of the
null hypothesis that the state depression level equals 25. We can thus consider
as retainable the null hypothesis of this level being 25 in the researcher’s state.
the mean is lower than this hypothetical value 0, viz., Ha: ⬍ 0. Specifi-
cally, we easily realize that the rejection region of concern would consist of all
scores of the z-ratio (7.16) that are sufficiently smaller than 0. They comprise
collectively the area of 5% under the standard normal curve and farthest away
to the left of 0. Given the symmetry of the standard normal distribution
about zero, it follows that these are all those z-test statistic values from Equa-
tion (7.16), which are smaller than 1.645.
That is, if in a given empirical setting the standard deviation of a random
variable y under consideration is known (and it can be assumed that its sam-
ple average is normally distributed), and one is interested in testing the null
hypothesis H0: 0 versus the alternative Ha: ⬍ 0, we reject the null
hypothesis if
ȳ0
(7.16) z
/ 兹n
is smaller than 1.645. Otherwise, we consider H0 as retainable. We illustrate
this discussion with the following example.
Example 7.6 (general mental ability testing): A general mental ability test is
known to have a mean of 95 and variance of 185 for the population of tenth
graders in a given US state. A study in a neighboring state enrolls n 450 tenth
graders, for whom the average performance on the test is 93.45. Based on the
empirical evidence, can it be suggested that high school sophomores in the latter
state have a lower mean on this test than their neighbor state sophomores,
knowing that the test variance is also 185 in this state?
Here we are interested in testing the null hypothesis H0: 95 about the
population mean versus the one-tailed alternative that it is smaller, i.e., Ha:
⬍ 95. To this end, according to the preceding discussion in this section, we
examine the z-ratio (7.16) and check if it is lower than 1.645. We can readily
achieve this with R as follows:
> z (93.45-95)/(sqrt(185)/sqrt(450))
> z
[1] -2.41742
Since this test statistic result is lower than the upper boundary of the rejection
region, viz., 1.645, its value falls within the rejection region. For this reason,
we reject the null hypothesis H0: 95 in favor of the one-tailed alternative
Ha: ⬍ 95. We consider this result as empirical evidence suggesting that in
the state in question the mean of the general mental ability test is lower than
in its neighboring state.
ested in simple (point) null hypotheses to begin with, as these may not be
particularly informative. In those circumstances, one may instead be inter-
ested in examining one-tailed null hypotheses, e.g., H0: ⱕ 0 or conversely
H0: ⱖ 0, with an alternative hypothesis being its corresponding negation,
i.e., Ha: ⬎ 0 or Ha: ⬎ 0, respectively. (We note that since we are
considering in this chapter continuous random variables, the probability of
the variable y taking a particular value—such as 0 —is zero; hence, it is
immaterial whether we include the equality sign in the null or in the alterna-
tive hypothesis.) We refer to a hypothesis like H0: ⱕ 0 or H0: ⱖ 0 as a
composite hypothesis (and similarly for an alternative hypothesis), since it
encompasses a whole range of values rather than a single one.
The preceding developments have also paved the way for developing a test
for composite null hypotheses, on the assumption again of known variance of
the underlying random variable y and normality of the associated sample
mean ȳ. Specifically, also here we can use the z-ratio (7.16) as a test statistic.
With it in mind, for each of the above pair of null and alternative hypothesis
to be tested, we need to find next the pertinent rejection region. As illustrated
on a few occasions earlier in this chapter, one can readily see here that the
rejection region would be the set of scores on the test-statistic z in Equation
(7.16), which is sufficiently away from the null hypothesis and in the direction
of the alternative tested. In actual fact, as can be easily argued for, we can
use the same rejection regions as developed above for testing composite null
hypotheses against one-sided alternatives. Specifically, if the null hypothesis
tested is H0: ⱕ 0 against the alternative H0: ⬎ 0, the rejection region
consists of the uppermost 5% of the area under the standard normal curve,
i.e., above 1.645. Conversely, if the null hypothesis tested is H0: ⱖ 0 against
the alternative H0: ⬍ 0, the rejection region consists of the lowest 5% of
the area under the standard normal curve, i.e., below 1.645. We illustrate
this activity next with an example.
Since the value of the test statistic is lower than 1.645, it falls outside of the
rejection region for testing the null hypothesis H0: 40 versus the alterna-
tive Ha: ⬎ 40. We thus conclude that the evidence in the available sample
is not sufficient to warrant rejection of the null hypothesis H0: ⱕ 40, which
we thus can consider retainable.
Therefore, if we are testing the null hypothesis H0: 0 against the alterna-
tive H0: ⬆ 0 at a significance level of ␣ .01, the rejection region consists
of all z-scores from Equation (7.16) that are below 2.58 and all z-scores
above 2.58 (rounded off to second decimal place). Similarly, if we are using a
significance level of ␣ .1, then the cutoff for the case a two-tailed alternative is
That is, if we are testing the null hypothesis H0: 0 against the alternative
H0: ⬆ 0 at a significance level of a .1, the rejection region consists of all
z-scores from Equation (7.16) that are below 1.645 and all z-scores above
1.645.
> qnorm(.99, 0, 1)
[1] 2.326348
> qnorm(.1,0,1)
[1] -1.281552
as indicated before in this chapter). That is, if the p-value is smaller than the
significance level (e.g., .05), we reject the null hypothesis. If the p-value is not
smaller than the significance level, we consider the null hypothesis as retain-
able. Therefore, the only remaining question is how to work out the p-value
in a given empirical setting where we have a data set available and are inter-
ested in testing a given null hypothesis versus an alternative hypothesis.
Working out the p-value using specifically designated command proce-
dures, based on the reasoning earlier in this subsection (see also below), is
implemented in widely circulated statistical analysis software, such as R, for
nearly all statistical tests used in this book. The only exception is the z-test we
have been dealing with in this chapter, which makes the assumption of known
variance of the underlying random variable, y. (This assumption is most of
the time considered not fulfilled in empirical research, and for this reason
there are no automatic procedures of obtaining the associated p-values with
these programs.) Beginning with the next chapter, we will routinely use the
software-implemented procedures for determining p-values associated with
analyzed sample data, but in the rest of this chapter we will instruct R to work
them out for us. We illustrate this discussion next by revisiting the earlier
college aspiration example in Section 7.5 (for completeness, we include its
text below and refer later to it as Example 7.8).
that is smaller than 2.25, and therefore just as inconsistent with the null
hypothesis.) That is, we need to find out the area under the normal curve that
is above 2.25 and below 2.25. To this end, due to symmetry reasons, we can
work out the area under the normal curve above 2.25 and multiply it by two,
in order to take into account the two-tailed nature of the alternative hypothe-
sis. We achieve this easily with R as follows:
> 1-pnorm(2.25, 0, 1)
[1] 0.01222447
> 2*(1-pnorm(2.25,0,1))
[1] 0.02444895
Therefore, the sought p-value is .024. Since this is smaller than the conven-
tionally used significance level of ␣ .05, we conclude that if the null
hypothesis were to be true, the sample we have observed from the studied
population would be associated with an event that would be very rare. For
this reason, faced with the necessity to make a choice between the null
hypothesis H0: 18 and the alternative hypothesis Ha: ⬆ 18 here, we
decide for the latter. This is because we interpret the empirical finding associ-
ated with a p-value of only .024 as strong evidence against the null hypothesis.
We thus reject the claim that in the population the mean on the used college
aspiration scale is 18. (We note that this is the same decision we arrived at
via use of the pertinent rejection region earlier in the chapter, as one would
expect.)
The discussed testing approach is appropriate when the alternative hypoth-
esis is two-tailed, as in the last example considered. If the alternative were
instead to be determined as one-tailed before looking at the data, a minor
modification would be needed in the presently outlined procedure. The rea-
son is that with one-tailed alternatives, we are considering deviations from
the null hypothesis only in the direction of the tail of the specified alternative
hypothesis, since only such deviations count as evidence against the null
hypothesis. For instance, suppose we had settled in the last example, before
looking at the data, on the one-sided alternative Ha: ⬎ 0, with the same
null hypothesis as above. Then, in order to work out the p-value associated
with the test statistic value observed in the sample, we need to find only the
area beyond, i.e., to the right, of the value 2.25 found in the above example.
This area was determined easily earlier with R to be
> 1-pnorm(2.25, 0, 1)
[1] 0.01222447
i.e., half the p-value when testing the two-tailed hypothesis. Since in the sam-
ple we observed an average that was in compliance with the alternative tail, as
22.5 ⬎ 18, given this small p-value we would decide to reject the null hypothe-
sis. However, if in the sample the average were to be less than 18 (e.g., 16.85),
given that this result is not in agreement with the alternative we would decide
for retaining the null hypothesis without even having to look at the p-value.
This is because none of the evidence we have in the sample would point in
favor of the alternative, and so we cannot decide for the alternative hypothe-
sis. (Note that, as indicated earlier, we would decide to retain the null hypoth-
esis no matter how much smaller than 18 the sample average is.)
The described testing approach, as mentioned before, is often referred to
for short as a one-tailed test and the p-value associated with it as a one-tailed
p-value. We emphasize that to obtain this value we halved the p-value that
pertains to the testing problem when dealing with a two-tailed alternative.
However, we use this halved p-value to make a decision in favor of the null
or alternative only if the test statistic value in the given sample is in agreement
with the alternative hypothesis. By analogy, the process of testing a null
hypothesis versus a two-tailed alternative is called a two-tailed test, and the
resulting p-value is also referred to as a two-tailed p-value.
The discussed procedure for working out the p-value is directly applicable
also in the case when the null is a more general hypothesis, such as H0:
⬍ 0. (At times, this hypothesis is also called a one-tailed null hypothesis.
Note that then the alternative hypothesis has to be one-tailed as well.) The
reason is that in the preceding developments leading up to this procedure we
never used the fact that the null was a simple hypothesis. Hence, the same
reasoning is applicable when the null is a composite (one-tailed) hypothesis.
To illustrate this discussion, consider the last college aspiration example,
but suppose we were interested to begin with in testing the null hypothesis
that the population mean on the used scale is no larger than 18, versus the
alternative that it is larger than 18. That is, we wish to test here the null
hypothesis H0: ⱕ 18 against the alternative Ha: ⬎ 18. (We reiterate that
we need to have settled on the null and alternative hypotheses, as well on the
significance level, before looking at the data or the results of any analyses
carried out on it.) Proceeding as above, given the sample finding of a z-value
of 2.25 associated with the observed sample average of 22.5, we need to work
out the area under the standard normal curve and beyond 2.25. As elaborated
earlier in this subsection, the associated p-value is readily found with R as
follows:
> 1-pnorm(2.25, 0, 1)
[1] 0.01222447
Given this p-value of .01 (rounded off), which is smaller than the convention-
ally used significance level of .05, we can reject the null hypothesis that the
In the preceding Sections 7.6 and 7.7, we used test statistics and associated
p-values to examine research hypotheses. A criticism that can be raised against
this testing approach when the null is a simple hypothesis, such as H0: 0,
is that this type of very precise hypothesis can be argued to be most of the
time incorrect in studied populations to begin with. In addition, as indicated
earlier, the statistical power increases toward 1 (being a probability) with
increasing sample size as well. Hence, one may argue, whether this null
hypothesis is rejected or not is a matter of how large a sample is used.
An alternative approach to statistical hypothesis testing when simple null
hypotheses are of concern is based on the use of confidence intervals intro-
duced in Section 7.3 of this chapter. As we clarified there, a confidence inter-
val (CI) represents a range of plausible values for a population parameter
(e.g., the mean). Suppose one were interested in testing the simple null
hypothesis H0: 0 against the alternative H0: ⬆ 0. Denote the confi-
dence interval at 95% confidence level as (l, u), where l and u are its lower
and upper endpoints. (We recall that their values are sample-dependent.)
Our preceding discussion in this chapter suggests that when testing a null
hypothesis like H0, we are in a sense asking the following question: ‘‘Given
the sample, is the hypothetical value for the parameter of interest—i.e., 0 —a
plausible value for this parameter in the population?’’ For example, given the
data, does the 0 look like a possible value for the mean in the studied
population? Yet the CI is a range of plausible values by definition, i.e., in fact
the CI contains all plausible values, at a confidence level chosen prior to look-
ing at the data, for the population parameter of concern. Hence, by checking
if the CI contains the hypothetical value of the parameter, we are in fact
performing statistical hypothesis testing. That is, if the CI contains the value
of the parameter as stated in the null hypothesis, we can consider it retainable;
if it does not contain it, we can reject the null hypothesis.
This reasoning underlies a more rigorous result (e.g., Roussas, 1997) stating
that a 100(1 ␣)%-CI can be used in this way to test at a significance level
␣ the null hypothesis stipulating a particular value for an unknown parame-
ter, versus the alternative that the latter does not equal that value. In case of
interest in the mean, this result implies that if one were concerned with testing
the null hypothesis H0: 0 against the alternative H0: ⬆ 0 at signifi-
cance level ␣, one can check if 0 is contained in the 100(1 ␣)%-CI, i.e.,
whether l ⬍ 0 ⬍ u holds. If this CI contains 0, then H0 is retained, but if
the CI does not contain 0, i.e., if l ⬎ 0 or u ⬍ 0, then H0 is rejected.
We illustrate this approach to hypothesis testing using the earlier college
aspiration scale example (Example 7.8). Assuming conventional signifi-
cance level of ␣ .05, as discussed in Section 7.3, the general form of the
100(1 ␣)%-CI in the current setting is presented in (7.12). (We reiterate
that here assumed is normality of the sample mean and knowledge of the
variance of the underlying random variable y, the score on the college aspi-
ration scale.) The general CI is as follows:
(ȳz␣/2/兹n,ȳz␣/2/兹n).
Thus, for our example, all we need is z␣/2, which we found in Section 7.3 to
be 1.96. Using this value, the 95%-CI for the population mean of the college
aspiration scale of concern here is (‘ ’ is used next to denote multiplication)
(22.51.96 12/6, 22.51.96 12/6),
i.e., the sought CI is (18.58, 26.42). Since this 95%-CI does not include the
hypothetical value 18 for the mean, this value is not plausible for the popula-
tion mean. For this reason, we conclude that the sample contains sufficient
evidence warranting rejection of the null hypothesis H0: 0 18, in
favor of the alternative Ha: ⬆ 18. This is the same decision we arrived at
before in the chapter using the earlier discussed means of hypothesis testing,
as one could expect.
In conclusion, we emphasize that the discussions in this section are valid
for testing statistical hypotheses about any population parameter, not just for
the mean—the parameter we were mostly concerned with here. Thereby, we
can formally treat the symbol as denoting generally any population parame-
ter of interest, and correspondingly 0 as a particular value for it. If we then
proceed exactly as in this last section, we will be in a position to test any
simple hypothesis about a population parameter by using a corresponding
level confidence interval for that parameter, against a two-tailed alternative.
All discussions regarding statistical inference in the last chapter assumed that
the variance of the underlying random variable y of interest was known and
that its sample mean ȳ was normal. Nevertheless, it is rarely the case in empir-
ical behavioral and social research that one can easily assume the variance 2
of y as known or possible to evaluate very precisely. For instance, a newly
developed measure of neuroticism cannot be typically assumed to have a
known population variance. In addition, the population of subjects for which
the measure has been developed may be rather heterogeneous. Hence, unless
the sample is very large—which may well be hard to achieve with limited
resources—it cannot be reasonably assumed that its population variance
could be known or evaluated with high precision.
In such situations, it may nonetheless be still of particular interest to carry
out statistical inference about population means on studied variables. For
these cases, the general idea underlying the z-test discussed in Chapter 7 in
fact remains applicable. Specifically, after estimation of the population stan-
dard deviation of the studied variable y by the sample standard deviation s,
and substitution of the latter for into the test statistic in Equation (7.16) of
that chapter, one can carry out testing and confidence interval evaluation with
the same general procedure after paying attention to one important point,
which we turn to next. This point has to do with the appropriate consider-
ation of the so-called t-ratio and t-distribution.
137
Equation (8.1) below can be used for testing the null hypothesis H0: 0
versus the alternative Ha: ⬆ 0 or even one-tailed alternative hypotheses:
ȳ0 .
(8.1) t
s / 兹n
The ratio in the right-hand side of Equation (8.1), which is frequently referred
to as the t-ratio, is not identical to the z-test statistic presented in Equation
(7.16) and used throughout the preceding Chapter 7 (frequently also referred
to as z-ratio). For this reason, we use a different notation—viz., t—to desig-
nate the left-hand side of Equation (8.1). Specifically, with this notation we
emphasize the fact that in Equation (8.1) we substitute the sample standard
deviation s for the population standard deviation in the earlier Equation
(7.16), which standard deviation we do not assume here as known anymore.
While this may seem to be a straightforward substitution leading from
Equation (7.16) in Chapter 7 to Equation (8.1) here, it actually entails some
important consequences. In particular, due to using the estimate s of the stan-
dard deviation of y in Equation (8.1), the resulting test statistic t in it does
not follow a standard normal distribution under the null hypothesis of the
mean being equal to 0, but instead follows a different distribution. The dis-
tribution of the right-hand side of Equation (8.1) under the assumption
0, that is, under the null hypothesis, is referred to as the t-distribution. A
more complete reference to it, which is used at times in the literature, is as
Student’s t-distribution—bearing the pseudonym ‘‘Student’’ used by the Brit-
ish statistician William S. Gossett in the early 20th century.
As shown in more advanced treatments (e.g., Roussas, 1997), the test statistic
t in Equation (8.1) possesses this t-distribution when y is normally distributed,
and a fairly close distribution when y is coming from a symmetric distribution.
As is the case with the normal distribution, a concept that we used throughout
the last two chapters, there are infinitely many t-distributions. Each one of these
distributions differs from any other of the class of t-distributions by an associ-
ated quantity that is called degrees of freedom, commonly denoted as df. This
quantity, df, is always an integer number, i.e., a whole positive number. The
concept of degrees of freedom plays a special role for many test statistics used
in empirical research, and for this reason we discuss it next in greater detail.
population mean , which we do with the sample average ȳ. That is, we obtain
in this way information about the population mean, or in other words we
extract this information from the sample—viz., from the initial n independent
sources of information about the random variable y. Hence, estimation of the
population mean is tantamount to utilizing one of the initial n pieces of
information available in a given sample. As a result, n 1 independent pieces
of information remain about y. For this reason, the degrees of freedom (df)
associated with the t-ratio in Equation (8.1) are df d n 1.
This procedure of obtaining the degrees of freedom associated with a statis-
tic in question remains valid also when many other unknown parameters or
a set of such are of interest. Specifically, to obtain then the degrees of freedom
we subtract the number of estimated parameters from the initial number of
independent pieces of information about a variable under consideration, as
available in a given study. (As an aside at this point, a study may involve more
than one sample, i.e., a sample from each of several distinct populations, and
all samples are then taken into account when determining the degrees of free-
dom associated with a particular statistic.) We will utilize and discuss this
degree of freedom calculation procedure on a few particular occasions in the
remainder of the book.
which the two test statistics differ from one another. In particular, this added
variability leads to the fact that the t-ratio in Equation (8.1) follows a distribu-
tion different from the standard normal, which as pointed out in the preced-
ing chapter is the distribution of the z-ratio (7.16) (see Chapter 7).
It has been shown in the literature that the variance of a random variable x
following a t-distribution with df degrees of freedom is (for df ⬎ 2; e.g., Rao,
1973)
(8.2) Var(x)df/(df2).
From Equation (8.2), it is seen that the variance of the t-distribution is always
larger than one, which is the variance of the standard normal distribution
N(0,1). Therefore, the probability density function (pdf) of the t-distribution
is somewhat flatter and ‘‘shorter’’ than that of the standard normal—due to
the fact that the area under the curve of the normal distribution is one as well.
We illustrate this flatness feature using graphs of the standard normal and
several t-distributions, which we can obtain readily using R. As before, first
we create a ‘‘net’’ (or ‘‘grid,’’ array) of very closely positioned points on the
horizontal axis, then compute both the density function values of the standard
normal and the t-distribution with degrees of freedom df 3, say, and super-
impose these two pdf’s. We note thereby that the pdf, or density curve, of the
t-distribution is obtained with the command ‘dt’, whereas that of the standard
normal is obtained with the command ‘dnorm’. The overlay of the two
graphed distributions is accomplished with the command ‘lines’, placing in
gray color the pdf of the t-distribution (by using the subcommand col
"gray"). All these activities are achieved consecutively with the following
five R commands:
These commands lead to the graph displayed in Figure 8.1 (the gray curve
appears thereby as thinner than the default black curve).
We notice from Figure 8.1 a characteristic property of the t-distribution.
Specifically, its tails are thicker than those of the standard normal distribution,
and thus it rises to a lower level. As pointed out before, the reason for this is
the fact that the area under the pdf of either distribution is one, due to the
earlier-mentioned pertinent convention for any pdf (see Chapter 5). This tail-
thickness property of the t-distribution relative to the standard normal distri-
bution results from the substitution of the imprecise sample estimate s for
FIGURE 8.1.
Graphs of the densities of the standard normal distribution (thick line) and the
t-distribution with df 3 degrees of freedom (thin line).
into the formula for the z-ratio (7.16) (see Chapter 7), in order to obtain the
t-ratio in (8.1) as mentioned earlier in this chapter.
The difference between the t-distribution and the standard normal distri-
bution diminishes, however, with increasing sample size n. Then also the
degrees of freedom df increase, since df n 1, as pointed out before. In
Figure 8.2, the pdf of the t-distribution with df 8 degrees of freedom is
overlaid with the standard normal. As seen from that figure, there are much
smaller differences now between the two distributions relative to their differ-
ences in Figure 8.1. (Figure 8.2. is obtained via the same sequence of
commands leading to Figure 8.1, with the only difference that the curve
d.t dt(x, 8) is now used instead of the earlier curve d.t dt(x,
3). Similarly, in the following Figure 8.3, the line d.t dt(x, 20) is used
instead in order to obtain the pdf of the t-distribution with df 20; see
below.)
Continuing in the same way, Figure 8.3 shows the pdf of the t-distribution
with df 20 degrees of freedom, where for most practical purposes the differ-
ences between the standard normal and that t-distribution are minimal if of
relevance. More generally, these differences become less and less pronounced
as sample size n increases (and thus degrees of freedom df n 1 do so as
well; practically, there is little difference between the two considered distribu-
tions for df ⬎ 30.)
The three Figures 8.1 through 8.3 enable us also to recognize another
important property of the t-distribution. While having fatter tails than the
FIGURE 8.2.
Graphs of the densities of the standard normal distribution (thick line) and the
t-distribution with df 8 degrees of freedom (thin line; the size of the figure frame is
the same as that of Figure 8.1).
FIGURE 8.3.
Graphs of the densities of the standard normal distribution (thick line) and the
t-distribution with df 20 degrees of freedom (thin line; the size of the figure frame
is the same as that of Figures 8.1 and 8.2).
standard normal N(0,1) for any finite sample, the t-distribution with df
degrees of freedom (df ⬎ 2) is also symmetric around zero, like any normal
distribution. It will be useful to keep this symmetry feature in mind for the
remainder of this chapter.
Having highlighted the differences between the t-distribution and the stan-
dard normal distribution in the last section, we now return to our main con-
cern in this chapter. Specifically, we would like to carry out statistical
inference about a population mean from a normally distributed random vari-
able y of interest, whose population standard deviation is unknown. We
stress that such a scenario is quite common in empirical research, since we
typically do not know the population variance (and hence the standard devia-
tion) of most random variables of concern.
Under those circumstances, as pointed out above we can use the t-ratio
defined in Equation (8.1). Since we now know the distribution of this t-ratio,
viz., a t-distribution with df n 1 degrees of freedom (n denoting sample
size as usual), in all our inferential activities we can proceed as in the case
with a known standard deviation but referring eventually to this
t-distribution. That is, whenever a percentile (quantile) is needed, we have to
use those of the pertinent t-distribution rather than standard normal distribu-
tion as we did throughout Chapter 7, where we assumed that was known.
By analogy to the quantiles of N(0,1), we denote these of the t-distribution as
t␣,d f , for given significance level ␣ and degrees of freedom df. We stress that
we now use two subindexes attached to the symbol of the t-distribution for
designating its quantiles, viz., ␣ and df. This is because the t-distribution itself
depends on df (and thus on sample size), a property that is not shared with
the distribution N(0,1) of the z-ratio in Equation (7.16) (see details provided
in Chapter 7) or with any other normal distribution.
> qt(.95, 3)
[1] 2.353363
That is, the number 2.353 (rounded off to the third decimal place) cuts off
to the right 5% of the area under the density curve of the t-distribution with
three degrees of freedom. We note that this percentile of the t-distribution is
larger than the corresponding 95th percentile of the standard normal distribu-
tion, N(0,1), which we know from before is 1.645. The reason is, as pointed
out earlier, that the tails of the t-distribution are thicker than those of N(0,1).
Thus, given the small sample size, we need to go further out—that is, away to
the right from zero—in order to find the same percentiles of the t-distribution
relative to the corresponding percentiles of the standard normal distribution.
To carry out a t-test for a given mean value 0, i.e., to test the null hypothe-
sis H0: 0 versus the alternative H0: ⬆ 0, we employ with R the
command ‘t.test’:
As seen from the last line, this command has two ‘‘arguments,’’ i.e., two
entities that need to be supplied to the software in order for it to carry out
the t-test in question. The first argument is the data, i.e., the variable y for
which we are testing a hypothesis with regard to its mean (as usual, denotes
the mean of y in the studied population). We discussed earlier in the book
how we can read data into R. Once we do so, any variable (variable name) in
a given data set can be used for ‘y’ in the ‘t.test’ command, as long as a
research question can be addressed by a t-test carried out with regard to its
mean.
The second argument, denoted mu mu0 in the last stated R command,
is the hypothetical value 0 of concern in the null hypothesis tested. Thereby,
the particular hypothetical value (number) is entered in place of ‘mu0’. For
instance, if we wish to test the null hypothesis H0: 22 versus the alterna-
tive Ha: ⬆ 22, for a variable denoted ‘y’ in a data file that has been read
into R, the above command is as follows (note that the name of the variable
appears as the first argument, while ‘mu’ is spelled out this way in the left-
hand side of the second argument of the following command):
Once the data are read in, a ‘‘data frame’’ is created where the variables are
not immediately available for our purposes to R. To make them so, we attach
the data to the R engine:
> attach(mta)
We are now ready to test the above null hypothesis H0: 27.5 against the
alternative Ha: ⬆ 27.5. We achieve this, as indicated earlier, with the R
command ‘t.test’, where we now substitute 27.5 for ‘mu0’ in its second argu-
ment:
level. For example, if we wanted to carry out statistical inference using the
significance level ␣ .01 to begin with (i.e., we chose ␣ .01 before looking
at the data), then we need to add this information as a third argument to the
above R command ‘t.test’. To this end, we can use its subcommand ‘conf.level
.99’, since the confidence level corresponding to a significance level of .01
is (1 .01)100% 99%. The command ‘t.test’ then looks as follows for the
last-considered example:
We notice that the p-value is not changed, since our data and test statistic are
not changed either. If we were to test the simple null hypothesis H0: 27.5
versus the alternative Ha: ⬆ 27.5, to begin with, we need to compare the
p-value reported by R to the chosen different significance level, which is .01.
Since the reported p-value of 5.135 107 is smaller than .01, we reject the
null hypothesis and decide for the alternative.
Similarly, if we were interested in the first instance in one-tailed tests, we
proceed in full analogy to the case of ␣ .05 discussed in detail in the previ-
ous subsection 8.5.3. We note in passing that the CI produced thereby will be
different in case the significance level ␣ .01 is preset. In fact, the 99%-CI
for the population mean will be wider than before, since we will be using here
a higher confidence level (viz., 99% rather than 95% as earlier).
In conclusion of this section, we emphasize that while we made in it the
assumption of normality of the studied random variable y, the t-tests and
confidence intervals discussed in it yield trustworthy results also with minor
to mild deviations from normality. More serious deviations from normality
can be handled with alternative methods that go beyond the confines of this
introductory book (e.g., Ott & Longnecker, 2010; Roussas, 1997). Further, we
stress that the significance level for any statistical test, and hence the implied
confidence level associated with a confidence interval of interest, needs to be
decided upon (i.e., chosen) before looking at the data. If this choice is done
after the data are inspected, it is possible that seriously misleading results can
be arrived at.
So far in this chapter, we have been concerned with inferences about a single
population mean. We have emphasized thereby that in order to carry out the
associated statistical tests, we must know a priori the specific hypothetical
value of the mean that we test about with them. Many times in empirical
behavioral and social research, however, this knowledge is not available, or
there is no particular value for the mean that would be of substantive interest.
In addition, very often one is interested in inferences involving not one but
more than a single population and their means. For instance, one may well
be interested in finding out how the means on a studied variable differ in an
experimental and a control group (populations), or for male and female sub-
jects, or for different socioeconomic groups, and so on.
In this section, we deal with settings where we are concerned specifically
with two populations (groups) under consideration. We consider two cases:
(a) when the populations of interest are independent of one another, and
(b) when they are related to one another. In case (a), we will assume that we
have obtained two random samples from either of the two distinct popula-
tions, with corresponding sizes n1 and n2, and we will use the fact that these
samples are also independent of one another. Our following developments
will be based on the next important result that will permit us to obtain a
measure of the distance between null hypothesis and data from the available
samples from these populations, and thus a test statistic for testing population
mean equality. (For a detailed demonstration of this result, see Roussas,
1997.) The result states that for independent normal random variables the
normal distribution is preserved when they are subtracted from one another
or added to each other.
Result 8.1: If y1 ⬃ N(1, 21) and y2 ⬃ N(2, 22) are two independent normally
distributed random variables, then their difference and their sum are also nor-
mally distributed, each with the following means and variances:
(a) y1y2 ⬃ N(12, 21 22), and
(b) y1y2 ⬃ N(12 , 21 22).
We point out that unlike the means, the variance of both the sum and the
difference of the two random variables is the same, viz., the sum of their
variances. This is because irrespective of whether one takes their sum or dif-
ference, the random variability of any of the two resulting variables, viz.,
y1 y2 and y1 y2, is superimposed on that variability of the other variable.
Result 8.1 can be used together with the central limit theorem (CLT), if
need be, to find out the distribution of the difference of two sample means in
the setting we are interested in this section. Here we are dealing with two
random samples from two unrelated populations, and ȳ1 and ȳ2 denote their
sample averages (means). As we know from Chapter 6, their means and vari-
ances are correspondingly 1 and 21 / n1 (for ȳ1) as well as 2 and 22 / n2 (for
ȳ2). Since the two samples are independent, however, so are their sample aver-
ages ȳ1 and ȳ2. Thus, using Result 8.1, we see that the mean and variance of
their difference are correspondingly
(8.4) ȳ1ȳ2ȳ1ȳ212, and
ȳ1ȳ2 2ȳ1 2ȳ2 21 / n1 22 / n2,
2
where ȳ1ȳ2 and 2ȳ1ȳ2 denote the mean and variance, respectively, of the
difference in sample means, ȳ1 ȳ2. Hence, for the standard deviation ȳ1ȳ2
of this mean difference, ȳ1 ȳ2, it follows that
(8.5) ȳ1ȳ2兹 2ȳ1 2ȳ2兹 21 / n1 22 / n2.
We summarize the last developments in the following result, which will be of
particular importance in the remainder of this Section 8.4.
Result 8.2: The random sampling distribution of the difference of two sample
averages (means), ȳ1 ȳ2, has the following three properties:
(i) it is approximately normal for large samples (and is normal for any
sample size when each of the random variables y1 and y2 is so),
(ii) its mean is the difference between the two population means (see
first of Equations (8.4)), and
(iii) the standard error of the difference of the two sample averages is
ȳ1ȳ2兹 21 / n1 22 / n2.
This result allows us to carry out statistical inference with regard to the differ-
ence in two independent population means, following the same principles
underlying the inferences about single population mean that were discussed
earlier in this chapter.
(8.8) sp
冪 (n11)s12(n21)s22
n1n22
.
(8.10) 冉
ȳ1ȳ2t␣/2,n1n22sp
冪 1 1
, ȳ1ȳ2t␣/2,n1n22sp
n1 n2 冪 1 1
冊
.
n1 n2
We can also use this confidence interval for the purpose of statistical testing
about two population means, as discussed in the next section.
ȳ1ȳ2
(8.11) t .
冪
1 1
sp
n1 n2
Analogous to our discussion in Section 8.1, the test statistic in Equation (8.11)
follows under the null hypothesis the t-distribution with n1 n2 2 degrees
of freedom. Hence, the pertinent p-value (with a two-tailed alternative, as
currently considered) is twice the area under the density curve for this
t-distribution, which is positioned above the absolute value of the t-ratio
(8.11) when the corresponding statistics from the two available samples are
substituted into it. When a one-tailed alternative is considered to begin with
(i.e., is decided for before one looks at the data), the p-value of relevance then
is half the two-tailed p-value. This halved p-value is to be used for decision
making with respect to the tested hypotheses only if the sample averages differ
in a way complying with the alternative hypothesis; if they do not, one consid-
ers the null hypothesis as retainable.
The outlined approach is also directly applicable when one is interested in
testing for a prespecified difference other than zero of the two population
means. Suppose this difference is ␦, a value that a researcher comes up with
before looking at the data (␦ ⬆ 0). That is, we are interested in testing the
null hypothesis, H0: 1 2 ␦ versus the alternative that they do not differ
by ␦, viz., Ha: 1 2 ⬆ ␦. To this end, we can still use the (1 ␣)100%-CI
in (8.10) for inferences about the population mean difference, at a prespeci-
fied confidence level (1 ␣)100% (0 ⬍ ␣ ⬍ 1). Specifically, if ␦ is covered
by the CI (8.10), we retain the null hypothesis; if ␦ is not covered by this CI,
we reject the null hypothesis, thus deciding for the alternative.
If of interest is to obtain a p-value associated with the sample mean differ-
ence, we use the correspondingly modified test statistic from Equation (8.11):
ȳ1ȳ2␦
(8.12) t .
冪
1 1
sp
n1 n2
With respect to one-tailed tests, as indicated before, we divide the resulting
p-value and use it further only if the sample mean difference is in the direction
of the alternative tested. Otherwise, we consider the null hypothesis as retain-
able.
Before we illustrate the statistical inferences discussed in this section, we
note that they yield trustworthy results also with some deviations from the
assumption of normality and that of equal variances, especially when the stud-
ied variable distributions in the two populations are symmetric and sample
sizes are fairly similar. We demonstrate next the above developments using
the following examples.
> t.test(y⬃g)
For now, we only say that the ‘⬃’ sign can be interpreted in this context as
relating the variable before it to that after it. That is, we wish to use the t-test
with regard to the variable ‘y’, once it is placed in relation to the variable ‘g’,
i.e., group membership. The last R command yields the following output.
whether the 95%-CI obtained for this data set covers the hypothetical value
of four. Since this CI is (2.01, 2.77), as found out from the last presented R
output, we see that four is not covered by the CI. For this reason, we would
reject the null hypothesis of the two means differing by four units on the used
test, and conclude that they differ by another number of units on the test.
where
s 2 / n1
(8.14) c 2 1 .
s1 / n1s22 / n2
Thereby, if the degrees of freedom following from Equation (8.13) are not an
integer number, their estimate is rounded down to the nearest integer.
One may reason that it might well be difficult to argue in a given empirical
setting in favor of the equality of two population variances on a random variable
of interest. We could examine for equality the two population variances using
an appropriate statistical test (discussed in the next chapter), in which case the
null hypothesis would be H0: 12 22, or alternatively H0: 12/22 1 (assum-
ing of course 22 ⬆ 0, as would typically be the case in empirical research).
However, it could also be argued that such a point hypothesis would unlikely
be strictly correct and would be rejected with a large enough sample (as statisti-
cal power approaches then one, as we mentioned earlier in the book). Thus, it
may seem unreasonable in general to make the assumption of equal variances,
since according to this reasoning that assumption is almost always incorrect.
Accepting this view, one could set out in practice to use always the last
discussed approximate method by Welch and Satterthwaite when interested
in examining population mean differences. This is in fact the approach imple-
mented in R, and the software does not assume equality of the variances but
always proceeds as if they were different—the latter being clearly the more
general case. An alternative would be first to test for equality the two popula-
tion differences and then proceed with either method—depending on the out-
come of that variance test (see next chapter). However, such a two-step
procedure has been criticized in the literature as unnecessarily including two
tests on the same data set—a potential problem with controlling the Type I
error rate. Instead, one could always proceed with the Welch/Satterthwaite
approximate method for testing two independent mean differences. This is a
widely followed view, which we accept in the remainder of this book. As
pointed out, it is also implemented as the default option in the command
‘t.test’ in R, and the last use of this command demonstrated its application on
an empirical example.
The methods discussed in the last section are appropriate only for the case of
two independent populations. In many empirical settings, however, one often
obtains data from two related samples—for example, when repeated measure-
ments are taken twice on the same units of analysis, or when subjects are
examined that are to some degree related to one another (e.g., siblings, cou-
ples, husbands and wives, etc.). Conducting inferences about mean differences
for two related groups (populations) is the subject of this section.
We begin by observing that the methods employed in Section 8.3 made an
essential use of the fact that the samples were drawn from two independent
populations. This is because they appropriately accounted for the fact that the
variance of the difference of the sample means was the sum of the variances
of the two sample means. When the samples are not independent, however,
this relation does not hold anymore, but instead the variance of the mean
difference depends also on the degree to which the two samples are related to
one another. Hence, using methods from Section 8.3 in the case of related
samples would waste important information about their relationship and lead
in general to incorrect results as well as possibly misleading substantive inter-
pretations.
Empirical settings where related samples are used, such as two repeated
assessments on the same subjects (units of analysis), have the potentially
important advantage that they can better evaluate, i.e., more precisely esti-
mate, their mean difference. This is because in a sense one of the measure-
ments can be taken as a control or benchmark against which the other
measurement can be compared with higher precision. This often leads to
higher power to detect mean differences, at a given sample size, in studies
If we denote the mean depression score before and after the trial by 1 and
2, respectively, this research question is concerned with testing the null
hypothesis H0: 2 1 versus the alternative hypothesis H1: 2 ⬍ 1. In
order to test these hypotheses, we first read in the data with R and focus on
the pretrial and posttrial depression score variables y1 and y2, which we can
use subsequently for our testing purposes (note the path):
> t.test(y2-y1, mu = 0)
One Sample t-test
data: y2 - y1
t = -5.4771, df = 29, p-value = 6.741e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.838387 -1.294947
sample estimates:
mean of x
-2.066667
These results include in particular the 95%-CI for the mean of the difference
score D of interest here. This interval is (2.838, 1.295), and is thus
entirely below the hypothetical mean of zero—as one would expect under the
alternative hypothesis Ha*. We can thus suggest that we can reject the null
hypothesis H0* in favor of the alternative Ha*. Hence, we can interpret this
finding as evidence against the claim of no mean differences in depression
before and after the trial, which evidence is in favor of the statement that the
depression level after the drug trial is lower.
The last few chapters have been entirely concerned with inferences about pop-
ulation means. In many empirical situations in behavioral and social science
research, however, variability in a studied variable is just as relevant as is its
mean or central tendency. An important result in statistics states that when a
random variable is normally distributed, the sample mean and variance esti-
mator are unrelated (e.g., Roussas, 1997). As a consequence, all methods deal-
ing with inferences about population means discussed up to this point in the
book do not provide any information in relation to drawing inferences about
population variability on a variable of concern.
The variability of studied variables represents the degree of individual differ-
ences on them, which differences are often of special substantive interest in their
own right. For example, variability in a given intelligence measure can be of
particular concern to educational researchers and psychologists when examining
intellectual development (and possible delays in it) in early childhood. As
another example, variance on a mathematics ability test informs us about the
potency of the resulting score (test score) to discriminate among students. This
is an important characteristic of the ability measure in its own right.
When samples are available from examined variables in populations under
investigation, their data can be utilized in order to also make inferences about
the degree of individual differences in the populations. To this end, we can
use statistical methods that have been specifically developed to address this
concern. These statistical methods will be discussed in the present chapter.
161
1
s 2 [(y1ȳ)2(y2ȳ)2...(ynȳ2]
n1
(9.1)
1 n
兺 (yiȳ)2.
n1 i1
We also mentioned in Chapter 2 that the sample variance s2 is an unbiased
estimator of the population variance 2. This implies that across repeated
sampling, at a given sample size, the average of the resulting sample variances
s2 will equal the population variance 2. This is an important property that we
would like a considered parameter estimator to possess.
FIGURE 9.1.
Probability density functions of the chi-square distribution with d 1, 2, 3, 5, 10, 15,
25, and 40 degrees of freedom.
take only positive values (for any positive degrees of freedom). This is due to
the ratio in (9.2) being positive for any available sample. Furthermore, the
mean and variance of this distribution are df and 2 2df—i.e., equal
the degrees of freedom and twice these, respectively (e.g., Roussas, 1997).
We note in passing that each of the chi-square pdf curves presented in
Figure 9.1 can be readily obtained with the following three combined R com-
mands. We notice that these are similarly obtained as other pdf’s considered
earlier in the book—observe the use now of the command ‘dchisq(x,d)’ to
obtain the value of the chi-square distribution with d degrees of freedom at
the value x:
whereby one consecutively updates the inputted value of the degrees of free-
dom from 1, to 2, 3, 5, 10, 15, 25, and finally to 40. We use thereby the
corresponding labeling for the vertical axis, using the subcommand ‘ylab’ of
the command ‘plot’.
> qchisq(.95, 5)
[1] 11.07050
i.e., 2.95,5 11.07. In other words, the area under the pdf of the chi-square
distribution with 5 degrees of freedom, which is above (beyond) the number
11.07, represents 5% of the entire area under the pdf of this distribution. (We
recall from our earlier discussion in the book that by convention this entire
area is assumed to be 1, as is the area under the pdf of any distribution having
a pdf.)
beneath each of them; notice that, as above, the symbol ‘l’ used next is meant
to be the letter l rather than the number 1):
Therefore, the sought 95%-CI for the pretrial depression score variance is
(5.34, 15.22). We interpret this finding as suggesting with a high degree of
confidence that the population pretrial depression variance may be between
5.32 and 15.22.
Here we check if the CI (9.5) covers the hypothetical value 20. If it does, we
retain the null hypothesis; otherwise we reject it.
If the CI (9.5) is entirely within the null hypothesis tail, we retain H0; if the
CI is entirely within the alternative hypothesis tail, we reject H0. If the CI,
however, covers the hypothetical value 20, decision is suspended; in another
study with a larger sample, it may be the case that the CI will be entirely
within one of the tails, in which case we proceed as just outlined. If no deci-
sion is reached, we interpret the CI as suggesting with high confidence that
the population variance may be between the numbers representing its lower
and upper endpoints.
To illustrate this discussion, let us return to the last depression example
(pretrial depression score; Example 8.2 from Chapter 8). If before looking at
the data our hypotheses to be tested were H0: 2 12 versus Ha: 2 ⬆ 12,
then at the significance level ␣ .05 we can retain H0. This is because the
above-found 95%-CI, (5.34, 15.22), contains this hypothetical value. How-
ever, if our hypotheses (settled on before looking at the data) were instead H0:
2 18 versus Ha: 2 ⬎ 18, we would retain H0. In case they were H0: 2 10
and Ha: 2 ⬎ 10 (again, settled on before looking at the data), then we would
suspend judgment, suggesting that the available data did not contain sufficient
evidence allowing us to decide for either of the hypotheses. We would then
only interpret the CI as suggesting with high confidence that the population
depression variance may be between 5.34 and 15.22.
Like all types of distributions discussed so far, there are infinitely many
F-distributions, and each one of them has specific values for two quantities
that are referred to as its degrees of freedom and usually denoted df1 and df2.
Thereby, df1 is the degrees of freedom associated with the sample variance
estimate in the first sample, s12, i.e., n1 1; and similarly df2 is the degrees
of freedom associated with the sample variance estimate in the second sample,
s22, i.e., n2 1.
To illustrate graphically this discussion, in Figure 9.2 the graphs of the pdf’s
of the F-distributions are displayed, with degrees of freedom ranging across
several choices (see label of vertical axis for pertinent pdf).
The graphs displayed in Figure 9.2 suggest the following two properties of
FIGURE 9.2.
The probability density functions of several F-distributions (see vertical axis label for
pertinent degrees of freedom).
the F-distribution, which are more generally valid and resemble those of the
chi-square distribution discussed earlier in the chapter. (A third property is
discussed further below.)
We note in passing that the graphs in Figure 9.2 can be readily furnished with
R using the following commands (see earlier discussion on obtaining the
graphs in Figure 9.1 with respect to explanation of commands and subcom-
mands, and note next the use of the command ‘df(x,d1,d2)’ for obtaining the
value of the pdf of the distribution F(df1, df2) where d1 and d2 stand for df1
and df2):
That is, the area under the pdf curve of the F-distribution with df1 55 and
df2 62 that is to the left of 1.54, is 95% of the entire area under that
curve—which is conventionally set at one.
That is, the product of the 95th and 5th percentiles in question here—with
reversed degrees of freedom—is precisely 1, which means that they are
‘‘inverse symmetric’’ in the above sense (see Equation (9.7)).
s2 s2
(9.10) ( 12F␣/2,df2,df1, 12F␣/2,df1,df2)
s2 s2
1 2
is a 100(1 ␣)%-confidence interval for the ratio of the two population
22
variances of concern. The lower and upper endpoints of this CI can be readily
obtained with R in full analogy to the way we obtained these limits of the CI
in expression (9.5) for a single population variance in subsection 9.1.4 of this
chapter.
The CI in expression (9.10) provides a range of plausible values for the
ratio of the variances of a studied random variable, y, in two independent
populations. As such, it could also be used to test various hypotheses about
this ratio if need be, as we have demonstrated earlier in the book in cases with
CIs for other parameters. This can also be readily accomplished with R, using
its command ‘var.test’. To this end, we need to specify the variable ‘y’ from a
given data file’s variables, whose variances are of interest, and the variable ‘g’
containing information about group membership. Then the full command is
analogous to that for testing equality in two independent population means,
‘t.test’, which we discussed in the preceding chapter:
> var.test(y ~ g)
> var.test(y~g)
We see that the F-ratio (9.6) equals here .472, and that the 95%-CI for the
ratio of the variances is (.20, 1.12). Since it includes one, the hypothetical
value of interest that represents population variance equality, we do not reject
the null hypothesis that the variances of the number division test scores are
the same under both teaching methods, viz., the new and old methods. That
is, discrepancies in effectiveness between the new and old methods, if any, do
not seem to be affecting the individual differences in number division test
scores.
We may note in passing that although the experimental group has less than
half the variance of the control group, since we have relatively limited samples
from either group (viz., 15 students in each) the 95%-CI is relatively wide and
covers the value of one, which represents the equality of the two group vari-
ances. Thus, this is just an illustrative example where possibly the lack of
power due to relatively small sample sizes leads to a finding of the null
hypothesis not being rejected. We will return to and elaborate on this and
several related issues in a later section of the book.
173
the command ‘prop.test(y, n, 0)’, where as mentioned before ‘y’ stands for
the number of successes in a series of n independent trials with probability of
0 for success in each. We illustrate next with an example.
Here we are concerned with testing the null hypothesis H0: 0, against
the alternative Ha: ⬎ 0, where 0 .65. We use the above-mentioned
R command ‘prop.test’ as follows (output is provided immediately after it):
We see that the sample estimate of probability for passing the exam, i.e.,
‘‘success’’ in this setting, is .7 and complies with the alternative hypothesis Ha
of interest. Hence, we halve the reported p-value of .55 and obtain the relevant
one-tailed p-value here as p .28. Assuming we use the conventional signifi-
cance level ␣ .05, since .28 ⬎ .05 we do not reject the null hypothesis. We
stress that R also provides a confidence interval for the proportion , which
represents a range of plausible values for this proportion that can be of impor-
tance in empirical research.
In conclusion of this section, we note that as discussed in Chapter 4 the
normality approximation—on which the developments in the present section
were based—would be satisfactory for most practical purposes when the num-
ber n of trials (sample size) is so large that both n and n(1 ) are at least
five. When this is not the case, for instance when n is not sufficiently large,
more advanced methods can be used within the framework of the so-called
theory of exact tests. For further details on these methods, the reader is
referred, for example, to Agresti (2002).
> c1 = c(28,26)
> c2 = c(40,40)
We are now ready to carry out the test of interest here via the command
‘prop.test(c1,c2)’ as mentioned above; the output produced by it is provided
beneath this command:
> prop.test(c1,c2)
data: c1 out of c2
X-squared = 0.057, df = 1, p-value = 0.8113
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1799779 0.2799779
sample estimates:
prop 1 prop 2
0.70 0.65
Since the estimate of the probability of passing the exam, i.e., the empirical
frequency of success, in the new method group is larger than that in the old
method group, the data complies with the alternative Ha of concern in this
example. We can thus proceed by halving the reported p-value of .81, obtain-
ing a one-tailed p-value of .42 (rounded off). At a conventional significance
level of .05, this p-value is larger than it, and therefore we retain the null
hypothesis H0. Hence, the available data does not contain sufficient evidence
warranting rejection of the hypothesis that both groups have the same proba-
bility of success on the exam. This may be due to both methods being about
equally effective in teaching multiplication to second-grade students.
comes for each type of outcome, i.e., for each outcome denoted 1, 2,..., k, for
notational simplicity. (In the above licensure example, one could denote the
outcome pass by 1, fail by 2, and no decision by 3, say. We do not impart any
numerical features to these numbers, however—i.e., we will not treat them as
‘‘real’’ numbers in this section.) This concept of expected number of out-
comes responds to the query what number of outcomes of a particular type
one would anticipate to observe under H0, if one were to make a given num-
ber of repetitions (observations) of the experiment in question. For example,
if there were 100 examinees, and we hypothesize that the probabilities for the
outcomes 1, 2, and 3 are .25, .50, and .25, respectively, we would expect to
observe 25 passes, 50 failures, and 25 with no decision. More generally, if n
denotes the number of trials, that is, number of repeats of the considered
experiment with k outcomes, or sample size, then ni is the expected number
of outcomes of the ith type (i 1,..., k). We denote these expected numbers
by Ei; that is, Ei ni is the number of expected outcomes of type i in
a series of n independent repetitions of the multinomial experiment under
consideration (i 1,..., k).
More than 100 years ago, the British statistician Karl Pearson developed
the following so-called chi-square statistic for testing the null hypothesis H0:
1 01,..., k 0k:
k
(n E )2
(10.11) 兺 i i .
2i1
Ei
if 2 ⬎ 21␣, k1, with the last symbol denoting the 100(1 ␣)th percentile of
the chi-square distribution with k 1 degrees of freedom.
This testing procedure can be carried out with the software R as follows.
First we enter the data, using the ‘c’ command (for ‘concatenate’) as earlier in
the chapter. We submit to R thereby one row containing the numbers n1, n2,...,
nk of observed outcomes of each of the k types of concern. Then we enter a
second row, using the same command, containing the hypothetical probabili-
ties 01, 02,..., 0k. Having done this, we use next the R command ‘sum’, in
order to obtain the right-hand side of Equation (10.11) defining the test statis-
tic of concern here. Once knowing the resulting test statistic value corre-
sponding to our data, we use the command ‘pchisq(chi-square, df)’ to work
out the probability to obtain a value smaller than the test statistic value.
Finally, we subtract this probability from 1, in order to furnish the p-value
associated with the null hypothesis H0 being tested.
We illustrate this testing procedure by revisiting and extending the earlier
licensure example in this section. Suppose that in a professional licensure
exam administered to 98 persons, 28 failed, 48 passed, and for 22 a decision
of pass or fail could not be made. Is there sufficient evidence in the data to
warrant rejection of the hypothesis that the probability of passing the exam is
.5, for failing it .25, and for a no-decision outcome .25? Here we have n
98, n1 48, n2 28, n3 22, and 01 .50, 02 .25, 03 .25. The null
hypothesis is H0: 1 .50, 2 .25, 3 .25. As outlined above, to test this
hypothesis we first create two rows—denoted next ‘y’ and ‘p’—with our data
and hypothetical probabilities, respectively, by ‘concatenating’ their corre-
sponding values:
With these two lines, we effectively communicate to R our data and hypothe-
sis to be tested. Next we use this software to work out the test statistic value
(10.11) and print it to the screen—the result is given beneath the last pre-
sented command next:
We need next the p-value associated with this test-statistic value or 6.102,
which we obtain as mentioned above, noting that the pertinent degrees of
freedom are df 3 1 2:
> 1-pchisq(6.102, 2)
[1] 0.04731159
We next symbolize by ni. the number of outcomes with the ith value of the
first variable, x (i 1,..., r). Similarly, let n.j designate the number of out-
comes with the jth value on the second categorical variable, y (j 1,..., c).
Obviously, the ni.’s give the frequencies with which the random variable x
takes its values. At the same time, the n.j’s are the frequencies with which the
random variable y takes its values. That is, while the nij’s represent the
observed cell frequencies corresponding to the joint distribution of both vari-
ables, x and y (i.e., when they are both considered together), the ni.’s are the
observed frequencies that correspond to the marginal distribution of x and
the n.j’s present the observed frequencies for the marginal distribution of y.
For this reason, the ni.’s and n.j’s are called observed marginal frequencies, cor-
respondingly for the variables x and y. We use the data in Table 10.1 to illus-
trate these frequencies, where we attach their symbols to the entries of the
cells of the table, as presented in Table 10.2.
Using all these observed frequencies—i.e., the observed cell and marginal
frequencies nij, ni., and n.j—we can estimate the population probabilities for
each of the two categorical variables to take their values (outcomes). To this
end, let us denote these probabilities by i. and .j, respectively (i 1,..., r,
and j 1,..., c). That is, i. is the population probability of x to take a value
in its ith category, and .j is the population probability of y to take a value in
its jth category. Further, denote by ij the population probability of x taking
a value in its ith category and of y taking a value in its jth category, i.e., for an
outcome falling in the (ij)th cell of the contingency table generated by the
simultaneous consideration of the variables x and y.
Based on the earlier developments in this chapter, in particular in Section
10.2, we can estimate these probabilities in the following way (see Equation
(10.1); e.g., Agresti, 2002):
(10.12) ˆ i.ni. / n and ˆ .jn.j / n,
i 1,..., r and j 1,..., c. We next note that if the null hypothesis of indepen-
dence of x and y is correct, then this would imply from the probability-related
discussion in Chapter 4 that iji..j. In fact, the latter equation is equiva-
Table 10.2 Contingency table for licensure exam outcome by gender, with
the generic notation nij, ni., and n.j for cell frequencies and for marginal
frequencies (i 1, 2; j 1, 2, 3).
Pass Failed No Decision Total
Males n11 45 n12 23 n13 12 n1. 80
Females n21 55 n22 22 n23 15 n2. 92
Total n.1 100 n.2 45 n.3 27 n 172
lent to the null hypothesis of independence, and we can formally use the
notation H0 for it in the remainder of this chapter. That is, the hypothesis of
independence of x and y is tantamount to the null hypothesis
(10.13) H0:iji..j (for all possible pairs of i1,..., r and j1,..., c).
Since in order to proceed with hypothesis testing we need to work out the
distribution of a selected test statistic under the null hypothesis, we have to
find next the expected cell frequencies on the assumption of H0 being true.
With this assumption, based on the discussion in Chapter 4, it follows that we
can estimate the population cell probabilities by the product of the pertinent
population marginal probabilities for the two variables involved, x and y. That
is,
Having obtained the expected cell frequencies, it follows from the discussion
in the preceding Section 10.3 (see Equation (10.11) and immediately preced-
ing and succeeding discussion) that one can test the null hypothesis (10.13)
of independence using the test statistic
r.c.
(nijEij)2
(10.16) 兺
2i,j1 ,
Eij
where the sum is over all r.c cells of the CT generated by the variables x and
y. This test statistic (10.16), as mentioned earlier, follows for large n the chi-
square distribution with df (r 1)(c 1) degrees of freedom when the
null hypothesis of independence is true (e.g., Rao, 1973). Testing this null
hypothesis in an empirical setting is thus completed by checking if the value
of (10.16) then is higher than the 100(1 ␣)th percentile, for a given signifi-
cance level ␣, of the chi-square distribution with df (r 1)(c 1) degrees
of freedom.
The outlined testing procedure is readily carried out with the software R.
To this end, first we need to communicate the data to it. We accomplish this
using a new command, ‘matrix’, where we concatenate all rows of the sample
contingency table. (A ‘matrix’ is here a reference to a rectangular array of
numbers, or a table of numbers.) Subsequently, we use the R command
> lis.exam = matrix(c(45, 23, 12, 55, 22, 15), nrow = 2, byrow = T)
This command initially enters the six cell observed frequencies in Table 10.1
into a single row with six elements, using the ‘concatenate’ command, or ‘c’.
To inform R, however, that these six numbers come from a contingency table
with two rows and three columns, we use the subcommand ‘nrow 2’—
stating that we have two rows in the resulting table. We then request from R
to treat the first provided three cell frequencies as coming from row 1 and
then the remaining as coming from the second row of the contingency table
of concern here. That is, these six numbers are presented above row-wise.
This is signaled to the software using the subcommand ‘byrow T’. This
subcommand effectively states ‘‘it is true that the numbers are given row by
row.’’ We can always print the resulting matrix to the screen to assure our-
selves that R has correctly represented internally the data to be analyzed subse-
quently:
> lis.exam
[,1] [,2] [,3]
[1,] 45 23 12
[2,] 55 22 15
Hence, the internal representation of the data, achieved with the last preced-
ing R command, is indeed correct (cf. Table 10.1). Now that we have the data
read into R, we test the null hypothesis of interest as indicated above, viz.,
with the ‘chisq.test’ command (its result is provided beneath the command):
> chisq.test(lis.exam)
Pearson’s Chi-squared test
data: lis.exam
X-squared = 0.5209, df = 2, p-value = 0.7707
These results indicate a fairly high p-value, definitely higher than any rea-
sonable significance level ␣ that could have been preset here. For this reason,
we do not reject the null hypothesis of licensure exam outcome being unre-
lated to gender. We conclude that there is not sufficient evidence in the data
to warrant a suggestion that the male and female distributions of the numbers
of pass, fail, and no-decision outcomes are different.
Correlation
The previous chapters were primarily concerned with issues related to analy-
ses of single variables. Specifically, we considered one variable at a time and
discussed a number of issues related to the distribution of this variable. Many
research questions in the behavioral and social sciences, however, involve at
least two studied variables. In particular, the vast majority of inquiries in these
sciences pertain to examining the potential relationships between two or more
variables measured on a sample of subjects (or units of analyses) or in a popu-
lation, like in census studies. The remainder of this book provides an intro-
duction to several methods that have been developed to address such research
questions and how their application is conducted using the software R. In this
chapter, we lay the basic foundations of these methods, which will be particu-
larly helpful when dealing with more general statistical methods of analysis
and modeling.
One of the simplest ways to address the question of whether two studied
variables are related to one another or not is by using the concept of correla-
tion. As its name suggests, this concept is concerned with the degree to which
two random variables, say x and y, co-relate or co-vary with one another.
That is, this question focuses on the extent to which variability in one of the
variables is associated with variability in the other. In other words, an impor-
tant aspect of this query is whether considered subjects with relatively large
values on one of the variables (say x) tend to be among the subjects also with
relatively large values on the other variable (y). Conversely, the query might
be whether subjects with relatively small values on x tend to be among the
subjects with relatively large values on y.
For example, a question that is frequently of interest in the field of educa-
tional research asks whether there is any relationship between Scholastic Apti-
tude Test (SAT) scores and college-freshman-year success. That is, are
189
students with high SAT scores among those who also have high grade point
averages (GPA scores) at the end of their college freshman year? This is a
typical question about what may be termed a positive relationship between
two random variables—SAT and GPA scores here. Another question of inter-
est may be whether the number of hours an elementary school child watches
television in a week is related to his or her grades in school. That is, do stu-
dents who watch television for many hours a week tend to be among those
with overall lower grades in school? This is a typical question about what may
be termed a negative, or inverse, relationship among two variables.
These and many other queries are concerned with whether there is a rela-
tionship between two random variables under consideration, x and y. These
questions specifically ask if above average (or below average) scores on x tend
to be associated with above average (or below average) realizations on y, as in
a positive relationship case; or perhaps conversely, whether below (or above)
average scores on x tend to go together with above (or below) average realiza-
tions on y for the same subjects or units of analysis, as in a negative relation-
ship case.
All of the above-posited questions are quite different from most of the
questions we have been concerned with so far in the book. Specifically, in
the previous chapters we addressed many queries about individual random
variables, considered separately from other random variables—i.e., we simply
looked at them one at a time. For example, in earlier chapters we asked vari-
ous questions about the graphical representations of scores on a given vari-
able, about central tendency and variability on a specified measure, about
probability distributions, or about mean and variance differences in popula-
tions for a given variable that was considered separately from other variables
of interest in a study. A common feature underlying all these questions was
our focus on one variable at a time and various features of its distribution.
Those previous questions do differ from the ones we just posed above. Spe-
cifically, the current questions are intrinsically concerned with two random
variables rather than just one. Their essential feature is that they consider the
pair of variables simultaneously. Indeed, a question about the relationship
between two variables cannot be meaningfully raised unless one considers a
pair of variables at the same time, rather than one at a time.
is a qualitative definition of a new concept that we have not dealt with earlier
in the book. For this reason, it is fitting to illustrate here its underlying idea
with some empirical data.
Example 11.1 (examining the relationship between SAT score and GPA dur-
ing freshman year of college): In a small study, n 14 freshmen in a university
reported their SAT scores given below (denoted x) and their GPAs (denoted y)
at the end of their first year of college. We are concerned with the question of
whether there is a discernible linear relationship between the variables SAT and
GPA. The students’ data are as follows (with ‘id’ subject identifier, ‘x’
SAT score, and ‘y’ freshman year GPA score) and are contained in the file
CH11_EX1.dat.
id x y
1 1520 3.8
2 1330 2.9
3 1460 3.6
4 1250 2.7
5 1270 2.7
6 1310 2.9
7 1450 3.5
8 1530 3.7
9 1560 3.8
10 1470 3.3
11 1510 3.5
12 1370 2.9
13 1550 3.9
14 1260 2.6
> plot(x,y)
FIGURE 11.1.
Plot of SAT scores (x) and GPA scores (y) for n 14 college freshmen.
holds also for smaller scores—students with lower SAT scores tend to be
among those with lower GPA scores. We emphasize that this is a tendency,
i.e., a discernible linear trend, rather than a perfect linear relationship. In
actual fact, the observation (of an imperfect relationship) that we just made
from Figure 11.1 is characteristic of the vast majority of empirical research in
the behavioral and social sciences. In them, one should not really expect to
observe often, if ever, perfect (or even close to perfect) linear relationships
between studied variables, for at least two main reasons. First, there is usually
a considerable amount of measurement error involved when evaluating studied
variables in these disciplines. This error may contain ‘‘pure’’ measurement
error, or error that results from measuring in fact a variable that is not really
identical to the one of actual interest (e.g., Raykov & Marcoulides, 2011).
Second, relationships between variables of concern in empirical research can-
not be realistically expected to be so simple as to be described well by straight-
line (linear) functions.
Nonetheless, the knowledge of an approximate linear relationship—i.e., of
a discernible linear relationship as in Figure 11.1—can be very useful in these
and many other disciplines. Such a relationship indicates potentially a very
important association between two variables of interest. This association is
particularly helpful when trying to predict one of the variables based on
knowledge of the other, as we will be doing later in the book. In fact, the
stronger this association is, the better these predictions are going to be. More-
over, knowledge of an existing association between two or more variables—by
considering them pairwise, say—allows us to deepen our knowledge about
these and other variables. This may well help us answer further and more
involved scientific questions, e.g., such pursued with more advanced statistical
methods.
> cov(x,y)
[1] 52.37363
That is, the estimate provided in Equation (11.2) for the data set of Example
11.1 is 52.37 (rounded off). Hence, the covariance between SAT and GPA
scores is positive here. Thus, in this study, students with a higher-than-average
SAT score—which is 1417.14—tend to be among those with a higher-than-
average GPA score, which is 3.27. (Using the command ‘mean’, as in Chapter
3, one can readily obtain these sample averages from the available data.) Con-
versely, freshmen with below-average SAT scores tend to be among those with
below-average GPA scores. However, we underscore that we cannot say
whether the covariance of 52.37 found in this sample is large (strong),
medium, or small (weak). This as mentioned is a main limitation of the covar-
iance coefficient. For this reason, we now move on to a discussion of another,
closely related index of linear relationship between a pair of variables. That
index has become very popular in the social and behavioral sciences, in part
because it is possible also to interpret its magnitude.
Cov(X,Y)
(11.3) X,Y ,
兹Var(X)Var(Y)
where Var(.) denotes variance (see Chapter 3, and below in this section). In a
given sample of size n, the correlation coefficient can be estimated by
1 n
(11.4) 兺 ẑX,iẑY,i,
ˆ X,Y i1
n
that is, by the average product of the corresponding z-scores associated with
the variables involved. This estimate is frequently denoted alternatively by rX,Y.
Equations (11.3) and (11.4) present what is frequently referred to as the
‘‘Pearson product moment correlation coefficient,’’ bearing the name of its
originator, Karl Pearson (for continuous random variables; see, e.g., Ray-
kov & Marcoulides, 2011, for a nontechnical discussion of possible correla-
tion coefficients between discrete variables). These two equations also show
that since the covariance is a symmetric coefficient, so is also the correlation
coefficient. Furthermore, from these equations it is readily observed that the
correlation and covariance coefficients are zero at the same time. That is, if
there is no linear relationship between the variables x and y, their correla-
tion is zero as is their covariance. Then, as pointed out earlier, there is no
discernible linear pattern in the scatterplot of the associated data points
(xi, yi) (i 1,..., n or N as in a census). Conversely, if their covariance is
zero, so is also their correlation and vice versa, and there is no linear rela-
tionship between x and y. In addition, Equations (11.3) and (11.4) indicate
that the correlation coefficient is not defined if any of the variables involved
has zero variance. That is, if at least one of the variables is a constant—
which as pointed out earlier in the book is equivalent to its variance and
standard deviation being zero—then their correlation does not exist, while
their covariance is zero.
Another important feature of the correlation coefficient also follows from
its definition. Specifically, as shown in more advanced sources, the covariance
between two variables never exceeds the product of their standard deviations
(e.g., Roussas, 1997). Since the covariance is divided by this product in the
definition of the correlation coefficient (see Equation (11.3)), it follows that
any correlation lies within the closed interval [1, 1]. That is, for any two
random variables x and y (with positive variances),
(11.5) 1 ⱕ x,y ⱕ 1
> cor(x,y)
[1] 0.9713206
That is, the correlation between SAT scores and GPA freshmen scores is fairly
strong, estimated at .97. Hence, there is a strong positive linear relationship
in this data set between success on the SAT and in the freshman college year.
As a result, the pertinent scatterplot (see Figure 11.1) shows a clearly discern-
ible linear pattern. Examining Equations (11.3) and (11.4), this correlation
effectively suggests that the rank ordering of the SAT scores is nearly the same
as that of the GPA scores—as could be seen using the ‘rank(x)’ and ‘rank(y)’
procedures with R.
We illustrate further the concept of correlation by considering several addi-
tional examples.
Example 11.2 (SAT and college junior year GPA): The following scores stem
from a small study of n 12 college juniors.
id x y
1 1460 3.1
2 1250 2.8
3 1270 2.8
4 1310 3.2
5 1450 3.1
6 1530 3.6
7 1560 3.7
8 1470 3.1
9 1510 3.6
10 1370 3.0
11 1550 3.7
12 1260 2.9
The correlation coefficient for these two variables is estimated with R as fol-
lows, and their scatterplot is provided in Figure 11.2:
> cor(x,y)
[1] 0.8749617
> plot(x,y)
FIGURE 11.2.
Scatterplot of SAT scores (x) and GPA scores (y) in junior college year.
Example 11.3 (hours watching television and score on a test of reading abil-
ity): The following data stem from a study with n 16 elementary school
students, where the relationship between hours watching television per week
and their score on a reading ability test is of interest (‘x’ weekly hours of
television watching, ‘y’ reading ability test score):
id x y
1 6 32
2 8 32
3 7 28
4 6 32
5 5 30
6 9 24
7 5 32
8 4 30
9 5 36
10 3 29
11 5 37
12 7 32
13 4 29
14 5 32
15 8 29
16 9 28
For this data set, the estimated correlation coefficient is obtained with R as
follows, and the associated scatterplot is displayed in Figure 11.3:
> cor(x,y)
[1] -0.3965841
> plot(x,y)
FIGURE 11.3.
Scatterplot for data on weekly hours TV watching (x) and a reading ability test (y).
is far less clustered along an (imaginary) line. We stress that here we are deal-
ing with a negative correlation—children watching more television tend to
have lower test scores. Similarly, due to the negative correlation, there is a
negative slope for a possible line through the scatter of points.
trx,y 兹 2
n2
(11.7)
兹1r x,y
We can respond to this question using R. To this end, we use the command
‘cor.test’, as follows (after first reading in the above data; output of testing
procedure given after this command):
> cor.test(x,y)
data: x and y
t = -15.8708, df = 54, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9449202 -0.8464944
sample estimates:
cor
-0.907448
We see from this output that the correlation coefficient between reaction
time and GPA score is negative, estimated at .91, which is a strong correla-
tion. (Recall that what determines whether a correlation is strong, medium,
or weak is the absolute value of this coefficient.) That is, there is a clearly
discernible linear relationship between reaction time and academic success in
freshman year at college. Specifically, there is a marked tendency of students
with short reaction times to have high GPA scores and for students with long
reaction times to have low GPA scores.
Further, the associated p-value with the t-test statistic (11.7) is very small—
less than .001—and in fact smaller than any reasonable preset significance
level. Thus, if we were to test the null hypothesis H0 of no correlation against
the two-tailed alternative Ha: x,y ⬆ 0, we would reject H0 and conclude that
there is evidence in the data indicating a discernible linear relationship
between reaction time and GPA score. If we were to have settled, before look-
ing at the data, on the alternative Ha: x,y ⬍ 0—as one could expect on subject-
matter grounds in this example—then given that the correlation estimate is
in its direction we would halve the reported p-value. Having thus obtained a
p-value that is even smaller than the reported one above, i.e., smaller than
.001, we would conclude that we could reject H0 in favor of the hypothesis
that there is a negative correlation between reaction time on the used cogni-
tive test and GPA score for college freshmen.
We similarly note that the software R provides with this command ‘cor.test’
also a confidence interval (CI) at the conventional 95%-CI. This CI stretches
here from .94 through .85. In general, the CI of the correlation coefficient
is nonsymmetric, due to the fact that the random sampling distribution of the
correlation coefficient is itself nonsymmetric in general (unless its population
value is zero; e.g., King & Minium, 2003). What underlies then the procedure
of obtaining the CI for the correlation coefficient is the so-called Fisher trans-
formation, which is a nonlinear function (e.g., Agresti & Finlay, 2009). This
nonlinearity leads to the general lack of symmetry in the resulting CI. The
details of how one could use R for obtaining such a CI, along with the under-
pinnings of the procedure used for this purpose, go beyond the confines of
this book. A nontechnical discussion of them can be found, for example, in
Raykov & Marcoulides (2011).
We would like to point out here that the CI of the correlation coefficient
can also be used to test hypotheses about a correlation coefficient if need be,
where the hypothetical value is not zero. For example, if the null hypothesis
> cor(x,y)
[1] 1.89315e-16
> plot(x,y)
FIGURE 11.4.
Plot of the variables x and y in Example 11.5.
The variable that is of focal interest in a given study is often called a dependent
variable (DV) or alternatively a response or outcome variable. This is the vari-
able for which we may also wish to make particular predictions. We will
assume in this and the following two chapters that the DV is (approximately)
continuous, or can be treated as a continuous random variable. Alternatively,
the variable that we use as a means for predicting the DV is called an indepen-
dent variable (IV) or a predictor. At times it is also referred to as an explanatory
variable or covariate in the literature—to emphasize that it typically covaries
with the DV.
In the remainder of this chapter, we will be concerned with the case when
there is just a single predictor, i.e., a single IV variable that is used to predict
a DV of interest. The more general case, with two or more predictors (IVs),
will be covered in the next chapter. As before, we denote the two variables
under consideration by x and y, but we now need to pay attention to which
variable is a DV and which an IV. We use throughout the symbol y to denote
the DV or response variable and x to denote the IV. That is, we wish to discuss
methods for studying the relationship between x and y in such a way that will
209
FIGURE 12.1.
Graphical representation of the intercept and slope in a simple linear regression
model.
(12.1) that we have not yet mentioned. That parameter plays an important
role when evaluating the intercept and slope based on sample data from a
studied population, as well as when assessing model adequacy.
As we can see from Equation (12.3), S is the sum of the squared deviations
between the individual observed scores yi on the dependent variable y, on the
one hand, and what their predictions would be according to the model (12.1),
viz., a bxi, on the other hand (i 1,..., n).
Estimating the model parameters a and b by minimizing the sum of
squared individual observations between data and model predictions on the
dependent variable is a main part of a major parameter estimation method in
statistics. This method is the least-squares (LS) method, which has a long his-
tory stretching over two centuries. The resulting estimates of the SLR model
parameters a and b are referred to as LS estimates of the intercept and slope,
respectively. It can be shown that these LS estimates result as follows (note the
use of caret to denote parameter estimate and bar to denote sample mean):
where rx,y denotes the sample correlation between x and y, whereas sx and sy
designate their standard deviations, respectively.
Equation (12.5) provides a very useful relationship between the concepts of
regression and correlation, which is worthwhile emphasizing here. Accord-
ingly, the slope of the SLR model (12.1) is directly related to the correlation
between the two variables involved. Specifically, the population slope b is
equal to the population correlation coefficient x,y up to the ratio of the stan-
dard deviations of the two variables involved. From this it follows that the
correlation and slope are identical only when the two variables are equally
varied (in the population or sample).
Let us illustrate these initial developments on linear regression by revis-
iting the data from Example 11.4 provided in Chapter 11. As will be recalled,
in that example we were interested in describing the relationship between
GPA score and reaction time on a cognitive test. Next, we will be interested
in predicting the GPA score for a subject with a known reaction time. (Data
are available in the file CH11_EX4.dat.) To estimate the intercept and slope
of the SLR model (12.1), which process is frequently referred to as SLR
model estimation or model fitting, one can readily use the software R. To
achieve this aim, we can employ the command ‘lm’—for ‘linear model-
ing’—in the following way (one could interpret the sign ‘⬃’ next as saying
that the variable on the left is the DV and that on the right as the IV, or
alternatively that the former is regressed upon the latter, for the SLR model
in Equation (12.1)):
> lm(y⬃x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
5.606733 -0.005162
of squares is often referred to as the residual sum of squares (RSS) and at times
denoted as SSres (where the SS typically stands for ‘‘sum of squares’’).
The average RSS is obtained by dividing the sum of residual squares by the
associated degrees of freedom (df):
n
兺 (yiŷi)2
s2e RSS / dfSSres / df,
i1
(12.8)
n2
where ŷi â b̂xi denote for convenience the model-based prediction of
the dependent variable value for the ith subject (case, or unit of analysis; i
1,..., n). In Equation (12.8), we divide the RSS by n 2, since here df n
2. The reason is the fact that in order to obtain the right-hand side of
Equation (12.8), we need to estimate two parameters, a and b—since other-
wise the predicted values ŷi cannot be computed. That is, we need to subtract
2 from the initial number of independent observations, n, as discussed earlier
in the book (Chapter 5), in order to obtain the associated degrees of freedom.
Thus, the degrees of freedom are n 2. We emphasize that the degrees of
freedom are representative here of the number of independent observations
remaining after estimating the intercept and slope. Since we want to obtain
an average measure of the individual residuals’ magnitude, the most natural
measure would be their average square. Given that we have in effect n 2
individual contributions to the RSS, this average measure of error magnitude
is provided by the ratio RSS/df, as defined in Equation (12.8).
The square root of s2e, i.e., se, is called the standard error of estimate (SEE)
associated with the SLR model (12.1). (As throughout this book, we assume
positive square roots being taken or considered.) The SEE represents the sam-
ple standard deviation around the regression line, or residual standard devia-
tion of the dependent variable y. The SEE can be interpreted as the standard
deviation of the response variable scores yi, which are associated with a given
value of the explanatory variable, xi (under the assumptions of the SLR model;
i 1,..., n).
We conclude this section by stressing that there are in fact three estimates,
ˆ
â, b , and se, which we obtain when fitting the SLR model (12.1). This is
because the SLR model is associated with three parameters, namely, the inter-
cept, slope, and SEE—the standard deviation of the population outcome
scores at a given value of the explanatory variable. Their estimates are
obtained when this model is fitted to sample data.
First, the units of the SEE—being the original units of measurement of the
DV y—are generally difficult to interpret in subject-matter terms, if at all
possible in the social and behavioral sciences. Second, there is no limit from
above for the magnitude of the SEE (while obviously it is zero from below).
As a consequence, it is hard to know in an empirical setting whether the SEE
is large or small. Only when it is zero, we know that there is no error in the
model, but this will rarely if ever be the case in applied research.
For this reason, it would be desirable to have another measure of model fit,
which is scale-free and normed, preferably between zero and one, say. Such a
measure is the so-called coefficient of determination, usually referred to as the
R-squared index and denoted R2. This measure is commonly defined as the
proportion of variance in the dependent variable that is explained in terms of
the independent variable via the SLR model (12.1). That is, the R2 definition
is as follows:
S RSS
(12.9) R2 yy .
Syy
In Equation (12.9), Syy is the variance of the response variable y multiplied by
n
兺 (yi ȳ)2. At times, Syy is referred to as total sum of
(n 1), i.e., Syy i1
squares and denoted by SStotal. We emphasize that in the numerator of Equa-
tion (12.9) we have the difference between the DV variance Syy on the one
hand and the residual (or error) sum of squares RSS, on the other hand. That
is, the numerator—being this difference—is actually the remainder of what is
achieved by the SLR model (12.1) when the goal is to explain variance in the
DV y, as it usually is when this model is considered in empirical research.
From Equation (12.9), it obviously follows that: (a) 0 ⱕ R2 ⱕ 1 holds always;
(b) scores of R2 close to 1 indicate that most variability in the DV y is
explained in terms of the IV x via the fitted SLR model; and (c) scores of R2
close to 0 or even 0 indicate that very little variability in the DV y, if any, is
explained in terms of the IV x via the SLR model.
Furthermore, the R2 index defined in Equation (12.9) can be shown to
equal the square of the maximum possible correlation between the DV y and
a linear function of the IV x that is of the form a bx (where a and b can be
chosen without any constraint in order to attain this maximal correlation).
This correlation coefficient, which equals R, is in general referred to as multi-
ple correlation—in particular when there are multiple predictors (see next
chapter). In the special case of a single predictor—i.e., an SLR model—the R2
index can be shown to equal the square of the correlation coefficient between
x and y, i.e., 2x,y (in the population, or its sample counterpart in the sample;
e.g., King & Minium, 2003). That is, in the SLR model (12.1) of concern in
this chapter, the R2 is
(12.10) R2x,y2.
It can also be demonstrated (e.g., Pedhazur, 1997) that the R2 index is an
‘‘optimistic’’—i.e., positively biased—estimator of the population percentage
of variance in the DV y that is explained in terms of the IV x via the SLR
model (12.1). This positive bias results from the fact that as mentioned the R2
is associated with the result of an optimization procedure. This is the minimi-
zation of the critical sum of squared deviations S in Equation (12.3), which is
carried out on the sample data in order to estimate the intercept and slope in
the SLR model. However, since a sample is a proper subset of the studied
population (unless obtained in a census study), it is expected to differ from
the population and it is this difference that is typically referred to as sampling
error. As a result of the presence of sampling error, the R2 index obtained in
the sample (which is not equal to the population) is in general an overestimate
of the population proportion variance in the DV that is explainable in terms
of variability in the IV via the SLR model. (This argument also holds in cases
with more than one predictor, that is, in multiple regression analysis models,
a topic we take up in the next chapter.)
To counteract this bias, the so-called adjusted R2 index, denoted R2a, aims
at correcting for the bias and is generally a better estimate of this population
proportion of explained variance. For the SLR model, the adjusted R2 is
defined as follows (Pedhazur, 1997):
(12.11) R2a1(1R2) (n1)/(n2).
The multiple correlation coefficient R, the R2, and the adjusted R2 can be
readily obtained with the software R when the SLR model is fitted to data. To
this end, we need to request a ‘‘summary’’ of the output associated with the
SLR model (12.1), which we obtain as earlier with the command ‘lm’. For the
previous reaction time example (Chapter 11, Example 11.4), we obtain these
goodness-of-fit indexes using the command ‘summary’, after creating first an
object that is defined as equal to the output produced by the command ‘lm’.
This object is named next ‘slr.ex.11.4’ (output presented beneath command):
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-4.741e-01 -1.225e-01 -5.682e-05 1.323e-01 3.678e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6067330 0.1452832 38.59 <2e-16 ***
x -0.0051619 0.0003252 -15.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output initially shows that the individual residuals range in value from
.47 to .37. (It can be shown that in the population their mean is zero; e.g.,
Pedhazur, 1997.) For now, we do not pay attention to the ‘standard error’,
‘t-value’, and ‘Pr’ columns in this output, as well as to its last line—we will
revisit these output parts again in the next section. The last presented (sum-
mary) output shows that the SEE is estimated at 0.2063 and is associated with
54 degrees of freedom (since n 2 56 2 54 here). We cannot really
say, however, what this number actually means; that is, by only looking at it
we cannot really say whether it is large or small. We could of course compare
the magnitude of the SEE to the magnitude (standard deviation) of the DV y,
which is the GPA score, and note that on average .21 points from the GPA
score variance remain unexplained by the SLR model (12.1) that is fitted.
However, even such an observation is not a particularly insightful interpreta-
tion when it comes to evaluating how well a used model fits a data set.
For this reason, we examine the R2 index associated with the SLR model
and its adjusted version. The R2 index is .82 here (rounded off to second
decimal place), whereas its adjusted value only marginally differs from it (viz.,
by less than .01). That is, about 82% of the individual differences (variability)
in the GPA score were possible to explain in this data set in terms of reaction
time using the SLR model
GPAab. (Reaction time)e.
This percentage, 82%, is a fairly sizable proportion explained variance in the
used sample, at least relative to what may be considered the majority of cur-
rent behavioral and social research studies.
In conclusion of this section dealing with overall model fit, we stress that
from the definition of the R2 index it also follows that its numerical value is
specific to a fitted model. An SLR model is a rather simple model, and it is
possible that if its R2 is not considered to be high, then an extended model—
e.g., a model using additional predictors or nonlinear (e.g., higher) powers of
the explanatory variable x—may have substantially higher R2 values (see Sec-
tion 12.6 later in this chapter for an example). Before we address aspects of
this issue in more detail later in this and in the next chapter, however, we
discuss how we can make conclusions about the population values of model
parameters and coefficients of determination of SLR models under consider-
ation.
When we fit the SLR model (12.1) to a given data set, we use only data from
an available sample rather than from the entire population of actual interest.
Hence, the question that naturally arises now is what could be said about the
intercept and slope of this model in the studied population itself. A related
question asks whether the model is capable of explaining any population vari-
ance in the DV in terms of population variance in the IV. These questions
request inferences to be made about the population parameters and the R2
index in the population of concern, based on the available sample and the
performance of the SLR model in it. To this end, we need to know what the
sampling distributions of the intercept and slope estimators, â and b̂, are. As
shown in more advanced treatments (e.g., Roussas, 1997), when at each value
of the IV the scores of the DV follow a normal distribution, these estimators
are normally distributed with means being their population values and vari-
ances obtainable via appropriate formulas given below.
The slope b of the SLR model (12.1) is of particular interest thereby, since
its value pertains to the answer to the question of whether there is a discern-
ible linear relationship between the DV and IV used. It can be shown (e.g.,
Ott & Longnecker, 2010) that the ratio
b̂b0
(12.12) t
se兹1 / Sxx
follows a t-distribution with df n 2 (since we estimate two parameters
here—a and b), where b0 is a hypothetical population slope. Equation (12.12)
allows one to readily carry out hypothesis testing about the slope. In particu-
lar, when the null hypothesis of no linear relationship between the DV and
IV is to be tested, then the test statistic is (note, b0 0 holds then)
b̂ ,
(12.13) t
se兹1 / Sxx
which follows the t-distribution with df n 2 under the corresponding
null hypothesis H0: b0 0. The associated p-value with this hypothesis is
provided in the output produced by the software R in a column titled
‘‘Pr(⬎兩t兩)’’. Also, a p-value is then provided that is associated with the null
hypothesis of the intercept being zero, in the column with the same title (and
in the row of the intercept estimate).
For our above example dealing with the relationship between GPA score
and reaction time, we see from the R output that the p-values associated with
these two null hypotheses of vanishing intercept and slope are very small,
practically zero. Hence, we can reject the null hypothesis that the slope is zero,
as well as the null hypothesis that the intercept is zero, and conclude that
there is evidence in the data for a discernible linear relationship between GPA
score and reaction time.
Using the distribution of the test statistic given in Equation (12.13), as
earlier in the book (see details in Chapter 6) it can be shown that for a given
␣ (0 ⬍ ␣ ⬍ 1), a (1 ␣)100%-confidence interval for the slope is
(12.14) (b̂t␣/2,n2se兹1/Sxx, b̂t␣/2,n2se兹1/Sxx),
where t␣/2,n2 denotes the (␣/2)th quantile of the t-distribution with df n
2. As discussed earlier, this confidence interval can also be used to test a
null hypothesis of the slope being equal to a prespecified number.
A closely related question that is often posed in empirical behavioral and
social research is if the SLR model (12.1) ‘‘matters.’’ This would be the case
when its R2 index is not zero in the studied population—as otherwise the
model will not matter at all since it won’t explain any DV variance in terms
of IV variance. The corresponding null hypothesis that one needs to test in
order to address this question is H0: R2 0, versus the alternative Ha: R2 ⬎
0. If this null hypothesis is to be retained, then it can be suggested that the
fitted SLR model does not explain any part of the variance in the DV in terms
of the variance in the IV—i.e., there is no linear relationship between the two
variables in the studied population. If this null hypothesis is rejected, however,
the statistical conclusion would be that there is a discernible linear relation-
ship between the two variables. As developed in more advanced treatments
(e.g., Roussas, 1997), this null hypothesis H0: R2 0 can be tested for the
SLR model (12.1) using the test statistic
SyyRSS
(12.15) F .
s 2e
The F-statistic in Equation (12.15) follows under the null hypothesis H0 an F-
distribution with degrees of freedom 1 and n 2 (recall, the F-distribution
is characterized by two rather than a single number as degrees of freedom).
The outcome of this test is provided in the last line of the SLR model
output obtained with R, when the software is used as outlined earlier in this
section. For the above reaction time example (Example 11.4), this test statistic
is 251.9 on one and 54 degrees of freedom, and the associated p-value is less
than 2.2 1016, i.e., practically zero. Thus, we reject the null hypothesis H0:
R2 0, and suggest that in the population the SLR model (12.1) does matter,
i.e., it explains a positive proportion of variance in GPA scores in terms of
variability in reaction time.
1. There exists a linear relationship between the variables x and y—i.e., the
SLR model (12.1) is correct in the studied population.
2. The error term e is normally distributed, and more specifically the individ-
ual errors are independent of one another as well as normally distributed
(with zero mean).
3. The variance of the y scores at any given x score, is constant—i.e., it does
not depend on the x score.
When these assumptions are correct, the SLR model (12.1) is a useful
means of describing the relationship between a predictor and a dependent
variable. When the assumptions cannot be considered plausible for a given
data set, however, then doubt is cast upon the validity of the SLR model
(12.1) in the population under consideration. In that case, a conclusion can
be reached that the model does not adequately describe the relationship
between the two studied variables, x and y. The set of assumptions 1 through
3 are often referred to as ordinary least squares (OLS) assumptions (cf. Wool-
dridge, 2008, for a more detailed discussion on this topic). This reference
results from the fact that when they hold, the classical—i.e., unmodified, or
ordinary—least squares principle we employed above can be used to fit the
SLR model (12.1) (as well as more general models containing two or more
predictors). We note that the OLS parameter estimates do not depend on the
normality assumption. In fact, this assumption is only needed when infer-
ences are to be made about model parameters and coefficients of determina-
tion in studied populations.
Since the SLR model is based on several assumptions as mentioned, its
application is warranted only when they are plausible for an empirical data
(i) a plot of the above residuals êi in Equation (12.7) against the predictor
values xi (i 1,..., n) will produce a band of points that is symmetric
around the horizontal line y 0, which band in addition will be of
(about) the same width regardless of the value of x;
(ii) a plot of the residuals êi against the response variable values yi (i
1,..., n) will produce a band of points that is symmetric around the
horizontal line y 0, which band in addition will be of (about) the
same width regardless of the value of y;
(iii) a plot of the residuals êi against the predicted values ŷi (i 1,..., n)
will produce a band of points that is symmetric around the horizontal
line y 0, which band in addition will be of (about) the same width
regardless of the value of x;
(iv) the residuals êi should be normally distributed (i 1,..., n).
If these plots do not exhibit the pertinent pattern indicated in (i) through
(iv) but instead reveal some other clearly discernible trend or pattern, then it
can be argued that at least one of the above three assumptions does not hold.
In such cases, the fitted SLR model does not represent a plausible means of
data description. Alternatively, if only a very limited number of points violates
these assumptions, as judged by the plots, then it could be argued that the
corresponding individuals may be ‘‘outliers,’’ i.e., possibly result from data
entry errors or stem from studied units belonging to a different population
from the targeted one (for a nontechnical discussion of the concept of ‘‘out-
lier,’’ see e.g., Raykov & Macoulides, 2008).
The plots described in (i) through (iv) can be readily obtained using the
following respective R commands:
where ‘name of model object’ is the name assigned to the object repre-
senting the fitted model (see next). Thereby, we use the last two listed com-
mands to obtain first a normal probability plot for the residuals that is then
overlaid by a line, thus allowing us to visually assess the extent of their devia-
tion from a straight line. We illustrate the residuals plots (i) through (iv)
and their utility in assessing the SLR model assumptions with the following
example.
To this end, we begin by assigning the name ‘mod.1’ to the object defined as
the output associated with the SLR model (12.1) for these two variables:
The residual plots of interest are then furnished with the following specific
commands:
The resulting plots, for the data of Example 12.1, are presented in Figures 12.2
through 12.5.
As can be seen by examining Figures 12.2 through 12.5, the residual plots
depicted on them exhibit the patterns indicated in (i) through (iv), which
suggest that the SLR model assumptions are not inconsistent with the data.
Specifically, the plots displayed in Figures 12.2 through 12.5 do not indicate
serious violations of the above OLS assumptions 1 through 3 with regard to
the analyzed data. Therefore, it could be suggested that the SLR model (12.1)
represents a plausible method of description of that example data set under
consideration.
We note that the assessment of model fit using the residual plots (i)
through (iv) contains some subjective element that in general cannot be
avoided when using these graphical evaluation approaches. In particular, with
small samples (e.g., only up to a few dozen, say) some deviations from the
straight line in the normal probability graph (iv) could be expected even if
the residuals were drawn from a normal distribution. Therefore, it would be
FIGURE 12.2.
Plot of residuals versus predictor (x) values for the simple regression model relating
algebra knowledge to motivation (Example 12.1).
FIGURE 12.3.
Plot of residuals versus response (y) values for the simple regression model relating
algebra knowledge to motivation (Example 12.1).
FIGURE 12.4.
Plot of residuals versus fitted values (predicted response values) for the simple
regression model relating algebra knowledge to motivation (Example 12.1).
FIGURE 12.5.
Normal probability plot (with overlaid straight line) for the residuals associated with
the simple regression model relating algebra knowledge to motivation (Example
12.1).
Since the first of the above OLS assumptions 1 through 3 (see subsection
12.6.1) stipulates that a linear relationship between the variables is involved,
we can first look at the plot of the data to examine if there may be any particu-
lar deviations from linearity, to begin with, in the relationship between aspira-
tion and mathematics ability score. We use the R command ‘plot’ for this aim,
as before, and the result is presented in Figure 12.6 following it:
FIGURE 12.6.
Plot of the data from Example 12.2.
We fit next the SLR model (12.1) to these data and create the above plots
(i) through (iii) in Figures 12.6 through 12.8 (see output beneath the follow-
ing two commands):
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-5.5756 -2.7389 -0.9553 1.5878 14.5017
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.83282 1.41261 -9.085 2.7e-13 ***
x 1.21634 0.04366 27.858 ⬍ 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the fitted linear regression model is associated with a fairly high
(adjusted) R2 index, suggesting this model explains a great deal of the variabil-
ity in the DV y. Similarly, the null hypotheses of vanishing slope and such R2
are rejected. Indeed, the associated F-statistic in Equation (12.15) is here 776.1
and based on one and 67 degrees of freedom its pertinent p-value is practically
zero. This suggests that the linear model matters—i.e., the hypothesis H0:
R2 0 in the studied population of high school students can be rejected.
Similarly, the slope estimate is 1.22 and its associated p-value is practically
zero as well (with its t-value being 27.86), suggesting that the null hypothesis
of no linear relationship between college aspiration and math ability test score
in the population can be rejected. (As can be seen from the preceding discus-
sion, these two null hypotheses are in fact identical, which is the reason why
the p-values associated with their F- and t-statistics cited are identical as well.)
Despite these apparent indications of relatively good overall fit of the SLR
(12.1) model, we recall that the R2 coefficient is by definition an omnibus
(overall) index of goodness of fit, as explicated in its definition in Equation
(12.9). The reason is that the R2 is defined as the ratio of two variances yet
the latter are themselves overall indexes, rendering thus in their ratio also an
overall goodness of fit measure. Hence, by its very definition the R2 index does
not pay attention to (i.e., does not reflect) the extent to which individual
observations are fitted (explained) by the model. Therefore, the R2 may be
large even if there may be a few individual residuals that are not well approxi-
mated (fitted) by the model and thus invalidate it as a possibly adequate
means of description of an entire given data set. In fact, the present is an
example of precisely this phenomenon occurring, as we will see shortly.
We therefore examine next the earlier mentioned residual plots (i) through
(iii) for this data set, which are presented in Figures 12.7 through 12.10 that
we obtain by using the following R commands:
The plot in Figure 12.7 exhibits distinctive nonlinearity, with notably large
residuals (relative to the magnitude of the response variable y) at smaller-
FIGURE 12.7.
Plot of residuals against predictor values (Example 12.2).
FIGURE 12.8.
Plot of residuals against response variable values (Example 12.2).
FIGURE 12.9.
Plot (iii) of residuals against predicted (fitted) response values (Example 12.2).
normality assumption is clearly violated for this model when fitted to the data
from Example 12.2.
Hence, all residual plots (i) through (iv) in Figures 12.6 through 12.9 for
the numerical example under consideration (Example 12.2) indicate serious
violations of the SLR model (12.1), and specifically of its assumptions. In
FIGURE 12.10.
Normal probability plot for the residuals of the simple linear regression model
(Example 12.2).
particular, there is evidence suggesting that the linearity and residual normal-
ity assumptions do not hold. Thus, despite the relatively high (adjusted) R2
index of the fitted linear model and the significance of all test results obtained
with it—in particular for the slope and this R2 index—we conclude that the
SLR model does not provide a plausible means of data description.
The present Example 12.2 at the same time demonstrates clearly the fact
that a high overall fit index, such as the adjusted R2, is no guarantee for a
well-fitting model. In particular, even though a model may be associated with
a high adjusted R2 index, there may be a considerable number of subjects for
which it provides less than satisfactory fit. This fact underscores the relevance
of also evaluating the model residuals as a matter of routine, when fitting
simple regression models. (This conclusion remains valid also for more com-
plex regression models, such as those containing nonlinear powers of used
predictors or multiple predictors, which are discussed later in this chapter and
the next.)
The residual plots in Figures 12.7 through 12.10 show marked violations
from the straight line for smaller-than-average x scores as well as for larger-
than-average x scores. This finding can be interpreted as suggesting that the
relationship between aspiration and mathematics ability, as captured in their
scale/test scores x and y, may in fact be quadratic rather than linear. That is,
the SLR model (12.1) may actually have an omitted term, the square of the
predictor x. (Such models with omitted terms are often referred to as ‘‘miss-
> x2 = x^2
Next, in order to fit the data from Example 12.2 better, we extend the SLR
model to include the squared reaction time—i.e., consider the extended
model
where for convenience the symbol e* denotes the error term of this model
(thus, not to be assumed equal to the error term e in the SLR model (12.1))
and the same notation is used for the intercept and coefficient in front of x.
In agreement with our earlier discussion in Section 12.1 of this chapter, we
emphasize here that model (12.16) is also a linear regression model, just like
(12.1) is. The reason is the fact that its parameters a, b, and c appear in the
right-hand side of its defining equation (12.16) only linearly. That is, despite
the inclusion of the square of the explanatory variable x into the latter model,
it remains a linear regression model.
We fit this model to the data using R in the following way, whereby to
emphasize its feature of including the square of the mathematics test score
(x) we formally refer to the resulting R-object as ‘model.qt’ (for ‘model with
quadratic term’) in the remainder of this chapter. To invoke fitting this model
to the data, we add the variable denoted ‘x2’—the squared explanatory vari-
able—after the symbol ‘⬃’ used in the model fitting command ‘lm’ next, with
the ensuing output provided beneath commands used:
Call:
lm(formula = y ~ x + x2)
Residuals:
Min 1Q Median 3Q Max
-3.62493 -1.43378 0.08254 1.08254 5.51978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.102148 0.997868 2.107 0.0390 *
x -0.050921 0.071370 -0.713 0.4781
x2 0.022105 0.001206 18.337 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We notice that the residuals associated with the fitted model (12.16) are
markedly smaller than those pertaining to the SLR model (12.1) fitted earlier.
In particular, here they range between 3.62 and 5.52, while for the former
model they ranged between 5.58 and 14.50. While it is not immediately
seen as an easy task to interpret the magnitude of the individual residuals of
a given model (except of course to relate it to that of the original dependent
variable scores), when we have more than one model fitted to the same set of
observed variables (data set) we can informally compare the models by com-
paring the magnitude of their residuals with one another. Such a comparison
does not give us, however, an overall idea about the relative fit of the models,
which can be readily obtained by comparing their overall indexes.
Returning to the example study of the relationship between college aspira-
tion and mathematics ability, we see that the R2 index of the fitted model
(12.16) is significant (see last line of output) since the p-value associated with
the F-statistic pertaining to the null hypothesis of this index being zero in the
population is very small—practically zero. This means that model (12.16)
matters—i.e., it explains a significant proportion of response variance. (This
is not an unexpected finding, since we know from the earlier discussion in
this chapter that the linear model (12.1) matters, and hence the model that
adds one more explanatory term to it—viz., the variable ‘x2’—would have to
matter as well.)
At the same time, we notice the markedly higher R2 index of the model
(12.16) than that index of the earlier fitted linear model, with the same rela-
tionship holding for their adjusted R2 indexes. This suggests that model
(12.16) is an improvement over the earlier model (12.1) as far as overall fit to
the analyzed data is concerned. In order to examine if this is also the case in
the studied population that is of actual interest, we check in the last presented
output if the coefficient of the quadratic term of the model (12.16) is signifi-
cant, i.e., test the hypothesis H0: c 0. Since the p-value associated with the
quadratic parameter estimate ĉ .022 is very small—practically zero—this
FIGURE 12.11.
Residual plots (i) through (iii) and normal probability plot for the quadratic model
(12.16) fitted to the data of Example 12.2 (see labels on axes to differentiate between
plots).
null hypothesis can be rejected. This suggests that as far as overall fit goes,
model (12.16) is indeed an improvement upon the SLR model (12.1) as a
means of description of the relationship between college aspiration and math-
ematics ability in the high school student population in question.
While similarly to the SLR model the last fitted model (12.16) is associated
with fairly high overall fit indexes, it is unclear whether the latter model is
associated with good local fit, i.e., whether it fits the analyzed individual sub-
jects’ data well. As we saw earlier in this section, a high R2 index is not a
guarantee for good local fit (i.e., individual data fit), and so we would now
like to see how model (12.16) fares at the level of the subject data. To this
end, we examine the above residual plots (i) through (iv) for model (12.16),
using the same commands as before (but now applied on the R-object ‘mod-
el.qt’; see Figure 12.11 for all four plots).
In the first three of these residual plots, there are no noticeable patterns—
unlike those in the earlier Figures 12.7 through 12.10 for the previously fitted
SLR model (12.1). This finding of no noticeable patterns in the residual plots
(i) through (iii) would be expected with well-fitting models that represent
plausible means of data description and explanation. In addition, the normal
probability plot for the residuals of model (12.16) does not suggest any
marked deviations from the normality assumption. Specifically, this model’s
residuals are relatively close to the fitted line, and their limited deviations
from it do not exhibit any particular pattern (compare with the normal prob-
ability plot of the residuals associated with the SLR model (12.1) when fitted
to this data set; e.g., Figure 12.10).
The example used throughout this section provided a clear demonstration
of the relevance of carefully examining residual plots, such as the above graphs
(i) through (iv), whenever the SLR model (12.1) is fitted to data. The example
also illustrated the fact that these and related residual plots can be used to
both guide a researcher to assess its fit to the analyzed data as well as in his or
her search for better fitting models if need be—like the extended model pre-
sented in (12.6).
In conclusion of this section, we also saw with this example and our perti-
nent discussion how these aims of (i) assessment of model fit at the level of
individual data through residual plots, and (ii) model improvement based on
information about possible misfit obtained from these plots can be readily
accomplished with the software R.
239
chapter, the goal of MRA is to find weights b1, b2,..., and bk such that the
following linear combination of the predictors
(13.1) ŷb0b1x1b2x2...bk,
is as close as possible to y, for all studied observations. As a result, since
the right-hand side of Equation (13.1) represents the predicted-by-the-model
response score, the correlation between model predictions and observed out-
come scores, r Corr(y, ŷ), will be maximal (with Corr(.,.) denoting correla-
tion; e.g., Agresti & Finlay, 2009).
A frequently used reference to this maximal correlation in the context of
MRA is as multiple correlation coefficient, or multiple R. We already encoun-
tered this reference in Chapter 12, specifically on a few occasions in the output
produced by the software R. In particular, the square of the multiple R, viz.,
R2, is called the coefficient of determination, as in the context of the SLR model
(12.1). (More specifically, as mentioned in the previous chapter, for the SLR
model the squared correlation of independent and response variable equals
R2.) In the present chapter, we will use the extended notation R2y.12...k to sym-
bolize the coefficient of determination associated with a MRA model under
consideration, when it will be helpful to emphasize that we are using more
than a single independent variable. (In other cases, where no confusion may
arise, we will use the simpler notation R2 for this coefficient.) Also in the MRA
case, 100% R2y.12...k provides the percentage of explained variance in the
response variable in terms of the used predictors—viz., via their assumed lin-
ear relationship to the outcome variable. In addition, it can be shown that if
all explanatory variables are unrelated (uncorrelated), then R2y.12...k equals the
sum of squared correlations of each one of them with the response variable y
(e.g., Pedhazur, 1997). However, if at least two independent variables are cor-
related with one another, which is the most frequent case in empirical be-
havioral and social research, then R2y.12...k is smaller than this sum of squared
correlations of predictors with the outcome variable.
As one might expect by analogy to the SLR model (12.1) discussed in Chap-
ter 12, the MRA model is
(13.2) yb0b1x1b2x2...bkxke,
where e is the model error (residual). The model error is typically assumed
normal when we want to carry out statistical inference, with a mean of zero,
and uncorrelated with any of the explanatory variables. (Similar to the SLR
case, we do not need to make this normality assumption if we are only inter-
ested in parameter estimation; cf. Wooldridge, 2008.) In the MRA model
(13.2), the residual e contains the effects of all other variables (or higher pow-
ers and/or interactions of the predictors used), which are possibly related to
the response y and not explicitly included in the model, i.e., do not appear in
the right-hand side of Equation (13.2). We note that the term e in (13.2) is
not identical to the error term in the SLR model in Equation (12.1) but is
conceptually similar to it. For this reason, we use formally the same notation
e for the model residual in Equation (13.2) as in the last chapter dealing with
the SLR model.
In the MRA model (13.2), a number of independent variables are included,
and this fact modifies the interpretation of the parameters appearing in its
right-hand side. Like the SLR model (12.1) in the previous chapter, the inter-
pretation of the intercept b0 is as the average response score for observations
(units of analysis) having the value of zero on all predictors x1 through xk.
Unlike the SLR model, however, the meaning of the jth regression coefficient
bj is as the average change in the dependent variable y, which is associated
with a unit change in the jth predictor xj while holding all remaining pre-
dictors x1,..., xj1, xj1,..., xk constant ( j 1,..., k). We stress that the interpre-
tation of any of the regression coefficients b1 through bk is conditional on
holding constant the other k 1 predictors in the model. This is the reason
why these coefficients are referred to as partial regression coefficients (weights
or slopes). As an implication, the partial regression coefficient bj may be small
and unimportant (nonsignificant) in the presence of other predictors, but if
xj is used as a single predictor of the dependent variable y then the pertinent
slope b in the SLR model (12.1) may be substantially larger and
significant ( j 1,..., k).
Similarly to the single predictor case of concern in Chapter 12, the
unknown parameters b0, b1, b2,..., bk are estimated in such a way that the
following sum Sm of the squared discrepancies between observations and
model predictions on the response is minimal (where as usual n denotes sam-
ple size):
n
兺 (yiŷi)2.
Smi1
Example 13.1: In a study of general mental ability of high school seniors, five
intelligence tests were administered to n 160 students. The first three tests
measured the fluid intelligence subabilities of inductive reasoning with letters,
figural relations, and inductive reasoning with symbols; the fourth test tapped
into the ability to rotate three-dimensional figures; and the fifth was a conven-
tional general mental ability test. (The data are found in the file CH13_EX1.dat,
where individual scores represent percentage correct answers on the tests.) We
wish to evaluate the (linear) relationship between the test of general mental
ability, on the one hand, and the other four intelligence tests on the other hand.
Accordingly, we denote the general mental ability measure as y, and the four
remaining tests as x1 through x4 (in the above order; these four variables will be
correspondingly denoted ‘x1’ through ‘x4’ in the following R commands).
To evaluate the relationship between the general mental ability test score
and the four intelligence tests, we fit the following multiple regression model
(13.2) with k 4 predictors:
(13.3) yb0b1x1b2x2b3x3b4x4e.
To this end, as in Chapter 12 we use the R command ‘lm’, but we now list
formally the sum of all used independent variables after the ‘⬃’ sign in it:
We stress that in this way we assign to the produced output the status of an
R-object with the name ‘mrmod.1’. To see the results of fitting model (13.3),
we request the associated output as in the last chapter with the command
‘summary’:
> summary(mrmod.1)
Since we are going to refer often below to the output associated with model
(13.3) when fitted to the data from the currently considered example, we
present this summary output in Table 13.1.
As we can see from Table 13.1, the MRA model (13.3) is associated with an
R index of .8018 here. That is, 80% of the variability in the general mental
2
ability score was explained in terms of individual differences on the four intel-
ligence measures x1 through x4, viz., via their assumed linear relationship to
the former score. At this stage, we focus only on the output column contain-
ing the estimates b̂0 through b̂4 of the intercept and partial regression coeffi-
cients b0 through b4, respectively (see first column after that with the names
Table 13.1 Summary of results associated with model (13.3) (fitted to data
from Example 13.1).
Call:
lm(formula y ⬃ x1 x2 x3 x4)
Residuals:
Min 1Q Median 3Q Max
-0.199329 -0.059242 -0.003189 0.061958 0.240692
Coefficients:
Estimate Std. Error t value Pr(>兩t兩)
(Intercept) 0.06075 0.02933 2.071 0.039997 *
x1 0.31664 0.08463 3.741 0.000257 ***
x2 0.16173 0.06729 2.404 0.017416 *
x3 0.43964 0.07601 5.784 3.9e-08 ***
x4 0.16241 0.08797 1.846 0.066765 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That is, the predicted value is 60.953 for the score of the subject in question
on the general mental ability test. (Recall from the preceding discussion in
this section that the data on each of the five tests used were available in terms
of percentage correct answers.) As pointed out in the last chapter, we empha-
size here that the credibility of predicted values depends on the tenability of
the model used to obtain them, and specifically on whether it is a plausible
means of description and explanation for the analyzed data. We attend to this
topic later in the chapter (see Section 13.6), after discussing next various
related matters that pertain to drawing inferences from model results to stud-
ied populations.
The estimates of the MRA model parameters discussed in Section 13.1 only
represent single numerical guesses about their values in a studied population
of actual interest. A researcher will typically have, however, many more ques-
tions that he or she would be interested in answering from the results of a
multiple regression analysis. We attend to a number of them in this section.
(13.5) tbj/sbj.12...k.
In Equation (13.5),
n
(13.6) 兺 (XjiX̄j.)2]
s2bj.12...ks2y.12...k/[(1R2j.12...k)i1
is the squared standard error for the jth partial regression coefficient, and
n
兺 (Yib̂0b̂1X1i...b̂kXki)2 /(nk1)
s2y.12...ki1
n
兺 (YiŶi)2 / (nk1)
i1
is the squared standard error of estimate (SEE) associated with the MRA
model fitted, while R2j.12...k is the R2-index for the model regressing the jth
predictor upon the remaining ones, x1,..., xj1, xj1,..., xk. When the null
hypothesis *H0 is true, the t-statistic in Equation (13.5) follows a t-distribution
with d n k 1 degrees of freedom (e.g., Agresti & Finlay, 2009).
Equation (13.6) is very informative with regard to the stability of the esti-
mate of the jth partial regression weight bj across repeated sampling (at the
same sample size, n, from the studied population). In particular, the right-
hand side of Equation (13.6) reveals that the stronger the relationship between
the jth predictor and the remaining ones, the larger is the standard error
associated with its slope estimate, i.e., the more unstable this estimate is across
repeated sampling (other things being the same). In the extreme case when
the jth predictor is perfectly predictable from the remaining independent vari-
ables x1,..., xj1, xj1,..., xk (i.e., when R2j.12...k 1), the standard error of the jth
slope estimate becomes infinity. This extreme case is called multicollinearity
(sometimes referred to as perfect multicollinearity).
When predictors are carefully chosen in empirical research, the multicollin-
earity phenomenon is less likely to occur, but near-perfect multicollinearity
may hold among a used set of predictors. In those cases, some of the indepen-
dent variables nearly completely predict the jth explanatory measure xj, and
the standard error of the pertinent weight bj is very large. This tends to lead
to retention of the null hypothesis *H0: bj 0. Such a finding may seem to
be inconsistent with a rejection of the overall null hypothesis H0': b1 b2
... bk 0, in the same data set. The finding is explained by the lack of
sufficient information then in the available sample, which would allow differ-
entiation of the contributions of individual predictors to the explanation of
response variance. In these cases, one may (a) enhance sample size—by study-
ing additional individuals or units of analysis sampled from the same popula-
tion, (b) combine two or more of the independent variables into a single (or
fewer) new measures in a substantively meaningful way, or (c) drop from
> left.limit
[1] 0.149463
> right.limit
[ob1] 0.483817
Thus, the 95%-CI for the first partial regression coefficient is (.149, .484).
That is, the weight of the inductive reasoning with symbols score in the best
linear combination of the four intelligence tests for the purpose of predicting
general mental ability could be expected with high confidence to lie between
.149 and .484 in the studied high school senior population.
contains sampling error (the difference between population and sample); and
relatedly (ii) the fact that the multiple correlation coefficient R derived from
a given sample is biased upward if considered an estimate of the population
multiple correlation coefficient associated with the fitted model (e.g., Pedha-
zur, 1997; see also Chapter 12).
As it turns out, it is not possible to determine exactly the extent of shrink-
age in the multiple R, but this shrinkage can be estimated. Then the adjusted
R2 index, denoted R2a, can be used as an alternative to R2 in predictive research
as well as in other regression analysis applications. This adjusted index is
defined as follows
(13.8) R2a1 (1R2) (n1)/(nk1),
and aims at providing a better estimate of the explanatory power of the fitted
model in the studied population than the conventional (unadjusted) R2 index.
From Equation (13.8), since (n 1)/(n k 1) ⬍ 1, it follows that R2a
⬍ R2, i.e., the adjusted index is lower than the conventional R2 index for a
given model. (This property also holds in the SLR model.)
As a pair of examples demonstrating this shrinkage, consider a model with
k 4 predictors fitted to a given sample with n 30 subjects and an associ-
ated R2 .64 value. From Equation (13.8) we obtain (‘ ’ denotes next mul-
tiplication)
R2a1(1R2)(n1)/(nk1)1(1.64) 29/25.58,
which is notably lower than the unadjusted R2 for the considered model; in
this case, the shrinkage is found to be .64.58.06. Alternatively, in the
above Example 13.1 (see lower part of Table 13.1 in Section 13.1), the adjusted
R2 was estimated at .797 and was associated with a much smaller shrinkage
value, viz., .802 .797 .005. The limited magnitude of the shrinkage in
Example 13.1 is due to the large sample size as compared to the number of
predictors, as well as to the relatively high unadjusted R2 of .802 associated
with the fitted model (13.3).
Upon investigation, the definition of the adjusted R2 index in Equation
(13.8) also reveals that for a given model and sample, other things being equal:
(a) the smaller the unadjusted R2 (i.e., the lower the percentage explained
response variance), the larger the shrinkage is; and
(b) the larger the ratio of number of predictors to sample size, k/n, the
greater the shrinkage is.
It can also be shown that even if in the population R2 0 holds (that is, a
model explains no response variance at all), the expected unadjusted R2 in a
sample is k/(n 1) (e.g., Pedhazur, 1997). Hence, for a given sample size
with a large enough number k of predictors used, the unadjusted R2 index can
be expected to be pretty close to one even if in the population it is in reality
equal to zero, i.e., all partial regression coefficients are zero there (viz., b1
b2 ... bk 0). Thus, a small R2 for a model fitted to an available sample
should make a researcher suspicious that there may not be even a weak linear
relationship in the studied population between the response and independent
variables considered. In contrast, it can be shown that when all slopes equal
zero in the population, the expected R2a is still zero. Moreover, Equation
(13.8) applies to the situation when all predictors are retained in the equation.
If a predictor selection procedure is used to arrive at a model in question (e.g.,
see further details offered in Section 13.3), then the capitalization on chance
is greater and results in even larger shrinkage. This discussion shows that there
are serious limitations underlying the unadjusted R2 index as a measure of
overall fit. In this sense, R2a is a much better index of overall fit for a consid-
ered regression model.
possibly be dropped are numbered (ordered) as last. We stress that this null
hypothesis H0 asserts we can drop the last k p predictors without losing
explanatory power of the MRA model (13.2) in the population. In other
words, this hypothesis stipulates that the MRA model (13.2) with all k predict-
ors, on the one hand, and its version only with the first p predictors on the
other hand, have the same explanatory power in the studied population. In
that case, the two models will be associated with the same population propor-
tion explained variance in a given response variable.
To test the null hypothesis H0, we follow a generally accepted procedure in
applications of statistics, which consists of comparing the fit of two rival mod-
els. In one of the models, referred to as the restricted (or reduced) model, the
null hypothesis is implemented—i.e., its stipulated parameter restrictions are
introduced. In the other model, referred to as the full model, this hypothesis
is not implemented. The difference in fit between these models provides infor-
mation that bears upon our decision whether to retain the null hypothesis or
to reject it. This general procedure is utilized frequently in applications of
statistics in the social and behavioral sciences, and it is also sometimes referred
to as nested model testing.
Specifically in the context of MRA, and for the purpose of examining the
above null hypothesis H0 of dropping the last k p predictors, we consider
the following two models in an application of this testing procedure:
As indicated above, we can also call such models nested and specifically refer
to the reduced model as being nested in the full model. (Notice the different
notation used for the error term in the restricted model, owing to this term
not being formally identical to that in the full model, since the latter has more
predictors. Although the intercept and partial regression coefficients are not
the same in both models, we use the same notation for them in order to
emphasize that Model 2 is nested in Model 1; see below.) The motivation for
referring to Model 2 as being nested in Model 1 is the fact that Model 2 is
obtained from Model 1 by imposing additional parameter restrictions, viz.,
those of the null hypothesis H0. Nested models are very widely used in various
fields of applied statistics, such as categorical data analysis, multilevel/hierar-
chical models, latent variable and structural equation modeling, and others.
This is due to the fact that they allow examining substantively meaningful
hypotheses by testing the validity of pertinent parameter restrictions in stud-
ied populations. In many regards, one may often consider nested models as
means of theory development and validation in the behavioral and social sci-
ences.
To test the above null hypothesis H0, we can use the ‘‘multiple F-test.’’ This
test is based on the statistic
(13.9) F(R2y.12...kR2y.12...p)/(1R2y.12...k)[(nk1)/(kp)],
which follows under H0 an F-distribution with kp and nk1 degrees of
freedom (Agresti & Finlay, 2009). When the outcome of this test is not sig-
nificant, i.e., the F-ratio in Equation (13.9) is not significant, one can conclude
that the full and reduced models have the same explanatory power with regard
to the response y in the studied population—that is, they have the same R2
index. As a consequence, one can drop the last kp predictors, since the
more parsimonious model that does not include them would be preferable on
grounds of having fewer parameters (see earlier discussion of the parsimony
principle in this section).
Conversely, when the test based on the F-ratio (13.9) is found to be sig-
nificant, it is concluded that dropping the last kp predictors leads to a sig-
nificant loss of explanatory power. The implication then is that one should
keep the last kp predictors, i.e., one will prefer the full model as a better
means of data description and explanation. We note that we invoke the parsi-
mony principle in this model choice activity only when the multiple F-test is
not significant, since only then are we dealing with two equally well-fitting
models that differ in their number of parameters. However, when the test
statistic (13.9) is significant, we are no longer facing such a situation, since
the full model fits the analyzed data better than the reduced model. In such a
case, we obviously would prefer the full model as a means to describe and
explain the data.
We emphasize that in the multiple F-test (13.9) there is no restriction on
the relationship between the two numbers k and p of predictors involved,
other than both being integers with the property 0 ⱕ p ⬍ k ⬎ 0. Thereby,
k p 1 is possible as a special case. Then one would be interested in
testing whether a given predictor can be dropped, viz., the last in the above
predictor ordering. In that case, the question that is addressed by testing the
resulting null hypothesis H0: bk 0, is whether the kth predictor is significant.
This question asks whether the kth predictor enhances significantly the
explanatory power of the model over and above what the remaining k1
predictors furnish with respect to explaining variance in the outcome y. That
is, testing the null hypothesis H0 when k p 1 amounts to testing the
unique predictive power of the kth predictor, in the context of the other k 1
predictors, i.e., after including them first in the model and adding then the
predictor in question as last in it.
Conversely, when p 0 the test of the null hypothesis is H0: b1 b2 ...
bk 0, and it amounts to testing whether the fitted MRA model (13.2)
has any explanatory power at all, i.e., is tantamount to testing the null hypoth-
intercept, slopes, and error term with the same symbols in Model 1, but we
purposely use the same notation here in order to emphasize that Model 2 is
nested in Model 1.
To fit Model 2 with R, we use the same command ‘lm’, yet now only for-
mally state the sum of the included predictors x2 and x4 after the sign ‘⬃’,
and assign the name ‘mrmod.2’ to the R-object resulting as the output pro-
duced thereby (output summarized beneath fitting model command):
Call:
lm(formula = y ~ x2 + x4)
Residuals:
Min 1Q Median 3Q Max
-0.262032 -0.077504 -0.001397 0.081131 0.431446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04353 0.03800 -1.145 0.254
x2 0.42613 0.08715 4.889 2.48e-06 ***
x4 0.65506 0.10565 6.200 4.78e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see from this output that the R2 index of Model 2 is markedly lower, as
is its adjusted R2, than the corresponding indexes of Model 1. Conversely, the
summary measures for the individual residuals associated with Model 2 (see
top part of output) are also notably higher than the corresponding measures
for Model 1 (see top part of output for Model 1 presented in Table 13.1). This
suggests that Model 2 does not fit the data as well as Model 1 in the available
sample of 160 students. However, the actual question to address is whether
this is indeed the case in the studied population. In statistical terms, this ques-
tion asks whether the drop in the R2 index when moving from Model 1 to
Model 2—i.e., when removing the induction test scores as predictors of the
Model 1: y ~ x2 + x4
Model 2: y ~ x1 + x2 + x3 + x4
According to these results, Model 2 fits the data significantly worse than
Model 1—see the F-statistic of 70.557, which is significant since its associated
p-value is very small, practically zero. Hence, we can conclude that the drop
in the R2, i.e., in explanatory power, when deleting the two predictors in ques-
tion is significant in the studied population. In other words, the removal of
the two induction test scores as predictors of general mental ability is associ-
ated with a significant loss in explanatory power of the MRA model initially
fitted (Model 1). For this reason, we prefer Model 1, i.e., the full model, to
Model 2.
Conversely, if one were interested in the first instance in examining
whether one could drop the figural relations test scores from the original set
of four predictors of the general mental ability score y, one would proceed as
follows. Again, Model 1 (see Section 13.1) would be the full model. However,
the reduced model would now be the following one, referred to as Model 3,
where as earlier in this section we use the same parameter and error term
notation to emphasize that it is nested in Model 1:
(13.11) yb0b1x1b3x3e.
We fit this model with the same R command ‘lm’ as above, but attach to it
the object name ‘mrmod.3’ (results summarized subsequently):
Residuals:
Min 1Q Median 3Q Max
-0.193680 -0.060374 -0.002124 0.071413 0.259955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.15095 0.01761 8.570 9.15e-15 ***
x1 0.35528 0.08806 4.035 8.52e-05 ***
x3 0.59523 0.06845 8.695 4.37e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the R2 index associated with Model 3 is notably lower in the
analyzed sample than that index for Model 1. To test if a loss in predictive
power is the case also in the population, we apply the multiple F-test for these
indexes of Models 1 and 3 (result given beneath command):
Model 1: y ~ x1 = x3
Model 2: y ~ x1 + x2 + x3 + x4
Res.Df RSS Df Sum of Sq F Pr(>F)
1 157 1.33893
2 155 1.20608 2 0.13285 8.5367 0.000304 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The multiple F-test is associated here with a fairly small p-value that is
lower than a prespecified significance level (usually ␣ .05). Thus, the loss
in predictive power is significant when moving from the original set of four
predictors to only the two inductive reasoning tests. We thus prefer Model 1,
rather than Model 3, as a means of data description.
Of course, in order to have more trust in Model 1 for data description
purposes, as was illustrated in Chapter 12, we need to examine the individual
residuals associated with it. We will attend to this issue in a later section of
the chapter.
The F-test used throughout this section plays also a prominent role in other
predictor selection strategies, which we turn to next.
In the previous section concerning the choice between models involving dif-
ferent subsets of an initial set of predictors, we had a preconceived idea which
particular subsets to use in the different model versions considered. Fre-
quently in empirical behavioral and social research, however, a researcher
does not have such an idea and is interested in finding the minimum number
of predictors, from an original set of independent variables, that account for
almost as much (as much as possible) variance in the response variable as
that original set of explanatory variables. The resulting model with a minimal
number of predictors will be associated with highest stability of regression
coefficient estimates, smallest standard errors, and shortest confidence inter-
vals, if it is a plausible means of data description and explanation. There are
several possible predictor selection approaches available that aim at accomp-
lishing this goal and are discussed next.
considered at a later stage, as far as fit to the analyzed data is concerned. Also,
given that multiple analyses are carried out on the same data set, capitalization
on chance may occur. Hence, it can be recommended to carry out a replica-
tion study before a final decision is made for selection of a particular model
version from the series of models considered during the process of backward
elimination. Despite these drawbacks, backward predictor selection may be
recommended for initial sets consisting of not too many independent vari-
ables that are not involved in a nearly perfect (or perfect) linear relationship.
(Such a relationship will be the case, as pointed out earlier in this chapter, if
the initial set of predictors explain a significant proportion of response vari-
ance yet none of them is significant; see below.)
We illustrate the backward model selection procedure with the following
example:
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + gender + ses)
Residuals:
Min 1Q Median 3Q Max
-0.210946 -0.064542 -0.004346 0.062526 0.229199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038246 0.036231 1.056 0.292981
x1 0.307247 0.088250 3.482 0.000668 ***
x2 0.169371 0.069205 2.447 0.015646 *
x3 0.436553 0.079024 5.524 1.59e-07 ***
x4 0.172747 0.090668 1.905 0.058824 .
gender -0.008562 0.021672 -0.395 0.693391
ses -0.028573 0.021630 -1.321 0.188689
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + ses)
Residuals:
Min 1Q Median 3Q Max
-0.212776 -0.065415 -0.003973 0.060831 0.227550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03070 0.03070 1.000 0.318955
x1 0.30718 0.08798 3.491 0.000644 ***
x2 0.16723 0.06878 2.431 0.016318 *
x3 0.43599 0.07877 5.535 1.50e-07 ***
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-0.201879 -0.060719 -0.001018 0.062009 0.237407
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02148 0.03022 0.711 0.47848
x1 0.30793 0.08838 3.484 0.00066 ***
x2 0.17255 0.06901 2.500 0.01356 *
x3 0.43206 0.07909 5.463 2.08e-07 ***
x4 0.17189 0.09025 1.905 0.05887 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08881 on 140 degrees of freedom
Multiple R-squared: 0.7979, Adjusted R-squared: 0.7921
F-statistic: 138.2 on 4 and 140 DF, p-value: ⬍ 2.2e-16
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-0.1964412 -0.0644397 -0.0008611 0.0625979 0.2445131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05389 0.02521 2.138 0.034267 *
x1 0.31187 0.08918 3.497 0.000630 ***
x2 0.23390 0.06160 3.797 0.000217 ***
x3 0.48146 0.07541 6.385 2.32e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The last fitted model, Model 4, shows that if fixing any two of the student
scores on the mathematics tests, knowledge of the third improves significantly
the proportion of explained variance in the response variable, college aspira-
tion. This is at the same time the most parsimonious model version consid-
ered, and we observe that in it the standard errors associated with the
intercept and partial regression slopes are smallest overall from the four mod-
els fitted in this example. In addition, Model 4 is associated with essentially
the same explanatory power that all six predictors afforded, when we compare
its R2 (and adjusted R2) index with that of the starting model containing all
initially considered independent variables. This example demonstrates how
use of the backward selection procedure can lead to parsimonious models that
fit relatively well a given data set.
We discussed at length in Chapter 12 (Section 12.6) how one can examine the
individual case residuals for fitted SLR models, in order to assess whether
their underlying assumptions are plausible and if the models fit the analyzed
data well. We emphasized thereby that while the R2 and adjusted R2 indexes
provide useful overall goodness-of-fit measures, they might be relatively high
also when there are several subjects whose data are not well fit by a model in
question, which therefore need not necessarily be viewed as a satisfactory
means of data description and explanation. Similarly, individual residuals
associated with MRA models are of particular relevance for assumption
assessment as well as exploration of the degree to which the models fit the
individual data. This is another important criterion of model adequacy, some-
times referred to as ‘‘local fit’’ as opposed to the ‘‘overall fit’’ evaluated by the
aforementioned R2 indexes.
When the MRA model (13.2) is fit to a given data set, as in the SLR model
(12.1) the residual ê i associated with the ith individual is the discrepancy
between his or her observation on the response variable, yi, and its corre-
sponding model prediction:
(13.12) êiyiŷiyib̂0b̂1x1ib̂2x2i...b̂kxki,
where xji denote his or her value on the jth predictor (i 1,..., n, j 1,..., k).
We stress that the residual êi depends on (i) that individual’s data for the used
predictors, (ii) the fitted model, and (iii) the response variable in a given
modeling session. (As indicated earlier in the book, response variables may
change within an empirical study, depending on research questions pursued.)
For a MRA model under consideration, plots of the individual residuals
ê i (i 1,..., n) against various other variables provide a researcher with
insightful information pertaining to the plausibility of its assumptions and
goodness of fit to the analyzed data, and they are readily obtained with R
using the same commands discussed in Chapter 12 (Section 12.6). When a
MRA model is of concern, it is useful to examine the plots of the residuals
against each predictor included in it, in order to assess whether higher powers
of that predictor may be also profitably included in the model. Furthermore,
with a MRA model it is also informative to plot residuals against predictors
not included in it. The presence of a discernible pattern in the resulting plots,
in particular a linear relationship to a variable not used as a predictor in the
model, may be interpreted as suggesting that inclusion of this predictor can
improve markedly the model. These residual plots are also useful to inspect
when an initial SLR model is fitted to data but one is interested in examining
FIGURE 13.1.
Plots of Model 4 residuals against predictors, response, and model predictions (fitted
values; see horizontal axis notation to differentiate between plots).
After inspecting the plots displayed in Figure 13.1, we see that none of these
plots shows a clear pattern indicating model misfit. This suggests that Model
4 does not exhibit serious lack of fit at the individual data level. We next
evaluate the normality assumption with regard to the model residuals, using
the R commands ‘qqnorm’ and ‘qqline’ (see Chapter 12, Section 12.6, and
Figure 13.2):
FIGURE 13.2
Normal probability plot for the residuals of Model 4.
> qqnorm(resid(mod4))
> qqline(resid(mod4))
The plot displayed in Figure 13.2 does not indicate serious violations of
normality, which as pointed out earlier in the chapter is an assumption
needed for inferences that can be made from the model results.
Based on the residual plots discussed in this section (see Figures 13.1 and
13.2), as well as the earlier findings of its relatively high R2, adjusted R2, and
parsimony, we conclude that Model 4 represents a plausible means of data
description and explanation for the empirical study in question, which we
prefer to the other considered models.
Analysis of Variance
and Covariance
Researchers in the behavioral and social sciences are often interested in com-
paring various response variables across several distinct populations, such as
different socioeconomic (SES) groups, states, countries, cities, or districts. For
example, a researcher studying depression and aging might want to determine
whether there are differences in the levels of depression exhibited by elderly
persons living in three different states. Similarly, examination of possible SES
group differences in student motivation may well be of concern in an educa-
tional setting. In such cases, a very useful statistical method for conducting
such comparisons is the analysis of variance (ANOVA). When, in addition to
the group differentiation, continuous independent variables are also consid-
ered as playing an explanatory role for a response measure, the extension of
this technique to accommodate such variables is referred to as analysis of
covariance (ANCOVA). These two modeling approaches are the topics of the
present chapter. As we will point out in a later section, both techniques are
closely related to the method of regression analysis we introduced in the two
previous chapters.
269
hypotheses across the g(g 1)/2 t-tests that one would need to carry out in
order to examine all pairs of groups with regard to their mean differences. In
particular, even with only three groups, the overall Type I error will be above
.14 rather than controlled at ␣, and this error grows essentially exponentially
with increasing number of groups g in the study (e.g., Howell, 2009). This is
a highly undesirable feature of the series of t-tests contemplated, which repre-
sents its serious inferential problem. This problem can be resolved by using
the ANOVA test of the overall null hypothesis (14.1), which is a single test
that is in addition carried out at a given and controlled significance level—
usually at ␣ .05 (at least in general, and depends on the choice of the
researcher). The remainder of the present section will be concerned with this
ANOVA test.
In order to be in a position to use the ANOVA approach for testing the
null hypothesis (14.1), several assumptions are typically made. As was done
on a few occasions earlier in the book, it is assumed that the response variable
y is normally distributed in each of the g populations under consideration.
(This implies that the outcome variable is assumed throughout this chapter
to be continuous, as in the preceding two chapters dealing with simple and
multiple regression analysis.) In addition, the variance of y is assumed to be
the same in all populations—we denote this common variance by 2. Finally,
the available random samples from the g populations in question are assumed
to be independent of one another. Under these assumptions, since the mean
and variance are all we need to know in order to reproduce a normal distribu-
tion (as they are sufficient statistics, e.g., see Chapter 5), ANOVA can be seen
as a methodology for testing whether the distribution of the outcome measure
y is independent from the categorical variable defined as group membership.
We stress that the latter variable is not a quantitative measure but rather a
nominal (ordinal) one, and thus the analysis needs to account for this fact. In
the context of ANOVA, such a variable is usually referred to as factor, and the
values it takes as its corresponding levels. We emphasize that there is typically
no meaningful numerical relationship between the levels of a factor. Despite
the fact that frequently they are denoted by real numbers initially in a behav-
ioral study, the factor level values are typically only labels and cannot be
related to one another like real numbers.
To illustrate this discussion, consider the following example:
After reading the data into the software and making it available to R with
the second command, we define the variable ‘state’ as a factor. (Unless we
do so, R will treat it as a continuous variable and each of its scores will be
handled as if it were a real number, which is obviously not the case in this
example.) To this end, we use the third-listed command ‘d$state fac-
tor(.,.)’, which assigns the status of a factor to the component (variable) of
the data set ‘d’ whose name is ‘state’, i.e., the variable ‘state’. Thereby, we
assign the labels A through D correspondingly to 1 through 4 on the original
variable ‘state’, as values on the new factor variable that for convenience we
also name ‘state’. We do not need then the original copy of the data set, and
so we detach it with the fourth command. In order to make the last created
variable available for further analyses, however, we attach the data set with
the so-modified ‘state’ variable (and the variable ‘y’ being unchanged) using
the last command. To see the result of these five commands, and in particu-
lar the number of subjects from each state in the overall sample of 156
elderly persons in the sample, we summarize the data on the variable ‘state’
in the newly created data set ‘d’ using the ‘summary’ command (output
given immediately beneath it):
> summary(state)
A B C D
39 38 40 39
This short output shows that we have about equally many elderly adults sam-
pled from each state, who participate in this investigation. To view the state
estimates of mean depression, we readily calculate the sample average of the
For each of the four states, we sum the depression scores only for the
elderly persons in its sample, which we accomplish by using a logical com-
mand assigning a 1 to each person from that state and 0 otherwise. For
instance, for state A this logical command is ‘state ‘‘A’’ ’, resulting in 1
for subjects from state A only and 0 for everyone else. We then multiply the
result of this logical operation with the depression score, sum across the entire
sample of 156 persons (but in effect only across the subjects from a given
state, say A), and divide by the state-specific sample size (see output of preced-
ing command ‘summary’). In this way, the following state sample means
result:
> mean.state.A
[1] 19.58974
> mean.state.B
[1] 21.71053
> mean.state.C
[1] 23.02500
> mean.state.D
[1] 20.53846
These four means differ somewhat, but we emphasize that they are sample
averages and hence their differences are not unexpected. For this reason, even
if in the elderly population of concern the state depression means were to be
the same, one cannot expect them to be so also in the samples from the four
states under consideration. Specifically, the observed mean differences are
obviously due to sampling fluctuations (sampling error). How much variabil-
ity of the sample means could one expect to observe, however, should the null
hypothesis be true, and how strong variability in them would be sufficient in
order not to consider this hypothesis credible or retainable? In other words,
how much variability in the sampling means could be explained by chance
fluctuations only? We need to find next an answer to this query, which is
(14.2) 兺 (yi1ȳ1.)2i1
SSWi1 兺 (yi2ȳ2.)2...i1
兺 (yigȳg.)2.
兺 (yijȳj.)2/(nj1),
sj i1
2
and hence
ng
(14.3) 兺 (yijȳj.)2(nj1)sj2
i1
g nj
(14.7) 兺 i1
SSTj1 兺 (yijȳ..)2SSBSSW.
In Equation (14.7), SST is the sum of squared individual deviations from
the overall mean ȳ.. of each observation yij in the entire sample of n subjects
(i 1,..., nj, j 1,..., g). This equation in fact follows with some algebra from
the following trivial decomposition of each response variable score yij as the
sum of the overall mean ȳ.., the group mean deviation from it, yj.ȳ.., and
that individual score’s deviation from the group mean, yijȳj:
(14.8) yijȳ..(ȳj.ȳ..)(yijȳj.).
Equation (14.8) obviously holds always because the overall and group means
appearing in it, y.. and ȳj, cancel out in it (after disclosing brackets). That is,
Equation (14.8) needs no assumptions in order to be true for any individual
response score, yij, in an empirical study. Taking square from both sides of
Equation (14.8) and summing up over i and j (i.e., over groups and individual
observations within them) leads after some algebra to Equation (14.7).
Equation (14.7) represents a fundamental decomposition of observed vari-
ance into such stemming from individual differences on the response variable
y within the groups (SSW) and variability of its means across the groups (SSB).
That is, (14.7) is a break-up of observed variance into between- and within-
group components. Using this variance decomposition and the earlier made
assumptions, it can be shown (e.g., King & Minium, 2003) that when the
studied population means are the same—i.e., the null hypothesis H0 is true—
the two possible estimates (14.5) and (14.6) of the common variance 2
should be close in magnitude, i.e., their ratio should be close to 1. When this
hypothesis H0 is incorrect, however, the estimate resulting from the variability
among the sample means will be expected to be larger than that resulting
from within the groups. The extent to which the former estimate will be larger
than the latter, depends on the degree to which the null hypothesis is violated,
i.e., the extent to which the population means differ.
This fact is the basis for using the ratio of these two variance estimates,
SSB/(g1)
(14.9) F ,
SSW/(ng)
as a test statistic of the null hypothesis H0 in Equations (14.1) stipulating
population mean equality. The ratio in (14.9) is frequently referred to as
F-ratio, or ANOVA F-ratio. Specifically, when this F-ratio is close to one in an
empirical study, one may conclude that there is no strong evidence in it
against the tested null hypothesis H0. Conversely, when the F-ratio (14.9) is
sufficiently larger than one, then this null hypothesis cannot be considered
credible and may be rejected. It has been shown that when H0 is true, the
> aov(y~state)
We note that within parentheses we state first the response variable—in this
example y—and then the factor variable, here ‘state’. Thereby, we connect
these two variables with the ‘⬃’ sign, the same sign we used in the last two
chapters when carrying out regression analysis. We will see later in this chap-
ter that this is not a coincidence, as we explore the relationship between
ANOVA and regression analysis and find out that there is a close connection
between these two methodologies. The result of applying the response vari-
ance decomposition in the current example is as follows
Call:
aov(formula = y ~ state)
Terms:
state Residuals
Sum of Squares 261.228 15429.919
Deg. of Freedom 3 152
Residual standard error: 10.07535
Estimated effects may be unbalanced
Accordingly, just over 261 units of variance are due to the factor ‘state’, out
of the total 15429.919 261.228 15691.147 units of variability in the
depression scores across the 156 elderly persons in this study. The estimate of
the (assumed) common variance, the squared standard error, is here 10.0752
101.510.
While this is an informative decomposition of observed variance, we are
interested here primarily in seeing whether there is sufficient evidence in the
data that would warrant rejection of the null hypothesis H0 in Equations
(14.10) that is of concern to test. This question is similarly readily responded
to with R, using the command ‘anova’, whereby we can use as an argument
the linear model relating the response with the factor variable (output pro-
vided immediately after command):
> anova(lm(y~state))
This output contains some of the information we obtained with the command
‘aov’, but in addition we have now the value of the F-ratio (14.9) for this
study and in particular its associated p-value. Since the p-value is larger than
a reasonable significance level, we conclude that the null hypothesis in Equa-
tions (14.10) may be considered retainable. Before we place more trust in
this result, however, we need to address an important assumption made in
developing the ANOVA test, that of variance homogeneity. As will be recalled
from the preceding Section 14.1, this assumption stipulates that the depres-
sion score variances are the same in the four groups involved—with this com-
mon variance denoted 2 throughout.
To address the plausibility of this assumption, we can test the hypothesis
that the variances of the depression scores are the same in the four studied
groups. (Note that this is not the group mean equality hypothesis (14.1) of
focal interest in ANOVA, but simply of one underlying assumption.) We
accomplish this with R using the command ‘bartlett.test’ (result given imme-
diately beneath it):
> bartlett.test(y~state)
Bartlett test of homogeneity of variances
data: y by state
Bartlett’s K-squared 0.558, df = 3, p-value = 0.906
As we can see from this short output, the variance homogeneity test—named
after a prominent statistician of the past century, M. Bartlett—is associated
with a nonsignificant p-value. This result shows that the assumption of equal
variances of depression scores across the four groups under consideration may
be viewed as plausible. We note in this regard that if in a given study this
variance homogeneity assumption is found not to be plausible, one can use
an approximation to the ANOVA F-test in (14.9) that is due to Welch, in
order to test the mean equality hypothesis H0. This approximation is available
in R with the command ‘oneway.test(y⬃factor)’, where one provides for ‘fac-
tor’ the name of the group membership variable (e.g., Crawley, 2005). Return-
ing to the example of concern here, given the last presented results we can
now place more trust in the earlier ANOVA finding of a nonsignificant F-ratio
that suggested we may reject the null hypothesis of equal means. In particular,
we can interpret that finding as indicative of no mean differences in older
adults’ depression levels across the four studied states.
As an aside at this point, we note in passing that the nonsignificant ANOVA
F-ratio and variance homogeneity test do not allow us to conclude that there
are no differences in aged adults’ depression across all states in the country.
Rather, our results are valid only for the four states included in this example,
i.e., for these four fixed states from which depression data were available for
the above analysis. For this reason, the ANOVA modeling approach we just
conducted is also commonly referred to as a ‘‘fixed effect’’ analysis of variance,
a reference we will make on a few occasions again later in the book (cf. Rau-
denbush & Bryk, 2002, for an alternative approach and assumptions allowing
possible conclusions beyond the particular states included in an empirical
study).
Whenever we carry out an ANOVA, it is useful to also graphically represent
the relationships between the studied groups with regard to the response vari-
able. As we discussed in Chapter 3, this is conveniently achieved using the
graphical tool of the boxplot, and in particular comparing the boxplots of the
four groups in question in the presently considered study. Such a comparison
is made feasible using the command ‘boxplot’, discussed in detail in that
chapter, which becomes particularly helpful when employed in conjunction
with ANOVA. For our Example 14.1, we obtain a simultaneous display of the
four group boxplots with the following command (see Chapter 3):
> boxplot(y~state)
Its results are presented in Figure 14.1. As we can see from Figure 14.1, a
particularly useful feature of this simultaneous display of the four groups’
boxplots is that it allows us to judge their differences in central tendency in
relation to their variability on the response variable. We readily observe from
Figure 14.1 that the central tendency measures for the four groups involved
differ somewhat (cf. Chapter 3). However, as mentioned earlier in this chap-
FIGU RE 14.1
Box plots of the four states’ depression scores (Example 14.1).
PAGE 280
1 4 . 3 . F O L L O W - U P A N A LY S E S 281
ter, some group differences should be expected in the analyzed samples since
they are not identical to the populations studied.
The simultaneous boxplot representation for all examined groups in an
ANOVA setting is especially helpful in assessing the extent to which these
observed differences ‘‘matter,’’ i.e., what their relationship is to the variability
of the scores in the studied groups. When in Figure 14.1 the group differences
in central tendency are compared to the group variability on the response as
reflected in the similar vertical distances between the ends of the two associ-
ated whiskers per boxplot, the central tendency differences across the four
groups are no more impressive. Loosely speaking, the F-test (14.9) evaluates
the mean group differences in relation to the within-group variability of the
scores. When these differences are sufficiently smaller than that variability, a
nonsignificant result ensues. This is precisely the situation we are dealing with
in the currently considered Example 14.1.
When the result of an ANOVA application is that the underlying null hypoth-
esis is not rejected (as in Example 14.1), the interpretation as mentioned is
that there are no mean group differences. The implication then is that no two
groups differ in their means on the response variable of concern. In that case,
there may be limited remaining interest in the study, if any, as far as mean
relationships across groups are concerned. However, when the overall null
hypothesis (14.1) is rejected, the interpretation is that at least two groups have
different means. A natural question that arises, then, is what are the reasons
for this hypothesis rejection? This query asks which groups are actually differ-
ent in their means.
This question is addressed by what is frequently referred to as ANOVA
follow-up analyses. These are post hoc analyses since they are performed after
the overall (global, or omnibus) null hypothesis (14.1) has been tested and
rejected. There are a number of post hoc procedures, also at times referred to
as multiple comparison or multiple testing procedures, that have been proposed
over the past several decades. In fact, there are more than a dozen such tests
available in the literature. Given the introductory nature of this book, we will
only be concerned with a particular post hoc procedure called Tukey’s hon-
estly significant difference (HSD), which is often of interest to use as a follow-
up of a significant ANOVA F-test, and we refer the reader to alternative treat-
ments for other multiple-testing procedures (e.g., Howell, 2009; Kirk, 2007).
Tukey’s HSD method has been specifically devised to enable testing for
pair-wise group differences upon rejection of the ANOVA null hypothesis H0
in Equations (14.1). This method responds to the frequently raised question
in empirical research, concerning which particular groups differ from one
another. The rationale behind Tukey’s procedure is the realization that testing
the difference between any two groups, after knowledge is available that the
null hypothesis H0 has been rejected, should be carried out in such a way
that only more pronounced pair-wise group differences are proclaimed as
significant than what would be the case if it was not known that H0 was
rejected in the first instance. This goal will be accomplished if one proclaims
as significantly differing, at a prespecified significance level, only such pairs of
groups with the property that the absolute value of their mean difference
exceeds the following expression
(14.12) q
冪12s (1/n 1/n ).
2
i j
As pointed out earlier in this chapter, we first need to read in the data and
declare the ‘ses’ variable as a factor with 4 levels—‘low’, ‘middle’, ‘high’, and
‘upper’, corresponding to the original scores of 1 through 4 on this variable
(see Section 14.2 for the needed R commands). Next we test the null hypothe-
sis of no SES differences in motivation (output presented immediately follow-
ing command):
> anova(lm(y~ses))
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
ses 3 1836.2 612.1 7.1106 0.0001744 ***
We see that the F-test is associated with a significant result here. That is, we
can reject the null hypothesis of no SES differences in motivation and con-
clude that at least two SES groups have different mean level of motivation.
(The Bartlett’s test of the variance homogeneity assumption is nonsignificant
here, as can be readily found using the earlier mentioned command ‘bartlett.
test’ on this four-group data set.) We then graphically display the boxplots of
the four groups:
> boxplot(y~ses)
The results of this command are presented in Figure 14.2. We see from the
simultaneous display of the four boxplots in Figure 14.2 that the high SES
group has the largest central tendency measure and is also most compact
in terms of variability of the motivation scores. Given the significant mean
differences finding above, we would like to see now which pairs of groups
have different means. We wish to ascertain statistically these differences, and
for this aim we apply the Tukey’s HSD method (note the use of the ‘aov’
command as argument of this post hoc analysis command):
> TukeyHSD(aov(lm(y~ses)))
$ses
diff lwr upr p adj
middle-low 2.684685 -2.9618758 8.331245 0.6050505
high-low 9.090090 3.4435296 14.736651 0.0002879
upper-low 1.000000 -4.6851040 6.685104 0.9681069
high-middle 6.405405 0.7976534 12.013157 0.0181971
upper-middle -1.684685 -7.3312451 3.961876 0.8653605
upper-high -8.090090 -13.7366505 -2.443530 0.0015874
This output presents the pair-wise group mean differences in the column
titled ‘diff ’. These differences are followed by their confidence intervals—
adjusted as mentioned for the multiplicity of these pair-wise comparisons—in
F I G U R E 14 . 2
Box plots of the four SES group educational motivation scores (Example 14.2).
PAGE 284
1 4 . 4 . T W O - W A Y A N D H I G H E R - O R D E R A N A LY S I S O F V A R I A N C E 285
the column titled ‘lwr’ and ‘upr’ (for their lower and upper limits, respec-
tively). The last column, ‘p adj’, presents the p-values associated with testing
the set of 4.3/2 6 pair-wise differences. These p-values are also adjusted for
the fact that six tests rather than a single test are carried out on the same data
set. We see from the last column that the high SES group differs significantly
from each other group, being higher on mean motivation than any of them
in the study. At the same time, the remaining three SES groups do not show
any significant differences among themselves as far as their mean motivation
level is concerned. Hence, it may be suggested that the rejection of the overall
null hypothesis (14.1) of equal educational motivation means across the four
SES groups considered, may have been primarily the result of the high SES
group having a motivation mean higher than any of the remaining three SES
groups. The other three SES groups do not differ considerably from one
another in their mean motivation levels (see also Figure 14.2).
The previous sections examined settings in which observed subjects were clas-
sified into groups (populations) only according to a single categorical variable.
That is, the groups in question represented the levels of a single factor under
consideration. For instance, in Example 14.1 this factor was state, while in
Example 14.2 it was socioeconomic status (SES). As a consequence, no further
differentiation between subjects was taken into account in these considered
settings. For this reason, the ANOVA we dealt with in Sections 14.1 and 14.2
is also referred to as one-way analysis of variance (one-way ANOVA layout or
design), to emphasize that only a single variable is assumed to contribute to
individual differences on an outcome variable of interest.
Behavioral and social research, however, is typically concerned with com-
plex phenomena that cannot be understood well by considering single cate-
gorical variables (group membership) as possibly contributing to observed
subject differences. Given that these phenomena are usually multifactorially
determined, their modeling will be more realistic when additional variables
are included into the analyses. When two factors (categorical variables) are
assumed to contribute to individual differences on a response variable, the
corresponding analytic approach is referred to as two-way ANOVA (two-way
ANOVA layout or design), and as higher-order ANOVA if more than two
categorical variables are considered simultaneously—e.g., a three-way
ANOVA. For example, when studying educational motivation and student
differences in it, gender differences may as well be seen as a potential contrib-
utor to individual differences on motivation in addition to socioeconomic
status. In this case, by considering also gender one would be dealing with two
rather than a single factor—viz., SES and gender. The corresponding analysis
of response variability will then account for the possible contributions of these
two factors. We turn next to a description of such analytic approaches.
In order to be in a position to discuss two-way and higher-order ANOVA,
it is instructive to return for a moment to Section 14.2, where the test of the
null hypothesis of no mean differences across groups with regard to a single
factor was of interest. In that section, of fundamental relevance was the
decomposition (14.7) of the observed variability, i.e., sum of squares, on the
response variable. Accordingly, the observed sum of squares SST was broken
down into a between-group part, denoted SSB, and a within-group part,
denoted SSW. This decomposition can also be interpreted as stating that SSB is
the part of observed variance (sum of squares) on the response variable, which
is due to the factor under consideration whose levels represent the groups in
the study. For instance, in Example 14.1, SSB is the part of older adults’ differ-
ences in depression that is accounted for by the factor ‘state’ (state residency),
with its four levels representing the four states included in the study. Similarly,
in Example 14.2, SSB is the part of student differences in educational motiva-
tion scores that is explained by the factor socioeconomic status, with its four
levels representing the four groups examined.
The decomposition of response variance is also of fundamental relevance in
two-way and higher-order ANOVAs. In contrast to the preceding two sections
dealing with one-way ANOVAs, however, in a two-way ANOVA we have two
factors contributing to individual differences—e.g., gender and SES in the
above-mentioned motivation study. Accordingly, there are two parts of the
overall sum of squares SST on the outcome measure that they account for. In
addition, there is a new part of the total sum of squares that has no counter-
part in our earlier ANOVA discussions. Specifically, in order to properly
model the sources of individual differences in a setting with two factors (and
similarly with more than two factors), we need to account for the fact that the
effect of one of the factors upon the response variable may depend on the
level of the other factor. For example, when we include both gender and SES
as factors in the motivation study, we need to keep in mind that the effect of
SES on motivation may be different for boys relative to its effect for girls. This
is tantamount to saying that the gender differences in educational motivation
may not be the same in each of the four SES groups of the study. This possibil-
ity of differential effect upon the response of one of the factors, depending on
the levels of the other factor, is captured by the concept of factor interaction.
An interaction between two factors is present when the effect of one of them
on the outcome measure is not the same across the levels of the other factor.
In the motivation study, an interaction between the factors gender and SES
would imply that SES differences in motivation are not the same for boys and
for girls, and that conversely the boy-girl differences in motivation depend on
SES (SES group).
The possibility that there is an interaction does not mean that it will always
exist—or even be of a marked magnitude—in an empirical study with two or
more factors. If in an empirical setting the effect of one of the factors is the
same regardless of the level of the other factor considered, then there is no
interaction. In the motivation study, lack of interaction between gender and
SES would imply that the boy-girl differences in motivation are the same
regardless of SES, and that the SES impact upon motivation is the same for
both genders. Whether an interaction is present or not in a study depends
obviously on the particular phenomenon researched and can be examined
using the available data from that study.
To evaluate the empirical evidence for an interaction, or lack thereof, in a
two-way ANOVA setting, instrumental use is made of the following decompo-
sition of the response variance into sources attributable to each of the two
factors—denoted, say, A and B—and their possible interaction (subindexed
by ‘‘AB’’; e.g., Howell, 2009):
(14.13) SSTSSASSBSSABSSR.
In Equation (14.13), the symbol SS is used to designate the sum of squares
attributable to the factor in its subindex, the interaction, or the residual—that
is, to all other factors not explicitly included in the right-hand side of (14.13).
(We assume in the remainder of this chapter that there is more than a single
observation in any factor level combination, as will frequently be the case in
an observational behavioral or social study.) To test the significance of the
interaction, we can reason in the following way: If we assume the null hypoth-
esis of no interaction to be true, then the interaction and residual sum of
squares—SSAB and SSR—should evaluate each the residual variance when
either of them is averaged by dividing with its degrees of freedom, leading to
their corresponding mean sum of squares, denoted MSSAB and MSSR. There-
fore, under the null hypothesis of no interaction of the factors A and B, the
F-ratio
(14.14) FMSSAB/MSSR(SSAB/dfAB)/(SSR/dfR)
can be expected to be close to one. In Equation (14.14), dfAB are the degrees
of freedom for the interaction that can be shown to equal the product of the
degrees of freedom dfA and dfB associated with the two factors A and B, and
dfR are the degrees of freedom for the residual. More specifically, if k and p
are the number of levels of factors A and B, respectively, then dfA k 1,
dfB p 1, dfAB (k 1)(p 1), and dfR n k p (k 1)(p 1)
holds (Agresti & Finlay, 2009).
Furthermore, under the null hypothesis of no interaction, it can be shown
that the F-ratio in (14.14) follows an F-distribution with dfAB and dfR degrees
of freedom (Howell, 2009). In an empirical setting, statistical software—in
particular R—provides readily the value of this F-ratio and its associated
p-value. A small enough p-value for the F-ratio—viz., smaller than a preset
significance level that can as usual be chosen to be ␣ .05—can be interpre-
ted as indicative of sufficient evidence in the analyzed data to warrant rejec-
tion of the null hypothesis of no interaction. When this p-value is higher than
the preset significance level, the interpretation is that there is not enough
evidence in the analyzed data that would allow us to reject the null hypothesis
of no interaction, and then we can retain this hypothesis. (For the case of a
single observation per factor level combination, which can be viewed as rare
in much of empirical social and behavioral research, see Howell, 2009.)
The concept of interaction reflects the possibility of the effect of one of the
factors to depend on the level of the other factor, but it does not respond to
the query whether a factor itself has an overall effect, i.e., disregarding the
other factor. Such an effect is referred to as main effect and would be present
when the group means on this factor (pooling over the levels of the other
factor, i.e., disregarding them) are sufficiently distinct. For a given factor, say
A, the null hypothesis pertaining to its main effect is that the factor ‘‘does not
matter,’’ i.e., its main effect is zero. In other words, this factor-specific null
hypothesis asserts that the means of the groups that represent the levels of
factor A—irrespective of those of the other factor—are not distinct in the
population. Similarly to (14.14), this null hypothesis can be tested using the
F-ratio
(14.15) FMSSA/MSSR(SSA/dfA)/(SSR/dfR).
This ratio follows under the null hypothesis an F-distribution with dfA and dfR
degrees of freedom (e.g., Agresti & Finlay, 2009). The value of the F-ratio
(14.15) is also routinely provided by statistical software, in particular R, as is
its associated p-value that allows in the usual manner a decision whether to
reject or consider retainable the pertinent null hypothesis.
The interpretation of the main effect for either factor considered in a two-
way ANOVA is substantially easier when there is no interaction of the two
factors involved in it. In case they interact, however, it is not as straightfor-
ward to interpret the main effects of the factors. One may argue then that
each of these main effects represents an average effect of the corresponding
factor across the levels of the other factor involved in the interaction. When-
ever that average is deemed by a researcher to be substantively relevant, this
interpretation can be used in an empirical setting. In some cases, however,
this average effect may not be of interest, and then it would be appropriate to
consider the effect of factor A, say, within each of the levels of factor B (or
conversely). This study of level-specific effects of the former factor is often
> anova(lm(y~ses+gender+ses*gender))
Before we discuss its output, we note that in the argument of the command
‘lm’ we list both factors in an additive fashion as well as their interaction
denoted by asterisk. (A shorter version of this argument is as ‘y⬃ses*gender’,
but we use here the above in order to emphasize the inclusion of the interac-
tion into the analysis.) The ANOVA variance table that results is provided
next.
Response: y
Df Sum Sq Mean Sq F value Pr(⬎F)
ses 3 1836.2 612.1 7.3844 0.0001263 ***
gender 1 275.5 275.5 3.3236 0.0704578 .
ses:gender 3 509.3 169.8 2.0484 0.1099760
Residuals 138 11438.4 82.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we see from this output, the interaction of the two factors (formally
denoted here ‘‘ses:gender’’) is associated with a p-value of .11 and is thus
larger than a typical significance level. For this reason, the interaction is not
significant, which we interpret as suggesting that the effect (if any) of SES
upon motivation is the same for boys and girls. Alternatively, we can interpret
this finding as suggesting that the gender differences in motivation (if any)
are the same in each SES group.
Since there is no interaction of SES and gender, the interpretation of the
main effect of either factor is straightforward, as mentioned earlier in this
section. Due to the fact that the SES factor is associated with a small p-value
(smaller than .05 usually employed as a significance level), we conclude that
there are SES differences in motivation pooling over gender, i.e., disregarding
gender. That is, according to the analyzed data, different SES groups have on
average different educational motivation, irrespective of gender. Further, due
to the fact that the p-value associated with the factor ‘gender’ is not signifi-
cant, it is suggested that there is no gender effect. That is, boys and girls do
not seem to differ in educational motivation, disregarding their SES.
Just as was done in a one-way ANOVA, it is helpful to examine also the
boxplots of the eight groups involved in this study, which is achieved with the
command ‘boxplot’ (note the notation used for the term appearing after the
‘⬃’ sign, in order to ensure that all eight group boxplots are obtained):
> boxplot(y~ses:gender)
The resulting boxplots are presented in Figure 14.3. These eight boxplots
reveal that the boys and girls in the high SES group have the highest level of
motivation, while those in the upper SES group tend to be the lowest. The
lack of interaction between gender and SES in this example is seen in the
graphical representation in Figure 14.3 by observing that the same pattern of
increase and then drop is found for boys and for girls, when one considers
also their SES in increasing order (from ‘low’ to ‘upper’, within each of the
genders—represented by the first four and last four boxes, respectively).
The discussion in this section of two-way ANOVA is directly extended to
the case of higher-order ANOVA, e.g., three-way ANOVA. An increase in the
number of factors simultaneously considered in a study beyond three is, how-
ever, associated with considerably larger difficulties in the substantive inter-
FIGURE 14.3.
Box plots of motivation scores for eight SES and gender groups (Example 14.3).
above we do not need to do any recoding of its values. The factor ‘ses’ is,
however, a polytomous variable with k 4 categories. Hence, to recode group
membership with regard to socioeconomic status (SES), we introduce k 1
3 dummy variables, denoted x1 through x3. Choosing the SES category
‘low’ as a reference group—which here is an essentially arbitrary decision—all
36 students in this group receive a 0 on the variables x1 through x3. (One can
use the command ‘summary(ses)’ to readily find out the SES group sample
sizes in this study.) Then all 37 students in the ‘middle’ SES group receive a 1
on the dummy variable x1 but 0’s on x2 and x3. Similarly, all 37 students in
the ‘high’ SES group receive a 1 on the dummy variable x2 but 0’s on x1 and
x3. Finally, all 36 students in the ‘upper’ SES group receive a 1 on the dummy
variable x3 but 0’s on x1 and x2. In this way, all studied subjects obtain a score
of 0 or 1 on each of the three newly defined variables x1 through x3.
These three dummy variables have the property that by looking at them
one can determine uniquely which SES group a given student belongs to.
That is, the three dummy variables x1 through x3 equivalently represent all
information available initially about group membership for each of the 146
subjects participating in this study. Thereby, to reflect correctly this informa-
tion, we need the three dummy variables—no single one or a pair of them
contains the complete group information for all subjects in the available sam-
ple. Table 14.1 contains the original data and the added three dummy vari-
ables for the first 20 students in Example 14.2 from Section 14.4. (The data
for all students, with the added three dummy variables, can be found in the
file CH14_EX4.dat.)
As indicated earlier, the recoded data set contains all original data of Exam-
ple 14.2 and no additional information. In fact, the dummy variables x1
through x3 reflect the same information as that pertaining to group member-
ship with regard to SES, which is contained in the original variable ‘ses’. The
gain in constructing the three dummy variables x1 through x3 is that when we
use them along with the gender variable, we can carry out ANOVA by employ-
ing regression analysis rather than the traditional analysis of variance frame-
work discussed in the preceding sections of this chapter. This will be possible
if we include the product of the dummy variables across factors as variables
that represent in their entirety the interaction in the associated ANOVA
model. This inclusion is automatically carried out by the statistical software
used, in particular R, which also performs internally the dummy coding we
discussed in this Section 14.5.
The earlier mentioned equivalence of the modeling results when studying
group mean differences using the conventional ANOVA framework on the
one hand, and corresponding regression analysis approach on the other hand,
is readily demonstrated on the recoded data from Example 14.2. To this end,
given that we already have the pertinent ANOVA results presented in Section
Table 14.1 Original variables and added dummy variables for the first 20
subjects in Example 14.2 (variable names in top row; see Section 14.2 and
file CH14_EX4.dat for the entire recoded data set).
y ses gender x1 x2 x3
20 2 0 1 0 0
37 3 0 0 1 0
29 3 0 0 1 0
28 3 1 0 1 0
24 4 0 0 0 1
14 4 1 0 0 1
12 1 1 0 0 0
45 2 1 1 0 0
6 2 0 1 0 0
10 4 1 0 0 1
16 1 0 0 0 0
42 3 1 0 1 0
20 2 1 1 0 0
1 4 1 0 0 1
13 1 0 0 0 0
13 1 0 0 0 0
15 2 0 1 0 0
32 3 1 0 1 0
27 3 0 0 1 0
37 3 1 0 1 0
14.4, all we need to do here is carry out a regression analysis using as predict-
ors the factors ‘gender’ and ‘ses’ as well as their interaction. As indicated
above, this interaction will be represented here by the three products of each
dummy variable x1 through x3 with gender.
We begin our analyses by creating as an R-object the output of a regression
analysis relating the response variable ‘y’ to SES, gender, and their interaction.
We achieve this aim as follows (output provided immediately beneath com-
mand):
> summary(ra1)
Call:
lm(formula = y ~ ses + gender + ses * gender)
Residuals:
Min 1Q Median 3Q Max
-20.4000 -5.6146 -0.8333 5.6900 25.0000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.583 1.858 12.152 <2e-16 ***
sesmed 1.042 2.628 0.396 0.6925
seshigh 7.167 3.219 2.226 0.0276 *
sesupper -3.583 3.315 -1.081 0.2816
gendergirl -8.750 3.219 -2.718 0.0074 **
sesmed:gendergirl 5.125 4.493 1.141 0.2560
seshigh:gendergirl 7.280 4.537 1.605 0.1109
sesupper:gendergirl 11.150 4.606 2.421 0.0168 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detailed for our goals in a typical ANOVA setting. In such a setting, we usually
wish to know whether the interaction effect is significant, in its entirety, as
well as whether there are main effects of the factors involved. To obtain this
combined information for each of these three effects (two main effects and an
interaction), we need the corresponding analysis of variance table. We furnish
this table with the R command ‘anova’ applied on the R-object being the
discussed R output for the last regression analysis:
> anova(ra1)
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
ses 3 1836.2 612.1 7.3844 0.0001263 ***
gender 1 275.5 275.5 3.3236 0.0704578 .
ses:gender 3 509.3 169.8 2.0484 0.1099760
Residuals 138 11438.4 82.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
involved in the interaction. All these products then are included in a single
step in the ensuing regression analysis. Similarly, the main effect of any factor
is represented by the set of dummy variables pertaining to it, which too are
entered as a block into the regression model when examining that effect. This
modeling approach equivalence is the reason why there was no need to discuss
in detail higher-order ANOVA in the preceding Section 14.4.
We indicated on a few occasions earlier in the book that behavioral and social
research is typically concerned with exceedingly complex phenomena that are
usually multifactorially determined. In order to realistically study them, there-
fore, as mentioned before it is often desirable that we utilize several explana-
tory variables that need not all be of the same type. Following this realization,
we extended in Section 14.4 the number of factors included in an ANOVA.
At times, however, substantive considerations may require including one
or more continuous predictors, in addition to a set of factors in an initial
ANOVA setting. This inclusion may be desirable in order to account for
important explanatory variable(s) with respect to the response measure. The
latter measure is then modeled in terms of several predictors that differ in
their scale—some binary or polytomous (factors), others continuous. Since
these variables usually show appreciable correlation with the outcome mea-
sure, they are frequently referred to as covariates. For this reason, the model-
ing of a response variable using categorical (factors) as well as continuous
explanatory variables is called analysis of covariance (ANCOVA). This is the
modeling approach of concern in the remainder of the present section.
As can be expected from our ANOVA and regression result equivalence
discussion in the preceding Section 14.5, ANCOVA presents no new problems
as a modeling approach (see also Raykov & Marcoulides, 2008). Indeed, in
the last section we demonstrated how one can account for categorical predict-
ors of a given outcome variable. To this end, one first appropriately recodes
them (a process carried out internally by statistical software, like R, after their
definition as factors by the user), and then includes the resulting group mem-
bership dummy variables as predictors in the corresponding regression analy-
sis model. When in addition to these factors also continuous predictors are of
interest to be added, one merely includes them along with the factors in that
multiple regression model. In this way, one carries out ANCOVA in an empir-
ical setting.
We emphasize that just as a researcher does not need to perform the factor
recoding in ANOVA, there is no need for such recoding to be done by him or
her for the purpose of an ANCOVA—of course as long as the factors are
initially declared to the software as categorical variables. Hence, all that is
Example 14.3: In this investigation, n 138 middle school students from five
states were administered a mathematics ability test at the end of a week-long
training in relevant mathematics concepts and their relationships. In addition,
at the beginning of the training an intelligence test was administered to all
students involved in the study. (The data are contained in the file CH14_
EX5.dat, where the mathematics test results are denoted ‘y’, state residence by
‘state’, and the intelligence test score is designated ‘iq’.) A researcher wishes to
examine possible state differences in mathematics achievement at the end of the
training, after accounting for potential differences in intelligence at the begin-
ning of the study (cf. Raykov & Marcoulides, 2008). That is, her intention is to
see whether there are differences across the five states involved in mathematics
achievement after controlling for possible prior intelligence differences (IQ
score differences) across the groups. In this way, she wishes to account for
possible initial intelligence differences that may be related to such observed in
the final mathematics test scores.
To respond to this question, first we read in the data and declare the state
variable as a factor (see Section 14.1):
We begin our modeling process with the ANCOVA model explaining individ-
ual differences in the mathematics test scores in terms of state and IQ score
differences. To this end, we regress the mathematics test scores on the factor
‘state’ and the continuous IQ score. Thereby, we include the interaction
between the factor ‘state’ and the quantitative variable ‘iq’, in addition to
their own main effects, in order to examine whether IQ scores contribute to
differences in the mathematics test scores in a way that is not the same across
states. We accomplish this using again the R command ‘lm’, followed by a
request for the associated analysis of variance table (results provided beneath
commands):
> m1 = lm(y~state*iq)
> anova(m1)
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
state 4 1627.2 406.8 4.6620 0.001516 **
iq 1 28.3 28.3 0.3245 0.569933
state:iq 4 198.8 49.7 0.5697 0.685105
Residuals 128 11168.7 87.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As can be seen from this output, the interaction between ‘state’ and ‘iq’ is not
significant. That is, different states have the same regression slope of the (lin-
ear) relationship between mathematics and intelligence test scores. This find-
ing suggests that a main assumption of the classical ANCOVA approach,
frequently referred to as homogeneity of regression (e.g., Raykov & Marcou-
lides, 2008), is fulfilled in this data set. The implication is that we can next
drop this interaction from the model and proceed with the more parsimoni-
ous model assuming no such interaction, i.e., homogeneity in the regression
slopes of mathematics test score upon IQ, across the five states involved in the
study. We fit this model using the same R command, with a correspondingly
simplified argument that reflects the lack of this interaction:
> m2 = lm(y~state+iq)
> anova(m2)
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
state 4 1627.2 406.8 4.7236 0.001356 **
iq 1 28.3 28.3 0.3288 0.567368
Residuals 132 11367.5 86.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We observe here the lack of significance of the IQ score, controlling for state.
We can interpret this finding as suggesting that within each of the states there
is no discernible (linear) relationship between mathematics and intelligence
test score. Further, due to the finding that the factor ‘state’ is significant,
controlling for IQ score, it is suggested that for students with the same IQ
score, state residency (schooling) does matter as far as their mathematics test
scores are concerned. Hence, there is evidence in the analyzed data set for
state differences in mathematics achievement at the end of the training period
even for students with the same initial IQ score.
We can examine how much better the last fitted, simpler ANCOVA model
is from the starting full model including the interaction of state residency and
IQ score, by using the ‘anova’ command to compare their fit to the analyzed
data:
Model 1: y ~ state + iq
Model 2: y ~ state * iq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 132 11367.5
2 128 11168.7 4 198.8 0.5697 0.6851
We see from this output that the test of the null hypothesis of no fit differ-
ences for these two models is not significant—the p-value of the pertinent
F-ratio, being .69, is above a reasonable prespecified significance level. Hence,
we can conclude that the two models do not differ in their degree of reproduc-
ing the data, i.e., in their goodness of fit to the latter. For this reason, we
prefer the more parsimonious, conventional ANCOVA model with the
assumption of identical regression slope of mathematics test score upon IQ
score across all five states.
As indicated earlier, this ANCOVA model suggests—based on the analyzed
data set—that there are state differences in mathematics achievement even
after controlling for possible initial state differences in intelligence. To inspect
graphically these differences, we take a look at the mathematics test score
boxplots across the five states, which we obtain by using as before the R com-
mand ‘boxplot’ (see Figure 14.4 for the resulting boxplots):
> boxplot(y~state)
Figure 14.4 suggests that students in state C perform best on the mathematics
test while the remaining four states are at about the same level of perform-
FIGURE 14.4.
Box plots of mathematics test scores for the five states in Example 14.3.
> TukeyHSD(aov(lm(y⬃state)))
$state
diff lwr upr p adj
B-A 4.1681034 -2.395180 10.731387 0.4035059
C-A 9.2416667 2.736020 15.747313 0.0012712
D-A 0.8266129 -5.624649 7.277875 0.9965853
E-A 2.5625000 -5.275692 10.400692 0.8948743
C-B 5.0735632 -1.592918 11.740044 0.2241862
D-B -3.3414905 -9.954909 3.271928 0.6304613
E-B -1.6056034 -9.577791 6.366584 0.9808965
D-C -8.4150538 -14.971276 -1.858831 0.0047763
E-C -6.6791667 -14.603971 1.245638 0.1416725
E-D 1.7358871 -6.144333 9.616107 0.9734265
The pair-wise group comparisons presented in this output section suggest that
the only significant differences in mean mathematics achievement are found
between states C and A, and between states C and E. Indeed, states A, B, D,
and E do not differ with regard to average performance on the mathematics
test, while state C outperforms all of them in this aspect. We stress that these
pair-wise comparisons do not take into account the IQ test scores, which we
found earlier not to be an important predictor of mathematics achievement
beyond state residency.
In conclusion of this chapter, we have seen that ANOVA is a very helpful
statistical method that permits examining group differences with regard to
one or more categorical variables (factors). Thereby, we have observed that
ANOVA can be carried out within a traditional variance decomposition
framework, or alternatively within a modern regression analysis approach. A
conceptual advantage of the latter is that it permits one also to include contin-
uous predictors (covariates) of outcome measures. The resulting ANCOVA
modeling approach, under its assumptions (e.g., Raykov & Marcoulides,
2008), makes it possible to study group differences on a response measure
after removing group differences that can be accounted for by the relationship
between that measure and one or more continuous covariates considered in
addition to categorical factors.
NOTE
Modeling Discrete
Response Variables
In the previous three chapters it was assumed that the examined response
variables were continuous measures (or approximately continuous). Such an
assumption about the response variables will be fulfilled in many but not all
empirical settings in the social and behavioral sciences. Often, researchers in
these disciplines will also be interested in explaining individual differences in
binary or highly discrete outcome variables, in terms of one or more indepen-
dent measures. In such cases, the continuous characteristics of the response
variables cannot be reasonably assumed. Applications then of the regression
analysis methods that we dealt with in the last three chapters can yield incor-
rect results and misleading substantive interpretations. Instead, use of special
methods developed for discrete response variables is necessary in settings with
discrete outcome measures. A number of modeling approaches that are
appropriate in these empirical cases are available within a very general analytic
framework, commonly referred to as the generalized linear model. This
framework is the topic of the present chapter.
303
Suppose the values of the predictors x1, x2,..., xk were fixed (for simplicity,
below we use for them the same notation; cf. Raykov & Marcoulides, 2011).
Since the mean of the residual e is zero, taking mean (expectation) from both
sides of Equation (15.1) we obtain
(15.2) M(y)b0b1x1b2x2...bkxk,
where M(y) denotes the mean response (at the given set of predictors, x1, x2,...,
xk). An equivalent version of Equation (15.2) is often presented as
(15.3) M(y 兩 x1, x2,...,xk)b0b1x1b2x2...bkxk,
where the left-hand side emphasizes now that the mean of y is taken for
subjects with the given set of predictor values x1 through xk. For this reason,
M(y 兩 x1, x2,..., xk) is commonly referred to as conditional expectation of y given
the predictors.
Equation (15.3) states that this conditional expectation of the response
variable is assumed to be a linear function of the predictors. This means that
all unknown coefficients (parameters) b0 through bk are involved in that equa-
tion only as multipliers of the predictor values, and no other operation is
performed on them. This feature of the conditional expectation of a response
measure is what is typically meant when referring to a model as linear, in
particular when referring to the regression model (15.1) as a linear model. If
this model is postulated simultaneously for a number of outcome variables—
which, as usual in empirical research, may be interrelated among them-
selves—the resulting set of equations is often referred to as a general linear
model (GLM; e.g., Raykov & Marcoulides, 2008; Timm, 2002).
When employing the regression model defined in Equation (15.1), as men-
tioned earlier in the book a researcher can choose the explanatory variables
any way he or she decides to (as long as they have positive variance and
correspond to the research question asked). In particular, the researcher need
not be concerned with their distribution (e.g., Agresti & Finlay, 2009). That
is, the predictors can be binary (dichotomous), nominal, categorical (ordinal),
or continuous (interval or ratio scaled). At the same time, however, it is
required that the response variable(s) be continuous. For such outcome vari-
ables the GLM postulates a linear relationship(s) between the dependent and
explanatory variables. This linear relationship(s) is the hallmark of the GLM
and a reason why it has been very widely applied throughout the behavioral
and social sciences.
While the GLM provides a highly useful methodology for studying variable
relationships, it does not cover the cases when response variables take a lim-
ited number of possible values in studied populations. For example, when the
answer of a student on a given item in an ability test is of interest, the resulting
random variable representing his or her response is usually recorded as true
that is,
g(M(y))b0b1x1b2x2...bkxk.
We stress that Equation (15.4) differs from Equation (15.2) only by the fact
that it is not the response mean itself, but rather a function of it, viz., g(),
which is linearly related to the given set of explanatory variables x1 through
xk. That is, by postulating Equation (15.4) the GLIM preserves the linear rela-
tionship idea in the right-hand side of its modeling equation. At the same
time, the GLIM framework provides a number of options with respect to the
quantity appearing in the left-hand side of Equation (15.4), g(), which
makes it very comprehensive. One such possibility, for instance, is realized
when the function g(.) is chosen to be the identity—that is, if g()
—which represents perhaps the simplest example of GLIM. With this choice,
a GLIM is in fact identical to a corresponding GLM, as we will see in more
detail later. Since within the GLIM framework one relates a transformation of
the response mean to a set of explanatory variables, as in Equation (15.3), the
function g(.) is often referred to as a link—a reference we will frequently use
in the rest of this chapter.
The choice of the link g(.) does not complete the specification of a particu-
lar GLIM, however. In fact, any GLIM consists of three main elements (e.g.,
Dobson, 2002; Raykov & Marcoulides, 2011):
The random component meant in (i) is the distribution of the response vari-
able. For example, if the answer (score) on a single item in an achievement test
is of interest to model, the associated random variable representing student
responses is often binary. In that case, this variable follows what is called a
Bernoulli distribution with a probability p of taking the value 1 (‘‘correct’’
cal setting. The reason is that using the logit function (15.5) one obtains a
value that is unrestricted. Indeed, this function f(p) is readily seen as being
actually continuous, and it thus can be now considered a continuous counter-
part of the original probability, p.
Returning for a moment to the binary response variable case, we recall
from Chapter 5 that the mean of a binary random variable, say y, is the proba-
bility of correct response, denoted p, say. Looking now at Equation (15.5), we
realize that since the logit function furnishes a continuous counterpart of the
probability p, i.e., of the mean of y, we can make the following observation.
When we wish to relate a binary response y to a set of explanatory variables,
x1, x2,..., xk, we can consider our effort as analogous to a regression analysis
on the logit of the mean of y, i.e., of p, upon the predictors. This is the essence,
at an informal level, of the popular statistical technique of logistic regression.
(For further and formal details, we refer the reader to more advanced treat-
ments, such as Hosmer & Lemeshow, 2002).
The GLIM with a binary response has been in fact long known in the
behavioral and social sciences under an alternative name, logistic regression.
Specifically, keeping also in mind the discussion in the preceding section
15.3.1, the model of logistic regression is as follows:
(15.6) ln[p/(1p)]b0b1x1b2x2...bkxk,
where p Pr(y 1) is the probability of the response variable taking the
value of 1 (e.g., a ‘‘yes’’ response on a given question in a survey, or a ‘‘cor-
rect’’ answer on a test item), and x1 through xk are the explanatory variables
included in the ‘‘linear predictor’’ of this GLIM (see Section 15.1). That is,
according to the logistic regression model (15.6), it is the logit of the probabil-
ity of response 1, which is assumed to be a linear function of a set of explana-
tory variables, rather than this probability itself.
An alternative representation of the logistic regression model—i.e., of the
GLIM with a binary response—is readily obtained from Equations (15.5) and
(15.6); this alternative highlights further interesting properties of this model
that can be very useful in an empirical setting. Specifically, through some
direct algebra we obtain the probability of the event in question (e.g., a ‘‘cor-
rect’’ response on a test item) as follows:
eb0b1x1...bkxk .
(15.7) pPr(y1)
1eb0b1x1...bkxk
This shows that this probability is a nonlinear function of the used predictors
x1 through xk. Therefore, Equation (15.7) demonstrates that the probability p
for the event considered depends on the predictor variables in a relatively
complicated way, rather than in a linear fashion.
The expression given in the right-hand side of Equation (15.7) is especially
useful when one is interested in predicting the probability of an event using a
set of known predictor values. Assuming that the logistic model (15.6) is a
plausible means of data description and explanation, the predicted probability
follows as
eb̂0b̂1x1...b̂kxk ,
(15.8) p̂
1eb̂0b̂1x1...b̂kxk
where b̂0 through b̂k are the estimated parameters of the model. When this
predicted probability is higher (smaller) than .5, one may conclude that given
the model and the set of predictor values, the event under consideration is
predicted to occur with higher (smaller) probability than by chance alone.
that are appropriate within the GLIM approach with other types of data (cf.
Raykov & Marcoulides, 2011). One important such link is provided by the
logarithm function, which turns out to be very useful when the outcome vari-
able data are in terms of counts. This can be the case, for instance, when of
interest is to study the number of times a particular event occurs within a
given interval, such as disease outbreak (e.g., Chapter 5). The logarithm link
has been shown to possess important properties in such settings, which make
it the choice for a corresponding GLIM application, usually associated with a
Poisson distribution of the response variable. Additional links are discussed
in detail, e.g., in Dobson (2002). More generally speaking, the GLIM frame-
work is highly comprehensive and covers many different types of data collec-
tion designs. Within this framework, optimal statistical procedures for
parameter estimation and testing purposes are available and discussed in
alternative and more advanced sources (e.g., Skrondal & Rabe-Hesketh,
2004). In the rest of this chapter, we return to the topic of logistic regression,
which is a very popular analytic framework in the social and behavioral disci-
plines.
Example 15.1: Suppose we have access to data from a study of college admission
with n 132 students from five states. In this study, data on the dichotomous
variable college admission was collected, along with such on student gender and
state residency, as well as their scores on an intelligence test and a scholastic
aptitude test. (The data are available in the file CH15_EX1. dat, where ‘college’
is the admission variable—with a value of 1 for students admitted and 0 other-
wise—‘aptitude’ is the pertinent test score, ‘state’ and ‘gender’ their state resi-
dency and gender, respectively, and ‘iq’ the intelligence test score, referred to
below as ‘‘IQ score’’ or ‘‘IQ’’ for short.)
We are interested here in explaining individual differences in college admis-
sion in terms of student differences in state residency, gender, and IQ, as well
as aptitude test scores. Given the binary response variable college admission, we
To this end, we employ a GLIM with (cf. Raykov & Marcoulides, 2011):
Hence, the GLIM model (i.e., logistic regression model) to be fitted is:
(15.9) ln(p/(1p))b0b1 stateb2 genderb3 aptitudeb4 IQ.
As indicated earlier in the chapter, this model is alternatively reexpressed in
terms of the probability of college admission as follows:
eb0b1stateb2genderb3aptitudeb4IQ
(15.10) p
1eb0b1stateb2genderb3aptitudeb4IQ
1
.
1e(b0b1stateb2genderb3aptitudeb4IQ)
As we mentioned before, Equation (15.10) can be used after fitting the model,
to work out predicted probability of college admission for a given set of stu-
dent predictor values (e.g., for an individual with available data on the used
predictors). Also, we emphasize that the right-hand side of Equation (15.12)
is never larger than one or smaller than zero, just like any probability.
To fit the logistic regression model in Equation (15.9) that addresses the
research question of concern here, we need to indicate to R the GLIM ele-
ments described in points (i) through (iii). To accomplish this aim, after we
declare the variable ‘state’ as a factor (see also Note to Chapter 14), we
> summary(m1)
Call:
glm(formula = college ~ state + gender + aptitude + iq, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9317 -0.6753 0.2370 0.6816 1.9986
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.93721 6.05502 -4.284 1.84e-05 ***
stateB -0.36291 0.73193 -0.496 0.620
stateC -0.11570 0.80071 -0.144 0.885
stateD -0.37403 0.69679 -0.537 0.591
stateE 0.16550 0.86090 0.192 0.848
gender 0.30155 0.49305 0.612 0.541
aptitude 0.16683 0.03746 4.454 8.44e-06 ***
iq 0.22820 0.05787 3.943 8.04e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output differs in an important way from the outputs we obtained when
carrying out regression analyses earlier in the book. Specifically, a new quan-
tity is included here, which is called deviance. A complete and formal defini-
tion of the deviance is beyond the scope of this introductory book (see e.g.,
Dobson, 2002, for such a discussion). We merely mention here that the devi-
ance in a GLIM corresponds to the sum of squared residuals in a linear mod-
eling context as discussed in Chapters 12 through 14. That is, with large
samples, a large deviance is indicative of a model that is not well fitting the
data, while a smaller deviance is consistent with a model fitting the data better.
When all cells of the underlying study design with regard to the factors
involved have expected frequencies under the model that are in excess of five
> summary(m2)
Call:
glm(formula = college ~ gender + aptitude + iq, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7961 -0.6737 0.2643 0.6681 1.9111
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.81204 5.92330 -4.358 1.31e-05 ***
gender 0.28482 0.48151 0.592 0.554
aptitude 0.16384 0.03449 4.751 2.03e-06 ***
iq 0.22594 0.05672 3.983 6.80e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is possible to carry out this test by comparing the deviances of the models
‘m1’ and ‘m2’. To this end, we use the R command ‘anova’, which we already
employed earlier in the book for model fit comparison purposes:
1 128 110.651
2 124 109.973 4 0.677 0.954
Since the p-value associated with the deviance difference in question is non-
significant, we conclude that the two models involved fit the data equally well.
That is, dropping the factor ‘state’ does not lead to a significant loss in predict-
ive power of the logistic regression model ‘m2’ relative to the first fitted model
‘m1’. An alternative interpretation suggests that the factor ‘state’ has no
unique predictive power for college admission, once accounting for individual
difference in gender, IQ, and aptitude test score.
We also notice that the output provided for each of the two fitted models,
‘m1’ and ‘m2’, contains one more fit-related quantity, AIC (short for
‘‘Akaike’s information criterion’’). This index evaluates model fit and
includes a penalty for lack of parsimony. Specifically, when two or more
models are compared in terms of fit to the same set of observed variables
(data set), the one with lower AIC can be considered as better fitting and
preferable. From the two models in question here, ‘m1’ and ‘m2’, the latter
has a smaller AIC—viz., 118.65 vs. 125.97. Hence, model ‘m2’ is preferable
to ‘m1’ as a means of data description and explanation also, as far as this
model fit index is concerned. (The formal definition and in particular ratio-
nale for this index are beyond the confines of this chapter and can be found,
e.g., in Dobson, 2002.)
Returning now to the output associated with the preferred model ‘m2’, we
notice from its ‘coefficients’ section that the variable ‘gender’ is not signifi-
cant. We interpret this finding as suggesting that gender is unrelated to college
admission for students with the same IQ and aptitude test scores. To examine
this suggestion also from the above model comparison perspective, we fit next
model ‘m2’ without the ‘gender’ variable:
> summary(m3)
Call:
glm(formula = college ~ achievmt + iq, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8505 -0.7015 0.2721 0.6960 1.9486
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.16930 5.77697 -4.357 1.32e-05 ***
aptitude 0.16492 0.03451 4.779 1.76e-06 ***
iq 0.22067 0.05561 3.968 7.23e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We note from this R output that both predictors in model ‘m3’, aptitude and
IQ test score, are significant. That is, the aptitude test score does contribute
to better prediction of the probability of college admission for students with
the same IQ score; alternatively, the IQ test score contributes to a better pre-
diction of this probability for students with the same aptitude test score.
To compare the fit of the last two fitted models, ‘m2’ and ‘m3’, we use the
‘anova’ command again (with them as the first two arguments):
1 129 111.002
2 128 110.651 1 0.352 0.553
The difference in fit in these two models is thus nonsignificant, and we con-
clude that gender does not have unique predictive power with regard to col-
lege admission once controlling for aptitude and IQ score.
We also note that the last fitted model, ‘m3’, has the smallest AIC—viz.,
117—of all three models discussed in this section. Hence, we can treat it as a
preferable means of data description and explanation, when compared to the
previous two models fitted. In addition, since its deviance is even smaller than
the associated degrees of freedom, there seems to be no indication in the
output presented that would suggest that this most parsimonious model—
where aptitude and IQ test scores are the only predictors—is associated with
a deficient fit to the analyzed data. We thus consider model ‘m3’ as a plausible
means of data description and evaluation.
If interested in predicting the probability of college admission, assuming
knowledge of aptitude and IQ test score, we can use this tenable model ‘m3’.
Accordingly, it furnishes this predicted probability as (see Equation (15.12)
and last presented output):
e25.17.16aptitude.22IQ
(15.13) p .
1e25.17.16aptitude.22IQ
From this prediction equation, we can see that increase in aptitude score leads
to enhanced college admission probability for students with the same IQ. Sim-
ilarly, an increase in the IQ test score is associated with an enhanced admis-
sion probability for students with the same aptitude test score.
The last interpretation can be viewed as conditional—we fix one of the
predictors in order to interpret the effect of the other upon the probability of
college admission. In some cases, one may as well be interested in uncondi-
tional interpretations of the relationships between a response variable and any
of a set of predictors of interest, pooling over the values of the other predict-
ors. For the preferred model ‘m3’, we can graphically represent this uncondi-
tional relationship by plotting either of the two predictors involved—aptitude
and IQ test score—against the predicted probability for college admission.
This is achieved with the R command ‘plot’, where the predictor and the
predicted probabilities (according to the model) are plotted against one
another:
The resulting graphs are displayed in Figure 15.1 and clearly show the pattern
of increase in the predicted probability of college admission with increasing
aptitude and increasing IQ test score.
In conclusion of this chapter, we discussed in it the comprehensive frame-
work of the generalized linear model (GLIM), which has proved to be highly
FIGURE 15.1.
Plot of predicted probabilities against aptitude and intelligence test scores.
This book provided an introduction to basic statistics using the popular soft-
ware R for its applications. Given the intended introductory level of the book
and space limitations, further statistical methods and implementations of R
for their application could not be covered, and those that we did we could
not discuss in greater detail. More advanced statistical methods as well as uses
of R are the subject of a number of other excellent treatments available (e.g.,
Crawley, 2007; Dalgaard, 2002; Everitt & Hothorn, 2010; Rao, 1973; Rice,
2006; Rizzo, 2009). Our aim with this book was to provide a stepping-stone
for undergraduate and graduate students as well as researchers within the
behavioral, biological, educational, medical, management, and social sciences
in their pursuit of those more advanced methods and R applications.
Throughout the book, we made the simplifying assumption that we were
not dealing with missing data or with data that were hierarchical or nested in
nature (e.g., students nested in classrooms, nested within schools, nested
within school districts). Although missing data pervade behavioral, biological,
educational, medical, management, and social science research, they also sub-
stantially complicate the discussion of statistical methods and applications.
How to handle missing data is the subject of more advanced treatments that
lie outside of the goals of this introductory text (e.g., Little & Rubin, 2002; for
a less technical discussion, see, e.g., Raykov & Marcoulides, 2008, ch. 12).
Similarly, hierarchical or nested data need a considerably more sophisticated
approach than that covered in the book (e.g., Heck & Thomas, 2009; Heck,
Thomas, & Tabata, 2010; Hox, 2002).
We conclude with the hope that this introductory book on the basics of
statistics and its applications with the highly comprehensive, state-of-the-art,
and freely available software R will be helpful to students and researchers in
various disciplines in their subsequent journey into more advanced topics of
the fascinating science of statistics and its applications.
321
323
325
rejection region. See infer- significance level, 117–18, straight line equation. See
ential statistics 121–22 simple linear regression
relative frequency distri- simple effects, 293 strength of association. See
bution, 43–45 simple hypothesis, 127 correlation coefficient
repeated-measures designs, simple linear regression Student’s t-distribution,
137 (SLR), 209–37 138
research: process of 1–6; single parameter distri- subjective definition of
purpose of 2–4; role of, bution. See Poisson probability, 45
1–5; role of statistics in, distribution summation notation. See
4–6 skewed, 39–40, 66–67 mean
research hypothesis, 112–16 skewed distribution. See summation operation and
residuals, 216–17 skewed sign. See mean
residual deviance. See slope, 210–13 sums of squares. See
logistic regression SLR. See simple linear analysis of variance
response variable. See regression symmetric, 38–42
dependent variable source table. See analysis of symmetry. See symmetric
restricted model, 251–52 variance
robustness, 24–25 specificity. See Bayes’ Tabu search algorithm, 263
R project, 6 theorem t-distribution, 137–38
RSD. See random sampling spread. See variability TE. See tolerable error
distribution squared correlation coeffi- test statistics (TS), 112–13
cient. See correlation test theory, 120
R-squared index, 218
coefficient tolerable error, 109–11
R2. See R-squared index
standard deviation, 30–33 t-ratio, 137–38
standard error of estimate tri-modal, 24
sample, 2–4 (SEE), 216–17 trinomial distribution, 179
sample size, 109–11 standard error of the mean, TS. See test statistics
sample statistics and popu- 87–88 t-tests: dependent samples,
lation parameters, 2–6 standard normal distri- 137; difference between
sample variance. See vari- bution. See normal means (see inferential
ance distribution statistics)
sampling distribution, standard score. See z-score Tukey’s HSD test, 285–86
83–85 statistic, 28 two-tailed p-value. See p-
sampling error. See infer- statistical hypothesis value
ential statistics testing. See inferential two-tailed tests, 129–36
sampling theory, 5 statistics two-way analysis of
sampling with replacement, statistical inference. See variance. See analysis of
46–47 inferential statistics variance
scatterplot. See correlation statistical power, 120 Type I error, 118–21
scientific method, 2–3 statistics: descriptive, 7–23, Type II error, 118–20
SEE. See standard error of 24–42; inferential,
estimate 99–136 unconditional probability,
sensitivity. See Bayes’ stem-and-leaf plot, 20–21 51
theorem stepwise regression, 262–63 unequal variances, 156
shrinkage, 248–50 stepwise selection, 262–63 unimodality. See mode
331