Get Introductory Statistics For The Life and Biomedical Sciences 1st Edition Julie Vu Free All Chapters
Get Introductory Statistics For The Life and Biomedical Sciences 1st Edition Julie Vu Free All Chapters
Get Introductory Statistics For The Life and Biomedical Sciences 1st Edition Julie Vu Free All Chapters
com
https://ebookmeta.com/product/introductory-
statistics-for-the-life-and-biomedical-
sciences-1st-edition-julie-vu/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-1st-edition-simon-mochrie/
https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-mechanics-volume-one-david-v-guerra/
https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-mechanics-volume-one-1st-edition-david-v-guerra/
https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-quantity-based-analysis-1st-edition-david-v-guerra/
Practice of Statistics in the Life Sciences Brigitte
Baldi
https://ebookmeta.com/product/practice-of-statistics-in-the-life-
sciences-brigitte-baldi/
https://ebookmeta.com/product/introductory-statistics-for-data-
analysis-warren-j-ewens/
https://ebookmeta.com/product/prealgebra-and-introductory-
algebra-second-edition-julie-miller/
https://ebookmeta.com/product/probability-and-statistics-for-
engineering-and-the-sciences-9th-edition-devore-j-l/
https://ebookmeta.com/product/probability-and-statistics-for-
engineering-and-the-sciences-9e-international-metric-edition-
devore/
Introductory Statistics for the
Life and Biomedical Sciences
First Edition
Julie Vu
Preceptor in Statistics
Harvard University
David Harrington
Professor of Biostatistics (Emeritus)
Harvard T.H. Chan School of Public Health
Dana-Farber Cancer Institute
This textbook and its supplements, including slides and labs, may be downloaded for free at
openintro.org/book/biostat.
This textbook is a derivative of OpenIntro Statistics 3rd Edition by Diez, Barr, and Çetinkaya-
Rundel, and it is available under a Creative Commons Attribution-ShareAlike 3.0 Unported United
States license. License details are available at the Creative Commons website:
creativecommons.org.
Table of Contents
1 Introduction to data 10
1.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Data collection principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 Relationships between two variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2 Probability 88
2.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3 Extended example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Index 469
5
Foreword
The past year has been challenging for the health sciences in ways that we could not have imagined
when we started writing 5 years ago. The rapid spread of the SARS coronavirus (SARS-CoV-2)
worldwide has upended the scientific research process and highlighted the need for maintaining a
balance between speed and reliability. Major medical journals have dramatically increased the pace
of publication; the urgency of the situation necessitates that data and research findings be made
available as quickly as possible to inform public policy and clinical practice. Yet it remains essential
that studies undergo rigorous review; the retraction of two high-profile coronavirus studies 1, 2
sparked widespread concerns about data integrity, reproducibility, and the editorial process.
In parallel, deepening public awareness of structural racism has caused a re-examination of
the role of race in published studies in health and medicine. A recent review of algorithms used to
direct treatment in areas such as cardiology, obstetrics and oncology uncovered examples of race
used in ways that may lead to substandard care for people of color. 3 The SARS-CoV-2 pandemic
has reminded us once again that marginalized populations are disproportionately at risk for bad
health outcomes. Data on 17 million patients in England 4 suggest that Blacks and South Asians
have a death rate that is approximately 50% higher than white members of the population.
Understanding the SARS coronavirus and tackling racial disparities in health outcomes are
but two of the many areas in which Biostatistics will play an important role in the coming decades.
Much of that work will be done by those now beginning their study of Biostatistics. We hope this
book provides an accessible point of entry for students planning to begin work in biology, medicine,
or public health. While the material presented in this book is essential for understanding the
foundations of the discipline, we advise readers to remember that a mastery of technical details is
secondary to choosing important scientific questions, examining data without bias, and reporting
results that transparently display the strengths and weaknesses of a study.
1 Mandeep R. Mehra et al. “Retraction: Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J
Med. DOI: 10.1056/NEJMoa2007621.” In: New England Journal of Medicine 382.26 (2020), pp. 2582–2582. doi: 10.1056/
NEJMc2021225.
2 Mandeep R Mehra et al. “RETRACTED:Hydroxychloroquine or chloroquine with or without a macrolide for treatment
of COVID-19: a multinational registry analysis”. In: The Lancet (2020). doi: https://doi.org/10.1016/S0140- 6736(20)
31180-6.
3 Darshali A. Vyas et al. “Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms”.
In: New England Journal of Medicine (2020). doi: 10.1056/NEJMms2004740.
4 Elizabeth J. Williamson et al. “OpenSAFELY: factors associated with COVID-19 death in 17 million patients”. In:
Nature (2020). issn: 1476-4687.
6
Preface
This text introduces statistics and its applications in the life sciences and biomedical research. It is
based on the freely available OpenIntro Statistics, and, like OpenIntro, it may be downloaded at no
cost. 5 In writing Introduction to Statistics for the Life and Biomedical Sciences, we have added sub-
stantial new material, but also retained some examples and exercises from OpenIntro that illustrate
important ideas even if they do not relate directly to medicine or the life sciences. Because of its
link to the original OpenIntro project, this text is often referred to as OpenIntro Biostatistics in the
supplementary materials.
This text is intended for undergraduate and graduate students interested in careers in biology
or medicine, and may also be profitably read by students of public health or medicine. It cov-
ers many of the traditional introductory topics in statistics, in addition to discussing some newer
methods being used in molecular biology.
Statistics has become an integral part of research in medicine and biology, and the tools for
summarizing data and drawing inferences from data are essential both for understanding the out-
comes of studies and for incorporating measures of uncertainty into that understanding. An intro-
ductory text in statistics for students who will work in medicine, public health, or the life sciences
should be more than simply the usual introduction, supplemented with an occasional example
from biology or medical science. By drawing the majority of examples and exercises in this text
from published data, we hope to convey the value of statistics in medical and biological research. In
cases where examples draw on important material in biology or medicine, the problem statement
contains the necessary background information.
Computing is an essential part of the practice of statistics. Nearly everyone entering the
biomedical sciences will need to interpret the results of analyses conducted in software; many
will also need to be capable of conducting such analyses. The text and associated materials sepa-
rate those two activities to allow students and instructors to emphasize either or both skills. The
text discusses the important features of figures and tables used to support an interpretation, rather
than the process of generating such material from data. This allows students whose main focus
is understanding statistical concepts not to be distracted by the details of a particular software
package. In our experience, however, we have found that many students enter a research setting
after only a single course in statistics. These students benefit from a practical introduction to data
analysis that incorporates the use of a statistical computing language. The‘ self-paced learning labs
associated with the text provide such an introduction; these are described in more detail later in
this preface. The datasets used in this book are available via the R openintro package available on
CRAN 6 and the R oibiostat package available via GitHub.
5 PDF available at https://www.openintro.org/book/biostat/ and source available at https://github.com/
OI-Biostat/oi_biostat_text.
6 Diez DM, Barr CD, Çetinkaya-Rundel M. 2012. openintro: OpenIntro data sets and supplement functions. http:
//cran.r-project.org/web/packages/openintro.
7
Textbook overview
The chapters of this book are as follows:
1. Introduction to data. Data structures, basic data collection principles, numerical and graphical
summaries, and exploratory data analysis.
2. Probability. The basic principles of probability.
3. Distributions of random variables. Introduction to random variables, distributions of discrete
and continuous random variables, and distributions for pairs of random variables.
4. Foundations for inference. General ideas for statistical inference in the context of estimating a
population mean.
5. Inference for numerical data. Inference for one-sample and two-sample means with the t-distribution,
power calculations for a difference of means, and ANOVA.
6. Simple linear regression. An introduction to linear regression with a single explanatory vari-
able, evaluating model assumptions, and inference in a regression context.
7. Multiple linear regression. General multiple regression model, categorical predictors with more
than two values, interaction, and model selection.
8. Inference for categorical data. Inference for single proportions, inference for two or more groups,
and outcome-based sampling.
EXAMPLE 0.1
This is an example. When a question is asked here, where can the answer be found?
The answer can be found here, in the solution section of the example.
When we think the reader would benefit from working out the solution to an example, we frame it
as Guided Practice.
There are exercises at the end of each chapter that are useful for practice or homework as-
signments. Solutions to odd numbered problems can be found in Appendix A. Readers will notice
that there are fewer end of chapter exercises in the last three chapters. The more complicated
methods, such as multiple regression, do not always lend themselves to hand calculation, and
computing is increasingly important both to gain practical experience with these methods and to
explore complex datasets. For students more interested in concepts than computing, however, we
have included useful end of chapter exercises that emphasize the interpretation of output from
statistical software.
Probability tables for the normal, t, and chi-square distributions are in Appendix B, and PDF
copies of these tables are also available from openintro.org for anyone to download, print, share, or
modify. The labs and the text also illustrate the use of simple R commands to calculate probabilities
from common distributions.
7 Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote
solution for any Guided Practice.
8 CHAPTER 0. PREFACE
Acknowledgements
The OpenIntro project would not have been possible without the dedication of many people, in-
cluding the authors of OpenIntro Statistics, the OpenIntro team and the many faculty, students,
and readers who commented on all the editions of OpenIntro Statistics.
This text has benefited from feedback from Andrea Foulkes, Raji Balasubramanian, Curry
Hilton, Michael Parzen, Kevin Rader, and the many excellent teaching fellows at Harvard College
who assisted in courses using the book. The cover design was provided by Pierre Baduel.
9
10
Chapter 1
Introduction to data
1.8 Notes
1.9 Exercises
11
Making observations and recording data form the backbone of empirical research,
and represent the beginning of a systematic approach to investigating scientific
questions. As a discipline, statistics focuses on addressing the following three
questions in a rigorous and efficient manner: How can data best be collected? How
should data be analyzed? What can be inferred from data?
This chapter provides a brief discussion on the principles of data collection, and
introduces basic methods for summarizing and exploring data.
The proportion of young children in Western countries with peanut allergies has doubled in
the last 10 years. Previous research suggests that exposing infants to peanut-based foods, rather
than excluding such foods from their diets, may be an effective strategy for preventing the develop-
ment of peanut allergies. The "Learning Early about Peanut Allergy" (LEAP) study was conducted
to investigate whether early exposure to peanut products reduces the probability that a child will
develop peanut allergies. 1
The study team enrolled children in the United Kingdom between 2006 and 2009, selecting
640 infants with eczema, egg allergy, or both. Each child was randomly assigned to either the
peanut consumption (treatment) group or the peanut avoidance (control) group. Children in the
treatment group were fed at least 6 grams of peanut protein daily until 5 years of age, while chil-
dren in the control group avoided consuming peanut protein until 5 years of age.
At 5 years of age, each child was tested for peanut allergy using an oral food challenge (OFC): 5
grams of peanut protein in a single dose. A child was recorded as passing the oral food challenge if
no allergic reaction was detected, and failing the oral food challenge if an allergic reaction occurred.
These children had previously been tested for peanut allergy through a skin test, conducted at the
time of study entry; the main analysis presented in the paper was based on data from 530 children
with an earlier negative skin test. 2
Individual-level data from the study are shown in Figure 1.1 for 5 of the 530 children—each
row represents a participant and shows the participant’s study ID number, treatment group assign-
ment, and OFC outcome. 3
The data can be organized in the form of a two-way summary table; Figure 1.2 shows the
results categorized by treatment group and OFC outcome.
1 Du Toit, George, et al. Randomized trial of peanut consumption in infants at risk for peanut allergy. New England
Journal of Medicine 372.9 (2015): 803-813.
2 Although a total of 542 children had an earlier negative skin test, data collection did not occur for 12 children.
3 The data are available as LEAP in the R package oibiostat.
1.1. CASE STUDY 13
The summary table makes it easier to identify patterns in the data. Recall that the question
of interest is whether children in the peanut consumption group are more or less likely to develop
peanut allergies than those in the peanut avoidance group. In the avoidance group, the proportion
of children failing the OFC is 36/263 = 0.137 (13.7%); in the consumption group, the proportion
of children failing the OFC is 5/267 = 0.019 (1.9%). Figure 1.3 shows a graphical method of dis-
playing the study results, using either the number of individuals per category from Figure 1.2 or
the proportion of individuals with a specific OFC outcome in a group.
1.0
FAIL OFC FAIL OFC
250
PASS OFC PASS OFC
0.8
200
0.6
150
0.4
100
50 0.2
0 0.0
Peanut Avoidance Peanut Consumption Peanut Avoidance Peanut Consumption
(a) (b)
Figure 1.3: (a) A bar plot displaying the number of individuals who failed or
passed the OFC in each treatment group. (b) A bar plot displaying the proportions
of individuals in each group that failed or passed the OFC.
The proportion of participants failing the OFC is 11.8% higher in the peanut avoidance group
than the peanut consumption group. Another way to summarize the data is to compute the ratio of
the two proportions (0.137/0.019 = 7.31), and conclude that the proportion of participants failing
the OFC in the avoidance group is more than 7 times as large as in the consumption group; i.e.,
the risk of failing the OFC was more than 7 times as great for participants in the avoidance group
relative to the consumption group.
Based on the results of the study, it seems that early exposure to peanut products may be
an effective strategy for reducing the chances of developing peanut allergies later in life. It is
important to note that this study was conducted in the United Kingdom at a single site of pediatric
care; it is not clear that these results can be generalized to other countries or cultures.
The results also raise an important statistical issue: does the study provide definitive evidence
that peanut consumption is beneficial? In other words, is the 11.8% difference between the two
groups larger than one would expect by chance variation alone? The material on inference in later
chapters will provide the statistical tools to evaluate this question.
14 CHAPTER 1. INTRODUCTION TO DATA
Effective organization and description of data is a first step in most analyses. This section
introduces a structure for organizing data and basic terminology used to describe data.
In evolutionary biology, parental investment refers to the amount of time, energy, or other
resources devoted towards raising offspring. This section introduces the frog dataset, which orig-
inates from a 2013 study about maternal investment in a frog species. 4 Reproduction is a costly
process for female frogs, necessitating a trade-off between individual egg size and total number of
eggs produced. Researchers were interested in investigating how maternal investment varies with
altitude and collected measurements on egg clutches found at breeding ponds across 11 study sites;
for 5 sites, the body size of individual female frogs was also recorded.
Figure 1.4 displays rows 1, 2, 3, and 150 of the data from the 431 clutches observed as part
of the study. 5 Each row in the table corresponds to a single clutch, indicating where the clutch
was collected (altitude and latitude), egg.size, clutch.size, clutch.volume, and body.size of
the mother when available. "NA" corresponds to a missing value, indicating that information on
an individual female was not collected for that particular clutch. The recorded characteristics are
referred to as variables; in this table, each column represents a variable.
variable description
altitude Altitude of the study site in meters above sea level
latitude Latitude of the study site measured in degrees
egg.size Average diameter of an individual egg to the 0.01 mm
clutch.size Estimated number of eggs in clutch
clutch.volume Volume of egg clutch in mm3
body.size Length of mother frog in cm
Figure 1.5: Variables and their descriptions for the frog dataset.
It is important to check the definitions of variables, as they are not always obvious. For ex-
ample, why has clutch.size not been recorded as whole numbers? For a given clutch, researchers
counted approximately 5 grams’ worth of eggs and then estimated the total number of eggs based
on the mass of the entire clutch. Definitions of the variables are given in Figure 1.5. 6
4 Chen, W., et al. Maternal investment increases with altitude in a frog on the Tibetan Plateau. Journal of evolutionary
biology 26.12 (2013): 2710-2715.
5 The frog dataset is available in the R package oibiostat.
6 The data discussed here are in the original scale; in the published paper, some values have undergone a natural log
transformation.
1.2. DATA BASICS 15
The data in Figure 1.4 are organized as a data matrix. Each row of a data matrix corresponds
to an observational unit, and each column corresponds to a variable. A piece of the data matrix for
the LEAP study introduced in Section 1.1 is shown in Figure 1.1; the rows are study participants
and three variables are shown for each participant. Data matrices are a convenient way to record
and store data. If the data are collected for another individual, another row can easily be added;
similarly, another column can be added for a new variable.
The Functional polymorphisms Associated with human Muscle Size and Strength study (FA-
MuSS) measured a variety of demographic, phenotypic, and genetic characteristics for about 1,300
participants. 7 Data from the study have been used in a number of subsequent studies,8 such as
one examining the relationship between muscle strength and genotype at a location on the ACTN3
gene. 9
The famuss dataset is a subset of the data for 595 participants. 10 Four rows of the famuss
dataset are shown in Figure 1.6, and the variables are described in Figure 1.7.
variable description
sex Sex of the participant
age Age in years
race Race, recorded as African Am (African American), Caucasian, Asian,
Hispanic or Other
height Height in inches
weight Weight in pounds
actn3.r577x Genotype at the location r577x in the ACTN3 gene.
ndrm.ch Percent change in strength in the non-dominant arm, comparing strength
after to before training
Figure 1.7: Variables and their descriptions for the famuss dataset.
The variables age, height, weight, and ndrm.ch are numerical variables. They take on numer-
ical values, and it is reasonable to add, subtract, or take averages with these values. In contrast,
a variable reporting telephone numbers would not be classified as numerical, since sums, differ-
ences, and averages in this context have no meaning. Age measured in years is said to be discrete,
since it can only take on numerical values with jumps; i.e., positive integer values. Percent change
in strength in the non-dominant arm (ndrm.ch) is continuous, and can take on any value within a
specified range.
7 Thompson PD, Moyna M, Seip, R, et al., 2004. Functional Polymorphisms Associated with Human Muscle Size and
Strength. Medicine and Science in Sports and Exercise 36:1132 - 1139.
8 Pescatello L, et al. Highlights from the functional single nucleotide polymorphisms associated with human muscle
size and strength or FAMuSS study, BioMed Research International 2013.
9 Clarkson P, et al., Journal of Applied Physiology 99: 154-163, 2005.
10 The subset is from Foulkes, Andrea S. Applied statistical genetics with R: for population-based association studies.
Springer Science & Business Media, 2009. The full version of the data is available at http://people.umass.edu/foulkes/
asg/data.html.
16 CHAPTER 1. INTRODUCTION TO DATA
The variables sex, race, and actn3.r577x are categorical variables, which take on values that
are names or labels. The possible values of a categorical variable are called the variable’s levels. 11
For example, the levels of actn3.r577x are the three possible genotypes at this particular locus:
CC, CT, or TT. Categorical variables without a natural ordering are called nominal categorical
variables; sex, race, and actn3.r577x are all nominal categorical variables. Categorical variables
with levels that have a natural ordering are referred to as ordinal categorical variables. For exam-
ple, age of the participants grouped into 5-year intervals (15-20, 21-25, 26-30, etc.) is an ordinal
categorical variable.
EXAMPLE 1.1
Classify the variables in the frog dataset: altitude, latitude, egg.size, clutch.size,
clutch.volume, and body.size.
The variables egg.size, clutch.size, clutch.volume, and body.size are continuous numerical
variables, and can take on all positive values.
In the context of this study, the variables altitude and latitude are best described as categorical
variables, since the numerical values of the variables correspond to the 11 specific study sites where
data were collected. Researchers were interested in exploring the relationship between altitude and
maternal investment; it would be reasonable to consider altitude an ordinal categorical variable.
Many studies are motivated by a researcher examining how two or more variables are related.
For example, do the values of one variable increase as the values of another decrease? Do the values
of one variable tend to differ by the levels of another variable?
One study used the famuss data to investigate whether ACTN3 genotype at a particular lo-
cation (residue 577) is associated with change in muscle strength. The ACTN3 gene codes for a
protein involved in muscle function. A common mutation in the gene at a specific location changes
the cytosine (C) nucleotide to a thymine (T) nucleotide; individuals with the TT genotype are un-
able to produce any ACTN3 protein.
Researchers hypothesized that genotype at this location might influence muscle function. As
a measure of muscle function, they recorded the percent change in non-dominant arm strength
after strength training; this variable, ndrm.ch, is the response variable in the study. A response
variable is defined by the particular research question a study seeks to address, and measures the
outcome of interest in the study. A study will typically examine whether the values of a response
variable differ as values of an explanatory variable change, and if so, how the two variables are
related. A given study may examine several explanatory variables for a single response variable. 14
The explanatory variable examined in relation to ndrm.ch in the study is actn3.r557x, ACTN3
genotype at location 577.
EXAMPLE 1.4
In the maternal investment study conducted on frogs, researchers collected measurements on egg
clutches and female frogs at 11 study sites, located at differing altitudes, in order to investigate
how maternal investment varies with altitude. Identify the response and explanatory variables in
the study.
The variables egg.size, clutch.size, and clutch.volume are response variables indicative of ma-
ternal investment.
The explanatory variable examined in the study is altitude.
While latitude is an environmental factor that might potentially influence features of the egg
clutches, it is not a variable of interest in this particular study.
Female body size (body.size) is neither an explanatory nor response variable.
14 Response variables are sometimes called dependent variables and explanatory variables are often called independent
variables or predictors.
15 Two sample questions: (1) Does change in participant arm strength after training seem associated with race? The
response variable is ndrm.ch and the explanatory variable is race. (2) Do male participants appear to respond differently to
strength training than females? The response variable is ndrm.ch and the explanatory variable is sex.
18 CHAPTER 1. INTRODUCTION TO DATA
The first step in research is to identify questions to investigate. A clearly articulated research
question is essential for selecting subjects to be studied, identifying relevant variables, and deter-
mining how data should be collected.
1. Do bluefin tuna from the Atlantic Ocean have particularly high levels of mercury, such that
they are unsafe for human consumption?
2. For infants predisposed to developing a peanut allergy, is there evidence that introducing
peanut products early in life is an effective strategy for reducing the risk of developing a
peanut allergy?
3. Does a recently developed drug designed to treat glioblastoma, a form of brain cancer, appear
more effective at inducing tumor shrinkage than the drug currently on the market?
Each of these questions refers to a specific target population. For example, in the first ques-
tion, the target population consists of all bluefin tuna from the Atlantic Ocean; each individual
bluefin tuna represents a case. It is almost always either too expensive or logistically impossible to
collect data for every case in a population. As a result, nearly all research is based on information
obtained about a sample from the population. A sample represents a small fraction of the popu-
lation. Researchers interested in evaluating the mercury content of bluefin tuna from the Atlantic
Ocean could collect a sample of 500 bluefin tuna (or some other quantity), measure the mercury
content, and use the observed information to formulate an answer to the research question.
16 In Question 2, the target population consists of infants predisposed to developing a peanut allergy. In Question 3, the
target population consists of patients with glioblastoma.
1.3. DATA COLLECTION PRINCIPLES 19
Anecdotal evidence typically refers to unusual observations that are easily recalled because of
their striking characteristics. Physicians may be more likely to remember the characteristics of a
single patient with an unusually good response to a drug instead of the many patients who did not
respond. The dangers of drawing general conclusions from anecdotal information are obvious; no
single observation should be used to draw conclusions about a population.
While it is incorrect to generalize from individual observations, unusual observations can
sometimes be valuable. E.C. Heyde was a general practitioner from Vancouver who noticed that a
few of his elderly patients with aortic-valve stenosis (an abnormal narrowing) caused by an accu-
mulation of calcium had also suffered massive gastrointestinal bleeding. In 1958, he published his
observation. 17 Further research led to the identification of the underlying cause of the association,
now called Heyde’s Syndrome. 18
An anecdotal observation can never be the basis for a conclusion, but may well inspire the
design of a more systematic study that could be definitive.
Sampling from a population, when done correctly, provides reliable information about the
characteristics of a large population. The US Centers for Disease Control (US CDC) conducts sev-
eral surveys to obtain information about the US population, including the Behavior Risk Factor
Surveillance System (BRFSS). 19 The BRFSS was established in 1984 to collect data about health-
related risk behaviors, and now collects data from more than 400,000 telephone interviews con-
ducted each year. Data from a recent BRFSS survey are used in Chapter 4. The CDC conducts
similar surveys for diabetes, health care access, and immunization. Likewise, the World Health Or-
ganization (WHO) conducts the World Health Survey in partnership with approximately 70 coun-
tries to learn about the health of adult populations and the health systems in those countries. 20
The general principle of sampling is straightforward: a sample from a population is useful for
learning about a population only when the sample is representative of the population. In other
words, the characteristics of the sample should correspond to the characteristics of the population.
Suppose that the quality improvement team at an integrated health care system, such as Har-
vard Pilgrim Health Care, is interested in learning about how members of the health plan perceive
the quality of the services offered under the plan. A common pitfall in conducting a survey is to
use a convenience sample, in which individuals who are easily accessible are more likely to be
included in the sample than other individuals. If a sample were collected by approaching plan
members visiting an outpatient clinic during a particular week, the sample would fail to enroll
generally healthy members who typically do not use outpatient services or schedule routine phys-
ical examinations; this method would produce an unrepresentative sample (Figure 1.9).
Figure 1.9: Instead of sampling from all members equally, approaching members
visiting a clinic during a particular week disproportionately selects members who
frequently use outpatient services.
Random sampling is the best way to ensure that a sample reflects a population. In a simple
random sample, each member of a population has the same chance of being sampled. One way to
achieve a simple random sample of the health plan members is to randomly select a certain number
of names from the complete membership roster, and contact those individuals for an interview
(Figure 1.10).
19 https://www.cdc.gov/brfss/index.html
20 http://www.who.int/healthinfo/survey/en/
1.3. DATA COLLECTION PRINCIPLES 21
Figure 1.10: Five members are randomly selected from the population to be in-
terviewed.
Even when a simple random sample is taken, it is not guaranteed that the sample is represen-
tative of the population. If the non-response rate for a survey is high, that may be indicative of
a biased sample. Perhaps a majority of participants did not respond to the survey because only a
certain group within the population is being reached; for example, if questions assume that par-
ticipants are fluent in English, then a high non-response rate would be expected if the population
largely consists of individuals who are not fluent in English (Figure 1.11). Such non-response
bias can skew results; generalizing from an unrepresentative sample may likely lead to incorrect
conclusions about a population.
Figure 1.11: Surveys may only reach a certain group within the population, which
leads to non-response bias. For example, a survey written in English may only
result in responses from health plan members fluent in English.
21 It is unlikely that the patients who respond constitute a representative sample from the larger population of patients.
This is not a random sample, because individuals are selecting themselves into a group, and it is unclear that each person
has an equal chance of answering the survey. If our experience is any guide, dissatisfied people are more likely to respond
to these informal surveys than satisfied patients.
22 CHAPTER 1. INTRODUCTION TO DATA
Almost all statistical methods are based on the notion of implied randomness. If data are not
sampled from a population at random, these statistical methods – calculating estimates and errors
associated with estimates – are not reliable. Four random sampling methods are discussed in this
section: simple, stratified, cluster, and multistage sampling.
In a simple random sample, each case in the population has an equal chance of being included
in the sample (Figure 1.12). Under simple random sampling, each case is sampled independently of
the other cases; i.e., knowing that a certain case is included in the sample provides no information
about which other cases have also been sampled.
In stratified sampling, the population is first divided into groups called strata before cases
are selected within each stratum (typically through simple random sampling) (Figure 1.12). The
strata are chosen such that similar cases are grouped together. Stratified sampling is especially
useful when the cases in each stratum are very similar with respect to the outcome of interest, but
cases between strata might be quite different.
Suppose that the health care provider has facilities in different cities. If the range of services
offered differ by city, but all locations in a given city will offer similar services, it would be effective
for the quality improvement team to use stratified sampling to identify participants for their study,
where each city represents a stratum and plan members are randomly sampled from each city.
1.3. DATA COLLECTION PRINCIPLES 23
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ● ●
●
● ●
●
● ●
●
●
●●
●
● ●
●
●
●
●
● ●
●
● ● ●
●
● ●
● ●
● ● ●
● ●
●
● ● ●
●
● ●
●
● ●
● ●
●
● ● ●
● ●
●
●
●
●
● ●
● ●
●
● ● ●
●
●
●
●
●
● ●
●
●●
●
● ●
● ●
●
●
●
● ●
● ●
●
● ●
●
● ●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ● ●
● ●●
● ●
● ●
●
● ● ● ●
● ● ● ●
● ● ●
● ●
Stratum 1 ●
●
● ●●
● ●
●
● ●
● ●
●
●
● ● ● ● ●
●
●
●
●
●
●
● ● ●
●
●
● ●
●● ●
●
●
●
●
● ●
● ●
● ●
● ●
● ●
●
●
● ●
● ●
●● ● ●
●
●
● ●
● ●
● ●
Stratum 5
Figure 1.12: Examples of simple random and stratified sampling. In the top
panel, simple random sampling is used to randomly select 18 cases (circled or-
ange dots) out of the total population (all dots). The bottom panel illustrates
stratified sampling: cases are grouped into six strata, then simple random sam-
pling is employed within each stratum.
24 CHAPTER 1. INTRODUCTION TO DATA
In a cluster sample, the population is first divided into many groups, called clusters. Then,
a fixed number of clusters is sampled and all observations from each of those clusters are included
in the sample (Figure 1.13). A multistage sample is similar to a cluster sample, but rather than
keeping all observations in each cluster, a random sample is collected within each selected cluster
(Figure 1.13).
Unlike with stratified sampling, cluster and multistage sampling are most helpful when there
is high case-to-case variability within a cluster, but the clusters themselves are similar to one an-
other. For example, if neighborhoods in a city represent clusters, cluster and multistage sampling
work best when the population within each neighborhood is very diverse, but neighborhoods are
relatively similar.
Applying stratified, cluster, or multistage sampling can often be more economical than only
drawing random samples. However, analysis of data collected using such methods is more com-
plicated than when using data from a simple random sample; this text will only discuss analysis
methods for simple random samples.
EXAMPLE 1.8
Suppose researchers are interested in estimating the malaria rate in a densely tropical portion of
rural Indonesia. There are 30 villages in the area, each more or less similar to the others. The goal
is to test 150 individuals for malaria. Evaluate which sampling method should be employed.
A simple random sample would likely draw individuals from all 30 villages, which could make
data collection extremely expensive. Stratified sampling is not advisable, since there is not enough
information to determine how strata of similar individuals could be built. However, cluster sam-
pling or multistage sampling are both reasonable options. For example, with multistage sampling,
half of the villages could be randomly selected, and then 10 people selected from each village. This
strategy is more efficient than a simple random sample, and can still provide a sample representa-
tive of the population of interest.
The two primary types of study designs used to collect data are experiments and observational
studies.
In an experiment, researchers directly influence how data arise, such as by assigning groups of
individuals to different treatments and assessing how the outcome varies across treatment groups.
The LEAP study is an example of an experiment with two groups, an experimental group that
received the intervention (peanut consumption) and a control group that received a standard ap-
proach (peanut avoidance). In studies assessing effectiveness of a new drug, individuals in the
control group typically receive a placebo, an inert substance with the appearance of the experi-
mental intervention. The study is designed such that on average, the only difference between the
individuals in the treatment groups is whether or not they consumed peanut protein. This allows
for observed differences in experimental outcome to be directly attributed to the intervention and
constitute evidence of a causal relationship between intervention and outcome.
In an observational study, researchers merely observe and record data, without interfering
with how the data arise. For example, to investigate why certain diseases develop, researchers
might collect data by conducting surveys, reviewing medical records, or following a cohort of
many similar individuals. Observational studies can provide evidence of an association between
variables, but cannot by themselves show a causal connection. However, there are many instances
where randomized experiments are unethical, such as to explore whether lead exposure in young
children is associated with cognitive impairment.
1.3. DATA COLLECTION PRINCIPLES 25
Cluster 9
Cluster 2 Cluster 5
●
●
Cluster 7 ●
●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ●
● ●●
●●
●
●
Cluster 3 ●
●
● ●
●
● ● ●
● ● ● ● ●
●
● ●● ● ●
●
●● ●● ●
●
●
●
●
●
● ● ● ●
● ● ●●
● ● ●
Cluster 8
● ●
● ●
●
●
●
Cluster 4
●
● ●● ●
●
●
●
● ● ● ●
● ●
●
● ●
●
●●
●
●
●
● ●
● ●● ● ●
●
● ●
●● ●
●
●
●
●
● ●
●
●
●● ●
●
● ●
●
● ●●
●
●
●●
●
●
● ●
●
●
●
●
● ●
●
●● ●
●
●
●
● ●
● ●
●
●
●
● ●
● ● ●
● ●
●
●
●
●● ●
●
●
●
●●
●
● ●
●
●
● ●
● ●
● ●
●
●
●●
Cluster 6 ●
● ●
● ●
●●● ●
Cluster 1
Cluster 9
Cluster 2 Cluster 5
Index ●
●● ● Cluster 7 ● ● ●
●
● ● ● ● ●
●
● ● ●
● ● ●
●
● ● ●
● ●
● ● ● ● ● ●
● Cluster 3 ●
●
●
●
●●
●
● ● ● ●
● ● ●
● ●
● ●
● ● ●
●
● ● ●
● ●
●
●
●
●
● ●
Cluster 8
● ●
● ● ●
● ● Cluster 4 ●●
●
●
●
●
● ●
●
●
●
●
● ● ● ● ●
●
●
●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
●●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
● ●●
● ●
●
●● ●
●
● ● ●
● ●
● ●
●●
●
● ●
● ●
●
● Cluster 6 ●
● ●
● ●
● ●
Cluster 1
Figure 1.13: Examples of cluster and multistage sampling. The top panel illus-
trates cluster sampling: data are binned into nine clusters, three of which are sam-
pled, and all observations within these clusters are sampled. The bottom panel
illustrates multistage sampling, which differs from cluster sampling in that only
a subset from each of the three selected clusters are sampled.
26 CHAPTER 1. INTRODUCTION TO DATA
1.3.6 Experiments
Control. When selecting participants for a study, researchers work to control for extraneous vari-
ables and choose a sample of participants that is representative of the population of interest.
For example, participation in a study might be restricted to individuals who have a condition
that suggests they may benefit from the intervention being tested. Infants enrolled in the
LEAP study were required to be between 4 and 11 months of age, with severe eczema and/or
allergies to eggs.
Randomization. Randomly assigning patients to treatment groups ensures that groups are bal-
anced with respect to both variables that can and cannot be controlled. For example, random-
ization in the LEAP study ensures that the proportion of males to females is approximately
the same in both groups. Additionally, perhaps some infants were more susceptible to peanut
allergy because of an undetected genetic condition; under randomization, it is reasonable to
assume that such infants were present in equal numbers in both groups. Randomization al-
lows differences in outcome between the groups to be reasonably attributed to the treatment
rather than inherent variability in patient characteristics, since the treatment represents the
only systematic difference between the two groups.
In situations where researchers suspect that variables other than the intervention may in-
fluence the response, individuals can be first grouped into blocks according to a certain at-
tribute and then randomized to treatment group within each block; this technique is referred
to as blocking or stratification. The team behind the LEAP study stratified infants into two
cohorts based on whether or not the child developed a red, swollen mark (a wheal) after
a skin test at the time of enrollment; afterwards, infants were randomized between peanut
consumption and avoidance groups. Figure 1.14 illustrates the blocking scheme used in the
study.
Replication. The results of a study conducted on a larger number of cases are generally more
reliable than smaller studies; observations made from a large sample are more likely to be
representative of the population of interest. In a single study, replication is accomplished by
collecting a sufficiently large sample. The LEAP study randomized a total of 640 infants.
Randomized experiments are an essential tool in research. The US Food and Drug Adminis-
tration typically requires that a new drug can only be marketed after two independently conducted
randomized trials confirm its safety and efficacy; the European Medicines Agency has a similar pol-
icy. Large randomized experiments in medicine have provided the basis for major public health
initiatives. In 1954, approximately 750,000 children participated in a randomized study compar-
ing polio vaccine with a placebo. 22 In the United States, the results of the study quickly led to the
widespread and successful use of the vaccine for polio prevention.
22 Meier, Paul. "The biggest public health experiment ever: the 1954 field trial of the Salk poliomyelitis vaccine." Statistics:
a guide to the unknown. San Francisco: Holden-Day (1972): 2-13.
1.3. DATA COLLECTION PRINCIPLES 27
Figure 1.14: A simplified schematic of the blocking scheme used in the LEAP
study, depicting 640 patients that underwent randomization. Patients are first
divided into blocks based on response to the initial skin test, then each block
is randomized between the avoidance and consumption groups. This strategy
ensures an even representation of patients in each group who had positive and
negative skin tests.
28 CHAPTER 1. INTRODUCTION TO DATA
In observational studies, researchers simply observe selected potential explanatory and re-
sponse variables. Participants who differ in important explanatory variables may also differ in other
ways that influence response; as a result, it is not advisable to make causal conclusions about the re-
lationship between explanatory and response variables based on observational data. For example,
while observational studies of obesity have shown that obese individuals tend to die sooner than
individuals with normal weight, it would be misleading to conclude that obesity causes shorter life
expectancy. Instead, underlying factors are probably involved; obese individuals typically exhibit
other health behaviors that influence life expectancy, such as reduced exercise or unhealthy diet.
Suppose that an observational study tracked sunscreen use and incidence of skin cancer, and
found that the more sunscreen a person uses, the more likely they are to have skin cancer. These
results do not mean that sunscreen causes skin cancer. One important piece of missing information
is sun exposure – if someone is often exposed to sun, they are both more likely to use sunscreen and
to contract skin cancer. Sun exposure is a confounding variable: a variable associated with both
the explanatory and response variables. 23 There is no guarantee that all confounding variables
can be examined or measured; as a result, it is not advisable to draw causal conclusions from
observational studies.
Observational studies may reveal interesting patterns or associations that can be further in-
vestigated with follow-up experiments. Several observational studies based on dietary data from
different countries showed a strong association between dietary fat and breast cancer in women.
These observations led to the launch of the Women’s Health Initiative (WHI), a large randomized
trial sponsored by the US National Institutes of Health (NIH). In the WHI, women were random-
ized to standard versus low fat diets, and the previously observed association was not confirmed.
Observational studies can be either prospective or retrospective. A prospective study identi-
fies participants and collects information at scheduled times or as events unfold. For example, in
the Nurses’ Health Study, researchers recruited registered nurses beginning in 1976 and collected
data through administering biennial surveys; data from the study have been used to investigate risk
factors for major chronic diseases in women. 27 Retrospective studies collect data after events have
taken place, such as from medical records. Some datasets may contain both retrospectively- and
prospectively-collected variables. The Cancer Care Outcomes Research and Surveillance Consor-
tium (CanCORS) enrolled participants with lung or colorectal cancer, collected information about
diagnosis, treatment, and previous health behavior, but also maintained contact with participants
to gather data about long-term outcomes. 28
27 www.channing.harvard.edu/nhs
28 Ayanian, John Z., et al. "Understanding cancer treatment and outcomes: the cancer care outcomes research and
surveillance consortium." Journal of Clinical Oncology 22.15 (2004): 2992-2996
30 CHAPTER 1. INTRODUCTION TO DATA
This section discusses techniques for exploring and summarizing numerical variables, using
the frog data from the parental investment study introduced in Section 1.2.
The mean, sometimes called the average, is a measure of center for a distribution of data. To
find the average clutch volume for the observed egg clutches, add all the clutch volumes and divide
by the total number of clutches. 29
x1 + x2 + · · · + xn
x= , (1.10)
n
where x1 , x2 , . . . , xn represent the n observed values.
The median is another measure of center; it is the middle number in a distribution after the
values have been ordered from smallest to largest. If the distribution contains an even number of
observations, the median is the average of the middle two observations. There are 431 clutches
in the dataset, so the median is the clutch volume of the 216th observation in the sorted values of
clutch.volume: 831.8 mm3 .
29 For computational convenience, the volumes are rounded to the first decimal.
1.4. NUMERICAL DATA 31
The spread of a distribution refers to how similar or varied the values in the distribution are
to each other; i.e., whether the values are tightly clustered or spread over a wide range.
The standard deviation for a set of data describes the typical distance between an observation
and the mean. The distance of a single observation from the mean is its deviation. Below are the
deviations for the 1st , 2nd , 3rd , and 431st observations in the clutch.volume variable.
The sample variance, the average of the squares of these deviations, is denoted by s2 : s2
sample
(−704.7)2 + (−625.5)2 + (−731.1)2 + · · · + (50.7)2 variance
s2 =
431 − 1
496, 602.09 + 391, 250.25 + 534, 507.21 + · · · + 2570.49
=
430
= 143, 680.9.
The denominator is n − 1 rather than n; this mathematical nuance accounts for the fact that sample
mean has been used to estimate the population mean in the calculation. Details on the statistical
theory can be found in more advanced texts.
The sample standard deviation s is the square root of the variance:
√
s = 143, 680.9 = 379.05mm3 . s
sample
standard
deviation
Like the mean, the population values for variance and standard deviation are denoted by
Greek letters: σ 2 for the variance and σ for the standard deviation. σ2
population
variance
STANDARD DEVIATION
σ
The sample standard deviation of a numerical variable is computed as the square root of the
population
variance, which is the sum of squared deviations divided by the number of observations minus standard
1. deviation
s
(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2
s= , (1.11)
n−1
Variability can also be measured using the interquartile range (IQR). The IQR for a distri-
bution is the difference between the first and third quartiles: Q3 − Q1 . The first quartile (Q1 ) is
equivalent to the 25th percentile; i.e., 25% of the data fall below this value. The third quartile
(Q3 ) is equivalent to the 75th percentile. By definition, the median represents the second quar-
tile, with half the values falling below it and half falling above. The IQR for clutch.volume is
1096.0 − 609.6 = 486.4 mm3 .
Measures of center and spread are ways to summarize a distribution numerically. Using nu-
merical summaries allows for a distribution to be efficiently described with only a few numbers. 30
For example, the calculations for clutch.volume indicate that the typical egg clutch has volume
of about 880 mm3 , while the middle 50% of egg clutches have volumes between approximately
600 mm3 and 1100.0 mm3 .
Figure 1.15 shows the values of clutch.volume as points on a single axis. There are a few
values that seem extreme relative to the other observations: the four largest values, which appear
distinct from the rest of the distribution. How do these extreme values affect the value of the
numerical summaries?
100 300 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 2700
Clutch Volumes
Figure 1.15: Dot plot of clutch volumes from the frog data.
Figure 1.16 shows the summary statistics calculated under two scenarios, one with and one
without the four largest observations. For these data, the median does not change, while the IQR
differs by only about 6 mm3 . In contrast, the mean and standard deviation are much more affected,
particularly the standard deviation.
Figure 1.16: A comparison of how the median, IQR, mean (x), and standard devi-
ation (s) change when extreme observations are present.
The median and IQR are referred to as robust estimates because extreme observations have
little effect on their values. For distributions that contain extreme values, the median and IQR will
provide a more accurate sense of the center and spread than the mean and standard deviation.
Graphs show important features of a distribution that are not evident from numerical sum-
maries, such as asymmetry or extreme values. While dot plots show the exact value of each obser-
vation, histograms and boxplots graphically summarize distributions.
In a histogram, observations are grouped into bins and plotted as bars. Figure 1.17 shows the
number of clutches with volume between 0 and 200 mm3 , 200 and 400 mm3 , etc. up until 2,600
and 2,800 mm3 . 31 These binned counts are plotted in Figure 1.18.
Count 4 29 69 99 ··· 2 1
100
80
Frequency
60
40
20
0
0 500 1000 1500 2000 2500
Clutch Volume
Histograms provide a view of the data density. Higher bars indicate more frequent obser-
vations, while lower bars represent relatively rare observations. Figure 1.18 shows that most of
the egg clutches have volumes between 500-1,000 mm3 , and there are many more clutches with
volumes smaller than 1,000 mm3 than clutches with larger volumes.
Histograms show the shape of a distribution. The tails of a symmetric distribution are
roughly equal, with data trailing off from the center roughly equally in both directions. Asym-
metry arises when one tail of the distribution is longer than the other. A distribution is said to be
right skewed when data trail off to the right, and left skewed when data trail off to the left. 32 Fig-
ure 1.18 shows that the distribution of clutch volume is right skewed; most clutches have relatively
small volumes, and only a few clutches have high volumes.
31 By default in R, the bins are left-open and right-closed; i.e., the intervals are of the form (a, b]. Thus, an observation
with value 200 would fall into the 0-200 bin instead of the 200-400 bin.
32 Other ways to describe data that are skewed to the right/left: skewed to the right/left or skewed to the posi-
tive/negative end.
34 CHAPTER 1. INTRODUCTION TO DATA
A mode is represented by a prominent peak in the distribution. 33 Figure 1.19 shows his-
tograms that have one, two, or three major peaks. Such distributions are called unimodal, bi-
modal, and multimodal, respectively. Any distribution with more than two prominent peaks is
called multimodal. Note that the less prominent peak in the unimodal distribution was not counted
since it only differs from its neighboring bins by a few observations. Prominent is a subjective term,
but it is usually clear in a histogram where the major peaks are.
15 20
15
15
10
10
10
5
5 5
0 0 0
0 5 10 15 0 5 10 15 20 0 5 10 15 20
Figure 1.19: From left to right: unimodal, bimodal, and multimodal distributions.
A boxplot indicates the positions of the first, second, and third quartiles of a distribution
in addition to extreme observations. 34 Figure 1.20 shows a boxplot of clutch.volume alongside a
vertical dot plot.
2500
outliers
2000
upper whisker
Clutch Volume
1500
Q3 (third quartile)
1000
− median (second quartile)
−
−
−
− Q1 (first quartile)
−
−
−
500 −
−
−
−
−
−
−
−
−
−
−
− lower whisker
Figure 1.20: A boxplot and dot plot of clutch.volume. The horizontal dashes
indicate the bottom 50% of the data and the open circles represent the top 50%.
33 Another definition of mode, which is not typically used in statistics, is the value with the most occurrences. It is
common that a dataset contains no observations with the same value, which makes this other definition impractical for
many datasets.
34 Boxplots are also known as box-and-whisker plots.
1.4. NUMERICAL DATA 35
In a boxplot, the interquartile range is represented by a rectangle extending from the first
quartile to the third quartile, and the rectangle is split by the median (second quartile). Extending
outwards from the box, the whiskers capture the data that fall between Q1 −1.5×IQR and Q3 +1.5×
IQR. The whiskers must end at data points; the values given by adding or subtracting 1.5 × IQR
define the maximum reach of the whiskers. For example, with the clutch.volume variable, Q3 +
1.5×IQR = 1, 096.5+1.5×486.4 = 1, 826.1 mm3 . However, there was no clutch with volume 1,826.1
mm3 ; thus, the upper whisker extends to 1,819.7 mm3 , the largest observation that is smaller than
Q3 + 1.5 × IQR.
Any observation that lies beyond the whiskers is shown with a dot; these observations are
called outliers. An outlier is a value that appears extreme relative to the rest of the data. For
the clutch.volume variable, there are several large outliers and no small outliers, indicating the
presence of some unusually large egg clutches.
The high outliers in Figure 1.20 reflect the right-skewed nature of the data. The right skew is
also observable from the position of the median relative to the first and third quartiles; the median
is slightly closer to the first quartile. In a symmetric distribution, the median will be halfway
between the first and third quartiles.
80
150 75
100 70
Frequency
Height
65
50
60
0
60 65 70 75
55
Height
(a) (b)
35 The data are roughly symmetric (the left tail is slightly longer than the right tail), and the distribution is unimodal
with one prominent peak at about 67 inches. The middle 50% of individuals are between 5.5 feet and just under 6 feet tall.
There is one low outlier and one high outlier, representing individuals that are unusually short/tall relative to the other
individuals.
36 CHAPTER 1. INTRODUCTION TO DATA
When working with strongly skewed data, it can be useful to apply a transformation, and
rescale the data using a function. A natural log transformation is commonly used to clarify the
features of a variable when there are many values clustered near zero and all observations are
positive.
30
100
25
80
20
Frequency
Frequency
60
15
40 10
20 5
0 0
$0 $20k $40k $60k $80k $100k $120k 5 6 7 8 9 10 11 12
(a) (b)
Figure 1.22: (a) Histogram of per capita income. (b) Histogram of the log-
transformed per capita income.
For example, income data are often skewed right; there are typically large clusters of low to
moderate income, with a few large incomes that are outliers. Figure 1.22(a) shows a histogram of
average yearly per capita income measured in US dollars for 165 countries in 2011. 36 The data are
heavily right skewed, with the majority of countries having average yearly per capita income lower
than $10,000. Once the data are log-transformed, the distribution becomes roughly symmetric
(Figure 1.22(b)). 37
For symmetric distributions, the mean and standard deviation are particularly informative
summaries. If a distribution is symmetric, approximately 70% of the data are within one standard
deviation of the mean and 95% of the data are within two standard deviations of the mean; this
guideline is known as the empirical rule.
EXAMPLE 1.13
On the log-transformed scale, mean log income is 8.50, with standard deviation 1.54. Apply the
empirical rule to describe the distribution of average yearly per capita income among the 165
countries.
According to the empirical rule, the middle 70% of the data are within one standard deviation of
the mean, in the range (8.50 - 1.54, 8.50 + 1.54) = (6.96, 10.04) log(USD). 95% of the data are within
two standard deviations of the mean, in the range (8.50 - 2(1.54), 8.50 + 2(1.54)) = (5.42, 11.58)
log(USD).
Undo the log transformation. The middle 70% of the data are within the range (e6.96 , e10.04 )
= ($1,054, $22,925). The middle 95% of the data are within the range (e5.42 , e11.58 ) = ($226,
$106,937).
Functions other than the natural log can also be used to transform data, such as the square
root and inverse.
This section introduces tables and plots for summarizing categorical data, using the famuss
dataset introduced in Section 1.2.2.
A table for a single variable is called a frequency table. Figure 1.23 is a frequency table for
the actn3.r577x variable, showing the distribution of genotype at location r577x on the ACTN3
gene for the FAMuSS study participants.
In a relative frequency table like Figure 1.24, the proportions per each category are shown
instead of the counts.
CC CT TT Sum
Counts 173 261 161 595
CC CT TT Sum
Proportions 0.291 0.439 0.271 1.000
A bar plot is a common way to display a single categorical variable. The left panel of Fig-
ure 1.25 shows a bar plot of the counts per genotype for the actn3.r577x variable. The plot in the
right panel shows the proportion of observations that are in each level (i.e. in each genotype).
300 0.50
250
200
proportion
count
150 0.25
100
50
0 0.00
CC CT TT CC CT TT
genotype genotype
Figure 1.25: Two bar plots of actn3.r577x. The left panel shows the counts, and
the right panel shows the proportions for each genotype.
38 CHAPTER 1. INTRODUCTION TO DATA
This section introduces numerical and graphical methods for exploring and summarizing re-
lationships between two variables. Approaches vary depending on whether the two variables are
both numerical, both categorical, or whether one is numerical and one is categorical.
Scatterplots
In the frog parental investment study, researchers used clutch volume as a primary variable of
interest rather than egg size because clutch volume represents both the eggs and the protective
gelatinous matrix surrounding the eggs. The larger the clutch volume, the higher the energy re-
quired to produce it; thus, higher clutch volume is indicative of increased maternal investment.
Previous research has reported that larger body size allows females to produce larger clutches; is
this idea supported by the frog data?
A scatterplot provides a case-by-case view of the relationship between two numerical vari-
ables. Figure 1.26 shows clutch volume plotted against body size, with clutch volume on the y-axis
and body size on the x-axis. Each point represents a single case. For this example, each case is one
egg clutch for which both volume and body size (of the female that produced the clutch) have been
recorded.
2500
Clutch Volume (mm3)
2000
●
●
●
● ●
●
1500 ● ●
●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
1000 ● ● ● ● ● ●
● ● ● ●
● ●
● ● ●
●
●
● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
●
●
●
500 ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●●
● ● ●
● ● ● ●
●●
The plot shows a discernible pattern, which suggests an association, or relationship, between
clutch volume and body size; the points tend to lie in a straight line, which is indicative of a linear
association. Two variables are positively associated if increasing values of one tend to occur with
increasing values of the other; two variables are negatively associated if increasing values of one
variable occurs with decreasing values of the other. If there is no evident relationship between two
variables, they are said to be uncorrelated or independent.
As expected, clutch volume and body size are positively associated; larger frogs tend to pro-
duce egg clutches with larger volumes. These observations suggest that larger females are capable
of investing more energy into offspring production relative to smaller females.
1.6. RELATIONSHIPS BETWEEN TWO VARIABLES 39
The National Health and Nutrition Examination Survey (NHANES) consists of a set of surveys
and measurements conducted by the US CDC to assess the health and nutritional status of adults
and children in the United States. The following example uses data from a sample of 500 adults
(individuals ages 21 and older) from the NHANES dataset. 38
EXAMPLE 1.14
Body mass index (BMI) is a measure of weight commonly used by health agencies to assess whether
someone is overweight, and is calculated from height and weight. 39 Describe the relationships
shown in Figure 1.27. Why is it helpful to use BMI as a measure of obesity, rather than weight?
Figure 1.27(a) shows a positive association between height and weight; taller individuals tend to
be heavier. Figure 1.27(b) shows that height and BMI do not seem to be associated; the range of
BMI values observed is roughly consistent across height.
Weight itself is not a good measure of whether someone is overweight; instead, it is more reasonable
to consider whether someone’s weight is unusual relative to other individuals of a comparable
height. An individual weighing 200 pounds who is 6 ft tall is not necessarily an unhealthy weight;
however, someone who weighs 200 pounds and is 5 ft tall is likely overweight. It is not reasonable
to classify individuals as overweight or obese based only on weight.
BMI acts as a relative measure of weight that accounts for height. Specifically, BMI is used as an
estimate of body fat. According to US National Institutes of Health (US NIH) and the World Health
Organization (WHO), a BMI between 25.0 - 29.9 is considered overweight and a BMI over 30 is
considered obese. 40
● 70 ●
●
200
● 60
● ●
● ●
Weight (kg)
● 50 ●
150 ● ●
●
● ●
● ● ● ●
BMI
● ● ● ● ● ●
● ●● ● ●
● ● ● ● ●● ●●● ● ●● ● ●●● ● ● ● ●● ● ● ●●
● ● ● ● ●● ● ● ● ●● ●●● ●
●●● ● ● ●
●
● 40 ●● ● ●
●
●●●
● ●● ● ●●
●●● ● ● ● ●● ●
● ● ● ● ●
●● ●●● ●
●●
●●
● ● ● ●●
●●
● ●●
●● ●●
●●●
●
● ●●●●●
●●● ●●●●●●●●●●
● ● ● ● ●
●●● ●● ●●
● ● ●●
● ●●●● ●
● ●●●●●●
●● ● ● ● ●● ●
●●
100 ●
● ●●
●
●●
● ●●● ●
●
●● ●
●●●
●
●
●●●
●
● ●
●●● ●
●●
●
●● ●●●
●
●●●
● ● ● ●●
●
● ●●●
● ● ● ● ●●● ●
● ●● ●●
● ●●
●
●● ● ●
●●●●●●
●
●
● ●
●●●●
●
●●●●●
● ●●●●● ●●● ●●● ● ●
● ● ● ●
●
● ●● ● ● ● ● ● ●
●●
●● ● ●●● ● ●● ● ●● ●● ● ● ●●● ● ● ●●● ● ●●
● ● ●●●●
● ●● ●● ●
●
●●● ●● ● ●●
●●●●● ●
●
●●●
●●● ●●● ●●●●
●●● ●●●●
●●
●●
●●● ●
●●●●●
●
● ● ●
●● ●
● ●●●●●
●
● 30 ● ●●● ●● ● ●
●●
● ●●
●●●
● ●
●●●●●●●
●
●●
●●●
●●●
●●●●●●● ● ● ●●●●●●
● ●
● ● ● ●
●●● ●
● ● ● ●●●●●●● ●●●
●●
● ●●●●
●●●●
●●●
●●●●●
●● ● ●● ●●● ● ● ● ● ● ●●●
● ●● ●
● ●
● ● ●
●● ●●●
●●●● ●● ●
● ●
●
●● ●
●●
●●●
●●
●● ●
●
● ●
●●● ● ●●●
● ●
● ● ●●●●●
●● ●
● ● ● ●●
●●
● ●● ●
●●●
●
●●●
● ●●
●●
●●●●● ● ●
● ● ●●● ●● ●● ●●
●●●●
●●
●● ● ●●● ●●
● ●
●●●
●
● ● ●●● ●
●
● ●● ●
●● ●●●
●
● ● ●● ●
●
●
● ● ●● ● ● ●●
●● ● ●●●
●
● ●●●
●
●●
●
●●
●
●●
●● ●●
●●●
●
● ● ●●● ● ● ● ●●●
●●● ●●●● ●●●
●●● ●●
●●● ●● ●● ● ●● ● ●● ● ● ● ●
●● ●● ●●● ●●
●●
●
●●
●●
● ●●●
●
●
● ●● ●● ●● ●
● ●
● ●●● ● ●●● ●● ●
●● ● ●
●●●●●●
●●●● ●●●●● ● ● ● ●● ●● ●● ● ● ● ●
50 ●● ● ● ●
●
●
● ●
● ●● ● ● ● ● ● 20 ● ● ●●● ●●●●● ● ●●
●
●● ● ● ● ●● ● ● ● ●
● ● ●
150 160 170 180 190 150 160 170 180 190
Height (cm) Height (cm)
(a) (b)
Figure 1.27: (a) A scatterplot showing height versus weight from the 500 individ-
uals in the sample from NHANES. One participant 163.9 cm tall (about 5 ft, 4 in)
and weighing 144.6 kg (about 319 lb) is highlighted. (b) A scatterplot showing
height versus BMI from the 500 individuals in the sample from NHANES. The same
individual highlighted in (a) is marked here, with BMI 53.83.
EXAMPLE 1.15
Figure 1.28 is a scatterplot of life expectancy versus annual per capita income for 165 countries in
2011. Life expectancy is measured as the expected lifespan for children born in 2011 and income
is adjusted for purchasing power in a country. Describe the relationship between life expectancy
and annual per capita income; do they seem to be linearly associated?
Life expectancy and annual per capita income are positively associated; higher per capita income
is associated with longer life expectancy. However, the two variables are not linearly associated.
When income is low, small increases in per capita income are associated with relatively large in-
creases in life expectancy. However, once per capita income exceeds approximately $20,000 per
year, increases in income are associated with smaller gains in life expectancy.
In a linear association, change in the y-variable for every unit of the x-variable is consistent across
the range of the x-variable; for example, a linear association would be present if an increase in
income of $10,000 corresponded to an increase in life expectancy of 5 years, across the range of
income.
●● ● ●● ● ●● ●
● ●●● ●
●
● ● ●● ●●● ● ●
80 ●● ● ● ●
● ● ●
● ● ●
Life Expectancy (years)
● ● ● ●
● ● ●
● ●
75 ●●●●
●● ●
● ●
●●●● ●●● ● ● ●
● ●●●●●
●
●●
●●
●
●●●●
●●●● ● ●
70 ● ●
●
●
●●●●● ●
●
●●
●●
65 ●●●●
●
●● ●● ●
●
●
●●
●
●●
60 ●
●
●
●
●
●●
●
●
● ●
55 ● ●
●
●●
●
●●●
50 ●
●
●●●
Figure 1.28: A scatterplot of life expectancy (years) versus annual per capita in-
come (US dollars) in the wdi.2011 dataset.
Correlation
Correlation is a numerical summary statistic that measures the strength of a linear relationship be-
r tween two variables. It is denoted by r, the correlation coefficient, which takes on values between
correlation -1 and 1.
coefficient
If the paired values of two variables lie exactly on a line, r = ±1; the closer the correlation
coefficient is to ±1, the stronger the linear association. When two variables are positively associated,
with paired values that tend to lie on a line with positive slope, r > 0. If two variables are negatively
associated, r < 0. A value of r that is 0 or approximately 0 indicates no apparent association between
two variables. 41
41 If paired values lie perfectly on either a horizontal or vertical line, there is no association and r is mathematically
undefined.
Another random document with
no related content on Scribd:
Quintana (Manuel Josef). Poesías selectas castellanas desde el
tiempo de Juan de Mena hasta nuestros días [Nueva edición
corregida y aumentada]. Madrid, 1830-1833. 2 vol.
Está tan lejos de ser cierto que en los escritos medievales se vea
nacer el castellano, que, por el contrario, lo que se ve nacer en ellos
es el latín. El castellano aparece, la primera vez que se le halla
escrito, como una lengua robusta y acabada, y los vocablos sueltos
que aparecen en los documentos latinos más antiguos son tan
castellanos como hoy día. Antes bien, las formas que aparecen
antes son las más castellanas y poco á poco se van acercando más
á las latinas. Es que los escritores iban sabiendo mejor el latín
conforme adelantaban los tiempos. Por ej., linde se encuentra en el
Fuero de Évora el año 1166 (M. P. Leges, p. 392): "Qui linde alieno
crebantaverit, pectet quinque solidos, et septem ad Palacio". En la
segunda recensión, Fuero de Abrantes en 1179, y de Corucha en
1182 (ibid., págs. 419 y 427): "Qui limde alienum quebrantaverit". En
la tercera, F. de Palmella en 1185 (ibid., pág. 430): "Qui limede (al.
limide) alieno crebantar...". En la cuarta, F. de Covilhan del 1186 y de
Centocellas del 1194 (ibid., páginas 457 y 487): "Qui limitem alienum
fregerit...". En la quinta, F. de San Vicente de Beira en 1195 (ibid.,
pág. 495): "Qui limidem alienum fregerit". Á la verdad, aquí no se ve
nacer el castellano, sino diríase que el latín: linde, limde, limede,
limitem, limidem. Otro tanto sucede con el término azor y el azorera,
que aparecen antes que acetore y aceptore. De las formas arroyo,
arroio y arrogio, la primera es la más antigua, del año 841, en la
donación de Alfonso el Casto á la catedral de Lugo. En la era 916
hallamos quoto: "factum est in supradicto quoto 8 idibus junias"; y
después, en las eras 937, 940 y 983, cautum; y en la de 984,
cautamus. No parece sino que el castellano va á convertirse otra
vez en latín; y es que la cultura adelantaba, y lo único que
pretendían era escribir en latín, haciéndolo cada vez mejor. Siendo
para ellos el habla vulgar un latín corrompido, lo saqueaban
latinizándolo en sus escritos: abatire de abatir, abadagium,
acampanare, acannizare, alcanzare, advescit == consuevit (Glos.
gót. Card.) de avezar, "dña Thereysia mea ama", del ama
castellano, attondus (era 1100, Arch. Arlam.) ó atuendo en ablativo
(ch. Ferdin. I, Sota), del vascuence atondo, "terras cultas vel
barbatas" de vervactum == barbecho (ch. Adeph. imper., era 1117.
Arch. Naj.), campidator de campeador, campear (ch. Adeph., 1111,
Sota), cargas de feno, carnerus, cavalcator, cerrus de cerro,
collacius de collazo, collata, ganare, ganatus, autero de otero,
heretarius de heredero, ingamno de engaño, quadrare, quitare,
sacare, spolas. Sería insensatez figurarse que tales formas latinas
hayan pertenecido jamás al habla: son vocablos castellanos, sin
origen latino muchos de ellos, pero latinizados por los pendolistas de
aquellos tiempos. El que sin criterio quiera amontonar los términos
intermedios entre los castellanos y los latinos, los hallará todos en
los documentos; pero no son términos medios de la evolución
natural del latín hasta hacerse castellano, sino muchas veces, al
revés, es la latinización cada vez más perfecta del habla vulgar. Por
ejemplo. En Berceo hallamos miraculo (Mil., 46), miraclo (íd., 869) y
miraglo (S. Dom., 315). "Berceo nos conserva tres de las cinco
formas por que ha pasado miraculum para fijarse en milagro", dice
Lanchetas. Si esto fuera verdad, en tiempo de Berceo aún no habría
nacido el castellano, ni aun siquiera el latín vulgar, pues el miraclo
del vulgar latino es posterior al miraculo de Berceo. Lo que hay es
que, menospreciándose entonces el romance vulgar, los escritores
creían que debían escribirlo lo más parecido al latín, única lengua
literaria para ellos; de modo que en vez de escribir siempre miraglo,
que es como se decía en el pueblo, escribían á veces miraclo por
acercarse al latín, y aun miraculo, tomado del latín clásico, del cual
no había salido miraglo, sino del vulgar miraclo. Siempre la reacción
literaria corrigiendo el habla vulgar.