Nothing Special   »   [go: up one dir, main page]

Get Introductory Statistics For The Life and Biomedical Sciences 1st Edition Julie Vu Free All Chapters

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Full download test bank at ebookmeta.

com

Introductory Statistics for the Life and


Biomedical Sciences 1st Edition Julie Vu

For dowload this book click LINK or Button below

https://ebookmeta.com/product/introductory-
statistics-for-the-life-and-biomedical-
sciences-1st-edition-julie-vu/

OR CLICK BUTTON

DOWLOAD EBOOK

Download More ebooks from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Introductory Physics for the Life Sciences 1st Edition


Simon Mochrie

https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-1st-edition-simon-mochrie/

Introductory Physics for the Life Sciences: Mechanics


(Volume One) David V Guerra

https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-mechanics-volume-one-david-v-guerra/

Introductory Physics for the Life Sciences Mechanics


Volume One 1st Edition David V. Guerra

https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-mechanics-volume-one-1st-edition-david-v-guerra/

Introductory Physics for the Life Sciences Quantity


Based Analysis 1st Edition David V. Guerra

https://ebookmeta.com/product/introductory-physics-for-the-life-
sciences-quantity-based-analysis-1st-edition-david-v-guerra/
Practice of Statistics in the Life Sciences Brigitte
Baldi

https://ebookmeta.com/product/practice-of-statistics-in-the-life-
sciences-brigitte-baldi/

Introductory Statistics for Data Analysis Warren J.


Ewens

https://ebookmeta.com/product/introductory-statistics-for-data-
analysis-warren-j-ewens/

Prealgebra and Introductory Algebra second edition


Julie Miller

https://ebookmeta.com/product/prealgebra-and-introductory-
algebra-second-edition-julie-miller/

Probability and Statistics for Engineering and the


Sciences 9th Edition Devore J.L.

https://ebookmeta.com/product/probability-and-statistics-for-
engineering-and-the-sciences-9th-edition-devore-j-l/

Probability and Statistics for Engineering and the


Sciences 9e International Metric Edition Devore

https://ebookmeta.com/product/probability-and-statistics-for-
engineering-and-the-sciences-9e-international-metric-edition-
devore/
Introductory Statistics for the
Life and Biomedical Sciences
First Edition

Julie Vu
Preceptor in Statistics
Harvard University

David Harrington
Professor of Biostatistics (Emeritus)
Harvard T.H. Chan School of Public Health
Dana-Farber Cancer Institute

A tablet-friendly version of this PDF


where margins have been minimized
can be found in the book's extra files.
Copyright © 2020. First Edition.
Version date: July 26th, 2020.

This textbook and its supplements, including slides and labs, may be downloaded for free at
openintro.org/book/biostat.

This textbook is a derivative of OpenIntro Statistics 3rd Edition by Diez, Barr, and Çetinkaya-
Rundel, and it is available under a Creative Commons Attribution-ShareAlike 3.0 Unported United
States license. License details are available at the Creative Commons website:
creativecommons.org.

Source files for this book may be found on Github at


github.com/OI-Biostat/oi_biostat_text.
3

Table of Contents

1 Introduction to data 10
1.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Data collection principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 Relationships between two variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2 Probability 88
2.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.3 Extended example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3 Distributions of random variables 138


3.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.3 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.5 Distributions related to Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.6 Distributions for pairs of random variables . . . . . . . . . . . . . . . . . . . . . . . . 177
3.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

4 Foundations for inference 198


4.1 Variability in estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
4.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4 TABLE OF CONTENTS

5 Inference for numerical data 236


5.1 Single-sample inference with the t-distribution . . . . . . . . . . . . . . . . . . . . . . 238
5.2 Two-sample test for paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
5.3 Two-sample test for independent data . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.4 Power calculations for a difference of means . . . . . . . . . . . . . . . . . . . . . . . . 257
5.5 Comparing means with ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
5.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

6 Simple linear regression 290


6.1 Examining scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
6.2 Estimating a regression line using least squares . . . . . . . . . . . . . . . . . . . . . . 295
6.3 Interpreting a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.4 Statistical inference with regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.5 Interval estimates with regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
6.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

7 Multiple linear regression 330


7.1 Introduction to multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 332
7.2 Simple versus multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7.3 Evaluating the fit of a multiple regression model . . . . . . . . . . . . . . . . . . . . . 338
7.4 The general multiple linear regression model . . . . . . . . . . . . . . . . . . . . . . . 342
7.5 Categorical predictors with several levels . . . . . . . . . . . . . . . . . . . . . . . . . . 347
7.6 Reanalyzing the PREVEND data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
7.7 Interaction in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
7.8 Model selection for explanatory models . . . . . . . . . . . . . . . . . . . . . . . . . . 358
7.9 The connection between ANOVA and regression . . . . . . . . . . . . . . . . . . . . . 368
7.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

8 Inference for categorical data 386


8.1 Inference for a single proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
8.2 Inference for the difference of two proportions . . . . . . . . . . . . . . . . . . . . . . 395
8.3 Inference for two or more groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
8.4 Chi-square tests for the fit of a distribution . . . . . . . . . . . . . . . . . . . . . . . . 414
8.5 Outcome-based sampling: case-control studies . . . . . . . . . . . . . . . . . . . . . . 416
8.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

A End of chapter exercise solutions 435

B Distribution tables 463

Index 469
5

Foreword

The past year has been challenging for the health sciences in ways that we could not have imagined
when we started writing 5 years ago. The rapid spread of the SARS coronavirus (SARS-CoV-2)
worldwide has upended the scientific research process and highlighted the need for maintaining a
balance between speed and reliability. Major medical journals have dramatically increased the pace
of publication; the urgency of the situation necessitates that data and research findings be made
available as quickly as possible to inform public policy and clinical practice. Yet it remains essential
that studies undergo rigorous review; the retraction of two high-profile coronavirus studies 1, 2
sparked widespread concerns about data integrity, reproducibility, and the editorial process.
In parallel, deepening public awareness of structural racism has caused a re-examination of
the role of race in published studies in health and medicine. A recent review of algorithms used to
direct treatment in areas such as cardiology, obstetrics and oncology uncovered examples of race
used in ways that may lead to substandard care for people of color. 3 The SARS-CoV-2 pandemic
has reminded us once again that marginalized populations are disproportionately at risk for bad
health outcomes. Data on 17 million patients in England 4 suggest that Blacks and South Asians
have a death rate that is approximately 50% higher than white members of the population.
Understanding the SARS coronavirus and tackling racial disparities in health outcomes are
but two of the many areas in which Biostatistics will play an important role in the coming decades.
Much of that work will be done by those now beginning their study of Biostatistics. We hope this
book provides an accessible point of entry for students planning to begin work in biology, medicine,
or public health. While the material presented in this book is essential for understanding the
foundations of the discipline, we advise readers to remember that a mastery of technical details is
secondary to choosing important scientific questions, examining data without bias, and reporting
results that transparently display the strengths and weaknesses of a study.

1 Mandeep R. Mehra et al. “Retraction: Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J
Med. DOI: 10.1056/NEJMoa2007621.” In: New England Journal of Medicine 382.26 (2020), pp. 2582–2582. doi: 10.1056/
NEJMc2021225.
2 Mandeep R Mehra et al. “RETRACTED:Hydroxychloroquine or chloroquine with or without a macrolide for treatment
of COVID-19: a multinational registry analysis”. In: The Lancet (2020). doi: https://doi.org/10.1016/S0140- 6736(20)
31180-6.
3 Darshali A. Vyas et al. “Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms”.
In: New England Journal of Medicine (2020). doi: 10.1056/NEJMms2004740.
4 Elizabeth J. Williamson et al. “OpenSAFELY: factors associated with COVID-19 death in 17 million patients”. In:
Nature (2020). issn: 1476-4687.
6

Preface

This text introduces statistics and its applications in the life sciences and biomedical research. It is
based on the freely available OpenIntro Statistics, and, like OpenIntro, it may be downloaded at no
cost. 5 In writing Introduction to Statistics for the Life and Biomedical Sciences, we have added sub-
stantial new material, but also retained some examples and exercises from OpenIntro that illustrate
important ideas even if they do not relate directly to medicine or the life sciences. Because of its
link to the original OpenIntro project, this text is often referred to as OpenIntro Biostatistics in the
supplementary materials.
This text is intended for undergraduate and graduate students interested in careers in biology
or medicine, and may also be profitably read by students of public health or medicine. It cov-
ers many of the traditional introductory topics in statistics, in addition to discussing some newer
methods being used in molecular biology.
Statistics has become an integral part of research in medicine and biology, and the tools for
summarizing data and drawing inferences from data are essential both for understanding the out-
comes of studies and for incorporating measures of uncertainty into that understanding. An intro-
ductory text in statistics for students who will work in medicine, public health, or the life sciences
should be more than simply the usual introduction, supplemented with an occasional example
from biology or medical science. By drawing the majority of examples and exercises in this text
from published data, we hope to convey the value of statistics in medical and biological research. In
cases where examples draw on important material in biology or medicine, the problem statement
contains the necessary background information.
Computing is an essential part of the practice of statistics. Nearly everyone entering the
biomedical sciences will need to interpret the results of analyses conducted in software; many
will also need to be capable of conducting such analyses. The text and associated materials sepa-
rate those two activities to allow students and instructors to emphasize either or both skills. The
text discusses the important features of figures and tables used to support an interpretation, rather
than the process of generating such material from data. This allows students whose main focus
is understanding statistical concepts not to be distracted by the details of a particular software
package. In our experience, however, we have found that many students enter a research setting
after only a single course in statistics. These students benefit from a practical introduction to data
analysis that incorporates the use of a statistical computing language. The‘ self-paced learning labs
associated with the text provide such an introduction; these are described in more detail later in
this preface. The datasets used in this book are available via the R openintro package available on
CRAN 6 and the R oibiostat package available via GitHub.
5 PDF available at https://www.openintro.org/book/biostat/ and source available at https://github.com/
OI-Biostat/oi_biostat_text.
6 Diez DM, Barr CD, Çetinkaya-Rundel M. 2012. openintro: OpenIntro data sets and supplement functions. http:
//cran.r-project.org/web/packages/openintro.
7

Textbook overview
The chapters of this book are as follows:

1. Introduction to data. Data structures, basic data collection principles, numerical and graphical
summaries, and exploratory data analysis.
2. Probability. The basic principles of probability.
3. Distributions of random variables. Introduction to random variables, distributions of discrete
and continuous random variables, and distributions for pairs of random variables.
4. Foundations for inference. General ideas for statistical inference in the context of estimating a
population mean.
5. Inference for numerical data. Inference for one-sample and two-sample means with the t-distribution,
power calculations for a difference of means, and ANOVA.
6. Simple linear regression. An introduction to linear regression with a single explanatory vari-
able, evaluating model assumptions, and inference in a regression context.
7. Multiple linear regression. General multiple regression model, categorical predictors with more
than two values, interaction, and model selection.
8. Inference for categorical data. Inference for single proportions, inference for two or more groups,
and outcome-based sampling.

Examples, exercises, and appendices


Examples in the text help with an understanding of how to apply methods:

EXAMPLE 0.1
This is an example. When a question is asked here, where can the answer be found?

The answer can be found here, in the solution section of the example.

When we think the reader would benefit from working out the solution to an example, we frame it
as Guided Practice.

GUIDED PRACTICE 0.2


The reader may check or learn the answer to any Guided Practice problem by reviewing the full
solution in a footnote. 7

There are exercises at the end of each chapter that are useful for practice or homework as-
signments. Solutions to odd numbered problems can be found in Appendix A. Readers will notice
that there are fewer end of chapter exercises in the last three chapters. The more complicated
methods, such as multiple regression, do not always lend themselves to hand calculation, and
computing is increasingly important both to gain practical experience with these methods and to
explore complex datasets. For students more interested in concepts than computing, however, we
have included useful end of chapter exercises that emphasize the interpretation of output from
statistical software.
Probability tables for the normal, t, and chi-square distributions are in Appendix B, and PDF
copies of these tables are also available from openintro.org for anyone to download, print, share, or
modify. The labs and the text also illustrate the use of simple R commands to calculate probabilities
from common distributions.
7 Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote
solution for any Guided Practice.
8 CHAPTER 0. PREFACE

Self-paced learning labs


The labs associated with the text can be downloaded from github.com/OI-Biostat/oi_biostat_
labs. They provide guidance on conducting data analysis and visualization with the R statistical
language and the computing environment RStudio, while building understanding of statistical
concepts. The labs begin from first principles and require no previous experience with statistical
software. Both R and RStudio are freely available for all major computing operating systems, and
the Unit 0 labs (00_getting_started) provide information on downloading and installing them.
Information on downloading and installing the packages may also be found at openintro.org.
The labs for each chapter all have the same structure. Each lab consists of a set of three
documents: a handout with the problem statements, a template to be used for working through
the lab, and a solution set with the problem solutions. The handout and solution set are most
easily read in PDF format (although Rmd files are also provided), while the template is an Rmd
file that can be loaded into RStudio. Each chapter of labs is accompanied by a set of "Lab Notes",
which provides a reference guide of all new R functions discussed in the labs.
Learning is best done, of course, if a student attempts the lab exercises before reading the
solutions. The "Lab Notes" may be a useful resource to refer to while working through problems.

OpenIntro, online resources, and getting involved


OpenIntro is an organization focused on developing free and affordable education materials. The
first project, OpenIntro Statistics, is intended for introductory statistics courses at the high school
through university levels. Other projects examine the use of randomization methods for learning
about statistics and conducting analyses (Introductory Statistics with Randomization and Simulation)
and advanced statistics that may be taught at the high school level (Advanced High School Statistics).
We encourage anyone learning or teaching statistics to visit openintro.org and get involved by
using the many online resources, which are all free, or by creating new material. Students can test
their knowledge with practice quizzes, or try an application of concepts learned in each chapter
using real data and the free statistical software R. Teachers can download the source for course
materials, labs, slides, datasets, R figures, or create their own custom quizzes and problem sets for
students to take on the website. Everyone is also welcome to download the book’s source files to
create a custom version of this textbook or to simply share a PDF copy with a friend or on a website.
All of these products are free, and anyone is welcome to use these online tools and resources with
or without this textbook as a companion.

Acknowledgements
The OpenIntro project would not have been possible without the dedication of many people, in-
cluding the authors of OpenIntro Statistics, the OpenIntro team and the many faculty, students,
and readers who commented on all the editions of OpenIntro Statistics.
This text has benefited from feedback from Andrea Foulkes, Raji Balasubramanian, Curry
Hilton, Michael Parzen, Kevin Rader, and the many excellent teaching fellows at Harvard College
who assisted in courses using the book. The cover design was provided by Pierre Baduel.
9
10

Chapter 1
Introduction to data

1.1 Case study

1.2 Data basics

1.3 Data collection principles

1.4 Numerical data

1.5 Categorical data

1.6 Relationships between two variables

1.7 Exploratory data analysis

1.8 Notes

1.9 Exercises
11

Making observations and recording data form the backbone of empirical research,
and represent the beginning of a systematic approach to investigating scientific
questions. As a discipline, statistics focuses on addressing the following three
questions in a rigorous and efficient manner: How can data best be collected? How
should data be analyzed? What can be inferred from data?

This chapter provides a brief discussion on the principles of data collection, and
introduces basic methods for summarizing and exploring data.

For labs, slides, and other resources, please visit


www.openintro.org/book/biostat
12 CHAPTER 1. INTRODUCTION TO DATA

1.1 Case study: preventing peanut allergies

The proportion of young children in Western countries with peanut allergies has doubled in
the last 10 years. Previous research suggests that exposing infants to peanut-based foods, rather
than excluding such foods from their diets, may be an effective strategy for preventing the develop-
ment of peanut allergies. The "Learning Early about Peanut Allergy" (LEAP) study was conducted
to investigate whether early exposure to peanut products reduces the probability that a child will
develop peanut allergies. 1
The study team enrolled children in the United Kingdom between 2006 and 2009, selecting
640 infants with eczema, egg allergy, or both. Each child was randomly assigned to either the
peanut consumption (treatment) group or the peanut avoidance (control) group. Children in the
treatment group were fed at least 6 grams of peanut protein daily until 5 years of age, while chil-
dren in the control group avoided consuming peanut protein until 5 years of age.
At 5 years of age, each child was tested for peanut allergy using an oral food challenge (OFC): 5
grams of peanut protein in a single dose. A child was recorded as passing the oral food challenge if
no allergic reaction was detected, and failing the oral food challenge if an allergic reaction occurred.
These children had previously been tested for peanut allergy through a skin test, conducted at the
time of study entry; the main analysis presented in the paper was based on data from 530 children
with an earlier negative skin test. 2
Individual-level data from the study are shown in Figure 1.1 for 5 of the 530 children—each
row represents a participant and shows the participant’s study ID number, treatment group assign-
ment, and OFC outcome. 3

participant.ID treatment.group overall.V60.outcome


LEAP_100522 Peanut Consumption PASS OFC
LEAP_103358 Peanut Consumption PASS OFC
LEAP_105069 Peanut Avoidance PASS OFC
LEAP_994047 Peanut Avoidance PASS OFC
LEAP_997608 Peanut Consumption PASS OFC

Figure 1.1: Individual-level LEAP results, for five children.

The data can be organized in the form of a two-way summary table; Figure 1.2 shows the
results categorized by treatment group and OFC outcome.

FAIL OFC PASS OFC Sum


Peanut Avoidance 36 227 263
Peanut Consumption 5 262 267
Sum 41 489 530

Figure 1.2: Summary of LEAP results, organized by treatment group (either


peanut avoidance or consumption) and result of the oral food challenge at 5 years
of age (either pass or fail).

1 Du Toit, George, et al. Randomized trial of peanut consumption in infants at risk for peanut allergy. New England
Journal of Medicine 372.9 (2015): 803-813.
2 Although a total of 542 children had an earlier negative skin test, data collection did not occur for 12 children.
3 The data are available as LEAP in the R package oibiostat.
1.1. CASE STUDY 13

The summary table makes it easier to identify patterns in the data. Recall that the question
of interest is whether children in the peanut consumption group are more or less likely to develop
peanut allergies than those in the peanut avoidance group. In the avoidance group, the proportion
of children failing the OFC is 36/263 = 0.137 (13.7%); in the consumption group, the proportion
of children failing the OFC is 5/267 = 0.019 (1.9%). Figure 1.3 shows a graphical method of dis-
playing the study results, using either the number of individuals per category from Figure 1.2 or
the proportion of individuals with a specific OFC outcome in a group.

1.0
FAIL OFC FAIL OFC
250
PASS OFC PASS OFC
0.8
200
0.6
150

0.4
100

50 0.2

0 0.0
Peanut Avoidance Peanut Consumption Peanut Avoidance Peanut Consumption

(a) (b)

Figure 1.3: (a) A bar plot displaying the number of individuals who failed or
passed the OFC in each treatment group. (b) A bar plot displaying the proportions
of individuals in each group that failed or passed the OFC.

The proportion of participants failing the OFC is 11.8% higher in the peanut avoidance group
than the peanut consumption group. Another way to summarize the data is to compute the ratio of
the two proportions (0.137/0.019 = 7.31), and conclude that the proportion of participants failing
the OFC in the avoidance group is more than 7 times as large as in the consumption group; i.e.,
the risk of failing the OFC was more than 7 times as great for participants in the avoidance group
relative to the consumption group.
Based on the results of the study, it seems that early exposure to peanut products may be
an effective strategy for reducing the chances of developing peanut allergies later in life. It is
important to note that this study was conducted in the United Kingdom at a single site of pediatric
care; it is not clear that these results can be generalized to other countries or cultures.
The results also raise an important statistical issue: does the study provide definitive evidence
that peanut consumption is beneficial? In other words, is the 11.8% difference between the two
groups larger than one would expect by chance variation alone? The material on inference in later
chapters will provide the statistical tools to evaluate this question.
14 CHAPTER 1. INTRODUCTION TO DATA

1.2 Data basics

Effective organization and description of data is a first step in most analyses. This section
introduces a structure for organizing data and basic terminology used to describe data.

1.2.1 Observations, variables, and data matrices

In evolutionary biology, parental investment refers to the amount of time, energy, or other
resources devoted towards raising offspring. This section introduces the frog dataset, which orig-
inates from a 2013 study about maternal investment in a frog species. 4 Reproduction is a costly
process for female frogs, necessitating a trade-off between individual egg size and total number of
eggs produced. Researchers were interested in investigating how maternal investment varies with
altitude and collected measurements on egg clutches found at breeding ponds across 11 study sites;
for 5 sites, the body size of individual female frogs was also recorded.

altitude latitude egg.size clutch.size clutch.volume body.size


1 3,462.00 34.82 1.95 181.97 177.83 3.63
2 3,462.00 34.82 1.95 269.15 257.04 3.63
3 3,462.00 34.82 1.95 158.49 151.36 3.72
150 2,597.00 34.05 2.24 537.03 776.25 NA

Figure 1.4: Data matrix for the frog dataset.

Figure 1.4 displays rows 1, 2, 3, and 150 of the data from the 431 clutches observed as part
of the study. 5 Each row in the table corresponds to a single clutch, indicating where the clutch
was collected (altitude and latitude), egg.size, clutch.size, clutch.volume, and body.size of
the mother when available. "NA" corresponds to a missing value, indicating that information on
an individual female was not collected for that particular clutch. The recorded characteristics are
referred to as variables; in this table, each column represents a variable.

variable description
altitude Altitude of the study site in meters above sea level
latitude Latitude of the study site measured in degrees
egg.size Average diameter of an individual egg to the 0.01 mm
clutch.size Estimated number of eggs in clutch
clutch.volume Volume of egg clutch in mm3
body.size Length of mother frog in cm

Figure 1.5: Variables and their descriptions for the frog dataset.

It is important to check the definitions of variables, as they are not always obvious. For ex-
ample, why has clutch.size not been recorded as whole numbers? For a given clutch, researchers
counted approximately 5 grams’ worth of eggs and then estimated the total number of eggs based
on the mass of the entire clutch. Definitions of the variables are given in Figure 1.5. 6

4 Chen, W., et al. Maternal investment increases with altitude in a frog on the Tibetan Plateau. Journal of evolutionary
biology 26.12 (2013): 2710-2715.
5 The frog dataset is available in the R package oibiostat.
6 The data discussed here are in the original scale; in the published paper, some values have undergone a natural log
transformation.
1.2. DATA BASICS 15

The data in Figure 1.4 are organized as a data matrix. Each row of a data matrix corresponds
to an observational unit, and each column corresponds to a variable. A piece of the data matrix for
the LEAP study introduced in Section 1.1 is shown in Figure 1.1; the rows are study participants
and three variables are shown for each participant. Data matrices are a convenient way to record
and store data. If the data are collected for another individual, another row can easily be added;
similarly, another column can be added for a new variable.

1.2.2 Types of variables

The Functional polymorphisms Associated with human Muscle Size and Strength study (FA-
MuSS) measured a variety of demographic, phenotypic, and genetic characteristics for about 1,300
participants. 7 Data from the study have been used in a number of subsequent studies,8 such as
one examining the relationship between muscle strength and genotype at a location on the ACTN3
gene. 9
The famuss dataset is a subset of the data for 595 participants. 10 Four rows of the famuss
dataset are shown in Figure 1.6, and the variables are described in Figure 1.7.

sex age race height weight actn3.r577x ndrm.ch


1 Female 27 Caucasian 65.0 199.0 CC 40.0
2 Male 36 Caucasian 71.7 189.0 CT 25.0
3 Female 24 Caucasian 65.0 134.0 CT 40.0
595 Female 30 Caucasian 64.0 134.0 CC 43.8

Figure 1.6: Four rows from the famuss data matrix.

variable description
sex Sex of the participant
age Age in years
race Race, recorded as African Am (African American), Caucasian, Asian,
Hispanic or Other
height Height in inches
weight Weight in pounds
actn3.r577x Genotype at the location r577x in the ACTN3 gene.
ndrm.ch Percent change in strength in the non-dominant arm, comparing strength
after to before training

Figure 1.7: Variables and their descriptions for the famuss dataset.

The variables age, height, weight, and ndrm.ch are numerical variables. They take on numer-
ical values, and it is reasonable to add, subtract, or take averages with these values. In contrast,
a variable reporting telephone numbers would not be classified as numerical, since sums, differ-
ences, and averages in this context have no meaning. Age measured in years is said to be discrete,
since it can only take on numerical values with jumps; i.e., positive integer values. Percent change
in strength in the non-dominant arm (ndrm.ch) is continuous, and can take on any value within a
specified range.

7 Thompson PD, Moyna M, Seip, R, et al., 2004. Functional Polymorphisms Associated with Human Muscle Size and
Strength. Medicine and Science in Sports and Exercise 36:1132 - 1139.
8 Pescatello L, et al. Highlights from the functional single nucleotide polymorphisms associated with human muscle
size and strength or FAMuSS study, BioMed Research International 2013.
9 Clarkson P, et al., Journal of Applied Physiology 99: 154-163, 2005.
10 The subset is from Foulkes, Andrea S. Applied statistical genetics with R: for population-based association studies.
Springer Science & Business Media, 2009. The full version of the data is available at http://people.umass.edu/foulkes/
asg/data.html.
16 CHAPTER 1. INTRODUCTION TO DATA

Figure 1.8: Breakdown of variables into their respective types.

The variables sex, race, and actn3.r577x are categorical variables, which take on values that
are names or labels. The possible values of a categorical variable are called the variable’s levels. 11
For example, the levels of actn3.r577x are the three possible genotypes at this particular locus:
CC, CT, or TT. Categorical variables without a natural ordering are called nominal categorical
variables; sex, race, and actn3.r577x are all nominal categorical variables. Categorical variables
with levels that have a natural ordering are referred to as ordinal categorical variables. For exam-
ple, age of the participants grouped into 5-year intervals (15-20, 21-25, 26-30, etc.) is an ordinal
categorical variable.

EXAMPLE 1.1
Classify the variables in the frog dataset: altitude, latitude, egg.size, clutch.size,
clutch.volume, and body.size.

The variables egg.size, clutch.size, clutch.volume, and body.size are continuous numerical
variables, and can take on all positive values.
In the context of this study, the variables altitude and latitude are best described as categorical
variables, since the numerical values of the variables correspond to the 11 specific study sites where
data were collected. Researchers were interested in exploring the relationship between altitude and
maternal investment; it would be reasonable to consider altitude an ordinal categorical variable.

GUIDED PRACTICE 1.2


Characterize the variables treatment.group and overall.V60.outcome from the LEAP study (dis-
cussed in Section 1.1). 12

GUIDED PRACTICE 1.3


Suppose that on a given day, a research assistant collected data on the first 20 individuals visiting a
walk-in clinic: age (measured as less than 21, 21 - 65, and greater than 65 years of age), sex, height,
weight, and reason for the visit. Classify each of the variables. 13

11 Categorical variables are sometimes called factor variables.


12 These variables measure non-numerical quantities, and thus are categorical variables with two levels.
13 Height and weight are continuous numerical variables. Age as measured by the research assistant is ordinal categorical.
Sex and the reason for the visit are nominal categorical variables.
1.2. DATA BASICS 17

1.2.3 Relationships between variables

Many studies are motivated by a researcher examining how two or more variables are related.
For example, do the values of one variable increase as the values of another decrease? Do the values
of one variable tend to differ by the levels of another variable?
One study used the famuss data to investigate whether ACTN3 genotype at a particular lo-
cation (residue 577) is associated with change in muscle strength. The ACTN3 gene codes for a
protein involved in muscle function. A common mutation in the gene at a specific location changes
the cytosine (C) nucleotide to a thymine (T) nucleotide; individuals with the TT genotype are un-
able to produce any ACTN3 protein.
Researchers hypothesized that genotype at this location might influence muscle function. As
a measure of muscle function, they recorded the percent change in non-dominant arm strength
after strength training; this variable, ndrm.ch, is the response variable in the study. A response
variable is defined by the particular research question a study seeks to address, and measures the
outcome of interest in the study. A study will typically examine whether the values of a response
variable differ as values of an explanatory variable change, and if so, how the two variables are
related. A given study may examine several explanatory variables for a single response variable. 14
The explanatory variable examined in relation to ndrm.ch in the study is actn3.r557x, ACTN3
genotype at location 577.

EXAMPLE 1.4
In the maternal investment study conducted on frogs, researchers collected measurements on egg
clutches and female frogs at 11 study sites, located at differing altitudes, in order to investigate
how maternal investment varies with altitude. Identify the response and explanatory variables in
the study.

The variables egg.size, clutch.size, and clutch.volume are response variables indicative of ma-
ternal investment.
The explanatory variable examined in the study is altitude.
While latitude is an environmental factor that might potentially influence features of the egg
clutches, it is not a variable of interest in this particular study.
Female body size (body.size) is neither an explanatory nor response variable.

GUIDED PRACTICE 1.5


Refer to the variables from the famuss dataset described in Figure 1.7 to formulate a question about
the relationships between these variables, and identify the response and explanatory variables in
the context of the question. 15

14 Response variables are sometimes called dependent variables and explanatory variables are often called independent
variables or predictors.
15 Two sample questions: (1) Does change in participant arm strength after training seem associated with race? The
response variable is ndrm.ch and the explanatory variable is race. (2) Do male participants appear to respond differently to
strength training than females? The response variable is ndrm.ch and the explanatory variable is sex.
18 CHAPTER 1. INTRODUCTION TO DATA

1.3 Data collection principles

The first step in research is to identify questions to investigate. A clearly articulated research
question is essential for selecting subjects to be studied, identifying relevant variables, and deter-
mining how data should be collected.

1.3.1 Populations and samples

Consider the following research questions:

1. Do bluefin tuna from the Atlantic Ocean have particularly high levels of mercury, such that
they are unsafe for human consumption?
2. For infants predisposed to developing a peanut allergy, is there evidence that introducing
peanut products early in life is an effective strategy for reducing the risk of developing a
peanut allergy?
3. Does a recently developed drug designed to treat glioblastoma, a form of brain cancer, appear
more effective at inducing tumor shrinkage than the drug currently on the market?

Each of these questions refers to a specific target population. For example, in the first ques-
tion, the target population consists of all bluefin tuna from the Atlantic Ocean; each individual
bluefin tuna represents a case. It is almost always either too expensive or logistically impossible to
collect data for every case in a population. As a result, nearly all research is based on information
obtained about a sample from the population. A sample represents a small fraction of the popu-
lation. Researchers interested in evaluating the mercury content of bluefin tuna from the Atlantic
Ocean could collect a sample of 500 bluefin tuna (or some other quantity), measure the mercury
content, and use the observed information to formulate an answer to the research question.

GUIDED PRACTICE 1.6


Identify the target populations for the remaining two research questions. 16

16 In Question 2, the target population consists of infants predisposed to developing a peanut allergy. In Question 3, the
target population consists of patients with glioblastoma.
1.3. DATA COLLECTION PRINCIPLES 19

1.3.2 Anecdotal evidence

Anecdotal evidence typically refers to unusual observations that are easily recalled because of
their striking characteristics. Physicians may be more likely to remember the characteristics of a
single patient with an unusually good response to a drug instead of the many patients who did not
respond. The dangers of drawing general conclusions from anecdotal information are obvious; no
single observation should be used to draw conclusions about a population.
While it is incorrect to generalize from individual observations, unusual observations can
sometimes be valuable. E.C. Heyde was a general practitioner from Vancouver who noticed that a
few of his elderly patients with aortic-valve stenosis (an abnormal narrowing) caused by an accu-
mulation of calcium had also suffered massive gastrointestinal bleeding. In 1958, he published his
observation. 17 Further research led to the identification of the underlying cause of the association,
now called Heyde’s Syndrome. 18
An anecdotal observation can never be the basis for a conclusion, but may well inspire the
design of a more systematic study that could be definitive.

17 Heyde EC. Gastrointestinal bleeding in aortic stenosis. N Engl J Med 1958;259:196.


18 Greenstein RJ, McElhinney AJ, Reuben D, Greenstein AJ. Co-lonic vascular ectasias and aortic stenosis: coincidence or
causal relationship? Am J Surg 1986;151:347-51.
20 CHAPTER 1. INTRODUCTION TO DATA

1.3.3 Sampling from a population

Sampling from a population, when done correctly, provides reliable information about the
characteristics of a large population. The US Centers for Disease Control (US CDC) conducts sev-
eral surveys to obtain information about the US population, including the Behavior Risk Factor
Surveillance System (BRFSS). 19 The BRFSS was established in 1984 to collect data about health-
related risk behaviors, and now collects data from more than 400,000 telephone interviews con-
ducted each year. Data from a recent BRFSS survey are used in Chapter 4. The CDC conducts
similar surveys for diabetes, health care access, and immunization. Likewise, the World Health Or-
ganization (WHO) conducts the World Health Survey in partnership with approximately 70 coun-
tries to learn about the health of adult populations and the health systems in those countries. 20
The general principle of sampling is straightforward: a sample from a population is useful for
learning about a population only when the sample is representative of the population. In other
words, the characteristics of the sample should correspond to the characteristics of the population.
Suppose that the quality improvement team at an integrated health care system, such as Har-
vard Pilgrim Health Care, is interested in learning about how members of the health plan perceive
the quality of the services offered under the plan. A common pitfall in conducting a survey is to
use a convenience sample, in which individuals who are easily accessible are more likely to be
included in the sample than other individuals. If a sample were collected by approaching plan
members visiting an outpatient clinic during a particular week, the sample would fail to enroll
generally healthy members who typically do not use outpatient services or schedule routine phys-
ical examinations; this method would produce an unrepresentative sample (Figure 1.9).

Figure 1.9: Instead of sampling from all members equally, approaching members
visiting a clinic during a particular week disproportionately selects members who
frequently use outpatient services.

Random sampling is the best way to ensure that a sample reflects a population. In a simple
random sample, each member of a population has the same chance of being sampled. One way to
achieve a simple random sample of the health plan members is to randomly select a certain number
of names from the complete membership roster, and contact those individuals for an interview
(Figure 1.10).

19 https://www.cdc.gov/brfss/index.html
20 http://www.who.int/healthinfo/survey/en/
1.3. DATA COLLECTION PRINCIPLES 21

Figure 1.10: Five members are randomly selected from the population to be in-
terviewed.

Even when a simple random sample is taken, it is not guaranteed that the sample is represen-
tative of the population. If the non-response rate for a survey is high, that may be indicative of
a biased sample. Perhaps a majority of participants did not respond to the survey because only a
certain group within the population is being reached; for example, if questions assume that par-
ticipants are fluent in English, then a high non-response rate would be expected if the population
largely consists of individuals who are not fluent in English (Figure 1.11). Such non-response
bias can skew results; generalizing from an unrepresentative sample may likely lead to incorrect
conclusions about a population.

Figure 1.11: Surveys may only reach a certain group within the population, which
leads to non-response bias. For example, a survey written in English may only
result in responses from health plan members fluent in English.

GUIDED PRACTICE 1.7


It is increasingly common for health care facilities to follow-up a patient visit with an email pro-
viding a link to a website where patients can rate their experience. Typically, less than 50% of
patients visit the website. If half of those who respond indicate a negative experience, do you think
that this implies that at least 25% of patient visits are unsatisfactory? 21

21 It is unlikely that the patients who respond constitute a representative sample from the larger population of patients.
This is not a random sample, because individuals are selecting themselves into a group, and it is unclear that each person
has an equal chance of answering the survey. If our experience is any guide, dissatisfied people are more likely to respond
to these informal surveys than satisfied patients.
22 CHAPTER 1. INTRODUCTION TO DATA

1.3.4 Sampling methods

Almost all statistical methods are based on the notion of implied randomness. If data are not
sampled from a population at random, these statistical methods – calculating estimates and errors
associated with estimates – are not reliable. Four random sampling methods are discussed in this
section: simple, stratified, cluster, and multistage sampling.
In a simple random sample, each case in the population has an equal chance of being included
in the sample (Figure 1.12). Under simple random sampling, each case is sampled independently of
the other cases; i.e., knowing that a certain case is included in the sample provides no information
about which other cases have also been sampled.
In stratified sampling, the population is first divided into groups called strata before cases
are selected within each stratum (typically through simple random sampling) (Figure 1.12). The
strata are chosen such that similar cases are grouped together. Stratified sampling is especially
useful when the cases in each stratum are very similar with respect to the outcome of interest, but
cases between strata might be quite different.
Suppose that the health care provider has facilities in different cities. If the range of services
offered differ by city, but all locations in a given city will offer similar services, it would be effective
for the quality improvement team to use stratified sampling to identify participants for their study,
where each city represents a stratum and plan members are randomly sampled from each city.
1.3. DATA COLLECTION PRINCIPLES 23









●●





● ●

● ● ●






● ●
● ●
● ●
● ● ●

● ●

● ●


●●

● ●




● ●

● ● ●

● ●
● ●
● ● ●
● ●

● ● ●

● ●

● ●
● ●

● ● ●
● ●




● ●
● ●

● ● ●




● ●

●●


● ●
● ●


● ●
● ●

● ●


● ●


● ●


●●








Stratum 2 Stratum 4 Stratum 6


● Index ●●
● ● ●




● ● ● ●●

● ● ●


● ● Stratum 3 ● ● ● ● ●
●●
●●

● ●

● ● ● ●
● ●

● ●



● ● ●


● ● ●●
● ●
● ●● ●
● ●



● ●
● ●
● ●
● ● ●
● ●●
● ●
● ●

● ● ● ●

● ● ● ●
● ● ●
● ●
Stratum 1 ●

● ●●
● ●

● ●
● ●


● ● ● ● ●






● ● ●


● ●
●● ●





● ●
● ●
● ●
● ●
● ●


● ●
● ●
●● ● ●


● ●

● ●

● ●
Stratum 5

Figure 1.12: Examples of simple random and stratified sampling. In the top
panel, simple random sampling is used to randomly select 18 cases (circled or-
ange dots) out of the total population (all dots). The bottom panel illustrates
stratified sampling: cases are grouped into six strata, then simple random sam-
pling is employed within each stratum.
24 CHAPTER 1. INTRODUCTION TO DATA

In a cluster sample, the population is first divided into many groups, called clusters. Then,
a fixed number of clusters is sampled and all observations from each of those clusters are included
in the sample (Figure 1.13). A multistage sample is similar to a cluster sample, but rather than
keeping all observations in each cluster, a random sample is collected within each selected cluster
(Figure 1.13).
Unlike with stratified sampling, cluster and multistage sampling are most helpful when there
is high case-to-case variability within a cluster, but the clusters themselves are similar to one an-
other. For example, if neighborhoods in a city represent clusters, cluster and multistage sampling
work best when the population within each neighborhood is very diverse, but neighborhoods are
relatively similar.
Applying stratified, cluster, or multistage sampling can often be more economical than only
drawing random samples. However, analysis of data collected using such methods is more com-
plicated than when using data from a simple random sample; this text will only discuss analysis
methods for simple random samples.

EXAMPLE 1.8
Suppose researchers are interested in estimating the malaria rate in a densely tropical portion of
rural Indonesia. There are 30 villages in the area, each more or less similar to the others. The goal
is to test 150 individuals for malaria. Evaluate which sampling method should be employed.

A simple random sample would likely draw individuals from all 30 villages, which could make
data collection extremely expensive. Stratified sampling is not advisable, since there is not enough
information to determine how strata of similar individuals could be built. However, cluster sam-
pling or multistage sampling are both reasonable options. For example, with multistage sampling,
half of the villages could be randomly selected, and then 10 people selected from each village. This
strategy is more efficient than a simple random sample, and can still provide a sample representa-
tive of the population of interest.

1.3.5 Introducing experiments and observational studies

The two primary types of study designs used to collect data are experiments and observational
studies.
In an experiment, researchers directly influence how data arise, such as by assigning groups of
individuals to different treatments and assessing how the outcome varies across treatment groups.
The LEAP study is an example of an experiment with two groups, an experimental group that
received the intervention (peanut consumption) and a control group that received a standard ap-
proach (peanut avoidance). In studies assessing effectiveness of a new drug, individuals in the
control group typically receive a placebo, an inert substance with the appearance of the experi-
mental intervention. The study is designed such that on average, the only difference between the
individuals in the treatment groups is whether or not they consumed peanut protein. This allows
for observed differences in experimental outcome to be directly attributed to the intervention and
constitute evidence of a causal relationship between intervention and outcome.
In an observational study, researchers merely observe and record data, without interfering
with how the data arise. For example, to investigate why certain diseases develop, researchers
might collect data by conducting surveys, reviewing medical records, or following a cohort of
many similar individuals. Observational studies can provide evidence of an association between
variables, but cannot by themselves show a causal connection. However, there are many instances
where randomized experiments are unethical, such as to explore whether lead exposure in young
children is associated with cognitive impairment.
1.3. DATA COLLECTION PRINCIPLES 25

Cluster 9
Cluster 2 Cluster 5


Cluster 7 ●

● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ● ●
● ●●
●●


Cluster 3 ●

● ●

● ● ●
● ● ● ● ●

● ●● ● ●

●● ●● ●





● ● ● ●
● ● ●●

● ● ●
Cluster 8
● ●
● ●



Cluster 4

● ●● ●



● ● ● ●
● ●

● ●

●●



● ●
● ●● ● ●

● ●
●● ●




● ●


●● ●

● ●

● ●●


●●



● ●




● ●

●● ●



● ●
● ●



● ●

● ● ●
● ●



●● ●



●●

● ●


● ●
● ●
● ●



●●
Cluster 6 ●
● ●
● ●
●●● ●

Cluster 1

Cluster 9
Cluster 2 Cluster 5
Index ●
●● ● Cluster 7 ● ● ●

● ● ● ● ●

● ● ●
● ● ●

● ● ●
● ●
● ● ● ● ● ●
● Cluster 3 ●



●●

● ● ● ●
● ● ●
● ●
● ●
● ● ●

● ● ●
● ●




● ●
Cluster 8
● ●
● ● ●
● ● Cluster 4 ●●



● ●




● ● ● ● ●



● ●

● ●
● ●
● ● ● ●
● ● ● ●
●●

● ●

●●





●●


● ●●
● ●

●● ●

● ● ●
● ●
● ●

●●

● ●
● ●

● Cluster 6 ●
● ●
● ●
● ●

Cluster 1

Figure 1.13: Examples of cluster and multistage sampling. The top panel illus-
trates cluster sampling: data are binned into nine clusters, three of which are sam-
pled, and all observations within these clusters are sampled. The bottom panel
illustrates multistage sampling, which differs from cluster sampling in that only
a subset from each of the three selected clusters are sampled.
26 CHAPTER 1. INTRODUCTION TO DATA

1.3.6 Experiments

Experimental design is based on three principles: control, randomization, and replication.

Control. When selecting participants for a study, researchers work to control for extraneous vari-
ables and choose a sample of participants that is representative of the population of interest.
For example, participation in a study might be restricted to individuals who have a condition
that suggests they may benefit from the intervention being tested. Infants enrolled in the
LEAP study were required to be between 4 and 11 months of age, with severe eczema and/or
allergies to eggs.

Randomization. Randomly assigning patients to treatment groups ensures that groups are bal-
anced with respect to both variables that can and cannot be controlled. For example, random-
ization in the LEAP study ensures that the proportion of males to females is approximately
the same in both groups. Additionally, perhaps some infants were more susceptible to peanut
allergy because of an undetected genetic condition; under randomization, it is reasonable to
assume that such infants were present in equal numbers in both groups. Randomization al-
lows differences in outcome between the groups to be reasonably attributed to the treatment
rather than inherent variability in patient characteristics, since the treatment represents the
only systematic difference between the two groups.
In situations where researchers suspect that variables other than the intervention may in-
fluence the response, individuals can be first grouped into blocks according to a certain at-
tribute and then randomized to treatment group within each block; this technique is referred
to as blocking or stratification. The team behind the LEAP study stratified infants into two
cohorts based on whether or not the child developed a red, swollen mark (a wheal) after
a skin test at the time of enrollment; afterwards, infants were randomized between peanut
consumption and avoidance groups. Figure 1.14 illustrates the blocking scheme used in the
study.

Replication. The results of a study conducted on a larger number of cases are generally more
reliable than smaller studies; observations made from a large sample are more likely to be
representative of the population of interest. In a single study, replication is accomplished by
collecting a sufficiently large sample. The LEAP study randomized a total of 640 infants.

Randomized experiments are an essential tool in research. The US Food and Drug Adminis-
tration typically requires that a new drug can only be marketed after two independently conducted
randomized trials confirm its safety and efficacy; the European Medicines Agency has a similar pol-
icy. Large randomized experiments in medicine have provided the basis for major public health
initiatives. In 1954, approximately 750,000 children participated in a randomized study compar-
ing polio vaccine with a placebo. 22 In the United States, the results of the study quickly led to the
widespread and successful use of the vaccine for polio prevention.

22 Meier, Paul. "The biggest public health experiment ever: the 1954 field trial of the Salk poliomyelitis vaccine." Statistics:
a guide to the unknown. San Francisco: Holden-Day (1972): 2-13.
1.3. DATA COLLECTION PRINCIPLES 27

Figure 1.14: A simplified schematic of the blocking scheme used in the LEAP
study, depicting 640 patients that underwent randomization. Patients are first
divided into blocks based on response to the initial skin test, then each block
is randomized between the avoidance and consumption groups. This strategy
ensures an even representation of patients in each group who had positive and
negative skin tests.
28 CHAPTER 1. INTRODUCTION TO DATA

1.3.7 Observational studies

In observational studies, researchers simply observe selected potential explanatory and re-
sponse variables. Participants who differ in important explanatory variables may also differ in other
ways that influence response; as a result, it is not advisable to make causal conclusions about the re-
lationship between explanatory and response variables based on observational data. For example,
while observational studies of obesity have shown that obese individuals tend to die sooner than
individuals with normal weight, it would be misleading to conclude that obesity causes shorter life
expectancy. Instead, underlying factors are probably involved; obese individuals typically exhibit
other health behaviors that influence life expectancy, such as reduced exercise or unhealthy diet.
Suppose that an observational study tracked sunscreen use and incidence of skin cancer, and
found that the more sunscreen a person uses, the more likely they are to have skin cancer. These
results do not mean that sunscreen causes skin cancer. One important piece of missing information
is sun exposure – if someone is often exposed to sun, they are both more likely to use sunscreen and
to contract skin cancer. Sun exposure is a confounding variable: a variable associated with both
the explanatory and response variables. 23 There is no guarantee that all confounding variables
can be examined or measured; as a result, it is not advisable to draw causal conclusions from
observational studies.

Confounding is not limited to observational studies. For example, consider a randomized


study comparing two treatments (varenicline and buproprion) against a placebo as therapies for
aiding smoking cessation. 24 At the beginning of the study, participants were randomized into
groups: 352 to varenicline, 329 to buproprion, and 344 to placebo. Not all participants successfully
completed the assigned therapy: 259, 225, and 215 patients in each group did so, respectively.
If an analysis were based only on the participants who completed therapy, this could introduce
confounding; it is possible that there are underlying differences between individuals who complete
the therapy and those who do not. Including all randomized participants in the final analysis
maintains the original randomization scheme and controls for differences between the groups. 25

GUIDED PRACTICE 1.9


As stated in Example 1.4, female body size (body.size) in the parental investment study is neither
an explanatory nor a response variable. Previous research has shown that larger females tend to
produce larger eggs and egg clutches; however, large body size can be costly at high altitudes.
Discuss a possible reason for why the study team chose to measure female body size when it is not
directly related to their main research question. 26

23 Also called a lurking variable, confounding factor, or a confounder.


24 Jorenby, Douglas E., et al. "Efficacy of varenicline, an α4β2 nicotinic acetylcholine receptor partial agonist, vs placebo
or sustained-release bupropion for smoking cessation: a randomized controlled trial." JAMA 296.1 (2006): 56-63.
25 This strategy, commonly used for analyzing clinical trial data, is referred to as an intention-to-treat analysis.
26 Female body size is a potential confounding variable, since it may be associated with both the explanatory variable
(altitude) and response variables (measures of maternal investment). If the study team observes, for example, that clutch
size tends to decrease at higher altitudes, they should check whether the apparent association is not simply due to frogs at
higher altitudes having smaller body size and thus, laying smaller clutches.
1.3. DATA COLLECTION PRINCIPLES 29

Observational studies may reveal interesting patterns or associations that can be further in-
vestigated with follow-up experiments. Several observational studies based on dietary data from
different countries showed a strong association between dietary fat and breast cancer in women.
These observations led to the launch of the Women’s Health Initiative (WHI), a large randomized
trial sponsored by the US National Institutes of Health (NIH). In the WHI, women were random-
ized to standard versus low fat diets, and the previously observed association was not confirmed.
Observational studies can be either prospective or retrospective. A prospective study identi-
fies participants and collects information at scheduled times or as events unfold. For example, in
the Nurses’ Health Study, researchers recruited registered nurses beginning in 1976 and collected
data through administering biennial surveys; data from the study have been used to investigate risk
factors for major chronic diseases in women. 27 Retrospective studies collect data after events have
taken place, such as from medical records. Some datasets may contain both retrospectively- and
prospectively-collected variables. The Cancer Care Outcomes Research and Surveillance Consor-
tium (CanCORS) enrolled participants with lung or colorectal cancer, collected information about
diagnosis, treatment, and previous health behavior, but also maintained contact with participants
to gather data about long-term outcomes. 28

27 www.channing.harvard.edu/nhs
28 Ayanian, John Z., et al. "Understanding cancer treatment and outcomes: the cancer care outcomes research and
surveillance consortium." Journal of Clinical Oncology 22.15 (2004): 2992-2996
30 CHAPTER 1. INTRODUCTION TO DATA

1.4 Numerical data

This section discusses techniques for exploring and summarizing numerical variables, using
the frog data from the parental investment study introduced in Section 1.2.

1.4.1 Measures of center: mean and median

The mean, sometimes called the average, is a measure of center for a distribution of data. To
find the average clutch volume for the observed egg clutches, add all the clutch volumes and divide
by the total number of clutches. 29

177.8 + 257.0 + · · · + 933.3


x= = 882.5 mm3 .
431
x The sample mean is often labeled x, to distinguish it from µ, the mean of the entire population
sample from which the sample is drawn. The letter x is being used as a generic placeholder for the variable
mean
µ of interest, clutch.volume.
population
mean
MEAN
The sample mean of a numerical variable is the sum of the values of all observations divided
by the number of observations:

x1 + x2 + · · · + xn
x= , (1.10)
n
where x1 , x2 , . . . , xn represent the n observed values.

The median is another measure of center; it is the middle number in a distribution after the
values have been ordered from smallest to largest. If the distribution contains an even number of
observations, the median is the average of the middle two observations. There are 431 clutches
in the dataset, so the median is the clutch volume of the 216th observation in the sorted values of
clutch.volume: 831.8 mm3 .

29 For computational convenience, the volumes are rounded to the first decimal.
1.4. NUMERICAL DATA 31

1.4.2 Measures of spread: standard deviation and interquartile range

The spread of a distribution refers to how similar or varied the values in the distribution are
to each other; i.e., whether the values are tightly clustered or spread over a wide range.
The standard deviation for a set of data describes the typical distance between an observation
and the mean. The distance of a single observation from the mean is its deviation. Below are the
deviations for the 1st , 2nd , 3rd , and 431st observations in the clutch.volume variable.

x1 − x = 177.8 − 882.5 = −704.7


x2 − x = 257.0 − 882.5 = −625.5
x3 − x = 151.4 − 882.5 = −731.1
..
.
x431 − x = 933.2 − 882.5 = 50.7

The sample variance, the average of the squares of these deviations, is denoted by s2 : s2
sample
(−704.7)2 + (−625.5)2 + (−731.1)2 + · · · + (50.7)2 variance
s2 =
431 − 1
496, 602.09 + 391, 250.25 + 534, 507.21 + · · · + 2570.49
=
430
= 143, 680.9.

The denominator is n − 1 rather than n; this mathematical nuance accounts for the fact that sample
mean has been used to estimate the population mean in the calculation. Details on the statistical
theory can be found in more advanced texts.
The sample standard deviation s is the square root of the variance:

s = 143, 680.9 = 379.05mm3 . s
sample
standard
deviation
Like the mean, the population values for variance and standard deviation are denoted by
Greek letters: σ 2 for the variance and σ for the standard deviation. σ2
population
variance
STANDARD DEVIATION
σ
The sample standard deviation of a numerical variable is computed as the square root of the
population
variance, which is the sum of squared deviations divided by the number of observations minus standard
1. deviation
s
(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2
s= , (1.11)
n−1

where x1 , x2 , . . . , xn represent the n observed values.


32 CHAPTER 1. INTRODUCTION TO DATA

Variability can also be measured using the interquartile range (IQR). The IQR for a distri-
bution is the difference between the first and third quartiles: Q3 − Q1 . The first quartile (Q1 ) is
equivalent to the 25th percentile; i.e., 25% of the data fall below this value. The third quartile
(Q3 ) is equivalent to the 75th percentile. By definition, the median represents the second quar-
tile, with half the values falling below it and half falling above. The IQR for clutch.volume is
1096.0 − 609.6 = 486.4 mm3 .
Measures of center and spread are ways to summarize a distribution numerically. Using nu-
merical summaries allows for a distribution to be efficiently described with only a few numbers. 30
For example, the calculations for clutch.volume indicate that the typical egg clutch has volume
of about 880 mm3 , while the middle 50% of egg clutches have volumes between approximately
600 mm3 and 1100.0 mm3 .

1.4.3 Robust estimates

Figure 1.15 shows the values of clutch.volume as points on a single axis. There are a few
values that seem extreme relative to the other observations: the four largest values, which appear
distinct from the rest of the distribution. How do these extreme values affect the value of the
numerical summaries?

100 300 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 2700

Clutch Volumes

Figure 1.15: Dot plot of clutch volumes from the frog data.

Figure 1.16 shows the summary statistics calculated under two scenarios, one with and one
without the four largest observations. For these data, the median does not change, while the IQR
differs by only about 6 mm3 . In contrast, the mean and standard deviation are much more affected,
particularly the standard deviation.

robust not robust


scenario median IQR x s
original data (with extreme observations) 831.8 486.9 882.5 379.1
data without four largest observations 831.8 493.9 867.9 349.2

Figure 1.16: A comparison of how the median, IQR, mean (x), and standard devi-
ation (s) change when extreme observations are present.

The median and IQR are referred to as robust estimates because extreme observations have
little effect on their values. For distributions that contain extreme values, the median and IQR will
provide a more accurate sense of the center and spread than the mean and standard deviation.

30 Numerical summaries are also known as summary statistics.


1.4. NUMERICAL DATA 33

1.4.4 Visualizing distributions of data: histograms and boxplots

Graphs show important features of a distribution that are not evident from numerical sum-
maries, such as asymmetry or extreme values. While dot plots show the exact value of each obser-
vation, histograms and boxplots graphically summarize distributions.
In a histogram, observations are grouped into bins and plotted as bars. Figure 1.17 shows the
number of clutches with volume between 0 and 200 mm3 , 200 and 400 mm3 , etc. up until 2,600
and 2,800 mm3 . 31 These binned counts are plotted in Figure 1.18.

Clutch volumes 0-200 200-400 400-600 600-800 ··· 2400-2600 2600-2800

Count 4 29 69 99 ··· 2 1

Figure 1.17: The counts for the binned clutch.volume data.

100

80
Frequency

60

40

20

0
0 500 1000 1500 2000 2500

Clutch Volume

Figure 1.18: A histogram of clutch.volume.

Histograms provide a view of the data density. Higher bars indicate more frequent obser-
vations, while lower bars represent relatively rare observations. Figure 1.18 shows that most of
the egg clutches have volumes between 500-1,000 mm3 , and there are many more clutches with
volumes smaller than 1,000 mm3 than clutches with larger volumes.
Histograms show the shape of a distribution. The tails of a symmetric distribution are
roughly equal, with data trailing off from the center roughly equally in both directions. Asym-
metry arises when one tail of the distribution is longer than the other. A distribution is said to be
right skewed when data trail off to the right, and left skewed when data trail off to the left. 32 Fig-
ure 1.18 shows that the distribution of clutch volume is right skewed; most clutches have relatively
small volumes, and only a few clutches have high volumes.

31 By default in R, the bins are left-open and right-closed; i.e., the intervals are of the form (a, b]. Thus, an observation
with value 200 would fall into the 0-200 bin instead of the 200-400 bin.
32 Other ways to describe data that are skewed to the right/left: skewed to the right/left or skewed to the posi-
tive/negative end.
34 CHAPTER 1. INTRODUCTION TO DATA

A mode is represented by a prominent peak in the distribution. 33 Figure 1.19 shows his-
tograms that have one, two, or three major peaks. Such distributions are called unimodal, bi-
modal, and multimodal, respectively. Any distribution with more than two prominent peaks is
called multimodal. Note that the less prominent peak in the unimodal distribution was not counted
since it only differs from its neighboring bins by a few observations. Prominent is a subjective term,
but it is usually clear in a histogram where the major peaks are.

15 20

15
15
10
10
10

5
5 5

0 0 0
0 5 10 15 0 5 10 15 20 0 5 10 15 20

Figure 1.19: From left to right: unimodal, bimodal, and multimodal distributions.

A boxplot indicates the positions of the first, second, and third quartiles of a distribution
in addition to extreme observations. 34 Figure 1.20 shows a boxplot of clutch.volume alongside a
vertical dot plot.

2500
outliers

2000
upper whisker
Clutch Volume

1500

Q3 (third quartile)
1000
− median (second quartile)



− Q1 (first quartile)



500 −










− lower whisker

Figure 1.20: A boxplot and dot plot of clutch.volume. The horizontal dashes
indicate the bottom 50% of the data and the open circles represent the top 50%.

33 Another definition of mode, which is not typically used in statistics, is the value with the most occurrences. It is
common that a dataset contains no observations with the same value, which makes this other definition impractical for
many datasets.
34 Boxplots are also known as box-and-whisker plots.
1.4. NUMERICAL DATA 35

In a boxplot, the interquartile range is represented by a rectangle extending from the first
quartile to the third quartile, and the rectangle is split by the median (second quartile). Extending
outwards from the box, the whiskers capture the data that fall between Q1 −1.5×IQR and Q3 +1.5×
IQR. The whiskers must end at data points; the values given by adding or subtracting 1.5 × IQR
define the maximum reach of the whiskers. For example, with the clutch.volume variable, Q3 +
1.5×IQR = 1, 096.5+1.5×486.4 = 1, 826.1 mm3 . However, there was no clutch with volume 1,826.1
mm3 ; thus, the upper whisker extends to 1,819.7 mm3 , the largest observation that is smaller than
Q3 + 1.5 × IQR.
Any observation that lies beyond the whiskers is shown with a dot; these observations are
called outliers. An outlier is a value that appears extreme relative to the rest of the data. For
the clutch.volume variable, there are several large outliers and no small outliers, indicating the
presence of some unusually large egg clutches.
The high outliers in Figure 1.20 reflect the right-skewed nature of the data. The right skew is
also observable from the position of the median relative to the first and third quartiles; the median
is slightly closer to the first quartile. In a symmetric distribution, the median will be halfway
between the first and third quartiles.

GUIDED PRACTICE 1.12


Use the histogram and boxplot in Figure 1.21 to describe the distribution of height in the famuss
data, where height is measured in inches. 35

80

150 75

100 70
Frequency

Height

65
50

60
0
60 65 70 75
55
Height
(a) (b)

Figure 1.21: A histogram and boxplot of height in the famuss data.

35 The data are roughly symmetric (the left tail is slightly longer than the right tail), and the distribution is unimodal
with one prominent peak at about 67 inches. The middle 50% of individuals are between 5.5 feet and just under 6 feet tall.
There is one low outlier and one high outlier, representing individuals that are unusually short/tall relative to the other
individuals.
36 CHAPTER 1. INTRODUCTION TO DATA

1.4.5 Transforming data

When working with strongly skewed data, it can be useful to apply a transformation, and
rescale the data using a function. A natural log transformation is commonly used to clarify the
features of a variable when there are many values clustered near zero and all observations are
positive.

30
100
25
80
20
Frequency

Frequency
60
15
40 10
20 5

0 0
$0 $20k $40k $60k $80k $100k $120k 5 6 7 8 9 10 11 12

Income (USD) Income (log USD)

(a) (b)

Figure 1.22: (a) Histogram of per capita income. (b) Histogram of the log-
transformed per capita income.

For example, income data are often skewed right; there are typically large clusters of low to
moderate income, with a few large incomes that are outliers. Figure 1.22(a) shows a histogram of
average yearly per capita income measured in US dollars for 165 countries in 2011. 36 The data are
heavily right skewed, with the majority of countries having average yearly per capita income lower
than $10,000. Once the data are log-transformed, the distribution becomes roughly symmetric
(Figure 1.22(b)). 37
For symmetric distributions, the mean and standard deviation are particularly informative
summaries. If a distribution is symmetric, approximately 70% of the data are within one standard
deviation of the mean and 95% of the data are within two standard deviations of the mean; this
guideline is known as the empirical rule.

EXAMPLE 1.13
On the log-transformed scale, mean log income is 8.50, with standard deviation 1.54. Apply the
empirical rule to describe the distribution of average yearly per capita income among the 165
countries.

According to the empirical rule, the middle 70% of the data are within one standard deviation of
the mean, in the range (8.50 - 1.54, 8.50 + 1.54) = (6.96, 10.04) log(USD). 95% of the data are within
two standard deviations of the mean, in the range (8.50 - 2(1.54), 8.50 + 2(1.54)) = (5.42, 11.58)
log(USD).
Undo the log transformation. The middle 70% of the data are within the range (e6.96 , e10.04 )
= ($1,054, $22,925). The middle 95% of the data are within the range (e5.42 , e11.58 ) = ($226,
$106,937).

Functions other than the natural log can also be used to transform data, such as the square
root and inverse.

36 The data are available as wdi.2011 in the R package oibiostat.


37 In statistics, the natural logarithm is usually written log. In other settings it is sometimes written as ln.
1.5. CATEGORICAL DATA 37

1.5 Categorical data

This section introduces tables and plots for summarizing categorical data, using the famuss
dataset introduced in Section 1.2.2.
A table for a single variable is called a frequency table. Figure 1.23 is a frequency table for
the actn3.r577x variable, showing the distribution of genotype at location r577x on the ACTN3
gene for the FAMuSS study participants.
In a relative frequency table like Figure 1.24, the proportions per each category are shown
instead of the counts.

CC CT TT Sum
Counts 173 261 161 595

Figure 1.23: A frequency table for the actn3.r577x variable.

CC CT TT Sum
Proportions 0.291 0.439 0.271 1.000

Figure 1.24: A relative frequency table for the actn3.r577x variable.

A bar plot is a common way to display a single categorical variable. The left panel of Fig-
ure 1.25 shows a bar plot of the counts per genotype for the actn3.r577x variable. The plot in the
right panel shows the proportion of observations that are in each level (i.e. in each genotype).

300 0.50
250

200
proportion
count

150 0.25

100

50

0 0.00
CC CT TT CC CT TT
genotype genotype

Figure 1.25: Two bar plots of actn3.r577x. The left panel shows the counts, and
the right panel shows the proportions for each genotype.
38 CHAPTER 1. INTRODUCTION TO DATA

1.6 Relationships between two variables

This section introduces numerical and graphical methods for exploring and summarizing re-
lationships between two variables. Approaches vary depending on whether the two variables are
both numerical, both categorical, or whether one is numerical and one is categorical.

1.6.1 Two numerical variables

Scatterplots

In the frog parental investment study, researchers used clutch volume as a primary variable of
interest rather than egg size because clutch volume represents both the eggs and the protective
gelatinous matrix surrounding the eggs. The larger the clutch volume, the higher the energy re-
quired to produce it; thus, higher clutch volume is indicative of increased maternal investment.
Previous research has reported that larger body size allows females to produce larger clutches; is
this idea supported by the frog data?
A scatterplot provides a case-by-case view of the relationship between two numerical vari-
ables. Figure 1.26 shows clutch volume plotted against body size, with clutch volume on the y-axis
and body size on the x-axis. Each point represents a single case. For this example, each case is one
egg clutch for which both volume and body size (of the female that produced the clutch) have been
recorded.

2500
Clutch Volume (mm3)

2000



● ●

1500 ● ●

● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
1000 ● ● ● ● ● ●
● ● ● ●
● ●
● ● ●


● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●



500 ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●●
● ● ●
● ● ● ●
●●

4.0 4.5 5.0 5.5 6.0


Female Body Size (cm)

Figure 1.26: A scatterplot showing clutch.volume (vertical axis) vs. body.size


(horizontal axis).

The plot shows a discernible pattern, which suggests an association, or relationship, between
clutch volume and body size; the points tend to lie in a straight line, which is indicative of a linear
association. Two variables are positively associated if increasing values of one tend to occur with
increasing values of the other; two variables are negatively associated if increasing values of one
variable occurs with decreasing values of the other. If there is no evident relationship between two
variables, they are said to be uncorrelated or independent.
As expected, clutch volume and body size are positively associated; larger frogs tend to pro-
duce egg clutches with larger volumes. These observations suggest that larger females are capable
of investing more energy into offspring production relative to smaller females.
1.6. RELATIONSHIPS BETWEEN TWO VARIABLES 39

The National Health and Nutrition Examination Survey (NHANES) consists of a set of surveys
and measurements conducted by the US CDC to assess the health and nutritional status of adults
and children in the United States. The following example uses data from a sample of 500 adults
(individuals ages 21 and older) from the NHANES dataset. 38

EXAMPLE 1.14
Body mass index (BMI) is a measure of weight commonly used by health agencies to assess whether
someone is overweight, and is calculated from height and weight. 39 Describe the relationships
shown in Figure 1.27. Why is it helpful to use BMI as a measure of obesity, rather than weight?

Figure 1.27(a) shows a positive association between height and weight; taller individuals tend to
be heavier. Figure 1.27(b) shows that height and BMI do not seem to be associated; the range of
BMI values observed is roughly consistent across height.
Weight itself is not a good measure of whether someone is overweight; instead, it is more reasonable
to consider whether someone’s weight is unusual relative to other individuals of a comparable
height. An individual weighing 200 pounds who is 6 ft tall is not necessarily an unhealthy weight;
however, someone who weighs 200 pounds and is 5 ft tall is likely overweight. It is not reasonable
to classify individuals as overweight or obese based only on weight.
BMI acts as a relative measure of weight that accounts for height. Specifically, BMI is used as an
estimate of body fat. According to US National Institutes of Health (US NIH) and the World Health
Organization (WHO), a BMI between 25.0 - 29.9 is considered overweight and a BMI over 30 is
considered obese. 40

● 70 ●

200
● 60
● ●
● ●
Weight (kg)

● 50 ●
150 ● ●

● ●
● ● ● ●
BMI

● ● ● ● ● ●
● ●● ● ●
● ● ● ● ●● ●●● ● ●● ● ●●● ● ● ● ●● ● ● ●●
● ● ● ● ●● ● ● ● ●● ●●● ●
●●● ● ● ●

● 40 ●● ● ●

●●●
● ●● ● ●●
●●● ● ● ● ●● ●
● ● ● ● ●
●● ●●● ●
●●
●●
● ● ● ●●
●●
● ●●
●● ●●
●●●

● ●●●●●
●●● ●●●●●●●●●●
● ● ● ● ●
●●● ●● ●●
● ● ●●
● ●●●● ●
● ●●●●●●
●● ● ● ● ●● ●
●●
100 ●
● ●●

●●
● ●●● ●

●● ●
●●●


●●●

● ●
●●● ●
●●

●● ●●●

●●●
● ● ● ●●

● ●●●
● ● ● ● ●●● ●
● ●● ●●
● ●●

●● ● ●
●●●●●●


● ●
●●●●

●●●●●
● ●●●●● ●●● ●●● ● ●
● ● ● ●

● ●● ● ● ● ● ● ●
●●
●● ● ●●● ● ●● ● ●● ●● ● ● ●●● ● ● ●●● ● ●●
● ● ●●●●
● ●● ●● ●

●●● ●● ● ●●
●●●●● ●

●●●
●●● ●●● ●●●●
●●● ●●●●
●●
●●
●●● ●
●●●●●

● ● ●
●● ●
● ●●●●●

● 30 ● ●●● ●● ● ●
●●
● ●●
●●●
● ●
●●●●●●●

●●
●●●
●●●
●●●●●●● ● ● ●●●●●●
● ●
● ● ● ●
●●● ●
● ● ● ●●●●●●● ●●●
●●
● ●●●●
●●●●
●●●
●●●●●
●● ● ●● ●●● ● ● ● ● ● ●●●
● ●● ●
● ●
● ● ●
●● ●●●
●●●● ●● ●
● ●

●● ●
●●
●●●
●●
●● ●

● ●
●●● ● ●●●
● ●
● ● ●●●●●
●● ●
● ● ● ●●
●●
● ●● ●
●●●

●●●
● ●●
●●
●●●●● ● ●
● ● ●●● ●● ●● ●●
●●●●
●●
●● ● ●●● ●●
● ●
●●●

● ● ●●● ●

● ●● ●
●● ●●●

● ● ●● ●


● ● ●● ● ● ●●
●● ● ●●●

● ●●●

●●

●●

●●
●● ●●
●●●

● ● ●●● ● ● ● ●●●
●●● ●●●● ●●●
●●● ●●
●●● ●● ●● ● ●● ● ●● ● ● ● ●
●● ●● ●●● ●●
●●

●●
●●
● ●●●


● ●● ●● ●● ●
● ●
● ●●● ● ●●● ●● ●
●● ● ●
●●●●●●
●●●● ●●●●● ● ● ● ●● ●● ●● ● ● ● ●
50 ●● ● ● ●


● ●
● ●● ● ● ● ● ● 20 ● ● ●●● ●●●●● ● ●●

●● ● ● ● ●● ● ● ● ●
● ● ●

150 160 170 180 190 150 160 170 180 190
Height (cm) Height (cm)

(a) (b)

Figure 1.27: (a) A scatterplot showing height versus weight from the 500 individ-
uals in the sample from NHANES. One participant 163.9 cm tall (about 5 ft, 4 in)
and weighing 144.6 kg (about 319 lb) is highlighted. (b) A scatterplot showing
height versus BMI from the 500 individuals in the sample from NHANES. The same
individual highlighted in (a) is marked here, with BMI 53.83.

38 The sample is available as nhanes.samp.adult.500 in the R oibiostat package.


weightkg
weightlb
39 BMI = = × 703.
2
heightm 2
heightin
40 https://www.nhlbi.nih.gov/health/educational/lose_wt/risk.htm
40 CHAPTER 1. INTRODUCTION TO DATA

EXAMPLE 1.15
Figure 1.28 is a scatterplot of life expectancy versus annual per capita income for 165 countries in
2011. Life expectancy is measured as the expected lifespan for children born in 2011 and income
is adjusted for purchasing power in a country. Describe the relationship between life expectancy
and annual per capita income; do they seem to be linearly associated?

Life expectancy and annual per capita income are positively associated; higher per capita income
is associated with longer life expectancy. However, the two variables are not linearly associated.
When income is low, small increases in per capita income are associated with relatively large in-
creases in life expectancy. However, once per capita income exceeds approximately $20,000 per
year, increases in income are associated with smaller gains in life expectancy.
In a linear association, change in the y-variable for every unit of the x-variable is consistent across
the range of the x-variable; for example, a linear association would be present if an increase in
income of $10,000 corresponded to an increase in life expectancy of 5 years, across the range of
income.

●● ● ●● ● ●● ●
● ●●● ●

● ● ●● ●●● ● ●
80 ●● ● ● ●
● ● ●
● ● ●
Life Expectancy (years)

● ● ● ●
● ● ●
● ●
75 ●●●●
●● ●
● ●
●●●● ●●● ● ● ●
● ●●●●●

●●
●●

●●●●
●●●● ● ●
70 ● ●


●●●●● ●

●●
●●
65 ●●●●

●● ●● ●


●●

●●
60 ●




●●


● ●
55 ● ●

●●

●●●
50 ●

●●●

$0 $20k $40k $60k $80k $100k

Figure 1.28: A scatterplot of life expectancy (years) versus annual per capita in-
come (US dollars) in the wdi.2011 dataset.

Correlation

Correlation is a numerical summary statistic that measures the strength of a linear relationship be-
r tween two variables. It is denoted by r, the correlation coefficient, which takes on values between
correlation -1 and 1.
coefficient
If the paired values of two variables lie exactly on a line, r = ±1; the closer the correlation
coefficient is to ±1, the stronger the linear association. When two variables are positively associated,
with paired values that tend to lie on a line with positive slope, r > 0. If two variables are negatively
associated, r < 0. A value of r that is 0 or approximately 0 indicates no apparent association between
two variables. 41

41 If paired values lie perfectly on either a horizontal or vertical line, there is no association and r is mathematically
undefined.
Another random document with
no related content on Scribd:
Quintana (Manuel Josef). Poesías selectas castellanas desde el
tiempo de Juan de Mena hasta nuestros días [Nueva edición
corregida y aumentada]. Madrid, 1830-1833. 2 vol.

Valera (Juan). Florilegio de poesías castellanas del siglo xix.


Madrid, 1901-1904. 5 vol.
NACIMIENTO DEL ROMANCE
Y DE LA LITERATURA POPULAR

1. Los sabios que tratan de prehistoria nos dicen que


hubo en España gentes de la fuerte raza que llaman
de Cro-Magnon; los teósofos añaden que aquellas
gentes fueron atlantes, desgajados de la Atlántida,
que se hundió en el mar entre Europa y América; los
hechos y la historia sólo nos aseguran que las tierras
de España conservan todavía, como los más
antiguos nombres que los hombres les pusieron,
vocablos claramente vascongados y que, por
consiguiente, la raza vascongada ó, por su propio
nombre, euscalduna, es la más antigua conocida en
España; de haberlo sido otra, no dejarían de haber
quedado huellas en la toponimia. Iberos llamaron los
griegos á los euscaldunas ribereños del Iber ó Ebro,
nombre éuscaro, extendiendo después la
denominación al resto de los españoles. La de
España es voz vascongada que indica extremo, el
non plus ultra, que después la simbolizó traduciendo
este nombre, por ser el límite de Europa. Por acá
vinieron y traficaron por el Sur y Levante fenicios y
griegos, y algunas corrientes de celtas por el
Noroeste, que corriéndose por la cuenca del Tajo á
la meseta central hacia el Nordeste, formaron con
los íberos los llamados en aquella región celtíberos.
Rarísimos rastros quedan de aquellas gentes en la
toponimia é inscripciones. Trajéronnos los romanos
su civilización y dieron su estructura al habla de los
españoles. Bandas de godos, suevos y vándalos, ya
romanizados, llegaron acá, dejándonos algunos
vocablos y apellidos, y más de asiento, dejándonos
otras veces algunos árabes de Siria y muchos más
africanos, medio arabizados. El iberismo, que se
toca y palpa en las provincias vascas y Navarra, está
en el fondo de todos los españoles, á pesar de las
diferencias que los distinguen, merced á las
cualidades de tierras y climas y á los tintes con que
pueblos extraños los colorearon. Siglos y siglos
vivieron apartadas varias regiones, sin que la unidad,
emprendida por los Austrias y continuada por los
Borbones, haya pasado de la política, siendo de
hecho más nominal que real.

Galicia, en inmediato trato con Portugal y con la


influencia francesa durante la Edad Media;
Andalucía, respirando aire africano y moruno hasta
el siglo xvii, ¿cómo iban á formar un todo verdadero
por la unidad absolutista de algunos reyes? Menos lo
habían de formar con Castilla las regiones de
Cataluña y Valencia, apartadas por el idioma y por la
historia durante tanto tiempo. Los elementos que
más ayudaron á la unidad de España fueron la
conformación geográfica de la Península, la cultura
romana y la religión católica. Pero el verdadero lazo
fué el idioma, que traba á Castilla con Andalucía y
Aragón más apretadamente que á las regiones
donde el habla se desvía del troquel castellano. Con
todo eso, las cualidades, buenas y malas, que tan á
la clara se hallan en los vascongados, se traslucen y
aun campean en el común de los habitantes todos
de España, incluyendo á Portugal; distinguiéndose
de los demás, si algunos, los catalanes. En el
español el espíritu vence á la materia, tiene más
cerebro que cuerpo, mejores cualidades morales que
físicas. La elevación de sentimientos le lleva á
reventar de hidalgo por no abatirse al trabajo
manual, que tiene por servil, dando en la picaresca y
busconería, cegándose así en no ver en ella bajeza
alguna, antes cierta grandeza de guapo dominador y
no menor maestría en el ingenio: tal es la causa de
su odio al trabajo, su afición á la vida apicarada y
aventurera y el gusto por la bizarría en el porte, la
majencia en el trato y el matonismo con los demás.
El ingenio del español es brillante, pronto y
despierto, más de intuición y fantasía que de
abstractiva inteligencia, más de poeta y soñador que
de sabio y erudito: de aquí su valer como artista y su
poca afición á sistematizar científicamente los
hechos, lo plástico y realista de sus creaciones
instintivas y el desvío de lo simbólico, ideal y
abstracto. El claro conocimiento de la justicia hace
vivir continuamente al español en el mundo moral,
juzgándolo todo éticamente, más que según el
interés y la conveniencia, moralizando siempre en
literatura y fiscalizando los actos de los demás,
sobre todo de los suyos y aun de sí mismo; de aquí
la gravedad en todo su proceder, hasta hacerse
pesado y tardo, perdiendo la oportunidad con la
indecisión.

El español es de una voluntad de hierro, tenaz hasta


la testarudez, constante y apegado á sus tradiciones
hasta el atraso en la civilización, religioso por
tradición, amante de la independencia como nadie.
No es guerreador por naturaleza, prefiriendo la paz;
pero por la independencia, por cualquier grande
ideal de justicia, se echa al campo y es constante,
sufrido y bravo guerrero, sin importarle nada el
perder la vida. Gracias á su claro ingenio y fuerza de
voluntad, es el español extraordinariamente franco y
sincero y nada supersticioso ni dado á ocultismos,
ama la luz y aborrece las medias tintas. En suma, de
gran sentido común en las cosas espirituales y de
muy escaso en las materiales, es pensador recio,
original y elevado, artista realista y sincero, de gran
corazón, compasivo y valiente y denodado defensor
de la justicia y de toda noble causa; pero no quiere
trabajar, odia el ahorro, menosprecia el propio
interés, no se muere por las comodidades materiales
y sólo fué grande cuando los ideales espirituales
señoreaban la opinión pública en los pueblos,
quedando aniquilado y por tierra, sin saberse qué
hacer, cuando los materiales del trabajo y del oro
sobrepujaron á todos los demás. El catalán, más
europeo y francés, es trabajador y ahorrador,
comúnmente por interés; lo es, no menos, el vasco,
por honradez y hombría de bien. El español lo será,
y con ello será grande, el día que haga lo que el
vasco, y lo hará algún día, porque lleva en su alma
los mismos ideales, dormidos hoy por el golpe que
dió al caer de su ahidalgado estado, al volcarse los
ideales de la sociedad; cuando se persuada de que
el trabajo, si puede ser cosa vil y de esclavos,
también puede ser una cosa virtuosa y noble, propia
de toda persona honrada é independiente.

El clima en España es extremo: africano unos meses


en valles y mesetas, siberiano otros en mesetas y
alturas montañosas. Los ardores del estío siéntense
en toda la Península, los fríos del invierno llegan á
todas partes. Algún tanto se templan estos rigores
en las costas, y en todo el territorio la primavera, y
más el otoño, son paradisíacos. Los tonos más
violentos colorean la literatura y el habla de los
españoles. No son literatura y habla de chimenea
rusa, de nieblas londinesas, de gris parisién, ni lo
son de arenoso y sofocado Egipto, de tropical y
malsano Ganges. Son de un ambiente atemperado;
pero con los mayores rigores que en un ambiente
atemperado pueden darse. Hay más violencia y rigor
en el tránsito de los climas en España que en Italia y
Grecia: el habla y la literatura lo dicen más claro que
las líneas isotérmicas é isobáricas.

2. La lengua castellana, como obra de arte popular,


vale infinitamente más que toda su literatura. Hay en
los modismos, en las metáforas, en las frases
hechas, en los refranes, mucho más hondura de
pensamiento, mayor sutileza de ingenio, más
brillante colorido, chiste más delicado, que en todas
nuestras obras literarias juntas. Nuestro idioma
vulgar, descostrado de la mitad ó más de las voces
que traen los diccionarios y empleamos los cultos,
que sólo sirven de emporcarla, aguarla y empañar su
vivo colorido, es la obra maestra del arte popular
nacional, inconsciente si se quiere, pero de hecho
hijo de la reflexión. Alguien fué el primero que dió en
el chiste de una expresión, que pintó el dicho con
singular gracejo ó lo vistió con no esperada
metáfora; el pueblo vió al punto que tal era la
expresión propia conforme al genio de la raza, y en
la cual los demás no habían dado, y la abrazó como
suya, se la apropió y, olvidado al día siguiente su
autor, corrió ya como cosa corriente, como
inconsciente brote del habla de todos.

Ni el idioma castellano ni los romances ó poesía


vulgar castellana nacieron en el punto y hora en que
les ocurrió trasladarlos al papel ó á los pergaminos á
algunos escritores más amantes de lo nacional y
menos pagados de la muerta lengua latina y de la
extranjeriza literatura, que el común de los escritores
suponían como únicamente dignas de escribirse.
Efectivamente, un idioma y un género poético no
nacen en un día ni brotan en un pueblo al amanecer
de un hoy tras un ayer de muchos siglos, durante los
cuales ese pueblo viviera sin literatura y sin idioma.
Al finalizar el siglo iv, todo latín había desaparecido
de los labios de las gentes, habíase trocado de latín
vulgar en otras hablas vulgares, que ya no se podían
llamar latín. Para llegar á aquel acontecimiento
largos años habían pasado que se hablaban ya esas
otras hablas populares, pues los truecos de idiomas,
la evolución de uno en otro, como de padre á hijo, no
son acaecimientos que pidan menos de varios
siglos. Cuando Cristo vino al mundo se hablaba, por
consiguiente, en España, latín y castellano á la vez:
latín por los colonos romanos y por las personas
cultas y aun acaso, más ó menos estropeado y
entendido, por los vecinos, originariamente
españoles, de los Conventos jurídicos, Colonias
romanas y poblaciones á medio latinizar; castellano
por las gentes del campo y de las aldeas, que eran
los más. Creer que los aldeanos llegaron jamás en
España á hablar latín, olvidando enteramente el
idioma nacional prerromano, es un sueño, del cual
puede suavemente despertar quienquiera que repare
en que más de la mitad del caudal léxico castellano,
inexplicable para los romanistas, no es latino, sino
de origen ibérico; en que la pronunciación castellana
es ibérica y no latina; en que no pocos sufijos
derivativos y algunas construcciones pertenecen al
habla prerromana de los españoles. De haberse
hablado en toda España latín, olvidada enteramente
aquella habla nacional, evolucionando después el
latín hasta convertirse en romance, no estaría éste
empapado de elementos ibéricos tan sustanciales
como son la pronunciación, la mitad del caudal léxico
y no pocos sufijos y construcciones, porque no hay
idioma vulgar que vaya á tomar voces, sufijos y
fonetismo de otra lengua ya muerta, mayormente de
lengua no erudita ni escrita, cual era el habla
prerromana de los españoles.

Que el castellano naciera del latín no era para


puesto en duda; que naciera del latín vulgar, no del
literario, tocaba averiguarlo á la moderna filología;
pero cuándo y cómo naciera ya son puntos más
espinosos, de pocos sabios conocidos, y aun esos
pocos traen contienda sobre ello. ¿Qué latín vulgar
era aquél del cual nació nuestro romance? Para
deslindarlo hay que cifrar en pocas palabras la
historia de la lengua latina. Hay que distinguir la
lengua hablada de la escrita ó literaria: la primera la
hubo desde que hubo romanos en el mundo; la
segunda nació más tarde, puede decirse que con
Livio Andrónico (514 de Roma), el primer autor en
fecha de la literatura latina. Sabemos con toda
certeza que, además del latín literario de los libros,
hubo un lenguaje que los autores latinos llaman
sermo vulgaris, plebeius, usualis, cottidianus,
inconditus, proletarius, prisca latinitas, ó acaso varios
lenguajes de la gente patricia y de la gente plebeya,
y esto según los diversos tiempos, pero con alguna
distinción entre las dos clases sociales, pues le
contraponen el sermo urbanus, eruditus, perpolitus
de los patricios, el cual siempre se ha diferenciado
en todas partes del habla puramente literaria. Lo
primero que echa de ver el que conoce
comparativamente las lenguas indo-europeas, es
que el antiguo latín vulgar, la prisca latinitas, tal cual
se transparenta en las inscripciones más añejas, en
los versos saturnios ó nacionales y hasta en los
mismos autores clásicos que afectan arcaísmos, se
allega más en el fonetismo á las demás lenguas de
la familia y á las otras itálicas en particular, que no al
latín literario clásico de la época de Cicerón y de
Augusto. Baste recordar el que e, o, del antiguo latín,
de varios dialectos itálicos y de las demás indo-
europeas, toman en latín literario el timbre más
estable de i, u; que los antiguos diptongos, debidos
al esfuerzo ó guna, por el cual deico se asemeja á
deic-numi, etc., etc., se contraen en i, u al llegar al
latín literario. No menos manifiesto es que las
tendencias del literario van poco á poco obrando con
mayor fuerza, dando sello particular á esta lengua
semioficial conforme adelantan los tiempos, pues se
les ve apuntar en los más viejos escritores y
generalizarse en la época clásica. De modo que en
sus comienzos el literario apenas difiere del vulgar;
pero poco á poco estas dos lenguas, evolucionando
conforme á sus particulares tendencias, que son,
comúnmente hablando, la vulgar hacia el dialecto
úmbrio y la literaria hacia el osco, van apartándose
entre sí cada vez más. En un principio, la pequeña
diferencia es de creer naciera de la diferente
pronunciación y gusto entre la plebe y la clase
patricia, más latina ésta, aquélla más montañesa y
que se iba acrecentando con los sabinos y otros que
se les iban allegando. La misma gente patricia,
cuando se comenzó á escribir, de creer es que
escribieron en su propia habla, no en la de la plebe:
por manera que siempre, en la época clásica, antes
y después, el lenguaje hablado por las personas de
cuenta en Roma se parecía más al literario que no al
plebeius, vulgaris, proletarius. De estas tres
variedades, el vulgar hablado, el hablado urbano y el
literario, sólo el primero fué el que pasó á las
provincias, después de colorearse con los matices
de los dialectos itálicos en sus correrías por toda
Italia, y el que dió nacimiento á los romances.
Tenemos, pues, una prisca rusticitas, más conforme
al indo-europeísmo y al dialecto úmbrio y que
encerraba en germen las tendencias que después se
desenvolvieron, dando su carácter analítico y
fonético á las lenguas románicas, y junto á ella un
latín más culto y parecido al osco, que llevado á la
literatura, da otra variedad, la del lenguaje literario, el
cual, tomando otro sendero opuesto al vulgar, y
acompañado siempre de cerca por el hablado de las
personas más granadas, se desarrolla y, apoyado en
la fuerza de la política y de la cultura y caracterizado,
ó digamos mejor, extranjerizado no poco con la
lengua y literatura helénica, de la cual abraza
vocablos y construcciones, se aparta cada vez más
del pueblo para vivir en los libros. En la época
clásica apenas suena para nada el habla vulgar, que
corre por lo más hondo sin meter ruido y
evolucionando por todo el imperio. El literario es el
único que aparece y domina, crece en poder por ser
el habla oficial, se impone por la Administración
central, por el establecimiento de escuelas, por el
mismo esplendor de la literatura. Desde Augusto á
los Antoninos lucha con el habla vulgar y aun parece
arrollarla en todas partes; pero declinando el poder
imperial, mejor digamos, perdiéndose el arte literario
verdadero con mayor velocidad de lo que tardó en
desenvolverse, puede decirse que al fin del siglo ii
fenece la literatura clásica y su lenguaje, el habla
urbana de Roma. Renace la literatura en el siglo iv,
pero ya es otra: la literatura cristiana. Los autores
desde aquel tiempo los más son cristianos y
escriben en una lengua muerta, especie de jerga que
ni es latín literario clásico ni latín vulgar hablado, sino
mezcla hechiza de entrambos; los pocos escritores
gentiles que aún quedan no escriben mejor, antes
Lactancio y otros cristianos sobrepujan á todos. Con
la venida de los bárbaros en el siglo v todo latín
hablado desaparece, pues el mismo vulgar tiempo
había que, no sólo en las provincias habíase
convertido en verdadero romance en labios de los
indígenas, pero aun en las ciudades más cultas de
ellas y en la misma Roma, se confundían los casos,
se perdían las terminaciones, sustituyéndolas por las
preposiciones, se usaban los participios con los
auxiliares, etcétera, etc. El lenguaje literario cristiano,
lengua muerta de hecho y puramente erudita,
apenas lo malsaben algunas personas instruidas,
por más que se siga enseñando en las escuelas que
quedan en pie y aun se emplee en el púlpito, siendo
entendido de la selecta sociedad. Los pocos que lo
escriben lo malean más y más latinizando los
vocablos extraños que los bárbaros traen ó que las
naciones diversas del mismo imperio emplean en
sus romances y que corren con las legiones en
continuo trasiego de una parte á otra. Tal es el
llamado bajo latín, latinización erudita de todo el
léxico vulgar, de cualquier procedencia que fuesen
las palabras, en manos de los escritores.
3. Cualquiera que conozca el espíritu de los antiguos sabe de sobra
que para las personas cultas de aquellos tiempos no había más latín
que el literario. Á nadie se le ocurrió jamás escribir en aquella jerga
vulgar, que se consideraba como una degeneración del latín culto,
torpemente desfigurado y estropeado en labios de la gente plebeya.
Tal es la causa de que las únicas noticias que tenemos del latín
vulgar las debamos á la investigación científica, que por medios
indirectos ha llegado á rastrear algunos datos: de ahí la dificultad del
problema. Y aquí ocurre una observación crítica de la mayor
importancia. Ese menosprecio y extravagante manera de considerar
el habla vulgar se mantuvo aun después de fenecido el Imperio.
Hasta bien adelantada la Edad Media, las personas instruidas no se
pusieron á escribir en romance por creerlo indigno instrumento para
la literatura; mas, antes del siglo xii todos creían que su habla era el
latín, bien que estropeado. Sólo así se explica que los autores
modificaran el romance vulgar, acercándolo en su ortografía al latín
cuanto podían, y que emplearan todos los términos latinos que les
venían á la cabeza con sólo darles un ligero tinte castellano. De aquí
esa dualidad lingüística en un mismo autor, que emplea, no sólo
términos desconocidos del vulgo, sino aun los vulgares, con una
ortografía semilatina ó etimológica y semifonética. Es imposible que
en tiempo de Berceo sonara de tres maneras el mismo verbo:
dannar, danpnar, damnar. Estas variantes ortográficas respondían á
dañar, que era como únicamente se decía entonces, lo mismo que
ahora. Pero hubieran creído estropear el latín, si lo escribían tal
como lo pronunciaban. Tenían un lenguaje para escribir y creían
echarlo á perder al hablar su roman paladino. Y aquí han tropezado
no pocos, aduciendo esas variantes ortográficas como formas que
realmente sonaron tal como están escritas y que, por consiguiente,
eran las formas comprobantes intermedias de la evolución, en las
cuales vemos convertirse el latín en castellano, vemos nacer á
nuestro romance.

Esta observación crítica se aplica lo mismo á los escritos latinos que


á los castellanos de aquellos tiempos, y es de tal importancia para la
investigación de la etimología y origen del castellano, que voy á
descender á casos particulares.

Está tan lejos de ser cierto que en los escritos medievales se vea
nacer el castellano, que, por el contrario, lo que se ve nacer en ellos
es el latín. El castellano aparece, la primera vez que se le halla
escrito, como una lengua robusta y acabada, y los vocablos sueltos
que aparecen en los documentos latinos más antiguos son tan
castellanos como hoy día. Antes bien, las formas que aparecen
antes son las más castellanas y poco á poco se van acercando más
á las latinas. Es que los escritores iban sabiendo mejor el latín
conforme adelantaban los tiempos. Por ej., linde se encuentra en el
Fuero de Évora el año 1166 (M. P. Leges, p. 392): "Qui linde alieno
crebantaverit, pectet quinque solidos, et septem ad Palacio". En la
segunda recensión, Fuero de Abrantes en 1179, y de Corucha en
1182 (ibid., págs. 419 y 427): "Qui limde alienum quebrantaverit". En
la tercera, F. de Palmella en 1185 (ibid., pág. 430): "Qui limede (al.
limide) alieno crebantar...". En la cuarta, F. de Covilhan del 1186 y de
Centocellas del 1194 (ibid., páginas 457 y 487): "Qui limitem alienum
fregerit...". En la quinta, F. de San Vicente de Beira en 1195 (ibid.,
pág. 495): "Qui limidem alienum fregerit". Á la verdad, aquí no se ve
nacer el castellano, sino diríase que el latín: linde, limde, limede,
limitem, limidem. Otro tanto sucede con el término azor y el azorera,
que aparecen antes que acetore y aceptore. De las formas arroyo,
arroio y arrogio, la primera es la más antigua, del año 841, en la
donación de Alfonso el Casto á la catedral de Lugo. En la era 916
hallamos quoto: "factum est in supradicto quoto 8 idibus junias"; y
después, en las eras 937, 940 y 983, cautum; y en la de 984,
cautamus. No parece sino que el castellano va á convertirse otra
vez en latín; y es que la cultura adelantaba, y lo único que
pretendían era escribir en latín, haciéndolo cada vez mejor. Siendo
para ellos el habla vulgar un latín corrompido, lo saqueaban
latinizándolo en sus escritos: abatire de abatir, abadagium,
acampanare, acannizare, alcanzare, advescit == consuevit (Glos.
gót. Card.) de avezar, "dña Thereysia mea ama", del ama
castellano, attondus (era 1100, Arch. Arlam.) ó atuendo en ablativo
(ch. Ferdin. I, Sota), del vascuence atondo, "terras cultas vel
barbatas" de vervactum == barbecho (ch. Adeph. imper., era 1117.
Arch. Naj.), campidator de campeador, campear (ch. Adeph., 1111,
Sota), cargas de feno, carnerus, cavalcator, cerrus de cerro,
collacius de collazo, collata, ganare, ganatus, autero de otero,
heretarius de heredero, ingamno de engaño, quadrare, quitare,
sacare, spolas. Sería insensatez figurarse que tales formas latinas
hayan pertenecido jamás al habla: son vocablos castellanos, sin
origen latino muchos de ellos, pero latinizados por los pendolistas de
aquellos tiempos. El que sin criterio quiera amontonar los términos
intermedios entre los castellanos y los latinos, los hallará todos en
los documentos; pero no son términos medios de la evolución
natural del latín hasta hacerse castellano, sino muchas veces, al
revés, es la latinización cada vez más perfecta del habla vulgar. Por
ejemplo. En Berceo hallamos miraculo (Mil., 46), miraclo (íd., 869) y
miraglo (S. Dom., 315). "Berceo nos conserva tres de las cinco
formas por que ha pasado miraculum para fijarse en milagro", dice
Lanchetas. Si esto fuera verdad, en tiempo de Berceo aún no habría
nacido el castellano, ni aun siquiera el latín vulgar, pues el miraclo
del vulgar latino es posterior al miraculo de Berceo. Lo que hay es
que, menospreciándose entonces el romance vulgar, los escritores
creían que debían escribirlo lo más parecido al latín, única lengua
literaria para ellos; de modo que en vez de escribir siempre miraglo,
que es como se decía en el pueblo, escribían á veces miraclo por
acercarse al latín, y aun miraculo, tomado del latín clásico, del cual
no había salido miraglo, sino del vulgar miraclo. Siempre la reacción
literaria corrigiendo el habla vulgar.

No se pueden tomar sin discernimiento todas las formas que


hallamos escritas en los autores: la más vulgar es la única
fehaciente; las otras son préstamos eruditos del latín y no reflejan el
castellano hablado. Mixtura por mezcla en Berceo (Duel., 40) es de
origen muy posterior respecto de mesturar por mezclar y de mesta
por cosa mezclada, así como lo es misto. La x de mixtura denuncia
un préstamo del latín; hoy ya ha pasado misto al pueblo, pero ha
perdido la x, que ni los romanos pronunciaban, cuanto menos los
riojanos del tiempo de su poeta Berceo. Modrar (S. Mill., 27, 1),
aunque erudito de origen, ya ha perdido la e; la reacción posterior
originó el moderar, calcándolo sobre moderare. Como modrar no se
usaba entre el pueblo, desapareció ante moderar. Aquí se ve cómo
la lengua erudita vive en parte enteramente divorciada del habla
vulgar, puesto que en cada época ha tomado los vocablos latinos,
modificándolos, no según el fonetismo castellano, sino conforme al
uso que los eruditos tenían en la adaptación, mayor ó menor, según
las épocas, á ese mismo fonetismo. Hoy la reacción latina es mayor
y lo ha sido cada vez más desde el renacimiento. Hoy no nos parece
bien se quite la e á moderare y decimos moderar, con sólo quitarle
la e final para que quepa dentro de la turquesa de los infinitivos. No
se atrevían á tanto los clérigos del siglo xiii, y decían modrar; pero
ambas formas han flotado y flotado sobre el habla vulgar, sin
penetrar en ella, como escoria erudita que va y viene y se cambia
conforme al capricho de los que la emplean en sus escritos y aun en
la conversación. El mismo Berceo emplea ya modulado: "Odi sonos
de aves dulces e modulados" (Mil., 7); pero ese préstamo es
posterior al que convirtió modulus en molde, que también es erudito,
pero de época anterior, de mod(u)lus, perdida la u, que nunca sonó
en el latín vulgar, y con la metátesis común que afectaron los
eruditos más antiguos al transcribir vocablos parecidos, como tilde,
si viene de titulus, espalda de spat(u)la. Hoy no nos atreveríamos á
derivar con tales metátesis, porque nos picamos de mejores
latinistas y tenemos menos cariño al fonetismo nacional. ¿Quién se
atrevería hoy á decir motral junto á mortal, como se atreve Berceo?
Muebda por movida es de formación erudita de aquel tiempo (S.
Dom., 119), como debda de debita; mover, movido, á ser vulgares,
huberan perdido la v. También hay mueda = causa motiva (S. Mill.,
387), ya más castellanizado, como muedo por modo (Mil., 29), que
nadie se atrevería hoy á decir, aunque es conforme al cambio sin
excepción de ŏ acentuada en ue, lo mismo que muesso por
mordisco (Loor., 77) de morsus, perdida la r según ley. En cambio
multo (Mil., 259) es una condescendencia por multum, que hoy
nadie la tendría, como no diría nadie nodicia, que dice Berceo (S.
Mill., 164), suavizando legítimamente la t de notitia, ni nudrir ó nodrir
por nutrire (S. Dom., 59, 528). No creo que odir ni udir se dijeran en
tiempo de Berceo juntamente con oir, aunque él escriba de estas
tres maneras (Sacr., 56, S. Dom., 312, Duel., 209); la d es por
reacción erudita, como en odiendo por oyendo. Tampoco creo
sonara palomba como escribe junto á paloma (S. Or., 40, 46), sino
que la b era otra condescendencia de escritor hecha al latín. Toda
cautela es poca cuando de los escritos queremos deducir lo que
realmente debemos atribuir al romance hablado, separándolo de lo
que los escritores añadían de su cosecha, por la creencia de que
sólo el latín era un lenguaje digno de escribirse y de que el romance,
no siendo más que un mal latín, debía purificarse lo más posible
para hacerlo digno de emplearse en los escritos, y que se podía y
aun debía echarse mano de todo el vocabulario latino, por ser latín
lo que se escribía y no ser más que una misma lengua la hablada y
la escrita. Otro tanto sucedía en Italia. Dante pensaba que el italiano
y el latín eran una misma cosa; llamaba al italiano habla vulgar y
gramática al latín, como quien dice: el italiano es un mal latín y el
latín sólo merece estudiarse; ó de otra manera, el latín es la lengua
literaria (gramática no significa otra cosa), y el italiano es latín mal
pronunciado. Petrarca juzgaba lo mismo y menospreciaba el
toscano, que en sus escritos levantaba á idioma literario. Tal es el
poder de una lengua literaria cuando ha pertenecido á un gran
imperio y á una gran civilización. Esas mismas creencias indican
que el romance no nació de un golpe, sino que fué, sin solución de
continuidad, el mismo latín que, hablado, mejor ó peor, en España
en tiempo de los romanos, había ido evolucionando insensiblemente
hasta el punto de no cambiar de nombre.

4. En los últimos tiempos del Imperio, verificada ya la fusión de


razas, cuando las provincias, adquiridos todos los derechos de los
antiguos ciudadanos de Roma por el edicto de Caracalla (212), se
tuvieron por tan romanas como la misma ciudad de Rómulo,
despertando el espíritu patriótico de la nacionalidad romana ante los
pueblos bárbaros ó extranjeros que por todas partes rondaban las
fronteras, el adjetivo romanus, aplicado antes á solos los habitantes
y cosas de Roma, hubo de generalizarse á todo el Imperio, en
oposición al de barbarus. Orosio llamó Romania á todo el conjunto
de razas y países comprendidos dentro del Imperio, como se
llamaban Hispania, Britannia, Graecia, Gallia cada uno de ellos. Lo
más propio de la Romanía, su idioma, llamóse, por lo mismo, lengua
romana, hablar en roman, romanice, en romance, era hablar el
lenguaje de la Romanía, del Imperio romano, era lo mismo que
hablar en latín. El tipo de esa habla era, naturalmente, el latín
literario oficial de la administración, que era el que más se acercaba
al literario; pero el habla vulgar de las provincias no se creía ser más
que ese mismo latín, bien que algo estropeado. Ese mismo latín
siguió hablándose por varios siglos; pero ¡qué diferencias no había
causado la evolución incesante! Virgilio Cordobés, citado por
Sarmiento[1], escribía en el siglo ix: "Ille est vituperandus qui loquitur
latinum circa romancium, maxime coram laicis, ita quod ipsimet
intelligunt totum... Et ita debent omnes clerici loqui latinum suum
obscure in quantum possunt et non circa romancium". En este
notable pasaje se traslucen algunos hechos históricos de la mayor
importancia. En aquel mismo siglo (842) se redactó el convenio
entre Carlos el Calvo y Luis de Alemania en francés ó romance del
Norte de la Galia, el primer monumento que poseemos en lengua
vulgar[2], del cual dice Sarmiento que lo podrían entender los
gallegos sin necesidad de versión. Los clérigos hablaban su latín—
dice el autor cordobés—, es decir, un latín de cocina, que distaba
bastante, por una parte, del latín clásico y por otra del habla vulgar,
puesto que les aconseja que lo empleen entre sí delante de la gente
lega, cuando conviene que ésta no les entienda. Por donde se verá
el craso error de Martínez Marina al sostener que sólo á principios
del siglo xii pudo hablarse de tal manera que se tuviese el romance
por distinto de la lengua latina.

Por lo mismo, cuando se querellaba[3] Álvaro Cordobés de que el


latín, habla de los cristianos, lo hubiesen olvidado los españoles que
andaban entre los moros, teniendo en mayor estima la lengua
arábiga, puesto que se refiere al pueblo español, trata del romance
vulgar español llamado por él latín por las razones antes apuntadas,
no trata del latín clásico que sin género de duda hacía siglos sólo
habían conocido algunos privilegiados eruditos, ni siquiera del latín
vulgar que para el siglo ix ya había desaparecido. Les dice, pues,
Virgilio que hablen su mal latín, latinum suum, lo menos
parecidamente al habla vulgar, obscure et non circa romancium. Ese
circa romancium ó romance ya no era el romano ó habla romana y
latina de la Romanía, y con todo conserva el nombre. ¿Qué habla
fué la de la Romanía, es decir, qué fué el llamado latín vulgar? Por
las dichas creencias, nadie escribió en ese latín; no tenemos ni el
menor documento redactado verdaderamente en esta lengua: de ahí
la dificultad del problema. Se trata de reconstruirla por el estudio

You might also like