Making Sense of Data Statistic Course
Making Sense of Data Statistic Course
Making Sense of Data Statistic Course
Author:
Gábor Bernát
May 1, 2013
2
Contents
Contents 3
1 Introduction 5
1 Making sense of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Data categorization . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Quantative variables . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Releationships and data collections . . . . . . . . . . . . . . . . . . . . 11
2.1 Relationship between quantitive and categorical variables . . 11
2.2 Relationship between two categorical variables . . . . . . . . . 12
2.3 Relationship between two quantitive variables . . . . . . . . . 16
2.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Introduction to Probability . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 The need for probability . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Probability basics . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Long running averages . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Confidence intervals with proportions . . . . . . . . . . . . . . 28
4.2 Sample size for estimating a proportion . . . . . . . . . . . . . 30
4.3 Confidence intervals for means . . . . . . . . . . . . . . . . . . 31
4.4 Robustness for confidence intervals . . . . . . . . . . . . . . . . 31
5 Statistical Tests of Significance . . . . . . . . . . . . . . . . . . . . . . 33
5.1 The structure of the statistical test . . . . . . . . . . . . . . . 33
5.2 Hypothesis Testing for Proportions . . . . . . . . . . . . . . . . 35
5.3 Hypothesis Testing for Means . . . . . . . . . . . . . . . . . . . 36
5.4 Powerand Type I and Type II Errors . . . . . . . . . . . . . . 37
5.5 Potential pitfalls to look out for . . . . . . . . . . . . . . . . . 38
4 Contents
Chapter 1
Introduction
quantitive variable take numerical values for which arithmetic operations make
sense. The height of the people is a such variable,
caterogical variable consist of records into which the observation falls into (one of
several categories). For example the countries of the world may be classified
into one of the five great continets: Europe, America, Africa, Asia and the
Pacific,
ordinal variable have natural order, however the difference between two instance
of the variables does nt always make sense. A good example is grades given
by a teacher: A, B, C, D, E, F.
Median is the center observation, average point. To find it you’ll need to sort
the observations, and then take the observation in the middle position.
First quartile is the observation value at the 14 rd position in the sorted observa-
tion array.
Third quartile is the observation value at the 34 rd position in the sorted obser-
vation array.
A graphical representation of this five values is possible via the boxplot as shown
on figure 1. On the boxplot the whiskers show the minimum and the maximum
values.
Boxplot
Maximum
80
3rd quartile
Inter−quartile range
75
Mean
70
65
1st quartile
60
55
50
Minimum
Note that the median, first or thrid quartile may result in a non–integer posi-
tion. In this case these values are calcualted by interpolating them from the nearby
observations, with the given percentage; therefore, it may happen that these values
are not part of the variable instances.
Modified boxplots
Outliers (known as extreme values, or unusual observations) are hard to study on
a classical boxplot, so for them we use the modified boxplot. In this case let us
Introduction 7
first define the inter-quartile range (noted as IQR) as the difference between the
3rd and the 1st quartile. Then we can define the inner fences as the:
Now the lower whisker is noted as the lower fence, while the upper fence as
the upper whisker. Observations smaller than the lower fence, or larger than the
upper fence are drawn with their own circle on the plot as shown on figure 2.
A modified boxplot
Maximum ●
Upper fence
20
Inter−quartile range
0
3rd quartile
Mean
−20
1st quartile
−40
Lower fence ●
●
−60
●
Minimum ●
Mean
Given a list of observations (x1 , x2 , . . . xn ), the mean of the variable is noted as x̄
(or µ) and is calculated as:
n
∑ data values ∑ xi
Mean = x̄ = = i=1 .
number of data points n
However, this definition of mean is not robust, as it’s easily influenced by
outlier points. Note, that in contrast the median is robust. To alleviate this we
can introduce the conept of trimmed mean, which exclude some percentage of
8 Making sense of data
the lowest and highest values from the observations, before performing the same
operation to calculate the trimmed–mean. The input of the trimmed mean is the
percentage of outliers to remove.
n 2
∑i=1 (xi − x̄)
variance =
n−1
Note that we divide by one less than the count of observation points. An
intuitive explanation for this is that the first observation does not tells us anything
about√deviation. The standard deviation (also noted as σ) is the square root of
this ( variance), and shows the dispersion of a set of data from its mean.
50
50
45
40
Frequency
30
27
24
20
15 15
12
10
8
0
50 60 70 80
Figure 3: Histogram
Introduction 9
The distribution is the pattern of values in the data, showing their frequency
of occurance relative to each other. The histogram is a good way to show this
graphically; you can see an example of this on figure 3.
Its key part is the number of bins used, as observations must be separated into
mutually exclusive and exhaustive bins. Cutpoints define where the bins start and
where they end. Each bin has its own frequency, the number of observations in it.
The largest bins define the peaks or modes. If a variable has a single peak we call
it an unimodal, bimodal for two peaks and multiple peaks above that.
Uniform distribution is a case when all the data values occur around the same
times, so we have no peaks, and such the variable has no mode. The tails of the
histogram are on its left or right side, where its extreme values are. A histogram
is left skewed if it has the left tail larger than the right, and right skewed if the
right tail is larger than its left.
Empirical rule
The empirical rule (also know as three σ rule) states that for a normal distribution
68% of the data is within one standard deviation of the mean value, 95% is within
two standard deviation, and 99.7% is within three standard deviation.
10 20 30 40 50
0.20
0.10
0.00
E.Asia&Pc
Americas
Eur&C.As
S−S.Africa
M.E&N.Afr
S.Asia
So create a boxplot (or summary) for each categorical observation and compare.
In the R language
In the R lanugage we can draw a new boxplot per category to make the com-
parision. To separate categories we can use the split function, and finally use
non-modified boxplots (range is set to 0) to draw them, as seen on figure 6:
lifedata = read.table(’LifeExpRegion.txt’)
colnames(lifedata) = c(’Country’, ’LifeExp’, ’Region’)
attach(lifedata)
lifedata[Region==’EAP’, ]
lifesplit = split(lifedata, Region)
lifeEAP = lifedata[Region==’EAP’,]
lifeSSA = lifedata[Region == ’SSA’, ]
boxplot(lifeEAP[,2], lifeSSA[,2], range=0, border=rainbow(2),
names=c(’EAP’, ’SSA’), main="Life Expectancies: Box Plot")
12 Releationships and data collections
Distributions
Categorical values in R
Categorical variables read into R are always sorted alphabetically, and therefore
any statistics about it will be displayed on that order. However, sometimes there
is a better order to this variables. In this case we can use the factor function and
its levels paramter to set a different order for the categories:
To find the number of items per category use the table command. You can
divide tis with the number of observations to get the relative frequencies:
relfreqBMI = table(BMI)/length(BMI)
BMI
underweight normal overweight obese
0.1850 0.5625 0.2025 0.0500
We can even combine the relative and non relative values in a single table:
cbind(freqBMI, relfreqBMI)
To get joint and the conditional distribution for two categorical variables we
need to use the CrossTable function from the gmodels library.
library(gmodels)
joint = CrossTable(BMI, Sex, prop.chisq=FALSE)
Cell Contents # legend for the table below
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 400
| Sex
BMI | Male | Female | Row Total |
-------------|-----------|-----------|-----------|
underweight | 46 | 28 | 74 |
| 0.622 | 0.378 | 0.185 |
| 0.164 | 0.235 | |
| 0.115 | 0.070 | |
-------------|-----------|-----------|-----------|
normal | 166 | 59 | 225 |
| 0.738 | 0.262 | 0.562 |
| 0.591 | 0.496 | |
| 0.415 | 0.147 | |
Introduction 15
-------------|-----------|-----------|-----------|
overweight | 59 | 22 | 81 |
| 0.728 | 0.272 | 0.203 |
| 0.210 | 0.185 | |
| 0.147 | 0.055 | |
-------------|-----------|-----------|-----------|
obese | 10 | 10 | 20 |
| 0.500 | 0.500 | 0.050 |
| 0.036 | 0.084 | |
| 0.025 | 0.025 | |
-------------|-----------|-----------|-----------|
Column Total | 281 | 119 | 400 |
| 0.703 | 0.297 | |
-------------|-----------|-----------|-----------|
EAP SSA
At this point the joint contains four tables: a contingency table (frequencies –
joint$t), two conditional distribution (one per point of view – sex joint$prop.col
or BMI joint$prop.row), and one displaying relative frequencies (joint distribu-
tion – joint$prop.tbl). We can use barplots to visualize this:
In the R lanuage
For calculating the correlation we can use the cor function.
Introduction 17
Countries = read.table(’LifeGDPhiv.txt’)
colnames(Countries) = c(’Country’, ’LifeExp’, ’GDP’, ’HIV’)
attach(Countries)
plot(GDP, LifeExp, xlab=’GDP(2000USD)’, ylab=’Life Expectancy (years)’,
main=’Scatterplot: Life Expectancy versus GDP per capita’)
cor(GDP, LifeExp)
[1] 0.6350906
cor(LifeExp, GDP)
●
●● ● ● ●● ●
● ●● ● ●
●● ●●
80
●●● ●
●● ● ● ●●●
● ● ●
●
● ●●● ●
● ● ●●
●● ● ●●
75
● ●
●●● ● ●● ● ●
●●
●
●● ● ●●●●
● ●●
Life Expectancy (years)
● ●
●● ● ●
●
● ●
● ● ●
70
●
●●
● ●
●
●● ●
●
●●
●
● ●● ●
●●●
65
●
●
● ●
● ●
●●
●
60
●
●●
●●
●● ●
●
●
●●
55
●
●
●
● ●
●●●
●
● ● ●
●
50
●
●
● ●
●●
●
●
2.4 Sampling
The goal of the statistics is to make rational decision or conclusion based on the in-
complete information that we have in our data. This process is knows as statistical
18 Releationships and data collections
inference. The question is that if we see something in our date (like a relationship
between two variables) is it due to chance or a real relationship? If it’s not due to
change then what broder conclusions we can make, like generalize them to a larger
group, or does it supports a theoretical model? In this process the data collection
has a major significance.
We collect data from the real world, however our scientific and statistical models
are part of a theoretical world.
population the group we are interested in making conclusion about.
census a collection of data on the entire population. This would be the best,
however it’s impractical due to time and cost effectiveness; or it’s straight
up impossible if by observing the item we would destroy it. Therefore, in
practice we sample the population; and infer conclusions from the sample
group.
statistic is a value calculated from our observed data. It estimates a feature of
the theoretical world.
paramter is a feature of the theoretical world. Statistics are used to estimate
their values. In order to get a good estimation our sampling needs to be
representative.
randomisation is the key to select representative samples. This ensures that we
do not over– or under–sample any part of the population.
Methods to make random sampling:
Simple Random Sampling – SRS Each possibly sample size of n (the sample
size) from the population is equally likely to be the sample that is choosen.
A pratical example of this is taking out balls from a hat.
Stratified sampling Divide the population into non–overlapping subgroups called
strata and choose a SRS within each subgroup. Provinces and states are a
practical instances of stratas. This performs better when we may want to
compare stratas, or can allow to better see traits if something is only char-
acteristic to only some of the stratas (which otherwise would be hidden on
the whole sample space).
Cluster sampling Divide the population into non–overlapping subgroups called
clusters, select clusters at random, and include all individual inside the clus-
ter for sampling. It’s good when it’s easier to select groups instead of mem-
bers; for example if we want to study students we may choose to select
random schools and use students inside those as samples. This does requires
that each cluster to be representative for the whole population.
Introduction 19
Systematic sampling Select every k–th individual from a list of the population,
where the position of the first person choosen is randomly selected from the
first k individuals. This will give a non–represenative sample if there is a
structure to the list. This is fine if in the ordering of the population has no
meaning.
Conveniance or Voluntar sampling Use the first n individuals that are avail-
able or the individuals who offer to participate. This is almost sure to give
a non–representative sample which cannot be generalized to the population.
If the sample is not representative it can induce bias into our results, that is
that it differs from its corresponding population in a systematic way. Bias types
are:
Selection bias occurs when the sample is selected in such a way that it system-
atically excludes or under–represents part of the population. For instance
poll by using only land line phones (misses the cellular population).
Measurement or Response bias occurs when the data are collected in such
a way that it tend to result in observed values that are different from the
actual value in some systematic way. In case of a poll this shows in terms of
ill formed questions.
Nonresponse bias occurs when responses are not obtained from all individuals
selected for inclusion in a sample. An example of this is in a poll working
parents tend to not respond, so their sampling will be under represented.
Data collection methods include anecdotes (this are not representative), obser-
vational studies and experiments. Experiments differ from observational studie is
the strength of the conclusion we can make, which is higher for the experiment.
In observational studies we just observe existing characteristics of a subset
of individuals inside the population. The goal is to make conclusion about the
population based on the samples, or to conclude the relationship between groups
or variables in the sample.
In this scenario the investigator has no control on which individual in which
group belongs or about any of their characteristic, as opposed to the experiment
where he can add some kind of intervention.
The relationship between the outcome and the explanatory variable may be:
common cause both of them are effected by another variable (who have diabiates
drink less coffee, however due to their sickness have shroter life)
confounding variable they vary with the explanatory variable. If one changes
the other changes with it (smokers tend to drink more cofee, however this
also effects the expected life outcome)
Lurking variables are variables that are not considered in the analysis, but may
effect the natuer of relationship between the explenatory variable and the outcome.
This may be a confounding variable, or the source of the common response , or
another variable that, when considered, changes the nature of the relationship.
2.6 Experiments
Are the golden standard. Allows for making conlusions. Again the response vari-
able (or also know as dependent variable – how it depends from other variables)
is the outcome of interest, measured on each subject or entity participating in the
study (this may be quantitive or categorical). Explanatory variable (predictor or
independent variable) is a variable that we think might help to explain the value
of the response variable (can also be quantitive or categorical).
Compared to the observation study now the researcher manipulates the ex-
planatory variables to see the effect of them on the outcome. Tipically a researcher
has finite time, and therefore he can study only a finit number of variable values,
Introduction 21
and such the explanatory variable tends to be a categorical one, to which we can
also refer as a factor. The values of the factor studied in the experiment are its
levels.
A particular combination of values for the factors is called treatment. An
experimental unit is the smallest unit to which the treatment is applied to. A
treatment may not be applied to a single entity, like trying out a new study method
for a class results in a single experimental unit (a class) instead of the count of the
students inside the class.
extraneous factors are not of interest in the current study, but are thought to
affect the reponse. They need to be controlled to avoid them effecting the
outcome. For controlling we can:
However, this still does not solves the problem of extranous or unknown vari-
ables. To bypass this we need to use randomisation to assign experimental
units to treatment groups.
Once we’ve eliminated other differences between the treatment groups, if the
response variable is different among the groups, the only explanation is the treat-
ment and casual conclusions can be made.
Fundamentals of experimental design:
3. Replication – induce it. Not repeat the experiment, but to apply each treat-
ment to more than one experimental unit. This allows to measure vari-
ability in the measurement of the response (which in turn also ensures that
treatment groups are more comparable by extraneous factors, by having the
oppurtunity of these to differ between groups).
22 Releationships and data collections
Experiments also have a control group. This is used to make comparisions with
a treatment of interest and either does not receives a treatment (what if the study
itself causes the change to occur) or receives the current standard treatment. It’s
also refered to as the comparision group.
In conclusion we can say the randomised controlled experiments are needed to
establish casual conclusion. Another technique to reduce the potential of bias is
blinding:
1. the experimental units are blinded, so they do not know which treatment
they have received.
2. the researcher is blinded if s/he does not know which treatment was given.
3 Introduction to Probability
3.1 The need for probability
Up to this point we’ve seen and focused on how to handle data available to us.
Now it’s time to see what the data in the real world corresponds in the theoretical
world. The data of the real world, that we usually end up having, can be viewed
as just a sampling of the theoretical world (which has an infinit number of data
points). Even if it really isn’t any more data points in the real world (so we
have collected all the existing data points) we can still pretend that there is and
construct a theoretical model around it.
We usually try to draw inferences about our theoretical world by using the data
that we have, which represents the real world. Of course, the theoretical world
may differ from the real world and we’ll be interested in studying the relationship
between two world. For instance let us consider a coin toss. In the theoretical
world we expect to get 5 heads out of ten tosses, yet if we were to conduct a little
experiment we may end up getting 6 heads out of ten tosses.
In some cases (like tossing a bottle cap) we may not even have a theoretical
model, so the question arises that what we can conclude in this case?
Probability – each event has it’s own probability to turn true, and for event A:
0 ≤ P(A) ≤ 1
1 2 1 2 1
variance = (1 − ) + (0 − ) + =
2 2 4
The standard deviation is:
√
√ 1 1
SD = variance = =
4 2
The mean is linear, so any linear combination of two random variables may be
expressed with their means:
Var(aX) = a2 ⋅ Var(X)
Var(aX + b) = a2 ⋅ Var(X)
SD(aX) = ∣a∣ ⋅ SD(X)
Introduction 25
Discrete random variables has a finit number of possible outcomes, and thus
may be enumerated in a form of a list. Continous random varaibles can take any
value inside an interval. An instance of this is the unifrom variable, for instance
on the interval from zero to one, meaning it’s equally likeley to be any number
inside this interval.
So for example P(0 ≤ X ≤ 1) = 1, and P(0 ≤ X ≤ 13 ) = 13 ; generally speaking
P(a ≤ X ≤ b) = b − a, if 0 ≤ a ≤ b ≤ 1. This does mean that if a = b the probability
is zero, so it’s easier to think of continous probability as the area under the graph
of the density function. It’s important that for any density function the total area
of the entire graph to be equal with 1.
Uniform distributions are of a form of square function, however other functions
exist to like the exponential(1) function has the form of:
⎪e−x if x > 0
⎧
⎪
f (x) = ⎨
⎩0 if x ≤ 0
⎪
⎪
The standard normal (Gaussian) distribution (bell–curve):
1
f (x) = √ ⋅ e− 2
x2
2π
New bell–curves may be constructed by shifting the upper with µ (making it the
new center point) and stretching it by a factor of σ, and is noted as Normal(µ, σ 2 ).
If we have a random variable X from this newly constructed normal distribution
we may transform it into a standard normal distribution by:
X −µ
Z= , where Z ∼ Normal(0, 1).
σ
For expected values and standard deviation now we use integals instead of
sums, for example the expected value of the uniform distribution between 0 and 1
is:
1 1
E(X) = ∫ xdx =
0 2
with it’s variance:
1 1 2 1
Var(X) = ∫ (x − ) dx =
0 2 12
For the exponential distribution it’s expected value is:
26 Introduction to Probability
∞
E(X) = ∫ x ⋅ e−x dx = 1
0
with it’s variance:
∞
(x − 1) ⋅ e−x dx = 1
2
Var(X) = ∫
0
The centreal limit theorem also is responsible for the emparical rule; the per-
centages are true for the graph of the normal distribution. In conclusion we can
say that all that is some kind of average, or is made up of lots and lots of small
contributions usually has a normal distribution behind it.
4 Confidence interval
We observe the real world in order to understand the theoretical world; for this
we’ll use the scientific and statistical models divised in the theoretical world and
use data from from the real world to estimate the paramters from the model. We
do this in hope that what conclusions we can make from our models will also hold
in the real world.
Now let us imagine the experiment of tossing a fair coin ten times. This exper-
iment has a binomial distribution with probability 12 , however when amassing data
from the real world we will not always get this proportion, due to the sampling
having its own distribution: for example extreme events (like getting ten heads)
are very unlikely, however getting half or close to half of them heads is likely to
happen. The question arrises where do we draw the line, what are the values well
likely to get most of the time?
Sometimes we will not have a model for our events: like in the case of flipping
a beer cap. If we were to perform a single experiment and get m one side out of n
events we can ask: was this a likely outcome? m may be a sample of any sample
distribution (with its own parameters). For instance for the beer cap let n = 1000
and m = 576. Our statistical model is binomial with a 1000 samples, however we
have no idea what’s the probability of the model.
n 576
An estimation is p̂ = m = 1000 = 0.576. This may be a good estimate, however it
may be some other number, and so we ask what elso could p be? That is what is
an interval that based on our existing experiments the probability of getting one
side of the beer cap could be? We refer to this as the confidence interval.
The following methods are suppose that our data was taken in a form of simple
random sampling, and that our variables are independent; something that statis-
tical inference requires, otherwise we may need more complicated model.
So in conclusionwe can say that the goal of statistical inference is to draw
conclusions about a population parameter based on the data in a sample (and
statistics calculated from the data). A goal of statistical inference is to identify a
range of plausible values for a population parameter. A goal of statistical inference
is to identify a range of plausible values for a population parameter. Inferential
procedures can be used on data that are collected from a random sample from a
population.
the addition of lots and lots of small flips, it approximately follows the normal
distribution; that is p̂ ≈ Normal (p, p(1−p)
1000 ). Now by performing a reorganization:
p̂ − p
√ ≈ Normal(0, 1)
p ⋅ 1−p
n
Now for a normal distribution (and according to the empirical rule) the area
between [−1.96, 1.96] covers 95% of the area, so most likely our sample is from this
interval. In mathematical formula:
R R
⎛RRRR p̂ − p RRRR ⎞
P ⎜RRRR √ RRR > 1.96⎟ = 0.05 = 5%
RRR
⎝RRRR p ⋅ 1−p
n RRR ⎠
R
By writing up the reverse, and expanding we can conclude that, with z α2 = 1.96:
⎛ margin of error ⎞
⎜ ³¹¹ ¹ ¹ ¹ ¹√ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ · ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹µ √ ⎟
⎜ ⎟
⎜ 1 − p 1 − p ⎟
⎟ = 95%
P ⎜p̂ − z α2 p ⋅
⎜ ≤ p ≤ p̂ + z α2 p ⋅ ⎟
⎜ n n ⎟
⎜´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶⎟
⎜ ⎟
⎝ lower limit upper limit ⎠
This means that we are 95% confident that the value of p is between is lower and
upper limit. Now the true value of p is not random, however p̂ is as we took a
random sample. Now the problem with this formula is that while we do know p̂, p
is unknown. One solution is to make p̂ = p; or to make p = 12 because that’s the
worst case, the widest interval we can get.
For a given confidence interval [a, b] the margin of error may be calculated as:
a = p̂ − margin of error
b = p̂ − margin of error
b−a
margin of error =
2
Now modifying the area under the normal distribution that we take we can
get different confidence intervals for different probabilities. Now if you specify a
bigger confidence value, like 99% you’ll get a widder confidence interval. It’s up
to you the trade off you are willing to accept.
Now assume you want to achive an α probability that you’re wrong. In this
instance taken the graph of the normal distribution you want to find z α2 (y axis)
such that the area remaining at each and is only α2 . In this case the area between
30 Confidence interval
But we do not know what p will be. To bypass this we plan for the worst case
scenario, the expression p ⋅ (1 − p) has its maximum at p = 12 , which also give our
maximal margin of error for a given n. Now we resolve the equation:
2
1.96 ⋅ 12
n=( ) = 1067
0.03
Confidence intervals are about our confidence in the procedure to give us correct
results – 95% confidence intervals should contain the population parameter p95% of
the time. If simple random samples are repeatedly taken from the population, due
to the randomness of the sampling process the observed proportion of p̂ will change
from sample to sample and so will the confidence interval constructed around p̂.
Of these confidence intervals, 95% will include the true population proportion p.
Note that p is a population parameter and is therefore not a random variable.
For any interval, p is either contained in the interval or not. Different margin of
error will result in different number of events required, however the sample size
increases quadratically as the margin of error decreases linearly.
Introduction 31
X̄ − µ
√ ≈ Normal (0, 1)
σ2
n
Now the problem now is that we do not know the value of σ. What would
be a good estimation for this? One solution is to use the standard deviation of
X (calculated with dividing with n − 1), noted as s. However, while E(s2 ) = σ 2 ,
substituting this into upper formula does not gives a normal distribution; instead
we’ll have a t distribution with n − 1 degrees of freedom:
X̄ − µ
√ ≈ tn−1
s2
n
The t distribution is similar to the normal one, however not quite there. In-
creasing the degree of freedom reduces the difference between these two. With this
the z α2 changes also, so you’ll need to use a table to get the correct number for a
given number of freedom (which is n−1, where n is the sample size). The marginal
error may be calculated as:
√
s2
marginal error = z α2 ⋅
n
We can use this to calculate the true mean from a sample mean. That is with
a given confidence we can say that our true mean is somewhere inside the calcu-
lated confidence interval, which depends from the sample mean with the calculated
marginal error.
2. n needs to be large enough, so that the central limit theorem may kick in
and p̂ to have a normal distribution,
For extreme theoretical p values larger counts of samples are required to achieve
the same confidence interval. For instance in case of a coin flip if p is 12 a hundred
samples may be enough for the central limit theorem to kick in and achive a 95%
confidence. However, if p = 0.01 we may need 1000 samples for the same confidence.
An explanation for this is that the normal distribution is a continous model, and
we are using it to estimate discrete values.
In order for the confidence interval procedure for the true proportion to provide
reliable results, the total number of subjects surveyed should be large. If the true
population proportion is close to 0.5 a sample of 100 may be adequate. However,
if the true population proportion is closer to 0 or 1 a larger sample is required.
Increasing the sample size will not eliminate non-response bias. In the presence
of non-response bias, the confidence interval may not cover the true population
parameter at the specified level of confidence, regardless of the sample size. An
assumption for the construction of confidence intervals is that respondents are
selected independently.
In case of means the conditions are:
The t distribution works extremly well with even a low number of samples
(n = 10) if the theoretical model is a normal or skewed normal one. For this to
not be true we need some really extreme distribution, like most of the time on one
end, but has some chance for an outlier value. However, in these cases by just
increasing the mean with a constant multiplier (like to 40) may already result in
a 90% confidence value.
Nevertheless, we also need to consider if estimating the mean is a meaningful
thing to do. Remeber that the mean is not a robust measurement, because it’s
effected by outliers, something that is true for the distribution too.
A method for constructing a confidence interval is robust if the resulting con-
fidence intervals include the theoretical paramter approximately the percentage
of time claimed by the confidence level, even if the necessary condition for the
confidence interval isn’t satisfied.
Introduction 33
The evidence
In the court to proove our claim we’ll need evidence. In statistics the evidences are
proovided by the data we have. In order to make it useful we have to summarize
our data into a test statistic; whcih is a numeric representation of our data. This is
always formulated such that it’s assumed that the null hypothesis holds. In case of
the plastic surgery we summarize data assuming that the perceived age difference
is zero, to support our null hypothesis.
Delibiration
Once the evidences are presented the judge/jury delibirates if beyond a reasonable
doubt the null hypothesis holds or not. In statistics the tool of delibiration is
the p–value. This transforms a test statistics into a probability scale; a number
between zero and one that quantifies how strong the evidence is against the null
hypothesis.
At this point we ask, if the H0 holds how likely would it be to observe a test
statistic of this magnitude or large just by chance? The numerical answer to this
question is the p–value. The smaller this is the stronger evidance it is against the
null hyptothesis. However, it is not an a measurement to tell you how likely it is
that the H0 is true.
H0 is or it is not true, however it’s not a random variable, because it’s either
true or not true, no randomness involved. The p–value just tells you how unlikely
the test statistic is if the H0 were to be true.
The verdict
The fourth final step is the verdict. Not strong enough evidance corresponds to
a high p–value. With this we conclude that the data is consistent with the null
hypothesis. However we haven’t yet prooved that it’s true, although we cannot
reject it either. A small p–value consist of sufficient evidence against H0 , and we
may reject it in favour of HA . In this case the result is statistically significant.
Introduction 35
• H0 ∶ p = 0.5
• HA ∶ p < 0.5
We want to compute the p–value. We start out from our theoretical model,
that:
p̂ − p
test statistic = √ ≈ Normal(0, 1)
1−p
p⋅ n
Now by doing the substitution for the known values:
p̂ − p
√ 1
≈ Normal(0, 1)
1 1− 2
2 ⋅ 1046
⎛ p̂ − p −0.08 ⎞ 1
P(p̂−p < −0.08) = P ⎜ √ ≤ √ ⎟ ≈ P (Normal (0, 1) ≤ −5.17) ≈
1
⎝ 1 ⋅ 1− 2
1
1 1− 2 ⎠ 9, 000, 000
2 1046 2 ⋅ 1046
This is an extremly small value, which means that our evidence against the
null hypothesis is strong, therefore we may reject it, and conclude that the mayors
support is less than 50%. If we were to put in a lower number into the equation
we may check just how small the mayors support it is. For example in case of 44%
we’l get a numbe of 0.0968, which means that under the null hypothesis we have
a ≈ 10% chance to sample this, and that’s not a strong evidence aginst H0 .
If we have a two sided hypothesis then instead of P(Z ≥ test statistic) we can
say that P(∣Z∣ ≥ test statistic)=P(Z ≥ test statistic) + P(Z < − test statistic),
which in case of symmetric distributions translates to 2⋅P(Z ≥ test statistic).
X̄ − µ
√ ≈ tn−1
s2
n
We also know from our observation that X̄ − µ = 7.177 − 0 = 7.177. The p–value
is the probability of observing such an extreme value given that H0 is true.
⎛ X̄ − µ 7.177 ⎞
P (X̄ − µ ≥ 7.177) = P ⎜ √ ≥ √ ⎟
⎝ sn2 s2 ⎠
n
Now we got something for what we know its distribution, it’s:
⎛ 7.177 ⎞ 1
P ⎜t59 ≥ √ ⎟ = P (t59 ≥ 18.86) = 26
⎝ (2.948)2 ⎠ 10
60
Which is really small, and such we can reject the null hypothesis.
Introduction 37
significance level of a test gives a cut–off value for how small is small for a p–
value. We note it with: α. This gives a definition for the reasonable doubt
part of the method. It shows how the testing would perform in repeated
sampling. If we collect test statistics again and again sometime we’ll get
weird data. So for a significance level of 1% in case of 100 we’ll draw the
wrong conclusion in one case (reject H0 ). Setting it to small however would
result that you never reject H0 .
power of the test is the probability of making a correct decision (by rejecting the
null hypothesis) when the null hypothesis is false. Higher power tests are
better at detecting false H0 . To do this you’ll need to:
• power increases the further our alternative hypothesis is from the null
hypothesis (however in practice this is rarely in our control),
• α is increased then less evidence is required from the data in order to
reject the null hypothesis; therefore for increased α, even if the null
hypothesis is the correct model for the data, it is more likely that the
null hypothesis will be rejected;
• less variability increases,
• increasing the sample sizea also helps.
To determine the sample size needed for the a study for which the goal is to
get a significant result from a test, set α and the desired power, decide on
an alternative value that is practically interesting, estimate σ, and calculate
the sample size required to to give the desired power.
type I error is incorrectly rejecting the null hypothesis, as we got data that
seemed unlikely with the data we got. For a fixed significance level α the
probability of this happening is α.
type II error is incorrectly not rejecting the null hypothesis, the probability of
this is β = 1 − power. This usually happens because we do not have enough
power due to small data size.
Generally we would like to decrease both types of errors, which can be done
by increasing the sample size (as having more evidence in a trial). It’s context
dependent which of the errors are more important.
38 Statistical Tests of Significance
• chance,
If multiple tests are caried out, some are likely to be significant by chance alone.
So we need to be suspicious when you see few significant results when many tests
have been caried out; such as significant results on a few subgroups of data.
Test results are not reliable if the statement of the hypothesis are suggested by
the data. This is called data snopping. Generally the primary hypothesis should
be stated before any data is collected. Seeing some relation in the data and testing
for it is not good, as in this the p–value does not says anything.
The tests presented here require:
• independent observations
Introduction 39