Nothing Special   »   [go: up one dir, main page]

Assumptions I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Assumptions for Parametric Tests - I

Fall '20, PSY 201, Ali İ. Tekcan 1


Assumptions

• The tests we covered (z-test, t-tests) are examples of


parametric tests, because these involve procedures to test
hypotheses about the parameters.

• Each parametric test have certain assumptions that needs to


be satisfied before the actual statistical test can be carried out
and meaningfully interpreted.

• Violation of each assumption may bring different problems


with it.

Fall '20, PSY 201, Ali İ. Tekcan 2


Outline

• There are three issues we will discuss with regard to


assumptions of parametrics tests:

– What are the assumptions for z & t-tests ?

– How do you know if assumptions are satisfied or not?

– What to do when the assumptions are violated ?


What are the assumptions ?

Measurement Scale for DV


The DV should be measured on an interval or a ratio scale.

Independence of Observations
Observations should be independent from each other. There
should be no effect of an individual’s scores on others.
What are the assumptions ?

Normality
The population(s) from which the samples come from are
normally distributed on the DV.

Homogeneity of Variance

Populations from which the samples come should have equal


variances on the DV.
Assumptions across tests

single z-t between-t within-t


METHODOLOGICAL
Measurement Scale + + +
Indep. of observations + + +
STATISTICAL
Normality + + +
Homogeneity of Variance - + -
Why should assumptions be satisfied?

Measurement Scale
When a variable is measured on scales lower than interval,
mathematical operations such as taking square roots are not
meaningful. This makes calculation of essential information
such as variability problematic.

Independence of Observations
Observations should be independent from each other so that
the measures such as mean and standard deviation are not
biased in estimating population parameters.
Normality of the Distribution(s) for the DV

If the normality assumption is not satisfied, the main problem is


that the probabilities corresponding to different values of the
statistic (z or t) are less applicable and accurate.

For instance, in a z distribution probability of obtaining a z value of


2.27 or larger is .0116. However, if the normality is not satisfied,
then we cannot guarantee that this value applies to our data.

Depending on the exact nature of our (obtained) distribution, the


probability corresponding to the z value might be more or less.
Therefore, we might have more chance of Type I or Type II error.
Homogeneity of Variance

In an independent-measures t-test, the variances of the two


groups (conditions) should not be different (heterogeneous). They
should be homogeneous.

When the variances are not homogeneous, pooled variance if not


representative of the combined variance in the population,
therefore may bias the obtained t-value and the resulting statistical
decision.

This might especially be a problem when sample sizes are not


equal. If the large variance comes from the smaller sample, that
increases Type I error rate. If the large variance comes from the
larger sample, then that increases Type II error rate.
The End
Assumptions - I

Fall '20, PSY 201, Ali İ. Tekcan 10


TESTING ASSUMPTIONS FOR PARAMETRIC TESTS

This is prepared for students registered in PSY 202, and should not be
distributed to others.

WHAT ARE THE ASSUMPTIONS OF THE PARAMETRIC TESTS ?


There are typically four assumptions associated with parametric tests. Not all
are applicable to all the tests. The table below gives a summary of which
assumptions are applicable to which tests.

ASSUMPTION Single z & t Between t Within t Correlation


Measurement Scale ✔ ✔ ✔ ✔
Independence of Observations ✔ ✔ ✔ ✔
Normality ✔ ✔ ✔ ✔
Homogeneity of Variance ✗ ✔ ✗ ✔

Measurement Scale
The dependent variable (DV) or the outcome variable (OV) should be
measured on an interval or a ratio scale.

Independence of Observations
Observations (participant responses) should be independent from each other.
There should be no effect of one’s scores on others.

Normality
The population(s) from which the data come from should be normally
distributed on the DV.

Homogeneity of Variance
Populations from which the data come should have equal variances on the
DV.

1
WHY SHOULD THE ASSUMPTIONS BE SATISFIED ?
The assumptions need to be satisfied so that the statistical conclusions you
reach would be valid. These assumptions contribute to the validity of the test in
different ways, as described below.

Measurement Scale
With a DV measured on an interval/ratio scale, it is possible to do multiplication,
division, squares etc., which are essentail for calculation of important measures such
as standard deviations. Ordinal or nominal scale are not appropriate for that; you
cannot take a square of ranks, for instance.

Independence of Observations
This makes the samples more representative of the population and is one way of
making your sample resemble a randomly selected sample.

Normality
If this assumptions is not satisfied, the main problem would be that the critical values
(z, t, p) of the theoretical distributions would be less applicable to your data. For
instance, in a z distribution probability of obtaining a z value of 2.27 or larger is
.0116. However, if the normality is not satisfied, we cannot guarantee that this value
applies to our data. Depending on the exact nature of our (obtained) distribution, the
probability corresponding to the z value of 2.27 might be more or less. Therefore, we
might have increased chance of Type I or Type II error, depending on specific
characteristics of the non-normality.

Homogeneity of Variance
First, remember that this is an issue only when you have two or more groups (two or
more levels of the IV). The problems mentioned for normality are applicable to
homogeneity of variance as well: when the variances are not homogeneous, the
main problem is that the p values based on the theoretical t distributions may not be
applicable. This results in increased Type I or Type II error. These problems are
more likely when sample sizes are not equal. If the large variance comes from the
smaller sample, that increases Type I error rate. If the large variance comes from the
larger sample, then that increases Type II error rate.
Keep in mind that, in general, sample size are considered unequal when one
sample 1.5. times larger than the other one.

2
HOW DO WE KNOW THAT THE ASSUMPTIONS ARE VIOLATED ?

Measurement Scale
Assumptions of measurement scale should be taken care of before data collection.
You select your DV so that it’s measured on a ratio or an interval scale.

Independence of Observations
The best way to achieve independence would be random assignment in an
experimental situation. If two close friends signed up for an experiment, random
assignment would increase the chance that they will go into separate conditions
(compared to if people who signed up for the same time slot are assigned to the
same condition). It’s more difficult to achieve if it is a non-experimental study, such
as when a group of people are asked to fill out questionnaires. In such cases, you
should try to increase the variety of people that you invite/take into your samples, so
that the problems of dependence would have less of an effect.

Normality
You use a variety of tools to determine whether normality is satisfied. In single
sample z and t tests this means that the population for the DV is normally distributed.
When you have between and within-subjects t-test this means that the DV is
normally distributed in the two populations (corresponding to two conditions). In other
words, you should look at two distributions separately to see if they are both normally
distributed. There are a number of tools that you can use in conjunction to test
normality. They are explained below:

I. Descriptive Statistics
Four of the descriptive statistics are especially useful for determining normality:
mean, median, skew, and kurtosis. A simple tool is to look at the difference between
the Mean and the Mdn. If they are close to each other than this would typically tell
you that there are no extreme values that are moving the mean to one side,
indication a normal distribution. If, on the other hand, there is a noticable/large
difference between the Mean and the Mdn, this indicates that there are (more)
extreme values on one side of the distribution, leading to a skew and a tendency for
a non-normal distribution.
More direct information comes from the skew statistics. If the skew value is
between -.5 to +.5, then you can safely assume that there is no skew. If the values
are around -1 or +1 they indicate moderate skew. Values beyond -1.5 & +1.5 indicate

3
substantial skew. There is a more formal test of whether the skew (and kurtosis) is
large enough to create problems. This involves hypothesis testing where you test the
Ho = Skew is zero and the H1 = Skew is non-zero (It is the same idea for
kurtosis).The null hypothesis is tested with a z test such that
› zskew = skew / st. error of skew
› zkurtosis = kurtosis / st. serror of kurtosis

If the zobt > zcr (±1.96), then you reject Ho and conclude that there is a significant
skew/kurtosis.

II. Visual Information


In evaluation of normality, you should always look at the visual information to
see if they lead to the same conclusion as other pieces of information. The basic
visuals that are used are histogram, stem-and-leaf, box-plot.
There is no clear criteria by which you can judge visually whether a
distribution is normal or not; this is a skill that will develop as you see more data. But,
keep in mind that the normal that you see in the theoretical distributions can never be
obtained. What you have to determine is whether the distribution looks “normal
enough”.

III. Tests of Normality


There are several formal statistical tests used for judging normality. SPSS
reports two of these tests: Kolmogorov-Smirnov and Shapiro-Wilk. Both tests involve
significance (hypothesis) tests where the Ho states that the distribution is normal and
H1 states that the distribution is non-normal. In these cases, if Ho is rejected, it means
that the distribution is non-normal.
Although both tests test normality, they do so in different ways and pay more
attention to different components of the distributions. Therefore, they do not always
lead to the same conclusion regarding normality.
Kolmogorov-Smirnov (K-S): It compares the theoretical normal distribution
with the distribution of data at hand, by comparing the cumulative frequency
distributions of the theoretical vs the actual. It is more sensitive to deviations from
normality in the mid-sections of the distribution and not that sensitive to differences in
tails. It is considered to be more appropriate for samples of ≥ 50, and not very
reliable for n < 50.
Shapiro-Wilk (S-W): S-W work well with large and smaller samples, and
therefore is generally preferred over K-S. The weakness of S-W is that it is overly

4
sensitive (liberal) to large samples, so even minor deviations from normality might
lead to a significant result. Moreover, it is not very reliable when there are too many
repeating values.

How to reach a conclusion about normality


The answer is to use all of the above information in conjunction: look at all these
pieces of information and then try to come up with a summary of what they tell you.
This is obviously easier said than done. As you look at more examples and more
data, you’ll develop a sense for this kind of summary. Here are some hints:

• With regard to the tests, if both K-S and S-W are giving you non-significant
results (telling you that the distribution is normally distributed), then you can
be quite confident that the data is actually normally distributed. But still check
the descriptives and the visuals to make sure they are converging.

• If one or both of the tests are telling you that the data is non-normal, first ask
yourself whether potential problems associated with the tests mentioned
above (sample size etc.) might be playing a role. For instance, the sample
might be very large, so that may be why both tests turn out to be significant.
Or, the S-W might be significant, when K-S is not. If that is the case this could
possibly come from the fact that there are many repeating values in the data
(e.g., the DV is the number of children in a sample of 60 familiies. There will
be a lot of 1s and 2s and very few of other possible values.). These will affect
the degree to which you’ll pay attention to results of each test.

• Use the visuals to see if the distribution agrees with the descriptive and/or the
statistical tests above.Visuals are very important because sometimes you’ll
see that your evaluation based on them will disagree with the descriptive
(e.g., the skew value) or inferential statistical information (e.g., K-S test). In
that case, remember that the visual information may be as important and
powerful as statistical information.

• When the statistical information (especially the significance tests K-S or S-W
for skew/kurtosis) disagrees with the visual, it will typically be the case that
statistics will tell you that there is significant skew/kurtosis and the visual will
tell you otherwise. This will be especially true when you have large samples.
(Remember large samples are more likely to lead to significant results in

5
general, despite the fact that the effect might not be large). It is rarer that the
tests tell you there is no skew/kurtosis, whereas the visuals say there is.

• Finally, remember from our discussions in PSY 201 the basic rule regarding
normality. We almost never know whether the population is distributed
normally, so the only way to assume that is to have large samples (which
means that the sampling distribution is normally distributed). So, having a
large sample should achieve a lot in terms of having a normally distributed
sample. (interesting irony: when the sample is large, K-S & S-W will turn out
to be significant, telling you that the distribution is not normal ! That is why
you should always consider information other than the tests)

• THEREFORE, NEVER BASE YOUR DECISION OF NORMALITY ON


STATISTICAL TESTS ALONE. ALWAYS CONSIDER THE VISUALS AND
THE DESCRIPTIVES ALONG WITH THE TEST.

Homogeneity of Variance

This is an assumption that applies when you have at least two levels of the
independent/predictor variable. So it is applicable to independent measures t-test
and independent-measures ANOVA.

I. A Rule of Thumb
The first thing to do would be to look at the standard deviations of the groups that
you are comparing. Look at each one individually, and ask the question: “is this too
large a standard deviation for this variable ?”. For instance, if, again, the DV is the
number of children a family has, an SD of 5.3 would feel too much, given that the
mean would be around 2. So, if that is the case check your data file to make sure
there is no entry error, and then try to see if there are outliers (more on that later).
This gives you a general sense of the magnitude of the variability, but does not say
much about the homogeneity of variance assumption.

Then, the important thing is whether the two devaitions are different. This is difficult
to determine by just comparing the numbers. For instance, SD1 = 18.21 and SD2 =
24.6, can we say that the variances are not homogeneous? There is a rule of thumb
used in these cases: if the variance of one group is not 3 times or larger than the
variance of the other group, we can assume that variances are homogeneous. If you
are carrying out an ANOVA, the largest variance should not be more than 3 times of

6
the smallest variance (because you have more than two groups). Some suggest
using 2 times rather than 3 times as the cut-off point. This is a matter of personal
choice to some degree, but you can use the following principle: if it is less than 2,
assume homogeneity, if it’s more than 3, homogeneity is violated. If it’s in between,
look at where it is between 2 and 3, and use judgment. This is for a quick decision on
homogeneity and cannot really be used when you officially report results.

II. Statistical Tests.


The typical and most preferred way of testing the variance assumption is through
statistical tests. We’ll look at the two most frequently used tests: Levene’s test and
Hartley’s Fmax test. Both tests homogeneity of variance by testing the null
hypothesis H0: s12 = s22 (the alternative hypothesis is H1: s12 ≠ s22).

• Levene’s test: This is the test Jamovi and SPSS use to determine
homogeneity of variance. It uses an ANOVA (F-test) to determine whether
variances of two or more groups are homogeneous. If the null hypothesis is
rejected than we conclude that variances are heterogeneous (different).

• Hartley’s F-Max test: This test can be considered a more formal version of
the rule of thumb described above. This is another Fmax-test, where the
obtained-Fmax value is calculated and than compared to the Fmax critical value.
The formula for the obtained Fmax is: Fmax = s2largest / s2smallest. Fmax critical
values are based on a modified F distribution. I provide an Fmax critical table
as a separate file, but you can find many versions of it on the web as well. If
Fmax > Fmax critical Ho is rejected, which means that the variances are
heterogeneous.

WHAT TO DO WHEN THE ASSUMPTIONS ARE VIOLATED

I. “Do Nothing” Approach or the “Robustness” Argument


One possibility is that when the assumptions are violated there is no need for any
further action, because parametric tests, especially the t-test and ANOVA are robust
against violations of normality and homogeneity of variance. This means that even
when the assumptions are violated, the null distributions used to determine critical
values and the p-values (areas under curve) associated with those statistics overlap
substantially with the distribution from which the sample comes from. In other words,
there is negligable differences between the hypothetical null distribution and the

7
actual data distribution. When that is the case Type-I or Type-II error rates do not
change much (changes are minimal and acceptable).

Keep in mind that this argument of robustness holds true when the samples are large
and approximately equal, which typically means one sample is no more than 1.5
times larger then the other one.

II. Nonparametric Tests


When assumptions for parametric tests are violated, an alternative is to use non-
parametric tests. These are also known as assumption-free tests, because they can
be carried out regardless of these assumptions.

There is at least one non-parametric test that corresponds to each parametric test.
For instance, Mann-Whitney U Test is the non-parametric counterpart to independent
measures t-test, Spearman’s r is the nonparametric version of Pearson’s r, and
Friedman’s Analysis of Variance is the nonparametric counterpart to repeated-
measures ANOVA.

The general approach of these tests is that the dependent/outcome variables that are
measured on a ratio/interval scale are changed into a lower measurement scale
(ordinal or nominal), and a test statistics is calculated and then compared with the
critical value for that test statistic.

In Mann-Whitney U, for example, you calculate a U value, and then compare it with
the critical value, then make your statistical decision based on that comparison.

Although, nonparametric tests are useful when assumptions are violated, they are
typically less powerful than parametric tests.

III. Data Transformation


Once you identify the DV/OV for which the normality or homogeneity of variance
assumptions are violated, this method involves replacing the data (variable) with a
mathematical function of that variable, so that a new variable is created and used as
the DV/OV.The function could be one of many alternatives, such as taking the
square root, logarithm, arcsine of the original variable. Such transformation keeps the
relationship between variables in the original scale while diminisihing the variability of
the scores.

8
A single transformation may actually help both with non-normality and homogeneity
of variance problems; so it may be very useful. However, one potential issue is that it
may not always be clear what the new (transformed) variable really is. Assume that
we are interested in whether two groups are different in self-esteem, as measured on
a 10-pt scale. When you transform that variable so that the new one is the “log of
self-esteem score”, what exactly is this new variable conceptually? Statistically, it
might lead to a better distribution, but conceptually it is difficulty to understand what
you are comparing between the groups (you might end up saying that the two groups
are different in log of self-esteem). There is no guarantee that this variable is the
same thing as the original one. So, when you plan to use transformation, you should
consider this interpretation issue.

When a choice needs to be made among alternative transformations: Typically, when


the violations are mild, squareroot is preferred, and when they are more extreme log
may be preferred. One can try different transformations on the same data to ses
which one leads to the distribution that is closer to a normal distribution with
reasonable variabilty. It is not OK, however, to try to find and use the transformation
that leads to a significant effect.

Transformation is rarely used to deal with violations. My view is that it should be used
under extreme circumstances, and only for variables that makes (more) sense when
the transformation is taken. One such example would be experiments where the DV
is reaction time. In such studies, the reaction time is mostly measured in milliseconds
and several substantial outliers may be seen.

IV. Outliers

Identifying Outliers
An outlier can be described as a data point that is far away from the bulk of the data,
and not every far away (small or large) data is an outlier.

First, dealing with outliers is not an assumption of the tests; but, it is an important
element in data analysis. Outliers show their effects indirectly through its effect on
normality and homogeneity of variance. You should always look at the descriptive
statistics and the shape of the data before doing any kind of analyses. Examining
outliers is a part of that.

9
If there are “real” outliers, you should “deal with” them. An outlier could occur for a
number of different reasons. These include calculation, coding or data entry errors; in
those cases very large or very small values might be seen. On the other hand, the
outlier might be one of the actual accurate values from the sample. It might be a very
low rating a participant gave or a very tall person. These two types of outliers are
different in nature and should be dealt with differently. If they come from errors, and
can not be corrected, they should be trimmed (removed) from data. If they are of the
second type, then you should carefully evaluate whether this can be considered an
outlier, by looking at the statistics below as well as the shape of the distribution, and
the remaining values in the data set.

Every large/small value that seems far away from the rest of the data is not an
outlier. Below are two ways in which you can identify outliers numerically. However,
you should also evaluate these outliers in terms of the hypothesis you are testing and
determine whether to treat them as outliers or a natural part of your data.

There are two ways you can spot an outlier numerically: 1) any value in the data set
that is beyond the 3*IQR (in both directions) can be considered an outlier, 2) any
value in the data set that is beyond a z-value of 2.5 in both directions. (There are
some who suggest that z = ± 3, which may also be used).*

As I noted above, once you identify a value as an outlier, evaluate it within the
context of the actual values in the distribution. Somebody who is 197 cm tall in a
class where the mean is 168 cm could turn out to be an outlier, technically. Or,
somebody who gives a rating of 5 on a self-estem scale where the mean is 2.12
could technically be identified as an outlier. But, because this person gave a rating
within the normal range of values, one should be careful about calling this an outlier,
because it seems that it is a natural part of the data.

* An important note about using the z-value to determine outliers. Remember that in a typical
normal distribution it is possible to find values that have a z-value of 2.5 or 3. So such
extreme values are part of the distribution, and if you identify a score as an outlier just
because it has a specific z value, you might be calling a natural part of a normal distribution
as an outlier. That is why you shoıld look at the distribution as well when you make a decision
about an outlier.

10
Dealing with Outliers
There are a number of ways of minimizing or eliminating the effects of outliers on
your data analysis.
Trimming (Removing). Trimming refers to removal of outliers from the data set
so that they are not included in the analyses. (Do not confuse this with “trimmed
mean”, where a certain percentage of the top and the bottom of the data are
removed and the mean is calculated). When you trim data for outliers, you just
remove those outliers.
This is a valid strategy when you are sure that the outlier is a “real” outlier (see
above). It is also important that in a typical research there should not be many
outliers. There is another rule of thumb that says that you can remove upto 5% of the
data if they are outliers. If you have more outliers, you should look at your design,
manipulation, the DV etc.
Trimming can be done at two levels: you either remove the participant with an
outlier completely from the data set; therefore that participant never goes into any of
the analyses. The alternative is that, that participant is trimmed only for the variables
where he/she was the outlier.
The potential disadvantages of trimming are a) you decrease the sample size
(and power), and b) you may be changing the nature of the data because you are
throwing out a legitimate data point that says something about the variable you are
investigating.

Winsorizing*. This is a technique where an outlier is not removed from the


analysis, but rather replaced by another value. More specifically, an outlier is
replaced by the closest value that is not considered to be an outlier. An outlier at the
high end of the distribution is replaced by the next higher value that is not an outlier,
and the same holds for an outlier at the lower end of the distribution. For a data set
such as 12, 16, 11, 13, 7, 12, 18, 36; if 36 is found to be a real outlier, winsorizing
means that you replace 36 with 18, the next highest value that is not an outlier. On
the lower end of the distribution, if 7 is an outlier, the winsorizing means that 7 is
replaced with next higher value that is not an outlier, 11. An advantage of winsorizing
is that you do not lose participants (and therefore power), while keeping the original
nature of the data (the high or the low value).
* In case you are interested, “Winsorizing” has no inherent meaning. John Tukey,
who invented the stem-and-leaf plots, came up with this procedure and named it in
honor of Charles Winsor, who initially suggested a similar procedure for dealing with
outliers.

11

You might also like