Data Transformation

Data Transformations
Data transformations
Data transformation can correct deviation from normality

and uneven variance (heteroscedasticity)
Right transformation can allow you to use parametric

statistics
The need for transformations

We should always check the assumptions that data follow a normal distribution with uniform variance *If the data meet the assumptions we can analyze the raw data as described. *If they are not met, we have two possible strategies:
1-We can use a method which does not require these assumptions, such as a rank-based method. 2-We can transform the data mathematically to make them fit the assumptions more closely before analysis.
There are three commonly used transformations for quantitative data: The logarithm, the square root, and the reciprocal. We call ,these transformations variance-stabilizing because
their purpose is to make variances the same.

For most data encountered in healthcare research, the first or third situation applies
If we have several groups of subjects and calculate the mean and variance for each group, we can plot variability against mean. We might have one of these situations: -Variability and mean are unrelated. We do not usually have a problem and can treat the variances as uniform. We do not need a transformation. -Variance is proportional to mean. A square root transformation should remove the relationship between variability and mean. -Standard deviation is proportional to mean. A logarithmic transformation should remove the relationship between variability and mean. -Standard deviation is proportional to the square of the mean. A reciprocal transformation should remove the relationship between variability and mean.
.
Variance-stabilizing transformations also tend to make distributions Normal. There is a mathematical reason for this, as for so much in statistics.
It can be shown that if we take several samples from the same population, the means and variances of these samples will be independent if and only if the distribution is Normal. This means that uniform variance tends to go with a Normal Distribution. A transformation which makes variance uniform will often also make data follow a Normal distribution and vice versa
logarithmic transformation
*The most frequently used is the logarithm.
*This is particularly useful for concentrations of substances in blood.

*The reason for this is that blood is very dynamic, with reactions happening continuously. Many of the substances we measure are part of a metabolic chain, both being synthesized and metabolized to something else.
*The rates at which these reactions happen depends on the amounts of other substances in the blood and the consequence is that the various factors which determine the concentration of the substance are multiplied together.
*Multiplying and dividing tends to produce skew distributions. *If we take the logarithm of several numbers multiplied together we get the sum of their logarithms. **So log transformation produces something where the various influences are added together and addition tends to produce a Normal distribution.
For example, the following figure shows
serum cholesterol in stroke patients
As we have seen, for the serum cholesterol in stroke patients data, the log transformation gives a good fit to the Normal. What happens if we analyze the logarithm of serum cholesterol then try to transform back to the natural scale?
For the raw data, serum cholesterol: mean = 6.34, SD = 1.40.
For log (base e) serum cholesterol: mean = 1.82, SD = 0.22.

If we take the mean on the transformed scale and backtransform by taking the antilog, we get exp(1.82) = 6.17. This is less than the mean for the raw data. The antilog of the mean log is not the same as the untransformed arithmetic mean.
geometric mean is calculated which is found by multiplying all the observations and taking the nth root The geometric mean is found by multiplying all the n observations together and then taking the nth root. For example, the geometric mean of 4 and 9 is 6, found by multiplying 4 by 9 to give 36 and taking the square (or second) root. The geometric mean is usually smaller than the arithmetic mean. For 4 and 9 this is (4 + 9)/2 = 6.5. Thus the mean of the logs is the log of the geometric mean.
What about the units for the geometric mean?

If cholesterol is measured in mmol/L, the log of a single observation is the log of a measurement in mmol/L. and the antilog is back in the original units, mmol/L
Even if a transformation does not produce a really good fit to the Normal distribution, it may still make the data much more amenable to analysis.
The following figure shows a histogram and Normal plot for the area of venous ulcer at recruitment
The raw data have a very skew distribution and the small number of very large ulcers might lead to problems in analysis. Although the log transformed data are still skew, the skewness is much less and the data much easier to analyze
Making a distribution more like the Normal is not the only

reason for using a transformation
The following figure shows prostate specific antigen (PSA) for three groups of prostate patients: with benign conditions, with prostatitis, and with prostate cancer
A log transformation of the PSA gives a much clearer picture . The variability is now much more similar in the three groups
The square root

The square root is best for fairly weak relationships between variability and magnitude, i.e. variance proportional to mean or standard deviation proportional to the square root of the mean. The square root can be used for variables which are greater than or equal to zero, the log and the reciprocal can only be used for variables which are strictly greater than zero, because neither the logarithm nor the reciprocal of zero are defined.
Arm lymphatic flow in rheumatoid arthritis with oedema
The distribution is positively skew and the variability is clearly greater in the groups with greater lymphatic activity.
A square root transformation has the effect of making the data less skew and making the variation more uniform. In these data, a log transformation proved to have too great an effect, making the distribution negatively skew, and so the square root of the data was used in the analysis.
Reciprocal transformation
Removes the relationship between variability and mean. The reciprocal is best for very strong relationships, where the standard deviation is proportional to the square of the mean.
The reciprocal can only be used for variables which are strictly greater than zero. If the square root removes the least amount of skewness , the reciprocal removes the .most.
Can all data be transformed?

Not all data can be transformed successfully. 1-Sometimes we have very long tails at both ends of the distribution, which makes transformation by log, square root or reciprocal ineffective
For example the distribution of blood sodium in ITU patients

This is fairly symmetrical, but has longer tails than a Normal distribution. The shape of the Normal plot is first convex then concave
2-Sometimes we have a bimodal distribution, which makes

transformation by log, square root or reciprocal ineffective For example systolic blood pressure in a sample of ITU patients
3-Sometimes we have a large number of identical observations, which will all transform to the same value whatever transformation we use. These are often at one extreme of the distribution, usually at zero
For example the distribution of coronary artery calcium in a large group of patients
More than half of these observations were equal at zero. Any transformation would leave half the observations with the same value, at the extreme of the distribution. It is impossible to transform these data to a Normal distribution
4-Sometimes transformation lead to variation in p-value. So, What can we do if we cannot transform data? It is usually safer to use methods that do not require such assumptions These include the non-parametric methods.
Hypothesis Testing Procedures

Hypothesis Testing Procedures
Parametric
Nonparametric
Wilcoxon Rank Sum Test Z Test t Test One-Way ANOVA
Kruskal-Wallis H-Test
Many More Tests Exist!
Types of data and analysis

Nominal Ordinal Discrete Continuous Parametric
Non-parametric
Types of Data
Nominal - no numerical value Ordinal - order or rank Discrete - counts Continuous - interval, ratio
Parametric Test Procedures

1-Involve Population Parameters
Example: Population Mean
2-Require Interval Scale or Ratio Scale .

Whole Numbers or Fractions Example: Height in Inches (72, 60.5, 54.7)
3-Have Stringent Assumptions .

Example: Normal Distribution
Nonparametric Test Procedures

A nonparametric test is a hypothesis test that does not require any specific conditions about the shape of the populations or the value of any population parameters. Tests are often called distribution free tests.
Why non-parametric statistics?

-Need to analyse Crude data (nominal, ordinal) -Data derived from small samples -Data that do not follow a normal distribution -Data of unknown distribution
Advantages of Nonparametric Tests

1-Used With All Scales . 2-Easier to Compute. 3- Make Fewer Assumptions. 4- Suitable for small sample size.
5-Analysis involves outlier values.

6- No need for population Parameters.
7-Results May Be as Exact as Parametric Procedures
1984-1994 T/Maker Co.
Disadvantages of Nonparametric Tests

1-May Waste Information
If Data Permit Using Parametric Procedures Example: Converting Data From Ratio to Ordinal Scale
1984-1994 T/Maker Co.
2-Difficult to Compute by Hand for Large Samples

3-Tables Not Widely Available
What is a parameter and why should I care?

Most statistical tests, like the t test, assume some kind of underlying distribution, like the normal distribution If you know the mean and the standard deviation of a normal distribution then you know how to calculate probabilities Means and standard deviations are called Parameters; all theoretical distributions have parameters. Statistical tests that assume a distribution and use parameters are called parametric tests Statistical tests that don't assume a distribution or use parameters are called nonparametric tests
Ranks
Many nonparametric procedures are based on ranks. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties. Example: 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5 If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5. Procedure: replace the original data with the ranks across subjects and then perform the parametric test.
For large samples, many nonparametric techniques can be viewed as the usual normal-theory-based procedures applied to ranks
Normal theory based test
Corresponding nonparametric test Mann-Whitney U test; Wilcoxon rank-sum test

Wilcoxon matched pairs signed-rank test Spearman rank correlation coefficient Kruskal-Wallis analysis of variance by ranks Friedman Two way analysis of variance
Purpose of test
t test for independent samples Paired t test Pearson correlation coefficient One way analysis of variance (F test) Two way analysis of variance
Compares two independent samples Examines a set of differences Assesses the linear association between two variables. Compares three or more groups Compares groups classified by two
different factors
Wilcoxon signed rank test

To test difference between paired data
STEP 1
-Exclude any differences which are zero -Put the rest of differences in ascending order -Ignore their signs -Assign them ranks
-If any differences are equal, average their ranks
STEP 2
-Count up the ranks of +ives as T+
-Count up the ranks of ives as T-
STEP 3
If there is no difference between drug (T+) and placebo (T-), then T+ & T- would be similar If there were a difference one sum would be much smaller and the other much larger than expected
The smaller sum is denoted as T

T = smaller of T+ and T-
STEP 4
Compare the value obtained with the critical values (5%, 2% and 1% ) in table N is the number of differences that were ranked (not the total number of differences) So the zero differences are excluded
Wilcoxon Signed Rank Test - assume distribution is continuous and symmetric Discard any observation(s) that equal M0, adjust n
Again look at the differences between the observations and the null value, M0 (Paired data, look at differences within pairs)
Rank the absolute values of the differences, from low to high Ties receive the average rank
T+ = sum of the ranks of the positive differences
T = sum of the ranks of the negative differences
p-values for one-sided tests are in Table - only if results are in correct direction
Double the table value to get the p-value for a two-sided test
For one simple sample. Test, at a = .05, if median age of students finishing a Masters degree in biostatistics is greater than 25. H0: M = 25 H1: M > 25 Age Age-25 Rank
26 30 37 23 42 25 28 33 28 1 5 12 -2 17 0 3 8 3 1 5 7 2 8 3.5 6 3.5
T+ = 34
T =2
T=2 p-value = .0118
*Because calculated T is at a p value less than 0.05 , from the tables ,the difference is significant . *we can reject H0
Signed Rank Test Computation Table for paired data.

X1i X2i Di = X1i - X2i |Di| Ri Sign Sign Ri : R1 R2 R3 : Rn T+ & T-
X11 X21 D1 = X11 - X21 |D1| R1 X12 X22 D2 = X12 - X22 |D2| R2 X13 X23 D3 = X13 - X23 |D3| R3 : : : : : X1n X2n Dn = X1n - X2n |Dn| Rn Total
Hours of sleep
Patient
Drug
Placebo
Difference
Rank Ignoring sign
1 2 3 4 5 6 7
6.1 7.0 8.2 7.6 6.5 8.4 6.9
5.2 7.9 3.9 4.7 5.3 5.4 4.2
0.9 -0.9 4.3 2.9 1.2 3.0 2.7
3.5* 3.5* 10 7 5 8 6
8
9 10
6.7
7.4 5.8
6.1
3.8 6.3
0.6
3.6 -0.5
2
9 1
3rd & 4th ranks are tied hence averaged T= smaller of T+ (50.5) and T- (4.5) Here T=4.5 significant at 2% level indicating the drug (hypnotic) is more effective than placebo
Signed Rank Test Computation Table

X1i X2i Di +0.10 +0.02 +0.07 +0.19 +0.07 0.00 |Di| Ri Sign Sign Ri + + + + + ... +4 +1 +2.5 +5 +2.5 Discard 9.98 9.88 9.88 9.86 9.90 9.83 9.99 9.80 9.94 9.87 9.84 9.84 Total 0.10 4 0.02 1 0.07 2 2.5 0.19 5 0.07 3 2.5 0.00 ...
T+ = 15, T- = 0
Wilcoxon Signed Rank Table (Portion)

One-Tailed Two-Tailed
n=5 1
n=6 2 1
n = 7 .. 4 2 0 .. .. .. .. :
a = .05 a = .025 a = .01 a = .005
a = .10 a = .05 a = .02 a = .01
n = 11 n = 12 n = 13 : : :
There are two types of comparison using tables for wilcoxon signed rank test
1- Looking at critical values (Z): In which the calculated T value ( smaller one ) is compared with the tabulated value at specific N and p The difference is significant (Null HYPOTHESIS IS REJECTED ) If calculated T < OR = tabulated T
2-By comparing the P values

By finding P value at certain N that match the calculated T If this P value > the specified one ( 0.05 for example ) the H0 can not be rejected.i,e not significant. It is only significant if that p = or < the assumed p .
** for 2 tailed test p=2 x p for one tailed

Data Transformation

Uploaded by

Copyright:

Available Formats

Data Transformation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Transformation

Uploaded by

Copyright:

Available Formats

Data Transformations

Data transformation can correct deviation from normality

Right transformation can allow you to use parametric

The need for transformations

their purpose is to make variances the same.

*This is particularly useful for concentrations of substances in blood.

For example, the following figure shows

serum cholesterol in stroke patients

For log (base e) serum cholesterol: mean = 1.82, SD = 0.22.

What about the units for the geometric mean?

Making a distribution more like the Normal is not the only

The square root

Arm lymphatic flow in rheumatoid arthritis with oedema

Can all data be transformed?

For example the distribution of blood sodium in ITU patients

2-Sometimes we have a bimodal distribution, which makes

Hypothesis Testing Procedures

Wilcoxon Rank Sum Test Z Test t Test One-Way ANOVA

Many More Tests Exist!

Types of data and analysis

Parametric Test Procedures

2-Require Interval Scale or Ratio Scale .

3-Have Stringent Assumptions .

Nonparametric Test Procedures

Why non-parametric statistics?

Advantages of Nonparametric Tests

5-Analysis involves outlier values.

7-Results May Be as Exact as Parametric Procedures

1984-1994 T/Maker Co.

Disadvantages of Nonparametric Tests

2-Difficult to Compute by Hand for Large Samples

What is a parameter and why should I care?

Corresponding nonparametric test Mann-Whitney U test; Wilcoxon rank-sum test

Wilcoxon signed rank test

-If any differences are equal, average their ranks

-Count up the ranks of +ives as T+

-Count up the ranks of ives as T-

The smaller sum is denoted as T

T+ = sum of the ranks of the positive differences

T = sum of the ranks of the negative differences

Signed Rank Test Computation Table for paired data.

Rank Ignoring sign

6.1 7.0 8.2 7.6 6.5 8.4 6.9

5.2 7.9 3.9 4.7 5.3 5.4 4.2

0.9 -0.9 4.3 2.9 1.2 3.0 2.7

Signed Rank Test Computation Table

Wilcoxon Signed Rank Table (Portion)

a = .05 a = .025 a = .01 a = .005

a = .10 a = .05 a = .02 a = .01

2-By comparing the P values

** for 2 tailed test p=2 x p for one tailed

You might also like