Data Transformation
Data Transformation
Data Transformation
Data transformations
1-We can use a method which does not require these assumptions, such as a rank-based method. 2-We can transform the data mathematically to make them fit the assumptions more closely before analysis.
There are three commonly used transformations for quantitative data: The logarithm, the square root, and the reciprocal. We call ,these transformations variance-stabilizing because
If we have several groups of subjects and calculate the mean and variance for each group, we can plot variability against mean. We might have one of these situations: -Variability and mean are unrelated. We do not usually have a problem and can treat the variances as uniform. We do not need a transformation. -Variance is proportional to mean. A square root transformation should remove the relationship between variability and mean. -Standard deviation is proportional to mean. A logarithmic transformation should remove the relationship between variability and mean. -Standard deviation is proportional to the square of the mean. A reciprocal transformation should remove the relationship between variability and mean.
.
Variance-stabilizing transformations also tend to make distributions Normal. There is a mathematical reason for this, as for so much in statistics.
It can be shown that if we take several samples from the same population, the means and variances of these samples will be independent if and only if the distribution is Normal. This means that uniform variance tends to go with a Normal Distribution. A transformation which makes variance uniform will often also make data follow a Normal distribution and vice versa
logarithmic transformation
*The most frequently used is the logarithm.
*The rates at which these reactions happen depends on the amounts of other substances in the blood and the consequence is that the various factors which determine the concentration of the substance are multiplied together.
*Multiplying and dividing tends to produce skew distributions. *If we take the logarithm of several numbers multiplied together we get the sum of their logarithms. **So log transformation produces something where the various influences are added together and addition tends to produce a Normal distribution.
As we have seen, for the serum cholesterol in stroke patients data, the log transformation gives a good fit to the Normal. What happens if we analyze the logarithm of serum cholesterol then try to transform back to the natural scale?
For the raw data, serum cholesterol: mean = 6.34, SD = 1.40.
geometric mean is calculated which is found by multiplying all the observations and taking the nth root The geometric mean is found by multiplying all the n observations together and then taking the nth root. For example, the geometric mean of 4 and 9 is 6, found by multiplying 4 by 9 to give 36 and taking the square (or second) root. The geometric mean is usually smaller than the arithmetic mean. For 4 and 9 this is (4 + 9)/2 = 6.5. Thus the mean of the logs is the log of the geometric mean.
Even if a transformation does not produce a really good fit to the Normal distribution, it may still make the data much more amenable to analysis.
The following figure shows a histogram and Normal plot for the area of venous ulcer at recruitment
The raw data have a very skew distribution and the small number of very large ulcers might lead to problems in analysis. Although the log transformed data are still skew, the skewness is much less and the data much easier to analyze
The following figure shows prostate specific antigen (PSA) for three groups of prostate patients: with benign conditions, with prostatitis, and with prostate cancer
A log transformation of the PSA gives a much clearer picture . The variability is now much more similar in the three groups
The distribution is positively skew and the variability is clearly greater in the groups with greater lymphatic activity.
A square root transformation has the effect of making the data less skew and making the variation more uniform. In these data, a log transformation proved to have too great an effect, making the distribution negatively skew, and so the square root of the data was used in the analysis.
Reciprocal transformation
Removes the relationship between variability and mean. The reciprocal is best for very strong relationships, where the standard deviation is proportional to the square of the mean.
The reciprocal can only be used for variables which are strictly greater than zero. If the square root removes the least amount of skewness , the reciprocal removes the .most.
3-Sometimes we have a large number of identical observations, which will all transform to the same value whatever transformation we use. These are often at one extreme of the distribution, usually at zero
For example the distribution of coronary artery calcium in a large group of patients
More than half of these observations were equal at zero. Any transformation would leave half the observations with the same value, at the extreme of the distribution. It is impossible to transform these data to a Normal distribution
4-Sometimes transformation lead to variation in p-value. So, What can we do if we cannot transform data? It is usually safer to use methods that do not require such assumptions These include the non-parametric methods.
Parametric
Nonparametric
Kruskal-Wallis H-Test
Types of Data
Nominal - no numerical value Ordinal - order or rank Discrete - counts Continuous - interval, ratio
Ranks
Many nonparametric procedures are based on ranks. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties. Example: 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5 If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5. Procedure: replace the original data with the ranks across subjects and then perform the parametric test.
For large samples, many nonparametric techniques can be viewed as the usual normal-theory-based procedures applied to ranks
Normal theory based test
Purpose of test
t test for independent samples Paired t test Pearson correlation coefficient One way analysis of variance (F test) Two way analysis of variance
Compares two independent samples Examines a set of differences Assesses the linear association between two variables. Compares three or more groups Compares groups classified by two
different factors
STEP 1
-Exclude any differences which are zero -Put the rest of differences in ascending order -Ignore their signs -Assign them ranks
STEP 2
STEP 3
If there is no difference between drug (T+) and placebo (T-), then T+ & T- would be similar If there were a difference one sum would be much smaller and the other much larger than expected
STEP 4
Compare the value obtained with the critical values (5%, 2% and 1% ) in table N is the number of differences that were ranked (not the total number of differences) So the zero differences are excluded
Wilcoxon Signed Rank Test - assume distribution is continuous and symmetric Discard any observation(s) that equal M0, adjust n
Again look at the differences between the observations and the null value, M0 (Paired data, look at differences within pairs)
Rank the absolute values of the differences, from low to high Ties receive the average rank
p-values for one-sided tests are in Table - only if results are in correct direction
Double the table value to get the p-value for a two-sided test
For one simple sample. Test, at a = .05, if median age of students finishing a Masters degree in biostatistics is greater than 25. H0: M = 25 H1: M > 25 Age Age-25 Rank
26 30 37 23 42 25 28 33 28 1 5 12 -2 17 0 3 8 3 1 5 7 2 8 3.5 6 3.5
T+ = 34
T =2
T=2 p-value = .0118
*Because calculated T is at a p value less than 0.05 , from the tables ,the difference is significant . *we can reject H0
X11 X21 D1 = X11 - X21 |D1| R1 X12 X22 D2 = X12 - X22 |D2| R2 X13 X23 D3 = X13 - X23 |D3| R3 : : : : : X1n X2n Dn = X1n - X2n |Dn| Rn Total
Hours of sleep
Patient
Drug
Placebo
Difference
1 2 3 4 5 6 7
3.5* 3.5* 10 7 5 8 6
8
9 10
6.7
7.4 5.8
6.1
3.8 6.3
0.6
3.6 -0.5
2
9 1
3rd & 4th ranks are tied hence averaged T= smaller of T+ (50.5) and T- (4.5) Here T=4.5 significant at 2% level indicating the drug (hypnotic) is more effective than placebo
T+ = 15, T- = 0
n=5 1
n=6 2 1
n = 7 .. 4 2 0 .. .. .. .. :
n = 11 n = 12 n = 13 : : :
There are two types of comparison using tables for wilcoxon signed rank test
1- Looking at critical values (Z): In which the calculated T value ( smaller one ) is compared with the tabulated value at specific N and p The difference is significant (Null HYPOTHESIS IS REJECTED ) If calculated T < OR = tabulated T