Statistical Analysis Using SAS
Statistical Analysis Using SAS
Statistical Analysis Using SAS
T-Tests
Location Variability
Count Value
Cumulative Cumulative
female Frequency Percent Frequency Percent
-----------------------------------------------------------
0 91 45.50 91 45.50
1 109 54.50 200 100.00
Exact Test
One-sided Pr <= P 0.1146
Two-sided = 2 * One-sided 0.2292
Chi-Square Test
for Specified Proportions
-------------------------
Chi-Square 5.0286
DF 3
Pr > ChiSq 0.1697
T-Tests
Equality of Variances
Variable Method Num DF Den DF F Value Pr > F
write Folded F 90 108 1.61 0.0187
The results indicate that there is a statistically significant difference between the mean writing
score for males and females (t = -3.73, p = .0002). In other words, females have a statistically
significantly higher mean score on writing (54.991) than males (50.121).
See also
Wilcoxon-Mann-Whitney test
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test
and can be used when you do not assume that the dependent variable is a normally distributed
interval variable (you need only assume that the variable is at least ordinal). We will use the
same data file (the hsb2 data file) and the same variables in this example as we did in the
independent t-test example above and will not assume that write, our dependent variable, is
normally distributed.
proc npar1way data = "c:\mydata\hsb2" wilcoxon;
class female;
var write;
run;
The NPAR1WAY Procedure
Statistic 7792.0000
Normal Approximation
Z -3.3279
One-Sided Pr < Z 0.0004
Two-Sided Pr > |Z| 0.0009
t Approximation
One-Sided Pr < Z 0.0005
Two-Sided Pr > |Z| 0.0010
schtyp(type of school)
female
Frequency|
Percent |
Row Pct |
Col Pct | 0| 1| Total
---------+--------+--------+
1 | 77 | 91 | 168
| 38.50 | 45.50 | 84.00
| 45.83 | 54.17 |
| 84.62 | 83.49 |
---------+--------+--------+
2 | 14 | 18 | 32
| 7.00 | 9.00 | 16.00
| 43.75 | 56.25 |
| 15.38 | 16.51 |
---------+--------+--------+
Total 91 109 200
45.50 54.50 100.00
female ses
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
---------+--------+--------+--------+
0 | 15 | 47 | 29 | 91
| 7.50 | 23.50 | 14.50 | 45.50
| 16.48 | 51.65 | 31.87 |
| 31.91 | 49.47 | 50.00 |
---------+--------+--------+--------+
1 | 32 | 48 | 29 | 109
| 16.00 | 24.00 | 14.50 | 54.50
| 29.36 | 44.04 | 26.61 |
| 68.09 | 50.53 | 50.00 |
---------+--------+--------+--------+
Total 47 95 58 200
23.50 47.50 29.00 100.00
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| 4| Total
---------+--------+--------+--------+--------+
1 | 22 | 10 | 18 | 118 | 168
| 11.00 | 5.00 | 9.00 | 59.00 | 84.00
| 13.10 | 5.95 | 10.71 | 70.24 |
| 91.67 | 90.91 | 90.00 | 81.38 |
---------+--------+--------+--------+--------+
2 | 2 | 1 | 2 | 27 | 32
| 1.00 | 0.50 | 1.00 | 13.50 | 16.00
| 6.25 | 3.13 | 6.25 | 84.38 |
| 8.33 | 9.09 | 10.00 | 18.62 |
---------+--------+--------+--------+--------+
Total 24 11 20 145 200
12.00 5.50 10.00 72.50 100.00
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent
variable (with two or more categories) and a normally distributed interval dependent variable
and you wish to test for differences in the means of the dependent variable broken down by the
levels of the independent variable. For example, using the hsb2 data file, say we wish to test
whether the mean of write differs between the three program types (prog). We will also use the
means statement to output the mean of write for each level of program type. Note that this will
not tell you if there is a statistically significant difference between any two sets of means.
proc glm data = "c:\mydata\hsb2";
class prog;
model write = prog;
means prog;
run;
quit;
The GLM Procedure
1 45 51.3333333 9.39777537
2 105 56.2571429 7.94334333
3 50 46.7600000 9.31875441
The mean of the dependent variable differs significantly among the levels of program type.
However, we do not know if the difference is between only two of the levels or all three of the
levels. (The F test for the model is the same as the F test for prog because prog was the only
variable entered into the model. If other variables had also been entered, the F test for the
Model would have been different from prog.) We can also see that the students in the academic
program have the highest mean writing score, while students in the vocational program have the
lowest.
See also
Kruskal Wallis test
The Kruskal Wallis test is used when you have one independent variable with two or more
levels and an ordinal dependent variable. In other words, it is the non-parametric version of
ANOVA. It is also a generalized form of the Mann-Whitney test method, as it permits two or
more groups. We will use the same data file as the one way ANOVA example above (the hsb2
data file) and the same variables as in the example above, but we will not assume that write is a
normally distributed interval variable.
proc npar1way data = "c:\mydata\hsb2";
class prog;
var write;
run;
The NPAR1WAY Procedure
Kruskal-Wallis Test
Chi-Square 34.0452
DF 2
Pr > Chi-Square <.0001
The results indicate that there is a statistically significant difference among the three type of
programs (chi-square with two degrees of freedom = 34.0452, p = 0.0001).
See also
Paired t-test
A paired (samples) t-test is used when you have two related observations (i.e., two observations
per subject) and you want to see if the means on these two normally distributed interval
variables differ from one another. For example, using the hsb2 data file we will test whether the
mean of read is equal to the mean of write.
proc ttest data = "c:\mydata\hsb2";
paired write*read;
run;
The TTEST Procedure
Statistics
T-Tests
Location Variability
Q1correct Q2correct
Frequency|
Percent |
Row Pct |
Col Pct | 0| 1| Total
---------+--------+--------+
0 | 15 | 6 | 21
| 7.50 | 3.00 | 10.50
| 71.43 | 28.57 |
| 68.18 | 3.37 |
---------+--------+--------+
1 | 7 | 172 | 179
| 3.50 | 86.00 | 89.50
| 3.91 | 96.09 |
| 31.82 | 96.63 |
---------+--------+--------+
Total 22 178 200
11.00 89.00 100.00
McNemar's Test
----------------------------
Statistic (S) 0.0769
DF 1
Asymptotic Pr > S 0.7815
Exact Pr >= S 1.0000
Simple Kappa Coefficient
--------------------------------
Kappa 0.6613
ASE 0.0873
95% Lower Conf Limit 0.4901
95% Upper Conf Limit 0.8324
Dependent Variable Y1 Y2 Y3 Y4
Level of a 1 2 3 4
Adj Pr > F
Source DF Type III SS Mean Square F Value Pr >
F G - G H - F
a 3 49.00000000 16.33333333 11.63
0.0001 0.0015 0.0003
Error(a) 21 29.50000000 1.40476190
Algorithm converged.
Sum of
Source DF Squares Mean Square F Value
Pr > F
Model 5 2278.24419 455.64884 5.67
<.0001
Error 194 15600.63081 80.41562
Intercept
Intercept and
Criterion Only Covariates
AIC 277.637 284.490
SC 280.935 304.280
-2 Log L 275.637 272.490
Wald
Effect DF Chi-Square Pr > ChiSq
prog 2 1.1232 0.5703
schtyp 1 0.4132 0.5203
prog*schtyp 2 2.4740 0.2903
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Exp(Est)
Intercept 1 0.3331 0.3164 1.1082 0.2925
1.395
prog 1 1 0.4459 0.4568 0.9532 0.3289
1.562
prog 2 1 -0.1964 0.3438 0.3264 0.5678
0.822
schtyp 1 1 -0.2034 0.3164 0.4132 0.5203
0.816
prog*schtyp 1 1 1 -0.6269 0.4568 1.8838 0.1699
0.534
prog*schtyp 2 1 1 0.3400 0.3438 0.9783 0.3226
1.405
The results indicate that the overall model is not statistically significant (LR chi2 = 3.1467, p =
0.6774). Furthermore, none of the coefficients are statistically significant either. In addition,
there is no statistically significant effect of program (p = 0.5703), school type (p = 0.5203) or of
the interaction (p = 0.2903).
Correlation
A correlation is useful when you want to see the linear relationship between two (or more)
normally distributed interval variables. For example, using the hsb2 data file we can run a
correlation between two continuous variables, read and write.
proc corr data = "c:\mydata\hsb2";
var read write;
run;
The CORR Procedure
read write
female write
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr >
F
Parameter Standard
Standardized
Variable Label DF Estimate Error t Value Pr > |
t| Estimate
Intercept Intercept 1 23.95944 2.80574 8.54
<.0001 0
read reading score 1 0.55171 0.05272 10.47
<.0001 0.59678
We see that the relationship between write and read is positive (.55171) and based on the t-
value (10.47) and p-value (0.000), we conclude this relationship is statistically significant.
Hence, there is a statistically significant positive linear relationship between reading and
writing.
See also
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be
normally distributed and interval (but are assumed to be ordinal). The values of the variables are
converted in ranks and then correlated. In our example, we will look for a relationship between
read and write. We will not assume that both of these variables are normal and interval. The
spearman option on the proc corr statement is used to tell SAS to perform a Spearman rank
correlation instead of a Pearson correlation.
proc corr data = "c:\mydata\hsb2" spearman;
var read write;
run;
The CORR Procedure
read write
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Exp(Est)
Intercept 1 0.7261 0.7420 0.9577 0.3278
2.067
read 1 -0.0104 0.0139 0.5623 0.4533
0.990
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr >
F
Model 5 10757 2151.38488 58.60
<.0001
Error 194 7121.95060 36.71109
Corrected Total 199 17879
Parameter Estimates
Parameter Standard
Standardized
Variable Label DF Estimate Error t Value Pr >
|t| Estimate
Intercept Intercept 1 6.13876 2.80842 2.19
0.0300 0
female 1 5.49250 0.87542 6.27
<.0001 0.28928
read reading score 1 0.12541 0.06496 1.93
0.0550 0.13566
math math score 1 0.23807 0.06713 3.55
0.0005 0.23531
science science score 1 0.24194 0.06070 3.99
<.0001 0.25272
socst social studies score 1 0.22926 0.05284 4.34
<.0001 0.25967
The results indicate that the overall model is statistically significant (F = 58.60, p = 0.0001).
Furthermore, all of the predictor variables are statistically significant except for read.
See also
Analysis of covariance
Analysis of covariance is like ANOVA, except in addition to the categorical predictors you have
continuous predictors as well. For example, the one way ANOVA example used write as the
dependent variable and prog as the independent variable. Let's add read as a continuous
variable to this model.
proc glm data = "c:\mydata\hsb2";
class prog;
model write = prog read;
run;
quit;
The GLM Procedure
Sum of
Source DF Squares Mean Square F Value
Pr > F
Model 3 7017.68123 2339.22708 42.21
<.0001
Error 196 10861.19377 55.41425
Corrected Total 199 17878.87500
Model Information
Response Profile
Ordered Total
Value female Frequency
1 1 109
2 0 91
Intercept
Intercept and
Criterion Only Covariates
AIC 277.637 253.818
SC 280.935 263.713
-2 Log L 275.637 247.818
Variable Prior
prog Name Frequency Weight Proportion Probability
1 _1 45 45.0000 0.225000 0.333333
2 _2 105 105.0000 0.525000 0.333333
3 _3 50 50.0000 0.250000 0.333333
3 12.18440
2 _ _ -1 _ _
D (i|j) = (X - X )' COV (X - X )
i j i j
From
prog 1 2 3
1 0 0.73810 0.31771
2 0.73810 0 1.90746
3 0.31771 1.90746 0
_ -1 _ -1 _
Constant = -.5 X' COV X Coefficient Vector = COV X
j j j
Variable Label 1 2 3
Constant -24.47383 -30.60364 -20.77468
read reading score 0.18195 0.21279 0.17451
write writing score 0.38572 0.39921 0.33999
math math score 0.40171 0.47236 0.37891
2 _ -1 _
D (X) = (X-X )' COV (X-X )
j j j
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
From
prog 1 2 3 Total
1 11 17 17 45
24.44 37.78 37.78 100.00
2 18 68 19 105
17.14 64.76 18.10 100.00
3 14 7 29 50
28.00 14.00 58.00 100.00
Total 43 92 65 200
21.50 46.00 32.50 100.00
1 2 3 Total
Rate 0.7556 0.3524 0.4200 0.5093
Priors 0.3333 0.3333 0.3333
Clearly, the SAS output for this procedure is quite lengthy, and it is beyond the scope of this
page to explain all of it. However, the main point is that two canonical variables are identified
by the analysis, the first of which seems to be more related to program type than the second.
See also
· discriminant function analysis
One-way MANOVA
MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more
dependent variables. In a one-way MANOVA, there is one categorical independent variable and
two or more dependent variables. For example, using the hsb2 data file, say we wish to examine
the differences in read, write and math broken down by program type (prog). The manova
statement is necessary in the proc glm to tell SAS to conduct a MANOVA. The h= on the
manova statement is used to specify the hypothesized effect.
proc glm data = "c:\mydata\hsb2";
class prog;
model read write math = prog;
manova h=prog;
run;
quit;
The GLM Procedure
Sum of
Source DF Squares Mean Square F Value
Pr > F
Model 2 3716.86127 1858.43063 21.28
<.0001
Error 197 17202.55873 87.32263
Corrected Total 199 20919.42000
Sum of
Source DF Squares Mean Square F Value
Pr > F
Model 2 3175.69786 1587.84893 21.27
<.0001
Error 197 14703.17714 74.63542
Corrected Total 199 17878.87500
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr >
F
Model 4 10620 2655.02312 71.32
<.0001
Error 195 7258.78251 37.22453
Corrected Total 199 17879
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t
Value Pr > |t|
Intercept Intercept 1 6.56892 2.81908
2.33 0.0208
female 1 5.42822 0.88089
6.16 <.0001
math math score 1 0.28016 0.06393
4.38 <.0001
science science score 1 0.27865 0.05805
4.80 <.0001
socst social studies score 1 0.26811 0.04919
5.45 <.0001
Model: MODEL1
Dependent Variable: read reading score
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr >
F
Model 4 12220 3054.91459 68.47
<.0001
Error 195 8699.76166 44.61416
Corrected Total 199 20919
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t
Value Pr > |t|
Intercept Intercept 1 3.43000 3.08624
1.11 0.2678
female 1 -0.51261 0.96436 -
0.53 0.5956
math math score 1 0.33558 0.06999
4.79 <.0001
science science score 1 0.29276 0.06355
4.61 <.0001
socst social studies score 1 0.30976 0.05386
5.75 <.0001
Model: MODEL1
Multivariate Test: female
Model: MODEL1
Multivariate Test: math
Model: MODEL1
Multivariate Test: science
Model: MODEL1
Multivariate Test: socst
Canonical correlation
Canonical correlation is a multivariate technique used to examine the relationship between two
groups of variables. For each set of variables, it creates latent variables and looks at the
relationships among the latent variables. It assumes that all variables in the model are interval
and normally distributed. In SAS, one group of variables is placed on the var statement and the
other group on the with statement. There need not be an equal number of variables in the two
groups. The all option on the proc cancorr statement provides additional output that many
researchers might find useful.
proc cancorr data = "c:\mydata\hsb2" all;
var read write;
with math science;
run;
The CANCORR Procedure
VAR Variables 2
WITH Variables 2
Observations 200
Standard
Variable Mean Deviation Label
read 52.230000 10.252937 reading score
write 52.775000 9.478586 writing score
math 52.645000 9.368448 math score
science 51.850000 9.900891 science score
Correlations Among the Original Variables
read write
read 1.0000 0.5968
write 0.5968 1.0000
math science
math 1.0000 0.6307
science 0.6307 1.0000
math science
read 0.6623 0.6302
write 0.6174 0.5704
Canonical Correlation Analysis
V1 V2
read reading score 0.063261313 0.1037907932
write writing score 0.0492491834 -0.12190836
W1 W2
math math score 0.0669826768 -0.120142451
science science score 0.0482406314 0.1208859811
Standardized Canonical Coefficients for the VAR Variables
V1 V2
read reading score 0.6486 1.0642
write writing score 0.4668 -1.1555
Standardized Canonical Coefficients for the WITH Variables
W1 W2
math math score 0.6275 -1.1255
science science score 0.4776 1.1969
Canonical Structure
V1 V2
read reading score 0.9272 0.3746
write writing score 0.8539 -0.5205
W1 W2
math math score 0.9288 -0.3706
science science score 0.8734 0.4870
Correlations Between the VAR Variables and the Canonical Variables of the
WITH Variables
W1 W2
read reading score 0.7166 0.0088
write writing score 0.6599 -0.0122
Correlations Between the WITH Variables and the Canonical Variables of the
VAR Variables
V1 V2
math math score 0.7178 -0.0087
science science score 0.6750 0.0114
Canonical Redundancy Analysis
M 1 2
read reading score 0.5135 0.5136
write writing score 0.4355 0.4356
M 1 2
math math score 0.5152 0.5153
science science score 0.4557 0.4558
The output above shows the linear combinations corresponding to the first canonical
correlation. At the bottom of the output are the two canonical correlations. These results
indicate that the first canonical correlation is .772841. The F-test in this output tests the
hypothesis that the first canonical correlation is equal to zero. Clearly, F = 56.47 is statistically
significant. However, the second canonical correlation of .0235 is not statistically significantly
different from zero (F = 0.11, p = 0.7420).
Factor analysis
Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the
number of variables in a model or to detect relationships among variables. All variables
involved in the factor analysis need to be continuous and are assumed to be normally
distributed. The goal of the analysis is to try to identify factors which underlie the variables.
There may be fewer factors than variables, but there may not be more factors than variables. For
our example, let's suppose that we think that there are some common factors underlying the
various test scores. We will use the principal components method of extraction, use a varimax
rotation, extract two factors and obtain a scree plot of the eigenvalues. All of these options are
listed on the proc factor statement.
proc factor data = "c:\mydata\hsb2" method=principal rotate=varimax
nfactors=2 scree;
var read write math science socst;
run;
The FACTOR Procedure
Initial Factor Method: Principal Components
Factor Pattern
Factor1 Factor2
READ reading score 0.85760 -0.02037
WRITE writing score 0.82445 0.15495
MATH math score 0.84355 -0.19478
SCIENCE science score 0.80091 -0.45608
SOCST social studies score 0.78268 0.53573
Factor1 Factor2
3.3808198 0.5573783
1 0.74236 0.67000
2 -0.67000 0.74236
Factor1 Factor2
2.1133589 1.8248392
Number
See also