Christianna S. Williams, Abt Associates Inc, Durham, NC: PROC COMPARE - Worth Another Look!
Christianna S. Williams, Abt Associates Inc, Durham, NC: PROC COMPARE - Worth Another Look!
Christianna S. Williams, Abt Associates Inc, Durham, NC: PROC COMPARE - Worth Another Look!
Paper 149-2010
INTRODUCTION
Im not sure if PROC COMPARE existed in the very first version of SAS, but it has certainly been around at least
since version 5 many moons ago. In my opinion, it is one of those oldies but goodies that well-rounded SAS
programmers need to know about. It is invaluable for the many situations in which one needs to compare two data
sets. For example, consider PROC COMPARE the next time you need
to examine whether two algorithms for computing certain variables produce comparable results;
to prepare for a merge/joint of two large data sets with many variables, so that one knows what variables
may need to be renamed.
Of course, much of the functionality of PROC COMPARE could be achieved in other ways, such as joining or
merging two data sets, followed by identifying mis-matches and analyzing similarities and differences among
variables on matching observations. Indeed, because we are talking about data set stuff, rather than statistical
analyses, there is the tendency to head first to the DATA step for a solution, rather than considering the sometimes
daunting stable of PROCs. However, why not make use of the SAS machinery that is specifically designed for the
purpose of comparing of data sets or at least be familiar enough with its capabilities so that you can consider
whether it suits your needs. The purpose of this paper, then, is to introduce (or re-introduce) many of the features of
PROC COMPARE through a series of examples, in the hopes that you can add this to your toolkit or deepen your
understanding of what can be accomplished with just a few statements.
THE DATA
The data for all the examples in this paper comes from a project I and others at Abt have been working on with the
Centers for Medicare and Medicaid (CMS) to generate quality ratings for all US Nursing Homes 1 . This includes
nearly 16,000 nursing homes, and the ratings are updated every month. To simplify things for the purposes of the
examples, Ive limited the data to just nursing homes in Washington state, and, of course, just a small subset of the
variables that are used in generating the ratings. To make the comparisons shown in the examples in this paper at
least a little bit relevant to real life, Im using one data set that has information about all certified Washington nursing
homes from July 2009 (named WA_jul2009) and another with information about the nursing homes in Washington
that were certified in February 2010 (named WA_feb2010). For reference, PROC CONTENTS output for each of
these data sets is included in an appendix.
Created
IN.WA_JUL2009
IN.WA_FEB2010
Modified
20FEB10:13:23:39
20FEB10:13:23:39
NVar
NObs
22
22
234
231
20FEB10:13:23:39
20FEB10:13:23:39
Label
WA: July 2009
WA: February 2010
Variables Summary
Number
Number
Number
Number
Number
of
of
of
of
of
Variables
Variables
Variables
Variables
Variables
in Common: 17.
in IN.WA_JUL2009 but not in IN.WA_FEB2010: 5.
in IN.WA_FEB2010 but not in IN.WA_JUL2009: 5.
with Conflicting Types: 1.
with Differing Attributes: 4.
Dataset
Type
Length
hospbased
IN.WA_JUL2009
IN.WA_FEB2010
Char
Num
3
8
Label
If provider is in a hospital, (YES/NO)
Provider in Hospital (0,1)
Dataset
Type
Length
SurveyDate
IN.WA_JUL2009
IN.WA_FEB2010
IN.WA_JUL2009
IN.WA_FEB2010
IN.WA_JUL2009
IN.WA_FEB2010
IN.WA_JUL2009
IN.WA_FEB2010
Num
Num
Num
Num
Char
Char
Num
Num
8
8
4
8
50
55
8
8
bedcert
PROVNAME
OCCUPY
Format
Label
MMDDYY10.
DATE9.
Observation Summary
Observation
First
First
Last
Last
Last
Obs
Unequal
Unequal
Match
Obs
Base
Compare
1
1
231
231
234
1
1
231
231
.
Type
PROVNUM
SurveyDate
numcycles
defscore
def5star
MDS5star
rnstf5
staff5star
PROVNAME
CITY
BEDCERT
RESTOT
OCCUPY
Comp5star
CHAR
NUM
NUM
NUM
NUM
NUM
NUM
NUM
CHAR
CHAR
NUM
NUM
NUM
NUM
Len1 Len2
10
8
8
8
8
4
8
8
50
28
8
8
8
8
10
8
8
8
8
4
8
8
55
28
8
8
8
8
Ndif
120
187
2
224
125
154
129
126
123
116
127
183
185
141
MaxDif
529
1.000
447
4.000
4.000
4.000
4.000
158
168
42.000
4.000
MissDif
0
0
0
0
0
1
7
7
0
0
0
0
0
0
% Diff
_________
1.6957
1.7332
1.7804
In the Data Set Summary section we see that both data sets have 22 variables, but one has 234
observations and the other has 231. We also see that they have different labels. Basically, we get a
comparison of their metadata.
2)
In the Variables Summary section, we see that although the two data sets have the same number of
variables, each one has 5 variables that are not in the other data set. By default, we do NOT get a listing of
the names of the variables that are unique to one data set or the other. Well see how to get that in Example
3 below.
3)
The Variables Summary section also tells us that there is one variable with a TYPE conflict, and there are 4
variables with differing attributes. These variables ARE listed in the two sections Listing of Common
Variables with Conflicting Types and Listing of Common Variables with Differing Attributes, and we can see
what the differences are. With respect to variable attributes, COMPARE checks for differences in LENGTH,
FORMATs, INFORMATs and LABELs. One important note, however, which is something I learned just by
experimentation, has to do with differences in variable LABELs. If you are using the listing destination (as I
have in the is example), and your LINESIZE option is too narrow for the LABELs to print in a single line, SAS
does NOT print the LABELS in the Listing of Common Variables with Differing Attributes. In such case, the
variable will show up in the table as having differing attributes, but there will be no column for the LABEL, and
you will not be able to see why the variable is listed there!
4)
The remainder of the output is not really of interest in this, since SAS is just comparing the data sets record
by record without any type of key or ID field. Well come back to these pieces later also. You do get a listing
of the variables that have value differences and some summary information about those differences. Note
that SAS is not going to even attempt to make value comparisons for variables with conflicting TYPEs.
5)
Well also come back to the listing of Value Comparison Results for Variables, (and Ive snipped a lot of it
out), but Ill note a couple of things here. You see from the NOTEs, that not all the value differences are
being printed, and also an indication that there is an option MAXPRINT that controls this. The MAXPRINT
option, which goes on the PROC COMPARE statement has the following syntax:
MAXPRINT = (x,y)
Where X is the maximum number of value differences to be printed for any one variable, and Y is the
maximum total number of differences to be shown. As the NOTEs in Figure 1 indicate, the default is
MAXPRINT = (50, 500).
Figure 2. SAS Listing for Example 2: Comparison of two identical data sets
Example 2: PROC COMPARE for two identical data sets
The COMPARE Procedure
Comparison of IN.WA_JUL2009 with IN.WA_JUL2009
(Method=EXACT)
Data Set Summary
Dataset
IN.WA_JUL2009
IN.WA_JUL2009
Created
Modified
NVar
NObs
06FEB10:17:32:51
06FEB10:17:32:51
06FEB10:17:32:51
06FEB10:17:32:51
22
22
234
234
Variables Summary
Number of Variables in Common: 22.
Observation Summary
Observation
First Obs
Last Obs
Base
Compare
1
234
1
234
EXAMPLE 3 WHEN ALL YOU REALLY CARE ABOUT IS THE CONTENTS: PROC
COMPARE WITH NOVALUES, LISTVAR AND BRIEF OPTIONS
We often get data sets from a client that have updated data, but are otherwise supposed to be the same (that is, they
have the same variables, and those variables have the same attributes and meaning). Now, of course, you could do
a PROC CONTENTS of each data set, and match them up and see if they are the same and this may very well be
worth doing, but PROC COMPARE also allows for a quick and dirty comparison of this metadata. So, when you do
not want to compare the values of variables, you can use the NOVALUES option on the PROC COMPARE
statement. Im also including the LISTVAR option, which Ill explain below. Otherwise, this code is identical to that
shown in Example 1.
PROC COMPARE BASE=in.WA_jul2009 COMPARE=in.WA_feb2010 NOVALUES LISTVAR;
TITLE1 'Example : PROC COMPARE with NOVALUES & LISTVAR options';
RUN;
If I did not include the LISTVAR option, the output that will be produced will be identical to that shown in Figure 1,
except it will stop after the table titled Variables with Unequal Values. So, you will still get a listing of the variables
that have unequal values, but you wont get listings of the specific value differences. And, without LISTVAR you
would also not get a listing of the names of the variables that are unique to one data set or the other and likely this
information is of interest. Figure 3 shows the additional information that LISTVAR produces; this section of output will
come after the Variables Summary and before the Listing of Variables with Conflicting Types.
Type
Length
revis_c1
revis_c2
revis_c3
defscore_rank
zipcode
Num
Num
Num
Num
Char
8
8
8
8
5
Label
Cycle 1 - Number of Revisits
Cycle 2 - Number of Revisits
Cycle 3 - Number of Revisits
Rank for Variable defscore
Zip code of Provider
Type
Length
defscore0104
def5star0104
ZIP
COUNTY
CONTROL
Num
Num
Char
Char
Char
8
8
5
3
2
Label
Total Weighted Survey Deficiency Score (3 cycles)
5-star ranking system for total score
Zip code of Provider
SSA GEOGRAPHIC CODE of provider
Nature of organization that operates a provider
Note that if you just want a list of variables in the BASE data set, but not in the COMPARE data set, use
LISTBASEVAR instead of LISTVAR. Conversely, if you just want a list of the variables in the COMPARE data set
and not the BASE data set, use the LISTCOMPVAR option.
An even more concise output will be produced with the BRIEF (or BRIEFSUMMARY) option. Compared to the
default output (Figure 1), the BRIEF option will suppress the Data Set Summary Report, the Variables Summary
Report, the Observation Summary Report and the Values Comparison Summary Report.
PROC COMPARE BASE=in.WA_jul2009 COMPARE=in.WA_feb2010 NOVALUES BRIEF;
TITLE1 'Example 3A: PROC COMPARE with NOVALUES & BRIEF options';
RUN;
In conjunction with the NOVALUES option the OUTPUT is very brief indeed just two NOTES are included. The
entire output is shown in Figure 4.
Figure 4. Complete SAS Listing for COMPARE with NOVALUES and BRIEF options
Example 3A PROC COMPARE with NOVALUES & BRIEF options
The COMPARE Procedure
Comparison of IN.WA_JUL2009 with IN.WA_FEB2010
(Method=EXACT)
NOTE: Data set IN.WA_JUL2009 contains 3 observations not in IN.WA_FEB2010.
NOTE: Values of the following 14 variables compare unequal: PROVNUM SurveyDate
numcycles defscore def5star bedcert MDS5star rnstf5 staff5star PROVNAME CITY RESTOT
OCCUPY Comp5star
So, BRIEF and NOVALUES take most of the output away. If you add LISTVAR, then what results might be the most
useful combination when it is a variable comparison that is of most interest.
These three options together provide the following items (refer back to Figures 1, 3 and 4), which give you the key
comparisons with respect to data set contents:
Listing of variables in the BASE data set and not the COMPARE data set (Figure 3)
Listing of variables in the COMPARE data set and not the BASE data set (Figure 3)
Two notes indicating number of observations found in one data set and not the other; and listing of variables
with unequal values (Figure 4).
Finally, there is another alternative for putting these variable differences to the output. If what I care about is knowing
if there are differences in what variables are on the two data sets, and if they have any differing attributes, and nothing
about the values, then I might choose to direct the variable information to the SAS log. You can do this with the
WARNING option. In the code below, Im also using NOPRINT so that no printed output is produced; all the info is
directed to the log.
PROC COMPARE BASE=in.WA_jul2009 COMPARE=in.WA_feb2010 NOVALUES WARNING NOPRINT;
TITLE1 'Example 3C: PROC COMPARE with NOVALUES, WARNING and NOPRINT options';
RUN;
In the LOG, shown in Figure 5, you get a little bit less information than we had seen in the printed output (e.g. you do
not get the names of the variables with differing attributes) but enough to let you know that there are some important
differences.
Figure 5. SAS Log for Example 3C: COMPARE with NOVALUES, WARNING and NOPRINT options
21
22
23
WARNING:
WARNING:
WARNING:
WARNING:
WARNING:
WARNING:
Note that including the LISTVAR option would have no effect on the information provided in the log; that is, you will
not get a listing of the unique variables in the log.
Instead of the WARNING option, there is also an ERROR option, which will produce exactly the same information in
the log as shown in Figure 5, but the statements will be proceeded with ERROR: instead of WARNING:. This
could be useful if you want to stop processing in the event of data set differences or you are scanning the log
specifically for ERROR messages. I often have to deliver a flat text file to a client, and I want to be sure that the
information it contains is identical to the SAS data set on which it is based. What we typically do in this case is use a
DATA step with FILE and PUT statements to write the text file, followed by another DATA step in which we use
INFILE and INPUT statements to read it back in to a SAS data set. We then use a PROC COMPARE step with the
ERROR option comparing the original SAS data set with the one just read in so that processing stops if ANY
differences are identified.
Observation Summary
Base Compare ID
1
1 PROVNUM=505004
1
1 PROVNUM=505004
234
231 PROVNUM=50A263
234
231 PROVNUM=50A263
Type
SurveyDate
defscore
def5star
bedcert
MDS5star
rnstf5
staff5star
PROVNAME
RESTOT
OCCUPY
Comp5star
NUM
NUM
NUM
NUM
NUM
NUM
NUM
CHAR
NUM
NUM
NUM
Len1 Len2
8
8
8
4
4
8
8
50
8
8
8
8
8
8
8
4
8
8
55
8
8
8
Ndif
MaxDif
MissDif
131
214
64
15
123
69
73
4
118
122
94
467
238
2.000
40.000
3.000
2.000
3.000
0
0
0
0
0
4
4
0
0
0
0
24.000
30.000
3.000
___________________________________________________________
|| Total certified beds
|| Total medicare and or medicaid certified
|| .. beds
||
Base
Compare
PROVNUM
||
bedcert
BEDCERT
Diff.
% Diff
__________ || _________ _________ _________ _________
||
505004
||
180.0000
140.0000
-40.0000
-22.2222
505010
||
95.0000
97.0000
2.0000
2.1053
NOTE: The MAXPRINT=2 printing limit has been reached for the variable bedcert. No more
values will be printed for this comparison.
____________________________________________________________
|| Provider Name
|| Base Value
Compare Value
PROVNUM
|| PROVNAME
PROVNAME
__________ || ___________________+ ___________________+
||
505033
|| ROCKWOOD SOUTH HILL
ROCKWOOD RETIREMENT
505069
|| CRISTA SENIOR COMMUN CRISTWOOD NURSING AN
<snip>
____________________________________________________________
|| % of beds occupied
|| restot/bedcert x 100
||
Base
Compare
PROVNUM
||
OCCUPY
OCCUPY
Diff.
__________ || _________ _________ _________
||
505004
||
51.0000
65.0000
14.0000
505010
||
39.0000
38.0000
-1.0000
% Diff
_________
27.4510
-2.5641
NOTE: The MAXPRINT=(2,20) printing limit has been reached. No more values will be
printed.
The output is fairly self-explanatory, but Ill just point out a couple of items.
Without the ID statement, all we knew was that one data set had 3 more observations than the other; now
from the Observation Summary we see that there are 5 PROVNUMs in the BASE data set (July 2009) that
are not in the COMPARE data set (Feb 2010), and conversely, there are 2 PROVNUMs in the COMPARE
data set that are not in the BASE data set. (In the next example, well see how to find out what the mismatched PROVNUMs are.)
The Variables with Unequal Values table provides some summary information about value differences.
Specifically,
o
Ndif is the number of observations that have different values on the variable
For numeric variables, MaxDiff gives the maximum difference between any two matching
observations on the variable (not counting missing values).
MissDiff gives the number of observations for which the variable is missing on one data set and not
the other.
It is handy that SAS maintains the formatting of dates in value comparisons for SAS date variables (e.g.
SurveyDate variable). Note, however, that SAS compares unformatted values (dates or otherwise).
If the compared variables have different labels, both are printed in the values comparisons section (e.g.
BEDCERT).
For the difference and percent difference shown in the Values Comparison results, these are computed with
the value in the BASE data set as the reference value. Difference and percent difference will be positive if
the value in the comparison data set is greater, and negative if the value in the base data set is greater.
Percent difference is computed relative to the value in the BASE data set, and will be missing if the value in
the BASE data set is 0.
With respect to values comparisons, whether or not two values are judged unequal depends on the
CRITERION option and the METHOD option. The default value of CRITERION is 0.00001, but it can be
modified to give greater or less precision in the values comparisons. (An example is provided later
Example 7). If a CRITERION value is not specified, the default METHOD is EXACT, which tests for exact
equality; if a CRITERION value is specified then the default method is RELATIVE. In fact, if METHOD is
EXACT then any specification of a CRITERION value is ignored. The header info (near the top of Figure 6)
shows which METHOD option is in force. Other options for METHOD include:
ABSOLUTE, which compares the absolute value of the difference in values between the variable on
the two data sets and judges them not equal if the absolute value exceeds the value of
CRITERION;
PERCENT, which compares the absolute percent difference to the CRITERION value; and
RELATIVE, which compares the absolute relative difference to the CRITERION value.
The values comparisons are often the part of the report that are of most interest to the analyst. Remember
that by changing the parameters of the MAXPRINT option you can show all of the differences. Well come
back to some other ways to modify and enhance this output in some later examples.
10
The output generated by this code is shown in Figure 7. For brevity, I have removed the Variables Summary, the
Listing of Common Variables with Conflicting Types, the Listing of Common Variables with Differing Attributes and the
Variables with Unequal Values Summary table, but I am showing the tables produced by LISTEQUALVAR and
LISTOBS.
Figure 7. Partial output for PROC COMPARE with ID Statement with LISTOBS, and LISTEQUALVAR options
Example 5: PROC COMPARE with ID statement, LISTOBS and LISTEQUALVAR options
The COMPARE Procedure
Comparison of IN.WA_JUL2009 with IN.WA_FEB2010
(Method=EXACT)
Dataset
IN.WA_JUL2009
IN.WA_FEB2010
Created
20FEB10:13:23:39
20FEB10:13:23:39
20FEB10:13:23:39
20FEB10:13:23:39
22
22
NObs
234
231
Label
WA: July 2009
WA: February 2010
11
Type
STATE
numcycles
CITY
statename
CHAR
NUM
CHAR
CHAR
Len1 Len2
2
8
28
20
2
8
28
20
Label
State abbreviation of Provider
# of cycles of deficiency data (0,2 or 3)
City of Provider
State Name
Thanks to the LISTOBS option, the Comparison Results for Observations section gives us an ordered listing of
observations in one data set and not the other. Analogous to the LISTBASEVAR and LISTCOMPVAR options, there
are LISTBASEOBS and LISTCOMPOBS options, which would give listings of just observations that are in the Base
data set and not the Compare data set (LISTBASEOBS) or just the observations that are in the Comparison data set
and not the Base data set (LISTCOMPOBS). Specifying LISTOBS option is equivalent to specifying both
LISTBASEOBS and LISTCOMPOBS. There is also a LISTALL option, which is the equivalent of specifying both
LISTVAR and LISTOBS so it generates lists of both observations and variables that are unique to one data setI
tend to just use LISTVAR and LISTOBS though, as their names sound more like their functions.
The LISTEQUALVAR option does not produce a listing of all common variables rather it produces a list of any
variables that are judged to be equal for all matching observations on the two files. This is a little different from most
of the other LISTxxx options.
Table 1. PROC COMPARE LIST options, along with their functions and equivalencies
Option
Function
LISTALL
LISTVAR
LISTOBS
LISTBASE
LISTCOMP
LISTBASEVAR
LISTCOMPVAR
LISTBASEOBS
LISTCOMPOBS
LISTEQUALVAR
12
Partial output is shown in Figure 8. I have snipped out the Data Set Summary.
Figure 8. Partial output for PROC COMPARE comparison of selected variables, with ALLSTATS option
Example 6: Comparing values only for selected variables
Comparison of IN.WA_JUL2009 with IN.WA_FEB2010
(Method=EXACT)
Variables Summary
Number
Number
Number
Number
Number
Number
Number
of
of
of
of
of
of
of
Base
Compare
1
2
234
234
1
2
231
231
Obs
Unequal
Unequal
Obs
ID
PROVNUM=505004
PROVNUM=505009
PROVNUM=50A263
PROVNUM=50A263
13
Variable
def5star
staff5star
Type Len
NUM
NUM
Label
8
8
Ndif
64
73
MaxDif
MissDif
2.000
3.000
0
4
Note that the Variables Summary now has a row telling us that there are two VAR statement variables but is
otherwise unchanged (we are still getting summary information about variables not listed on the VAR statement). The
Observation summary, however, is more affected. The First Unequal and Last Unequal refer to observations that are
different on one or more of the VAR variables. Also, the Number of Observations with some Compared Variables
Unequal and the Number of Observations with all Compared variables Equal refer only to the VAR variables (i.e.
the compared variables). All of the information in the Values Comparison Summary pertains specifically to the
compared variables.
14
Lets look a little more carefully at the summary information produced by the ALLSTATS option. The N row for each
variable gives the number of non-missing values; then N value for the Diff. And % Diff columns gives the number
of matching values that can be compared (i.e. non-missing in both data sets) the fact that (for the variable
STAFF5STAR) this number is less than the 224 shown for each data set is an indication that the variable is
occasionally missing for one data set and not for the other . The Mean, Std, Max, Min StdErr, t and Prob > |t| rows
give, respectively, the mean, standard deviation, maximum, minimum, the standard error of the mean, the t statistic
(mean/stderr), and the probability that t would be greater if the population mean was 0. Under the Diff and %Diff
columns these statistics refer to the paired differences, which may provide some useful information. For example, for
the STAFF5STAR variable, the fact that the mean of the differences is positive (0.1081) indicates that these ratings
have, on average, risen between July 2009 and February 2010. Further, the fact that the Prob>|t| value is 0.0453
indicates that this improvement is likely greater than that expected by chance alone this is a paired t-test. So, weve
accomplished a little longitudinal analysis!!
What about the 3 rows at the bottom Ndif, DifMeans and r, rsq? Ndif gives the number (and percent) of values for
the variable that are different on the two data sets. The DifMeans row gives the difference between the mean values
on the two datasets, first expressed as a percentage of the base mean, second as a percentage of the comparison
mean and finally simply the difference in the two means. Analytically, it is important to see why this value (0.1250 for
STAFF5STAR) is different from the mean of the differences (0.1081). The DifMeans value (0.1250) just takes the
difference in the two sample means, which will differ from the mean of the paired differences (0.1081) because the
latter is based on the only those where the value is non-missing in both cases and this is what one wants when the
interest is examining how individual facilities change rather than the overall sample of facilities. Finally, the r and rsq
values are the Pearson correlation and its square (i.e. R-square) of the matching non-missing values for the variable
between the two data sets it can be viewed as a crude measure of the autocorrelation over the six month period
between the two sets of ratings.
NOMISSBASE judges a missing value in the base data set equal to any value in the comparison data set.
NOMISSCOMP judges a missing value in the comparison data set equal to any value in the base data set.
NOMISSING judges missing values in either data set equal to any value in the other data set.
The NOMISSBASE and NOMISSCOMP options can be used to evaluate what changes would be made to a master
data set in an UPDATE step since missing values on the transaction data set do not overwrite non-missing values on
the master dataset. Hence, if you run a PROC COMPARE with the NOMISSCOMP option, you will get a listing only
of non-missing values in the comparison data set that differ from the values on the matching observations on the base
data set, giving you insight into the expected results of an UPDATE step in which the base data set was the master
and the comparison data set was the transaction file.
Here we just tweak the previous example a little by implementing the NOMISS option. Im also just comparing the
STAFF5STAR variable for simplicity and suppressing any listing of value differences with the NOVALUES option. I
also added NOSUMMARY to eliminate the printing of the SUMMARY tables (Data set summary, variable summary,
observations summary and values comparison summary) to shorten the output.
PROC COMPARE BASE=in.WA_jul2009 COMPARE=in.WA_feb2010
NOSUMMARY ALLSTATS NOVALUES NOMISS;
TITLE1 'Example 6A: Comparing values only for selected variables NOMISS Option';
ID provnum;
VAR staff5star;
RUN;
The complete output is shown in Figure 9, and the results are somewhat unexpected.
15
Figure 9. Output for PROC COMPARE specifying comparison of selected variables, with NOMISS option
Example 6A: Comparing values only for selected variables - NOMISS Option
NOTE: Data set IN.WA_JUL2009 contains 5 observations not in IN.WA_FEB2010.
NOTE: Data set IN.WA_FEB2010 contains 2 observations not in IN.WA_JUL2009.
NOTE: Values of the following 1 variables compare unequal: staff5star
Value Comparison Results for Variables
__________________________________________________________
|| 5-star Rating for Staffing
||
Base
Compare
|| staff5star staff5star
Diff.
% Diff
________ || _________ _________ _________ _________
N
||
224
224
222
222
Mean
||
3.2455
3.3705
0.1081
11.3889
Std
||
1.1704
1.0968
0.7999
45.7594
Max
||
5.0000
5.0000
3.0000
300.0000
Min
||
1.0000
1.0000
-3.0000
-75.0000
StdErr
||
0.0782
0.0733
0.0537
3.0712
t
||
41.5012
45.9948
2.0137
3.7083
Prob>|t| ||
<.0001
<.0001
0.0453
0.0003
||
Ndif
DifMeans
r, rsq
||
||
||
69
3.851%
0.753
30.131%
3.709%
0.566
0.1250
The only thing that differs in the statistical output is the Ndif it is now 69, where it was 73 without the NOMISS
option. All the other statistics are the same as previously and this is a reminder of what the NOMISS option is doing
(and what it is not doing). It is just ignoring missing values in determining if the two data sets have the same value for
the compared variable (i.e. missing compares equal to any value), NOT excluding missing values so the N for the
two data sets has not changed, nor has the DifMeans.
16
Figure 10. Partial Output for PROC COMPARE using VAR and WITH Statements
Example 7: Comparing values for selected variables - VAR & WITH statements
Variables Summary
<snip>
Number of ID variables: 1.
Number of VAR Statement Variables: 2.
Number of WITH Statement Variables: 2.
<snip>
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 0.
Number of Variables Compared with Some Observations Unequal: 2.
Total Number of Values which Compare Unequal: 19.
Maximum Difference: 40.
All Variables Compared have Unequal Values
Variable
Type
Len
Compare
Len
zipcode
bedcert
CHAR
NUM
5
4
ZIP
BEDCERT
5
8
Label
Zip code of Provider
Total certified beds
Ndif
MaxDif
4
15
40.000
||
||
||
||
||
||
||
||
Ndif
DifMeans
r, rsq
||
||
||
229
95.4236
40.5094
215.0000
20.0000
2.6769
35.6466
<.0001
15
-0.279%
0.993
17
229
95.1572
40.5153
215.0000
20.0000
2.6773
35.5418
<.0001
6.550%
-0.280%
0.986
229
-0.2664
4.7308
20.0000
-40.0000
0.3126
-0.8521
0.3951
-0.2664
229
-0.2104
3.9771
18.1818
-29.4737
0.2628
-0.8006
0.4242
In the Variables Summary, we see the number of VAR variables and WITH variables.
In the Values Comparison Summary, we see which variable in the base data set is being compared to
which variable in the comparison data set. (Compare this back to what was seen with just the VAR
statement Figure 8.)
Since ZIP (& ZIPCODE) are character variables, the listing of value comparisons does not include a difference
calculation. Similarly, the ALLSTATS option does not produce any output for character variables.
For the comparison of BEDCERT, there is the same N on both data sets (229) and for the comparison (no
missing values on either) so the mean of the differences is identical to the difference in means, namely 0.2664. Recall that this is different than what we saw in Examples 6 and 6A.
Variables on the VAR and WITH statements are matched up one-to-one; that is, the first variable on the
VAR statement in the base data set is compared with the first variable on the WITH statement in the
comparison data set, the second with the second, and so on. If the two statements have different numbers
of variables the behavior is as follows:
o
If there are more variables on the VAR statement than the WITH statement, SAS will try to
compare the extra VAR variables in the base data with variables of the same name on the
comparison data set.
If such a variable exists, the comparison will occur as if the variable had been included on
the WITH statement, though a warning will go into the log stating WARNING: The WITH
statement list is shorter than the VAR statement list. For example, if
BEDCERT had been left off the WITH statement in this example, the output would be
identical to that shown in Figure 10.
If such a variable does not exist, then the extra variable(s) on the VAR statement is
ignored, but the comparison for the matching variables would proceed. Two warnings are
produced. For example, if the variable REVIS_C1 (which is on the July 2009 data set but
not the Feb 2010 dataset; Figure xx) was included on the VAR statement, the following
warnings would be generated:
WARNING: The WITH statement list is shorter than the VAR statement list.
WARNING: The following 1 variables are not in IN.WA_FEB2010: revis_c1.
But the comparison of ZIP and ZIPCODE and BEDCERT with BEDCERT would proceed,
producing the same output as in Figure 10.
o
If there are more variables on the WITH statement than the VAR statement, SAS will put a
WARNING in the log stating WARNING: The WITH statement list is longer than the VAR
statement list but will then essentially ignore the extra variables on the WITH statement,
whether or not these variables exist on the base data set. For example, if BEDCERT was left off
the VAR statement, only the comparison of ZIP and ZIPCODE would be generated. And if another
variable that existed on just the comparison data set was added on the WITH statement, only the
one warning about more variables on the WITH statement would be produced in the log, and the
comparison of the matching WITH and VAR variables (here ZIP with ZIPCODE and BEDCERT
with BEDCERT) would occur.
If you want to compare one variable on the base data set with two different variables on the comparison
data set, you would just include the variable name twice on the VAR statement. For example, to compare
BEDCERT on the base data set with BEDCERT and RESTOT on the comparison data set, the code would
be as follows (output not shown).
18
Note that the ID statement is not needed (though it would have no effect if included). By including the NOVALUES
option, we do not get a list of differences. And, since I did not include ALLSTATS, I am not getting summary statistics
on the variables and their differences, so the output is quite short, and is in its entirety shown in Figure 11.
Figure 11. Output for Comparing Values of Variables on the Same Data Set
Example 8:
Dataset
IN.WA_FEB2010
Created
Modified
NVar
NObs
07FEB10:16:07:35
07FEB10:16:07:35
22
231
Type
def5star
defscore
def5star
NUM
NUM
NUM
Len
8
8
8
Compare
def5star0104
defscore0104
staff5star
Len
Ndif
MaxDif
MissDif
8
8
8
9
92
168
1.000
126
4.000
0
0
5
19
A portion of the output is shown in Figure 12. Up to the limits imposed by the prevailing values of the MAXPRINT
option, there will be a row in the output for every observation where the values of one or more of the VAR variables
differs between the two data sets. If VAR and WITH statements are both included, they generate the same
comparisons as previously, but again the results are provided by observation rather than by variable. Without a VAR
statement, all matching variables will be compared and differences printed. If the ID variable were not included, the
listing would be by observation number, but in most cases where TRANSPOSE is employed, youll be wanting to use
an ID variable as well.
I have found this particular way of using PROC COMPARE to be very helpful to check that expected updates to a
data set have been made, for example, after data cleaning and editing.
Compare
3.000000
Diff.
1.000000
% Diff
50.000000
PROVNUM=505016:
Variable
Base Value
staff5star
3.000000
Compare
4.000000
Diff.
1.000000
% Diff
33.333333
PROVNUM=505017:
Variable
Base Value
def5star
4.000000
Compare
3.000000
Diff.
-1.000000
% Diff
-25.000000
PROVNUM=505027:
Variable
Base Value
def5star
3.000000
staff5star
5.000000
Compare
4.000000
4.000000
Diff.
1.000000
-1.000000
% Diff
33.333333
-20.000000
PROVNUM=505350:
Variable
Base Value
staff5star
.
Compare
3.000000
Diff.
.
% Diff
.
PROVNUM=505434:
Variable
Base Value
def5star
1.000000
Compare
2.000000
Diff.
1.000000
% Diff
100.000000
NOTE: The MAXPRINT=50 printing limit has been reached for the variable def5star.
No more values will be printed for this comparison.
NOTE: Data set IN.WA_JUL2009 contains 5 observations not in IN.WA_FEB2010.
NOTE: Data set IN.WA_FEB2010 contains 2 observations not in IN.WA_JUL2009.
NOTE: Values of the following 2 variables compare unequal: def5star staff5star
20
OUT=outcomp OUTNOEQUAL
As with other examples, a subset of the output is shown (Figure 13), and then described below.
Figure 12. Partial Listing of an Output Data Set
Example 10:
PROVNUM _TYPE_
505009
BASE
COMPARE
DIF
2
2
2
2
3
1
2
2
E
505016
BASE
COMPARE
DIF
4
4
4
3
3
E
3
4
1
505027
BASE
COMPARE
DIF
8
8
8
3
4
1
5
4
-1
HEARTHSTONE, THE
HEARTHSTONE, THE
...........................................
505033
BASE
COMPARE
DIF
9
9
9
4
4
E
4
4
E
505069
BASE
COMPARE
DIF
13
13
13
3
3
E
5
4
-1
505333
BASE
112
505350
BASE
COMPARE
DIF
123
121
121
5
5
E
.
3
.
505426
BASE
171
HERITAGE GROVE
505430
BASE
COMPARE
DIF
173
170
170
5
5
E
2
4
2
505431
BASE
COMPARE
DIF
174
171
171
4
2
-2
4
2
-2
21
505518
COMPARE
225
50A263
BASE
COMPARE
DIF
234
231
229
4
3
-1
5
5
E
The OUT= option produces an output data set. By default this would have one observation for each match
on the ID variable, showing differences (if any) between the two data sets on the VAR variables.
Note that each observation on the output data set is identified by the combination of the ID variable, and the
automatic variable _TYPE_. The _TYPE_ variable has one of four possible values BASE, COMPARE,
DIF and PERCENT.
o
BASE shows the values of the compared variables on the BASE data set.
observation is written to the output data set if the OUTBASE option is specified.
COMPARE shows the values of the compared variables on the BASE data set. This _TYPE_ of
observation is written to the output data set if the OUTCOMPARE option is specified.
DIF shows the differences between the values on the base and comparison data sets.
This _TYPE_ of
For numeric variables, E (actually .E, the special missing value) indicates the values are
equal on that variable and that observation.
For character variables, a period (.) is included for each position that is the same between
the two data sets and an X is used to designate unequal characters.
PERCENT would give the differences for numeric variables in percentage terms.
variable differences are shown in the same was as the DIF observations.
Character
The _OBS_ variable gives the number of the observation from the data set indicated by the _TYPE_ variable
for BASE and COMP observations. For observations with the _TYPE_ of DIF or PERCENT, _OBS_ is a
counter for the matching observations between the two data sets.
By specifying OUTNOEQUAL the output data set will contain observations only for cases where there is a
match on the ID variables and one or more of the compared variables differ between the two data sets.
The OUTBASE and OUTCOMP options also ensure that non-matching observations (i.e. the ID value is in
one data set and not the other) will be included in the output data set. In this way they are comparable to the
LISTBASEOBS and LISTCOMPOBS options discussed earlier.
CONCLUSION
I hope that the examples presented in this paper have convinced you the PROC COMPARE is a utility procedure that
it is worth getting to know. While Ive presented a lot of examples and different ways of using PROC COMPARE,
there are several features that I have not even addressed, and Ill mention a few of these here:
Though PROC COMPARE was a part of Base SAS for years before the Output Delivery System (ODS)
existed, as with all other procedures, the various pieces of the PROC COMPARE output are available as
ODS objects. I encourage you to explore this if you need to customize your PROC COMPARE output, but I
have to say that many of the available objects are not as easy to manipulate as one might hope.
In addition to the OUT= data set features that I introduced in Example 10, there is an OUTSTATS data set
available, which contains more of the summary statistical information about differences (such as the info
shown in Example 6 and 7). If this type of information is useful in your application and you need to work with
it in further analyses or report it in a different way, experiment with the OUTSTATS= option.
22
PROC COMPARE has a BY statement, which allows you to stratify your comparison on variables of interest.
This works pretty much the way the BY statement works in other procedures, and would provide a way of
examining whether data set differences vary in systematic ways based on stratification (BY) variables of
interest.
While I discussed the CRITERION and METHOD options briefly, if your COMPARE needs are such that you
need to control what is meant by equality, youll need to work with these options further to fine tune your
comparisons. Additionally, there is a FUZZ= option that can be used to control what differences are printed.
When PROC COMPARE runs, a return code is stored in the automatic macro variable SYSINFO, and the
value of this code provides information about the comparison results. For example, there are different
values for minor differences such as the data sets have differing labels vs. potentially more critical
differences such as conflicting variable types or value differences. The values are scaled so that these types
of differences could be parsed from the SYSINFO variable, potentially to direct further processing. There is
a table of these values in the PROC COMPARE chapter of the Base SAS Procedures guide, referenced
below.
Ive found this PROC to be really useful in a lot of my work with big government data sets just another way to get to
know your data. If youve never used it or it has been a while, take another look. And, happy COMPARE-ing!
REFERENCES
SAS Institute Inc. 2009. The COMPARE Procedure. Base SAS 9.2 Procedures Guide. Cary, NC: SAS Institute Inc.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Name: Christianna Williams
Enterprise: Abt Associates Inc.
E-mail: Christianna_Williams@abtassoc.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.
23