Spss For Beginners PDF
Spss For Beginners PDF
Spss For Beginners PDF
Beginners
www.vgupta.com
SPSS for Beginners
Copyright 1999 Vijay Gupta
Published by VJBooks Inc.
All rights reserved. No part of this book may be used or reproduced in any form or by any
means, or stored in a database or retrieval system, without prior written permission of the
publisher except in the case of brief quotations embodied in reviews, articles, and research
papers. Making copies of any part of this book for any purpose other than personal use is a
violation of United States and international copyright laws.
This book is sold as is, without warranty of any kind, either express or implied, respecting the
contents of this book, including but not limited to implied warranties for the book's quality,
performance, merchantability, or fitness for any particular purpose. Neither the author, the
publisher and its dealers, nor distributors shall be liable to the purchaser or any other person or
entity with respect to any liability, loss, or damage caused or alleged to be caused directly or
indirectly by the book.
This book is based on SPSS versions 7.x through 10.0. SPSS is a registered trademark of SPSS
Inc.
Publisher: VJBooks Inc.
Editor: Vijay Gupta
Author: Vijay Gupta
www.vgupta.com
About the Author
Vijay Gupta has taught statistics, econometrics, SPSS, LIMDEP, STATA, Excel, Word,
Access, and SAS to graduate students at Georgetown University. A Georgetown University
graduate with a Masters degree in economics, he has a vision of making the tools of
econometrics and statistics easily accessible to professionals and graduate students. At the
Georgetown Public Policy Institute he received rave reviews for making statistics and SPSS so
easy and "non-mathematical." He has also taught statistics to institutions in the US and abroad.
In addition, he has assisted the World Bank and other organizations with econometric analysis,
survey design, design of international investments, cost-benefit and sensitivity analysis,
development of risk management strategies, database development, information system design
and implementation, and training and troubleshooting in several areas. Vijay has worked on
capital markets, labor policy design, oil research, trade, currency markets, transportation policy,
market research and other topics on The Middle East, Africa, East Asia, Latin America, and the
Caribbean. He has worked in Lebanon, Oman, Egypt, India, Zambia, and the U.S.
www.vgupta.com
www.vgupta.com
Acknowledgments
To SPSS Inc, for their permission to use screen shots of SPSS.
Dedication
To my Grandmother, the late Mrs. Indubala Sukhadia, member of India's Parliament. The
greatest person I will ever know. A lady with more fierce courage, radiant dignity, and
leadership and mentoring abilities than any other.
www.vgupta.com
Table of Contents Table of Contents- 1
TABLE OF CONTENTS
Contents
INTRODUCTION.......................................................................................................................................I
Merits of the Book ............................................................................................................................ i
Organization of the Chapters............................................................................................................. i
Conventions Used in this Book ....................................................................................................... iii
Quick Reference and Index: Relation Between SPSS Menu Options and the Sections in
the Book ........................................... iv
www.vgupta.com
Table of Contents Table of Contents- 2
4. 2 Boxplots........................................................................................................................... 4-3
4. 3 Comparing Means and Distributions ................................................................................. 4-5
6. Tables..................................................................................................................................... 6-1
6. 1 Tables for Statistical Attributes ......................................................................................... 6-1
6. 2 Tables of Frequencies ......................................................................................................6-12
www.vgupta.com
Table of Contents Table of Contents- 3
www.vgupta.com
Table of Contents Table of Contents- 4
Detailed Contents
Merits of the Book ........................................................................................................................................ i
www.vgupta.com
Table of Contents Table of Contents- 5
1.7.D. Complex Filter: Choosing a Sub-set of Data Based On Criterion from More
than One Variable ........................................................................................................................1-35
www.vgupta.com
Table of Contents Table of Contents- 6
3.4 Testing if the Mean is Equal to a Hypothesized Number (the T-Test and Error
Bar)..................................................................................................................................3-23
3.4.C. Error Bar (Graphically Showing the Confidence Intervals of Means) ................................3-24
3.4.A. A Formal Test: The T-Test...............................................................................................3-25
www.vgupta.com
Table of Contents Table of Contents- 7
6. TABLES .................................................................................................................................6-1
7. 4 Diagnostics............................................................................................................................7-17
7. 4.A. Collinearity .....................................................................................................................7-17
7. 4.B. Misspecification ..............................................................................................................7-18
7. 4.C. Incorrect Functional Form................................................................................................7-19
7. 4.D. Omitted Variable. ............................................................................................................7-19
7. 4.E. Inclusion of an Irrelevant Variable. ..................................................................................7-20
7. 4.F. Measurement Error. .........................................................................................................7-20
7. 4.G. Heteroskedasticity ...........................................................................................................7-20
www.vgupta.com
Table of Contents Table of Contents- 8
9. 1 Logit....................................................................................................................................... 9-1
www.vgupta.com
Table of Contents Table of Contents- 9
12. 3 Reading Data Stored in ASCII Delimited (Freefield) Format other than Tab....................12-4
www.vgupta.com
Table of Contents Table of Contents- 10
14. 3 The Runs Test - Checking Whether a Variable is Randomly Distributed ........................14-10
15. 2 Choosing the Default View of the Data and Screen .............................................................15-4
www.vgupta.com
Table of Contents Table of Contents- 11
17. 7 Co-integration.....................................................................................................................17-38
www.vgupta.com
Introduction i
Introduction
Chapter 1, Data Handling," teaches the user how to work with data in SPSS. The chapter
teaches how to insert data into SPSS, define missing values, label variables, sort data, filter the
file (work on sub-sets of the file) and other data steps. Some advanced data procedures, such as
reading ASCII text files and merging files, are covered at the end of the book (chapters 12 and
13).
Chapter 2, Creating New Variables, shows the user how to create new categorical and
continuous variables. The new variables are created from transformations applied to the
existing variables in the data file and by using standard mathematical, statistical, and logical
operators and functions on these variables.
www.vgupta.com
Introduction ii
Chapter 4, Comparing Variables, explains how to compare two or more similar variables.
The methods used include comparison of means and graphical evaluations.
Chapter 5, Patterns Across Variables (Multivariate Statistics), shows how to conduct basic
analysis of patterns across variables. The procedures taught include Bivariate and partial
correlations, scatter plots, and the use of stem and leaf graphs, boxplots, extreme value tables,
and bar/line/area graphs.
Chapter 6, Custom Tables, explains how to explore the details of the data using custom tables
of statistics and frequencies.
In Chapter 7, Linear Regression, users will learn linear regression analysis (OLS). This
chapter includes checking for the breakdown of classical assumptions and the implications of
each breakdown (heteroskedasticity, mis-specification, measurement errors, collinearity, etc.) in
the interpretation of the linear regression. A major drawback of SPSS is its inability to test
directly for the breakdown of classical conditions. Each test must be performed step-by-step.
For illustration, details are provided for conducting one such test - the Whites Test for
heteroskedasticity.
Chapter 9, Maximum Likelihood Estimation: Logit, and Non-Linear Estimation, teaches non-
linear estimation methods, including non-linear regression and the Logit. This chapter also
suggests briefly how to interpret the output.
Chapter 10 teaches "comparative analysis," a term not found in any SPSS, statistics, or
econometrics textbook. In this context, this term means "analyzing and comparing the results of
procedures by sub-samples of the data set." Using this method of analysis, regression and
statistical analysis can be explained in greater detail. One can compare results across categories
of certain variables, e.g. - gender, race, etc. In our experience, we have found such an analysis
to be extremely useful. Moreover, the procedures taught in this chapter will enable users to
work more efficiently.
Chapter 11, Formatting Output, teaches how to format output. This is a SPSS feature ignored
by most users. Reviewers of reports will often equate good formatting with thorough analysis.
It is therefore recommended that users learn how to properly format output.
Chapters 1-11 form the sequence of most statistics projects. Usually, they will be sufficient
for projects/classes of the typical user. Some users may need more advanced data
handling and statistical procedures. Chapters 12-18 explore several of these procedures.
The ordering of the chapters is based on the relative usage of these procedures in
advanced statistical projects and econometric analysis.
Chapter 12, Reading ASCII Text Data, and chapter 13 Adding Data, deal specifically with
reading ASCII text files and merging files. The task of reading ASCII text data has become
easier in SPSS 9.0 (as compared to all earlier versions). This text teaches the procedure from
versions 7.x forward.
www.vgupta.com
Introduction iii
Chapter 14, "Non-Parametric Testing," shows the use of some non-parametric methods. The
exploration of various non-parametric methods, beyond the topic-specific methods included in
chapters 3, 4, and 5, are discussed herein.
Chapter 15, "Setting System Options," shows how to set some default settings. Users may
wish to quickly browse through this brief section before reading Chapter 1.
Chapter 16 shows how to read data from any ODBC source database application/format. SPSS
9.0 also has some more database-specific features. Such features are beyond the scope of this
book and are therefore not included in this section that deals specifically with ODBC source
databases.
Chapter 17 shows time series analysis. The chapter includes a simple explanation of the non-
stationarity problem and cointegration. It also shows how to correct for non-stationarity,
determine the specifications for an ARIMA model, and conduct an ARIMA estimation.
Correction for first-order autocorrelation is also demonstrated.
Chapter 18 teaches how to use the two programming languages of SPSS )without having to do
any code-writing yourself).
This may be the first interactive book in academic history! Depending on your
comments/feedback/requests, we will be making regular changes to the book and the free
material on the web site.
The index is in two parts - part 1 is a menu-to-chapter (and section) mapping, whereas part 2 is
a regular index.
1
A menu is a list of options available from the list on the top of the computer screen. Most software applications
have these standard menus: FILE, EDIT, WINDOW, and HELP.
www.vgupta.com
Introduction iv
Written instructions are linked to highlighted portions of the picture they describe. The
highlighted portions are denoted either by a rectangle or ellipse around the relevant picture-
component or by a thick arrow, which should prompt the user to click on the image.
Some terms the user will need to know: a dialog box is the box in any Windows software
program that opens up when a menu option is chosen. A menu option is the list of
procedures that the user will find on the top of the computer screen.
www.vgupta.com
Introduction v
Quick reference and index: Relation between SPSS menu options and
the sections in the book
Menu Sub-Menu Section that teaches the
menu option
FILE NEW -
,, OPEN 1.1
,, DATABASE CAPTURE 16
,, READ ASCII DATA 12
,, SAVE -
,, SAVE AS -
,, DISPLAY DATA INFO -
,, APPLY DATA DICTIONARY -
,, STOP SPSS PROCESSOR -
EDIT OPTIONS 15.1
,, ALL OTHER SUB-MENUS -
VIEW STATUS BAR 15.2
,, TOOLBARS 15.2
,, FONTS 15.2
,, GRID LINES 15.2
,, VALUE LABELS 15.2
DATA DEFINE VARIABLE 1.2
,, DEFINE DATES -
,, TEMPLATES -
,, INSERT VARIABLE -
,, INSERT CASE, GO TO CASE -
,, SORT CASES 1.5
,, TRANSPOSE -
,, MERGE FILES 13
,, AGGREGATE 1.4
,, ORTHOGONAL DESIGN -
,, SPLIT FILE 10
,, SELECT CASES 1.7
,, WEIGHT CASES 1.3
TRANSFORM COMPUTE 2.2
,, RANDOM NUMBER SEED -
www.vgupta.com
Introduction vi
www.vgupta.com
Introduction vii
www.vgupta.com
Introduction viii
www.vgupta.com
Chapter 1: Data handling 1-1
Ch 1. DATA HANDLING
Before conducting any statistical or graphical analysis, one must have the data in a form
amenable to a reliable and organised analysis. In this book, the procedures used to achieve this
are termed Data Handling2.3 SPSS terms them "Data Mining." We desist from using their
term because "Data Mining" typically involves more complex data management than that
presented in this book and that which will be practical for most users.
The most important procedures are in sections 1.1, 1.2, and 1.7.
In section 1.1, we describe the steps required to read data from three popular formats:
spreadsheet (Excel, Lotus and Quattropro), database (Paradox, Dbase, SYLK, DIF), and SPSS
and other statistical programs (SAS, STATA, E-VIEWS). See chapter 12 for more information
on reading ASCII text data.
Section 1.2 shows the relevance and importance of defining the attributes of each variable in the
data. It then shows the method for defining these attributes. You need to perform these steps
only once - the first time you read a data set into SPSS (and, as you will learn later in chapters 2
and 14, whenever you merge files or create a new variable). The procedures taught here are
necessary for obtaining well-labeled output and avoiding mistakes from the use of incorrect data
values or the misreading of a series by SPSS. The usefulness will become clear when you read
section 1.2.
Section 1.3 succinctly shows why and how to weigh a data set if the providers of the data or
another reliable and respectable authority on the data set recommend such weighing.
Sometimes, you may want to analyze the data at a more aggregate level than the data set
permits. For example, let's assume you have a data set that includes data on the 50 states for 30
years (1,500 observations in total). You want to do an analysis of national means over the
years. For this, a data set with only 30 observations, each representing an "aggregate" (the
national total) for one year, would be ideal. Section 1.4 shows how to create such an
"aggregated" data set.
In section 1.5, we describe the steps involved in sorting the data file by numeric and/or
alphabetical variables. Sorting is often required prior to conducting other procedures.
2
We can roughly divide these procedures into three sub-categories:
Data handling procedures essential for any analysis. These include the reading of the data and the defining of
each variables attributes (Sections 1.1, 1.2, and chapters 12 and 16.)
Data handling procedures deemed essential or important because of the nature of the data set or analysis. These
include weighing of the variables, reducing the size of the data set, adding new data to an existing data set, creating
data sets aggregated at higher levels, etc. (Sections 1.3, 1.4, 1.6, and chapter 13.)
Data handling procedures for enhancing/enabling other statistical and graphical procedures. These include the
sorting of data, filtering of a Sub-set of the data, and replacing of missing values (Sections 1.5-1.8.)
3
The "Data Handling" procedures can be found in the menus: FILE and DATA. From the perspective of a beginner
or teacher, the biggest drawback of SPSS is the inefficient organisation of menus and sub-menus. Finding the correct
menu to conduct a procedure can be highly vexing.
www.vgupta.com
Chapter 1: Data handling 1-2
If your data set is too large for ease of calculation, then the size can be reduced in a reliable
manner as shown in section 1.6.
Section 1.7 teaches the manners in which the data set can be filtered so that analysis can be
restricted to a desired Sub-set of the data set. This procedure is frequently used. For example,
you may want to analyze only that portion of the data that is relevant for "Males over 25 years
in age."
Creating new variables (e.g. - the square of an existing variable) is addressed in chapter 2.
The most complex data handling technique is "Merging" files. It is discussed in chapter 13.
Click on "Open.
www.vgupta.com
Chapter 1: Data handling 1-3
Click on Save."
In SPSS, go to FILE/OPEN.
www.vgupta.com
Chapter 1: Data handling 1-4
The data within the defined range will be read. Save the opened file as a SPSS file by going to
the menu option FILE/ SAVE AS and saving with the extension ".sav."
A similar procedure applies for other spreadsheet formats. Lotus files have the extensions "wk."
Note: the newer versions of SPSS can read files from Excel 5 and higher using methods shown
in chapter 16. SPSS will request the name of the spreadsheet that includes the data you wish to
use. We advise you to use Excel 4 as the transport format. In Excel, save the file as an Excel 4
file (as shown on the previous page) with a different name than the original Excel file's name (to
preclude the possibility of over-writing the original file). Then follow the instructions given on
the previous page.
Rather, while still in the statistical program that contains your data, you must save the file in a
format that SPSS can read. Usually these formats are Excel 4.0 (.xls) or Dbase 3 (.dbf). Then
follow the instructions given earlier (sections 1.1.b and 1.1.c) for reading data from
spreadsheet/database formats.
www.vgupta.com
Chapter 1: Data handling 1-5
In effect, using variable labels indicates to SPSS that: "When I am using the variable
fam_id, in any and all output tables and charts produced, use the label "Family
Identification Number" rather than the variable name fam_id."
In order to make SPSS display the labels, go to EDIT / OPTIONS. Click on the tab
OUTPUT/NAVIGATOR LABELS. Choose the option "Label" for both "Variables" and
"Values." This must be done only once for one computer. See chapter for more.
3. Missing value declaration. This is essential for an accurate analysis. Failing to define the
missing values will lead to SPSS using invalid values of a variable in procedures, thereby
biasing results of statistical procedures. (See section 1.2.c.)
4. Column format can assist in improving the on-screen viewing of data by using appropriate
column sizes (width) and displaying appropriate decimal places (See section 1.2.d.). It does
not affect or change the actual stored values.
5. Value labels are similar to variable labels. Whereas "variable" labels define the label to use
instead of the name of the variable in output, "value" labels enable the use of labels instead
of values for specific values of a variable, thereby improving the quality of output. For
example, for the variable gender, the labels "Male" and "Female" are easier to understand
than "0" or "1. (See section 1.2.e.)
In effect, using value labels indicates to SPSS that: "When I am using the variable gender,
in any and all output tables and charts produced, use the label "Male" instead of the value
"0" and the label "Female" instead of the value "1"."
6
If you create a new variable using compute or recode (see chapter 2) or add variables using merge (see
chapter 13), you must define the attributes of the variables after the variables have been created/added.
www.vgupta.com
Chapter 1: Data handling 1-6
TYPE EXAMPLE
Numeric 1000.05
Comma 1,000.005
Scientific 1 * e3
Dollar $1,000.00
String Alabama
SPSS usually picks up the format automatically. As a result, you typically need not worry about
setting or changing the data type. However, you may wish to change the data type if:
1. Too many or too few decimal points are displayed.
2. The number is too large. If the number is 12323786592, for example, it is difficult to
immediately determine its size. Instead, if the data type were made comma, then the
number would read as 12,323,786,592. If the data type was made scientific, then the
number would read as 12.32*E9, which can be quickly read as 12 billion. ("E3" is
thousands, "E6" is millions, "E9" is billions.)
3. Currency formats are to be displayed.
4. Error messages about variable types are produced when you request that SPSS conduct a
procedure7. Such a message indicates that the variable may be incorrectly defined.
7
For example, Variable not numeric, cannot perform requested procedure.
www.vgupta.com
Chapter 1: Data handling 1-7
8
We knew this from information provided by the supplier of the data.
www.vgupta.com
Chapter 1: Data handling 1-8
9
A width of 1 would also suffice.
www.vgupta.com
Chapter 1: Data handling 1-9
www.vgupta.com
Chapter 1: Data handling 1-10
97 for No Response
98 for Not Applicable
99 for Illegible Answer
www.vgupta.com
Chapter 1: Data handling 1-11
By defining these values as missing, we ensure that SPSS does not use these observations in any
procedure involving work_ex10.
Note: We are instructing SPSS: "Consider 97-99 as blanks for the purpose of any calculation or
procedure done using that variable." The numbers 97 through 99 will still be seen on the data
sheet but will not be used in any calculations and procedures.
10
You will still see these numbers on the screen. The values you define as missing are called User (defined)
Missing in contrast to the System Missing value of null seen as a period (dot) on the screen.
www.vgupta.com
Chapter 1: Data handling 1-12
www.vgupta.com
Chapter 1: Data handling 1-13
www.vgupta.com
Chapter 1: Data handling 1-14
dummy variables gender and pub_sec, the column width can be much smaller than the default,
which is usually 8.
www.vgupta.com
Chapter 1: Data handling 1-15
fam_id, you can define a label Family Identification Number. SPSS displays the label (and
not the variable name) in output charts and tables. Using a variable label will therefore improve
the lucidity of the output.
Note: In order to make SPSS display the labels in output tables and charts, go to EDIT /
OPTIONS. Click on the tab OUTPUT/NAVIGATOR LABELS. Choose the option "Label" for
both "Variables" and "Values." This must be done only once for one computer. See also:
Chapter 15.
www.vgupta.com
Chapter 1: Data handling 1-16
In order to make SPSS display the labels, go to EDIT / OPTIONS. Click on the tab
OUTPUT/NAVIGATOR LABELS. Choose the option "Label" for both "Variables" and
"Values." This must be done only once for one computer. See also: Chapter 15.
We show an example for one variable - pub_sec. The variable has two possible values: 0 (if the
respondent is a private sector employee) or 1 (if the respondent is a public sector employee).
We want to use text labels to replace the values 0 and 1 in any output tables featuring this
variable.
Note: Defining value labels does not change the original data. The data sheet still contains the
values 0 and 1.
www.vgupta.com
Chapter 1: Data handling 1-17
Repeat the above for the value 1, then click on the "Continue" button.
www.vgupta.com
Chapter 1: Data handling 1-18
www.vgupta.com
Chapter 1: Data handling 1-19
www.vgupta.com
Chapter 1: Data handling 1-20
WAGE WAGE 1
Print Format: F9.2
Write Format: F9.2
11 12
WORK_EX WORK EXPERIENCE
13
2
14
Print Format : F9
Write Format: F9
15
Missing Values : 97 thru 99, -1
EDUC EDUCATION 3
Print Format: F9
Write Format: F9
FAM_MEM FAMILY MEMBERSHIP NUMBER (IF MORE THAN ONE RESPONDENT FROM
THE FAMILY) 5
Print Format: F8
Write Format: F8
GENDER 6
Print Format: F2
Write Format: F2
16
Value Label
0 MALE
1 FEMALE
PUB_SEC 7
Print Format: F8
Write Format: F8
Value Label
0 PUBLIC SECTOR EMPLOYEE
1 PRIVATE SECTOR EMPLOYEE
11
This is the name of the variable.
12
This is the "Variable Label."
13
This is the column number of the variable.
14
This is the "Data Type." A type "F9.2" means-- Numeric ("F" is for Numeric, "A" for String), width of 9, 2
decimal points.
15
This is the "Missing Value" definition.
16
This list gives the "Value Labels" for the variable gender.
www.vgupta.com
Chapter 1: Data handling 1-21
AGE 8
Print Format: F8
Write Format: F8
The agency that conducted the survey will usually provide a "Weighting Variable" that is
designed to correct the bias in the sample. By using this variable, you can transform the
variables in the data set into Weighted Variables. The transformation is presumed to have
17
lowered the bias, thereby rendering the sample more "random."
Let's assume that the variable fam_mem is to be used as the weighing variable.
You can turn weighting off at any time, even after the file has been saved in weighted form.
17
Our explanation is simplistic, but we believe it captures the essence of the rationale for weighting.
www.vgupta.com
Chapter 1: Data handling 1-22
Note: If this topic seems irrelevant, feel free to skip it. Most projects do not make use of this
procedure.
www.vgupta.com
Chapter 1: Data handling 1-23
18
In the next example, we show how to choose other statistics as the criterion for aggregation.
www.vgupta.com
Chapter 1: Data handling 1-24
Click on Open.
www.vgupta.com
Chapter 1: Data handling 1-25
Click on Labels.
19
In our data set, females have a value of 1 in the variable gender.
www.vgupta.com
Chapter 1: Data handling 1-26
Click on Function.
www.vgupta.com
Chapter 1: Data handling 1-27
You can create several such aggregated files using different break variables (e.g. - age and
gender, etc.). In the former there will be as many observations as there are age levels. In the
latter the new data set will be aggregated to a further level (so that male and 12 years is one
observation, female and 12 years is another).
www.vgupta.com
Chapter 1: Data handling 1-28
Ch 1. Section 5 Sorting
Sorting defines the order in which data are arranged in the data file and displayed on your
screen. When you sort by a variable, X, then you are arranging all observations in the file by
the values of X, in either increasing or decreasing values of X. If X is text variable, then the
order is alphabetical. If it is numerical, then the order is by magnitude of the value.
Sorting a data set is a prerequisite for several procedures, including split file, replacing missing
values, etc.
www.vgupta.com
Chapter 1: Data handling 1-29
Example of how the sorted data will look (ascending in gender, then descending in educ)
www.vgupta.com
Chapter 1: Data handling 1-30
www.vgupta.com
Chapter 1: Data handling 1-31
20
If you choose Filtered then the cases that were not selected will not be used in any analysis, but will also not be
deleted. Rather, they will be hidden. In the event that you wish to use those cases again, go to DATA/ SELECT
CASES and choose the first option, All Cases.
www.vgupta.com
Chapter 1: Data handling 1-32
Similarly, you can study the statistical attributes of females only, adult females only, adult
females with high school or greater education only, etc21. If your analysis, experience, research
or knowledge indicates the need to study such sub-set separately, then use DATA/ SELECT
CASE to create such sub-sets.
21
Apart from allowing you to concentrate on a Sub-set of the sample, SELECT CASE (or filtering, as it is often
called) creates dummy variables on the basis of the subgroup you filter. Let's assume you have used DATA/
SELECT CASE to filter in adult females. SPSS will create a new variable that takes the value of 1 for the filtered in
observations (i.e. - for adult females) and a value of 0 for all other observations. This dummy variable may be used
in regression and other analysis and, even more importantly, in running a comparative analysis to compare the
differences in statistical and graphical results for adult females versus the rest (see chapter 10). Ignore this footnote
if it is too complex or seems irrelevant. We will get back to the use of filtering later in the book. Within the proper
context, the usefulness will become apparent.
www.vgupta.com
Chapter 1: Data handling 1-33
www.vgupta.com
Chapter 1: Data handling 1-34
The filtered-out data have a diagonal line across the observation number. These observations
are not used by SPSS in any analysis you conduct with the filter on.
22
SPSS creates a filter variable each time you run a filter. This variable takes the value 1 if the case satisfies the
filter criterion and 0 if it does not. This variable can be used as a dummy variable (see chapters 3-9). Also, if you
want to compare more deeply between the filtered and unfiltered groups, use the filter variable as a criterion for
comparative analysis (see chapter 10).
www.vgupta.com
Chapter 1: Data handling 1-35
Do not forget this step. Reason: You may conduct other procedures (in the current or next
SPSS session) forgetting that SPSS is using only a Sub-set of the full data set. If you do so,
your interpretation of output would be incorrect.
Equal to = Equal to
www.vgupta.com
Chapter 1: Data handling 1-36
And & This means satisfies BOTH criteria. For example, if you
want to isolate public sector (pub_sec=1) females (gender=1),
your condition would be pub_sec=1 & gender =1.
Example 1
Let's assume you want to select only those male employees (gender=0) who work in the public
sector (pub_sec = 1).
Choose the menu option DATA/ SELECT
CASE
23
The symbol ~ can be obtained by pressing down on the shift button and clicking on the apostrophe button on the
upper left portion of the keyboard.
www.vgupta.com
Chapter 1: Data handling 1-37
www.vgupta.com
Chapter 1: Data handling 1-38
Now let us assume that you want to select cases where the respondent is a female (gender=1)
and her wage is above twenty (wage > 20).
To do so, choose DATA / SELECT CASES, and If Condition is Satisfied. (See section 1.7.a
for details on the process involved.) In the large white window, you want to specify female
(gender =1) and wages above twenty (wage>20). Select gender = 1 & wage > 20.
Now you can conduct analysis on "Adult Females only." (See sections 1.7.b and 1.7.c.)
Let's assume you want to choose the lowest or highest levels of education (education < 6 or
education > 13). Under the DATA menu, choose SELECT CASES and If Condition is
Satisfied (See section1.7.a for details on the process involved). In the large white window, you
must specify your conditions. Remember that the operator for or is | which is the symbol
that results from pressing the keyboard combination SHIFT and "\." Type in educ < 6 | educ
> 13 in the large white window.
Now you can conduct analysis on "Respondents with Low or High Education only." (See
sections 1.7.b and 1.7.c.)
www.vgupta.com
Chapter 1: Data handling 1-39
If you have some idea about the patterns and trends in your data, then you can replace missing
values with extrapolations from the other non-missing values in the proximity of the missing
value. Such extrapolations make more sense for, and are therefore used with, time series data.
If you can arrange the data in an appropriate order (using DATA/ SORT) and have some
sources to back your attempts to replace missing values, you can even replace missing values in
cross-sectional data sets - but only if you are certain.
Let's assume work_ex has several missing values that you would like to fill in. The variable age
has no missing values. Because age and work_ex can be expected to have similar trends (older
people have more work experience), you can arrange the data file by age (using DATA /SORT
and choosing age as the sorting variable - see section 1.5) and then replacing the missing values
of work_ex by neighboring values of itself.
www.vgupta.com
Chapter 1: Data handling 1-40
24
This is a safe strategy - many statisticians are sceptical of analysis in which the analyst has let the computer fill in
missing values. We therefore suggest that you not let the original variable change as you may need it again for future
analysis.
www.vgupta.com
Chapter 1: Data handling 1-41
Example: for a certain project (let's assume it is Analysis of Gender Bias in Earnings) you
may need to use a certain Sub-set of variables. For a different project (let's assume it is
Sectoral Returns to Education), you may need to use a different set of variables.
www.vgupta.com
Chapter 1: Data handling 1-42
www.vgupta.com
Chapter 1: Data handling 1-43
www.vgupta.com
Chapter 2: Creating new variables 2-1
Your project will probably require the creation of variables that are imputed/computed from the
existing variables. Two examples illustrate this:
1. Let's assume you have data on the economic performance of the 50 United States. You
want to compare the performance of the following regions: Mid-west, South, East Coast,
West Coast, and other. The variable state has no indicator for region. You will need to
create the variable region using the existing variable state (and your knowledge of
geography).
2. You want to run a regression in which you can obtain the % effect on wages of a one year
increase in education attainment. The variable wage does not lend itself to such an
analysis. You therefore must create and use a new variable that is the natural log
transformation of wage. In section 2.1, after explaining the concept of dummy and
categorical variables, we describe how to create such variables using various procedures.
We first describe recode, the most used procedure for creating such variables. Then we
briefly describe other procedures that create dummy or categorical variables - automatic
recode and filtering25 (the variables are created as a by-product in filtering).
In section 2.2, we show how to create new variables by using numeric expressions that include
existing variables, mathematical operators, and mathematical functions (like square root, logs,
etc).
Section 2.3 explains the use of "Multiple Selection Sets." You may want to skip this section and
come back to it after you have read chapters 3 and 6.
Section 2.4 describes the use of the count procedure. This procedure is used when one wishes
to count the number of responses of a certain value across several variables. The most frequent
use is to count the number of "yeses" or the "number of ratings equal to value X."
Let's assume that you wish to create a variable with the categories High, mid, and low income
groups" from a continuous variable wage. If you can define the exact criteria for deciding the
range of values that define each income range, then you can create the new variable using the
procedures shown in section 2.1. If you do not know these criteria, but instead want to ask
SPSS to create the three "clusters" of values ("High," "Mid," and "Low") then you should use
"Cluster Analysis" as shown in section 2.5.
You may want to use variables that are at a higher level of aggregation than in the data set you
have. See section 1.4 to learn how to create a new "aggregated" data set from the existing file.
25
See section 1.7 for a detailed discussion on filtering.
www.vgupta.com
Chapter 2: Creating new variables 2-2
Once the dummy or categorical variables have been created, they can be used to enhance most
procedures. In this book, any example that uses gender or pub_sec as a variable provides an
illustration of such an enhancement. Such variables are used in many procedures:
Value Category
0 Male
1 Female
Categorical variables can take several values, with each value indicating a specific category.
For example, a categorical variable Race may have six values, with the values-to-category
mapping being the following:
Value Category
0 White-American
1 African-American
2 Asian-American
3 Hispanic-American
26
Using graphs, boxplots, custom tables, etc. (see chapters 5 and 6).
27
If you use the dummy variable in regression, Logit, and some other procedures, the coding must be 0 and 1. If the
original coding is 1 and 2, you should change it to 0 and 1 using the procedure shown in section 2.1.c
www.vgupta.com
Chapter 2: Creating new variables 2-3
Value Category
4 Native-American
5 Other
Dummy and categorical variables can be computed on a more complex basis. For example:
Value Category
0 wage between 0 and 20
1 wage above 20
28
It is a good practice to write down the mapping in tabular format before you start creating the new variable.
www.vgupta.com
Chapter 2: Creating new variables 2-4
www.vgupta.com
Chapter 2: Creating new variables 2-5
basiced).
Area 3 (OldNew)
contains the mapping of the
old variable to the new one.
www.vgupta.com
Chapter 2: Creating new variables 2-6
Click on Add.
www.vgupta.com
Chapter 2: Creating new variables 2-7
Click on Add.
29
User-missing are the values defined by us (the "user") as missing in DATA/ DEFINE VARIABLE (see section
1.2.c). System-missing are the blank or empty data cells, defined by the system (i.e. - the data set) as missing. In the
data sheet in SPSS, these cells have periods only. (In contrast, in Excel, the blank or empty cells have nothing in
them.) Please be clear on the distinction between user-missing and system-missing. In the new variable the user-
missing values from the original variable will be mapped into empty cells (with periods in them).
www.vgupta.com
Chapter 2: Creating new variables 2-8
Click on Add.
30
This is a safety measure because we do not want any surprises from incorrect values in the new variable generated
from incorrect data in the original variable. For example, the last mapping rule (see the entry ELSE SYSMIS)
ensures that if any nonsensical or invalid values in the old variable (like, for example, an education level of "-1" in
the variable educ) do not carry over into the new variable.
www.vgupta.com
Chapter 2: Creating new variables 2-9
Let's assume we want to create a new dummy variable, educ2, in which Master's or higher level
education (17 or above) is recoded with one value, i.e. - 17. The other values remain as they
are. The mapping is:
www.vgupta.com
Chapter 2: Creating new variables 2-10
www.vgupta.com
Chapter 2: Creating new variables 2-11
www.vgupta.com
Chapter 2: Creating new variables 2-12
Click on Add.
31
A common use is recoding dummies from the codes 1 and 2 into the codes 0 and 1.
www.vgupta.com
Chapter 2: Creating new variables 2-13
Let's assume you want to look at cases according to different age groups to test the hypothesis
that workers in their forties are more likely to earn higher wages. To do this, you must recode
age into a variable with 5 categories: workers whose age is between 20-29 years, 30-39 years,
40-49 years, 50-59 years, and all other workers (i.e. - those who are 20 years old or younger and
those who are 60 years old and over).
32
Note that this is a different sub-sub-menu compared to that used in the previous section (that menu option was
TRANSFORM / RECODE / INTO NEW VARIABLES.
www.vgupta.com
Chapter 2: Creating new variables 2-14
www.vgupta.com
Chapter 2: Creating new variables 2-15
33
You can choose a different way to achieve the same result. For example, you may prefer using the following three
mapping items to replace the one item we use ("ELSE 4"): "lowest thru 19 4," "60 thru highest 4," and
"ELSE System-Missing." The more finely you define the items, especially the values to be considered as missing,
the lower the chances of creating a variable with incorrect values.
www.vgupta.com
Chapter 2: Creating new variables 2-16
Value Category
0 Females of age 20 or under and all Males
1 Females of age 20 and above
This dummy variable can be used as any other dummy variable. To use it, you must first turn
the above filter off by going to DATA/ SELECT CASE and choosing All cases as shown in
section 1.7.c.
www.vgupta.com
Chapter 2: Creating new variables 2-17
Tip: This procedure is not often used. If you think this topic is irrelevant for you, you may
simply skip to the next section.
Let's assume that you have the names of countries as a variable cty. (See picture below.)
You want to create a new variable cty_code where the countries listed in the variable cty
are recoded numerically as 1,2,......... into a new variable, cty_code. The recoding must be
done in alphabetical order, with Afghanistan being recoded into 1, Argentina into 2, etc.
www.vgupta.com
Chapter 2: Creating new variables 2-18
To do so, go to TRANSFORM/
AUTORECODE.
www.vgupta.com
Chapter 2: Creating new variables 2-19
Now you can use the variable cty_code in other data manipulation, graphical procedures, and
statistical procedures.
Dont worry if these terms/procedures are alien to you. You will learn about them in later
chapters and/or in your class.)
www.vgupta.com
Chapter 2: Creating new variables 2-20
www.vgupta.com
Chapter 2: Creating new variables 2-21
In the next table we provide a summary of basic mathematical operators and the corresponding
keyboard digits.
www.vgupta.com
Chapter 2: Creating new variables 2-22
Mathematical Operators
Operation Symbol
Addition +
Subtraction -
Multiplication *
Division /
Power ** or ^
www.vgupta.com
Chapter 2: Creating new variables 2-23
www.vgupta.com
Chapter 2: Creating new variables 2-24
Note: Choose the menu option DEFINE / VARIABLE and define the attributes of the new
variable. See section 1.2 for examples of this process. In particular, you should create variable
labels and define the missing values.
The next table shows examples of the types of mathematical/statistical functions provided by
SPSS.
Important/Representative Functions
Function Explanation
EXP(X) Exponent of X
(1) Using multiple variables: the difference between age and work experience.
34
Refer to your textbook for details on the usefulness of an interactive term.
www.vgupta.com
Chapter 2: Creating new variables 2-25
(3) Using multiple functions: you may want to find the square root of the log of the interaction
between gender and education. This can be done in one step. The following equation
is combining three mathematical functions - multiplication of gender and education,
calculating their natural log and, finally, obtaining the square root of the first two steps.
(4) Using multi-variable mathematical functions: you may want to find the maximum of
three variables (the wages in three months) in an observation. The function MAX
requires multi-variable input. (In the example below, wage1, wage2, and wage3 are
three separate variables.)
In section 2.1, you learned how to use RECODE to create dummy and categorical variables.
The RECODE procedure usually narrows down the values from a variable with (let's assume
M) more possible values into a new variable with fewer possible values, e.g. - the education
to basic education recode mapped from the range 0-23 into the range 0-1.
What if you would like to do the opposite and take a few dummy variables and create one
categorical variable from them? To some extent, Multiple Response Sets help you do that. If
you have five dummy variables on race (African-American or not, Asian-American or not,
etc.) but want to run frequency tabulations on race as a whole, then doing the frequencies on the
five dummy variables will not be so informative. It would be better if you could capture all the
categories (5 plus 1, the or not reference category) in one table. To do that, you must define
the five dummy variables as one Multiple Response Set.
Let us take a slightly more complex example. Continuing the data set example we follow in
most of this book, assume that the respondents were asked seven more yes/no questions of the
form -
1. Ad: Did the following resource help in obtaining current job - response to
newspaper ad
2. Agency: Did the following resource help in obtaining current job - employment
agency
3. Compense: Did the following resource help in obtaining current job - veteran or
other compensation and benefits agency
4. Exam: Did the following resource help in obtaining current job - job entry
examination
5. Family: Did the following resource help in obtaining current job - family
members
www.vgupta.com
Chapter 2: Creating new variables 2-26
6. Fed_gov: Did the following resource help in obtaining current job - federal
government job search facility
7. Loc_gov: Did the following resource help in obtaining current job - local
government job search facility
All the variables are linked. Basically they are the Multiple Responses to the question What
resource helped in obtaining your current job?
Let's assume you want to obtain a frequency table and conduct cross tabulations on this set of
variables. Note that a respondent could have answered yes to more than one of the questions.
www.vgupta.com
Chapter 2: Creating new variables 2-27
Click on Add.
www.vgupta.com
Chapter 2: Creating new variables 2-28
Note: you can also use category variables with more than two possible values in a multiple
response set. Use the same steps as above with one exception: choose the option Categories in
the area Variables are Coded As and enter the range of values of the categories.
To do frequencies, go to
STATISTICS / MULTIPLE
RESPONSE / MULTIPLE
RESPONSE FREQUENCIES.
www.vgupta.com
Chapter 2: Creating new variables 2-29
www.vgupta.com
Chapter 2: Creating new variables 2-30
In the next dialog box, define the sets as you did above.
Let's assume a wholesaler has conducted a simple survey to determine the ratings given by five
retailers (firm 1, firm 2, , firm 5) to product quality on products supplied by this
wholesaler to these retailers. The retailers were asked to rate the products on a scale from 0-10,
www.vgupta.com
Chapter 2: Creating new variables 2-31
with a higher rating implying a higher quality rating. The data was entered by product, with one
variable for each retailer.
The wholesaler wants to determine the distribution of products that got a positive rating,
defined by the wholesaler to be ratings in the range 7-10. To do this, a new variable must be
created. This variable should count the number of firms that gave a positive rating (that is,
a rating in the range 7-10) for a product.
www.vgupta.com
Chapter 2: Creating new variables 2-32
Using cluster analysis, a continuous variable can be grouped into qualitative categories based on
the distribution of the values in that variable. For example, the variable wage can be used to
www.vgupta.com
Chapter 2: Creating new variables 2-33
create a categorical variable with three values by making three groups of wage earnings - high
income, mid income, and low income - with SPSS making the three groups.
Value Category
1 High income
2 Low income
3 Mid income
Let's assume you want to use "income-group membership" as the variable for defining the
groups in a comparative analysis. But let's also assume that your data have a continuous
variable for income, but no categorical variable for "income-group membership." You therefore
must use a method that can create the latter from the former. If you do not have pre-defined
cut-off values for demarcating the three levels, then you will have to obtain them using methods
like frequencies (e.g. - using the 33rd and 66th percentile to classify income into three groups),
expert opinion, or by using the classification procedure. We show an example of the
classification procedure in this section.
Note: The Classification procedure has many uses. We are using it in a form that is probably
too simplistic to adequately represent an actual analysis, but is acceptable for the purposes of
illustrating this point.
We show you how to make SPSS create groups from a continuous variable and then use those
groups for comparative analysis.
www.vgupta.com
Chapter 2: Creating new variables 2-34
Click on Save.
www.vgupta.com
Chapter 2: Creating new variables 2-35
Cluster WAGE
1 66.0 66.0
2 1417.0 1417.0
3 510.0 510.0
35
SPSS does not label them as Low, Medium, or High. To do so, go to DATA/ DEFINE VARIABLE and,
following the steps shown in section 1.2, assign these labels to the values 1, 2, and 3.
www.vgupta.com
Chapter 2: Creating new variables 2-36
Choose the menu option DATA/ DEFINE VARIABLE and define a variable label and value
labels for the three values of the newly created variable qcl_2 (see section 1.2 for instructions).
On the data sheet, the new variable will be located in the last column. We use this variable to
conduct an interesting analysis in section 10.1.a.
www.vgupta.com
Chapter 3: Univariate Analysis 3-1
Ch 3. UNIVARIATE ANALYSIS
A proper analysis of data must begin with an analysis of the statistical attributes of each variable
in isolation - univariate analysis. From such an analysis we can learn:
how the values of a variable are distributed - normal, binomial, etc.36
the central tendency of the values of a variable (mean, median, and mode)
dispersion of the values (standard deviation, variance, range, and quartiles)
presence of outliers (extreme values)
if a statistical attribute (e.g. - mean) of a variable equals a hypothesized value
The answer to these questions illuminates and motivates further, more complex, analysis.
Moreover, failure to conduct univariate analysis may restrict the usefulness of further
procedures (like correlation and regression). Reason: even if improper/incomplete univariate
analysis may not directly hinder the conducting of more complex procedures, the interpretation
of output from the latter will become difficult (because you will not have an adequate
understanding of how each variable behaves).
This chapter explains different methods used for univariate analysis. Most of the methods
shown are basic - obtaining descriptive statistics (mean, median, etc.) and making graphs.
(Sections 3.2.e and 3.4.b use more complex statistical concepts of tests of significance.)
In section 3.1, you will learn how to use bar, line, and area graphs to depict attributes of a
variable.
In section 3.2, we describe the most important univariate procedures - frequencies and
distribution analysis. The results provide a graphical depiction of the distribution of a variable
and provide statistics that measure the statistical attributes of the distribution. We also do the
Q-Q and P-P tests and non-parametric testing to test the type of distribution that the variable
exhibits. In particular, we test if the variable is normally distributed, an assumption underlying
most hypotheses testing (the Z, T, and F tests).
Section 3.3 explains how to get the descriptive statistics and the boxplot (also called "Box and
Whiskers plot" for each numeric variable. The boxplot assists in identifying outliers and
extreme values.
Section 3.4 describes the method of determining whether the mean of a variable is statistically
equal to a hypothesized or expected value. Usefulness: we can test to discover whether our
sample is similar to other samples from the same population.
Also see chapter 14 for non-parametric univariate methods like the Runs test to determine if a
variable is randomly distributed.
36
Check your textbook for descriptions of different types of distributions.
www.vgupta.com
Chapter 3: Univariate Analysis 3-2
Select GRAPHS/BAR.
www.vgupta.com
Chapter 3: Univariate Analysis 3-3
60
40
20
Count
0
15
20
25
30
35
40
45
50
55
60
65
.0
.0
.0
.0
.0
.0
.0
.0
.0
.0
.0
0
AGE
www.vgupta.com
Chapter 3: Univariate Analysis 3-4
www.vgupta.com
Chapter 3: Univariate Analysis 3-5
80 0
Number of people with specific education level
60 0
40 0
20 0
Count
0
0 2 4 6 9 11 13 15 17 19 21 23
E D U C A T IO N
www.vgupta.com
Chapter 3: Univariate Analysis 3-6
37
If you prefer to use line or area graphs, use similar steps.
www.vgupta.com
Chapter 3: Univariate Analysis 3-7
3000
2000
Cumulative Frequency
1000
0
15.00 21.00 27.00 33.00 39.00 45.00 51.00 57.00 63.00
AGE
Click on Define.
38
A pie chart with too many "slices" of the pie is difficult to read.
www.vgupta.com
Chapter 3: Univariate Analysis 3-8
1
Female
0
Male
39
Using histograms and frequency statistics, we can answer several questions about the distribution of individual
variables. What is the nature of the distribution of a variable: normal, lognormal, exponential, uniform, etc? Is the
variable distributed normally? Is it skewed, and if so, to the left or right? Is there a range in which many
observations occur? Are there outliers, and if there are, where do they lie? What is the mode?
Note: check your statistics text for definitions/descriptions of the terms we use. We do not go into the details of
statistical descriptions.
www.vgupta.com
Chapter 3: Univariate Analysis 3-9
www.vgupta.com
Chapter 3: Univariate Analysis 3-10
40
The latter will depict a normal distribution superimposed on the histogram. If the histogram and the normal curve
are similar, then the variable is normally distributed. If they do not, then you must conduct Q-Q or P-P graphs to test
for the type of distribution of each variable (see section 3.2.b).
www.vgupta.com
Chapter 3: Univariate Analysis 3-11
Click on OK.
The output will have one frequency table for all the
variables and statistics chosen and one histogram
for each variable.
41
The median and interquartile range (75th - 25th percentile or 3rd- 1st quartile) have a useful property - they are not
affected by some outliers or extreme values. Another measure of dispersion is the Semi-Interquartile Range, defined
as [(3rd - 1st quartile) divided by 2].
42
Skewness measures the degree of symmetry in the distribution. A symmetrical distribution includes left and right
halves that appear as mirror images. A positive skew occurs if skewness is greater than zero. A negative skew
occurs if skewness is less than ten. A positive skewness indicates that the distribution is left heavy. You can
consider values between 0 and 0.5 as indicating a symmetrical distribution. Kurtosis measures the degree to which
the frequencies are distributed close to the mean or closer to the extremes. A bell-shaped distribution has a kurtosis
estimate of around 3. A center-heavy (i.e. - close to the mean) distribution has an estimated kurtosis greater than 3.
An extreme-heavy (or flat) distribution has a kurtosis estimate of greater than 3. (All in absolute terms)
www.vgupta.com
Chapter 3: Univariate Analysis 3-12
Statistics
In the next three graphs, the heights of the bars give the relative frequencies of the values of
variables. Compare the bars (as a group) with the normal curve (drawn as a bell-shaped line
curve). All three variables seem to be left heavy relative to the relevant normal curves, i.e. -
lower values are observed more often than higher values for each of the variables.
We advise you to adopt a broad approach to interpretation: consult the frequency statistics result
(shown in the table above), the histograms (see next page), and your textbook.
easily identifiable.
43
See how the bars follow a similar pattern to the "idealised normal curve."
www.vgupta.com
Chapter 3: Univariate Analysis 3-13
400
FREQUENCY
200
STD. DEV = 5.60
MEAN = 6.1
0 N = 2016.00
.8 5.0 9.2 13.3 17.5 21.7
EDUCATION
1200
The P-P or Q-Q tests and formal tests
are used to make a more confident 1000
400
FREQUENCY
WAGE
The analysis in section 3.2.a showed that education, age, and wage might not be distributed
normally. But the histograms provide only a rough visual idea regarding the distribution of a
44
variable. Either the P-P or Q-Q procedure is necessary to provide more formal evidence . The
P-P tests whether the Percentiles (quartiles in the case of the Q-Q) of the variables' distribution
match the percentiles (quartiles in the case of the Q-Q) that would indicate that the distribution
is of the type being tested against.
44
The two methods are roughly equivalent, so we use only one of them here - the Q-Q test. Many statisticians
consider these methods to be insufficient. Section 3.2.e shows the use of a more stringent/formal test.
www.vgupta.com
Chapter 3: Univariate Analysis 3-14
45
The next section shows the use of the "Transformation" feature.
46
A detailed explanation of this is beyond the scope of this book.
47
If two values have the same rank (e.g. - if both are "18th largest") what rank should be given to them in the
mathematical procedure underlying the P-P or Q-Q? The choice "Mean" implies that the mean rank would be used
(continuing the example, this number would be 18.5).
www.vgupta.com
Chapter 3: Univariate Analysis 3-15
0
0 10 20 30 40 50 60 70
OB SER VE D V ALUE
-10
-10 0 10 20 30
OBSERVED VALUE
48
A "formal" testing method typically involves the use of a hypothesis test that, in turn, uses a test like the T, F, Z,
etc. An "informal" testing method is typically a graphical depiction.
www.vgupta.com
Chapter 3: Univariate Analysis 3-16
normally distributed.
0
-20
-40
-100 0 100 200
OBSERVED VALUE
49
If the term "log" or the concept of variable "transformation" are not familiar (and confusing), you can skip over to
section 3.2.e.
50
In effective, SPSS is not testing the variable wage for normality but, instead, the variable log of wage.
www.vgupta.com
Chapter 3: Univariate Analysis 3-17
-1
-2
0 2 4
OBSERVED VALUE
TRANSFORMS: NATURAL LOG
Note: Check your statistics book for descriptions of different distributions. For understanding
this chapter all you need to know is that the lognormal is like a normal distribution but with a
slight tilt toward the left side (lower values occur more frequently than in a normal distribution).
www.vgupta.com
Chapter 3: Univariate Analysis 3-18
OBSERVED VALUE
Move the variables whose normality you wish to test into the box "Test Variable List."
Choose the option "Normal." Execute the procedure by clicking on the button OK." The result
is in the next table. The test statistic used is the Kolmogorov-Smirnov (or simply, K-S) Z. It is
based upon the Z distribution.
www.vgupta.com
Chapter 3: Univariate Analysis 3-19
Age in
complete Hourly Net
years Income
N 1444 1446
a,b
Normal Parameters Mean 34.27 1.6137
Std. Deviation 11.14 1.7733
Most Extreme Absolute .111 .193
Differences Positive .111 .171
Negative -.065 -.193
Kolmogorov-Smirnov Z
4.229 7.355
In class, you may have been taught to compare this estimated Z to the appropriate51 value in the
Z-distribution/test (look in the back of your book - the table will be there along with tables for
the F, T , Chi-Square, and other distributions.) SPSS makes this process very simple! It
implicitly conducts the step of "looking" at the appropriate table entry and calculates the
"Significance" value. ALL YOU MUST DO IS LOOK AT THE VALUE OF THIS
"SIGNIFICANCE" VALUE. The interpretation is then based upon where that value stands in
the decision criterion provided after the next table.
If sig is less than 0.10, then the test is significant at 90% confidence (equivalently, the
hypothesis that the distribution is normal can be rejected at the 90% level of confidence). This
criterion is considered too "loose" by some statisticians.
If sig is less than 0.05, then the test is significant at 95% confidence (equivalently, the
hypothesis that the distribution is normal can be rejected at the 95% level of confidence). This
is the standard criterion used.
If sig is less than 0.01, then the test is significant at 99% confidence (equivalently, the
hypothesis that the distribution is non-normal can be rejected at the 99% level of confidence).
This is the strictest criterion used.
You should memorize these criteria, as nothing is more helpful in interpreting the output
from hypothesis tests (including all the tests intrinsic to every regression and ANOVA
analysis). You will encounter these concepts throughout sections or chapters 3.4, 4.3, 5, 7, 8, 9,
and 10.
In the tests above, the sig value implies that the test indicated that both variables are
normally distributed. (The null hypothesis that the distributions are normal cannot be
rejected.)
51
The aptness is based upon the degrees of freedom(s), level of significance, etc.
www.vgupta.com
Chapter 3: Univariate Analysis 3-20
Boxplots are plots that depict the cut-off points for the four quartiles: 25th percentile, 50th
percentile, 75th percentile, and the 99.99th percentile. Essentially, it allows us to immediately
read off the values that correspond to each quarter of the population (if the variable used is
wage, then "25% youngest," "50% youngest,"and so on.) Section 3.3.b. has an example of
boxplots and their interpretation.
Section 3.2.a showed you how to obtain most of the descriptive statistics (and also histograms)
using the "frequencies" procedure (so you may skip section 3.3.a).
www.vgupta.com
Chapter 3: Univariate Analysis 3-21
52
Note that almost all of these statistics can be obtained from STATISTICS/ SUMMARIZE/ FREQUENCIES (see
section 3.1).
www.vgupta.com
Chapter 3: Univariate Analysis 3-22
The spread of the values can be depicted using boxplots. A boxplot chart provides the medians,
quartiles, and ranges. It also provides information on outliers.
Click on Define.
www.vgupta.com
Chapter 3: Univariate Analysis 3-23
Interpretation:
70
e
a-b: lowermost quartile [0-25%] 60
20
d-e: highest quartile [75-100%]
a
10
The individual cases above the highest
quartile are the outliers. 0
-10
N= 2016 2016
AGE WORK_EX
For example, say that mean education in a national survey of 100 million people was 6.2. In
your sample, the mean is 6.09. Is this statistically similar to the mean from the national survey?
If not, then your sample of education may not be an accurate representation of the actual
distribution of education in the population.
There are two methods in SPSS to find if our estimated mean is statistically indistinct from the
hypothesized mean - the formal T-Test and the Error Bar. The number we are testing our mean
against is called the hypothesized value. In this example that value is 6.2.
The Error Bar is a graph that shows 95% range within which the mean lies (statistically). If the
hypothesized mean is within this range, then we have to conclude that "Our mean is statistically
indistinct from the hypothesized number."
www.vgupta.com
Chapter 3: Univariate Analysis 3-24
Choose the menu option GRAPHS / ERROR BAR. Choose "Simple" type. Select the option
"Summaries of separate variables."
Click on "Define."
In the box "Error Bars," place the variables whose "Confidence interval for mean" you wish to
determine (we are using the variable wage)
Choose the confidence level (the default is 95%. You can type in 99% or 90%).
www.vgupta.com
Chapter 3: Univariate Analysis 3-25
36
35
6.55
34
33
Mean
32
31
95% CI WAGE
30
6.04
29
N= 299
WAGE
53
The Error Bar gives the 95% confidence interval for the mean . After looking at the above
graph you can conclude that we cannot say with 95% confidence that 6.4 is not the mean
(because the number 6.4 lies within the 95% confidence interval).
53
The Error Bar can also be used to depict the 95% confidence interval for the standard deviation (see section 4.3).
www.vgupta.com
Chapter 3: Univariate Analysis 3-26
One-Sample Test
Test Value = 6.2
95% Confidence
Interval of the
Sig. Mean Difference
t df (2-tailed) Difference Lower Upper
EDUCATION -.875 2015 .382 -.11 -.35 .14
The test for the difference in sample mean from the hypothesized mean is statistically
insignificant (as it is greater than .1) even at the 90% level. We fail to reject the hypothesis that
the sample mean does not differ significantly from the hypothesized number54.
Note: If sig is less than 0.10, then the test is significant at 90% confidence (equivalently,
the hypothesis that the means are equal can be rejected at the 90% level of confidence).
This criterion is considered too "loose" by some.
If sig is less than 0.05, then the test is significant at 95% confidence (equivalently, the
hypothesis that the means are equal can be rejected at the 95% level of confidence). This
is the standard criterion used.
If sig is less than 0.01, then the test is significant at 99% confidence (equivalently, the
hypothesis that the means are equal can be rejected at the 99% level of confidence). This
is the strictest criterion used.
You should memorize these criteria, as nothing is more helpful in interpreting the output from
hypothesis tests (including all the tests intrinsic to every regression, ANOVA and other
analysis).
Your professors may like to see this stated differently. For example: "Failed to reject null
hypothesis at an alpha level of .05." Use the terminology that the boss prefers!
Referring back to the output table above, the last two columns are saying that "with 95%
confidence, we can say that the mean is different from the test value of 6.2 by -.35 to .14 - that
is, the mean lies in the range '6.2-.35' to '6.2+.14' and we can say this with 95% confidence."
54
The sample mean of education is statistically close to the hypothesized value.
www.vgupta.com
Chapter 3: Univariate Analysis 3-27
www.vgupta.com
Chapter 4: Comparing similar variables 4-1
Ch 4. COMPARING SIMILAR
VARIABLES
Sometimes a data set may have variables that are similar in several respects - the variables
measure similar entities, the units of measurement are the same, and the scale of the ranges is
similar55.
We debated the justification for a separate chapter on methods that are not used in a typical
analysis. For the sake of completeness, and because the topic did not fit seamlessly into any
other chapter, we decided to stick with this chapter. The chapter also reinforces some of the
skills learned in chapter 3 and introduces some you will learn more about in chapter 5.
If you feel that your project/class does not require the skills taught in this section, you can
simply skip to chapter 5.
In section 4.3, we describe how the means (or other statistical attributes) of user-chosen pairs of
these variables are compared. For non-normal variables, a non-parametric method is shown.
In the remaining portion of the chapter we show how graphs are used to depict the differences
between the attributes of variables. In section 4.2, we describe the use of box plots in comparing
several attributes of the variables - mean, interquartile ranges, and outliers.
Note: You could compare two variables by conducting, on each variable, any of the Univariate
procedures shown in chapter 3. Chapter four shows procedures that allow for more direct
comparison.
55
Two examples:
Twelve variables, one for each month, that have spending in a particular month.
Six variables, each of which captures the percentage increase in a stock index at a specific stock
exchange in a different city (New York, London, Tokyo, Paris, Frankfurt, and Hong Kong).
An interesting analysis would be a comparison of these variables. In the first example, such an analysis can indicate
the differences in spending across the months. In the second example, such an analysis can tell us about differences
in average price movement in the major stock market indices. This chapter discusses such comparisons.
www.vgupta.com
Chapter 4: Comparing similar variables 4-2
Click on Define.
www.vgupta.com
Chapter 4: Comparing similar variables 4-3
OLD_WAGE
WAGE
Ch 4. Section 2 Boxplots
The spread of the values of two similar variables can be compared using boxplots. Let's assume
that you want to compare age and work experience. A boxplot chart compares the medians,
quartiles, and ranges of the two variables56. It also provides information on outliers.
56
In separate boxplots on the same chart. As a reminder: the first quartile defines the value below which 25% of the
variables' values lie, the second quartile (also called the median or mid-point) defines the value below which 50% of
the variables' values lie, the third quartile defines the value below which 75% of the variables' values lie, and the
interquartile range is defined by the values of the third and first quartiles.
www.vgupta.com
Chapter 4: Comparing similar variables 4-4
Click on Define.
Interpretation: 70
e
a-b: lowermost quartile (0-25%) 60
AGE WORK_EX
www.vgupta.com
Chapter 4: Comparing similar variables 4-5
www.vgupta.com
Chapter 4: Comparing similar variables 4-6
www.vgupta.com
Chapter 4: Comparing similar variables 4-7
9.5
9.0
8.5
8.0
95% CI
7.5
7.0
N= 2016 2016
WAGE OLD_WAGE
Measured in dollars
www.vgupta.com
Chapter 4: Comparing similar variables 4-8
9.5
9.0
8.5
Mean +- 2 SE
8.0
7.5
7.0
N= 2016 2016
WAGE OLD_WAGE
Measured in dollars
www.vgupta.com
Chapter 4: Comparing similar variables 4-9
error.
30
20
10
Mean +- 2 SD
-10
-20
N= 2016 2016
WAGE OLD_WAGE
Measured in dollars
Let's assume you have three variables with which to work - education (respondents education),
moth_ed (mothers education), and fath_ed (fathers education). You want to check if:
The mean of the respondent's education is the same as that of the respondent's mother's
The mean of the respondent's education is the same as that of the respondent's father's
Using methods shown in sections 3.2.a and 3.3, you could obtain the means for all the above
variables. A straightforward comparison could then be made. Or, can it? "Is it possible that our
estimates are not really perfectly accurate?"
The answer is that our estimates are definitely not perfectly accurate. We must use methods for
comparing means that incorporate the use of the mean's dispersion. The T-Test is such a
method.
www.vgupta.com
Chapter 4: Comparing similar variables 4-10
57
Note that the first Paired Variable is defined as the difference between educatio and fath_ed, i.e. - Respondents'
education level MINUS Fathers education level.
58
We are choosing the variable educatio first so that the two pairs are symmetric.
www.vgupta.com
Chapter 4: Comparing similar variables 4-11
The next table gives the results of the tests that determine whether the difference between the
means of the variables (in each pair) equals zero.
Paired Differences
Std. Error Mean .2452 .1367
Both the pairs are significant (as the sig value is below 0.05)60. This is telling us:
59
Note that the two new variables are the differences between the variables in the pair. SPSS creates (only in its
memory - no new variable is created on the data sheet) and uses two new variables:
Educatio minus fath_ed
Educatio minus moth_ed
The procedure is determining whether the means of these variables equal zero. If they do, then the paired variables
have the same mean.
www.vgupta.com
Chapter 4: Comparing similar variables 4-12
The mean of the variable fathers education is significantly different from that of the
respondents. The negative Mean (-4.7) is signifying that the mean education of fathers is
higher.
The mean of the variable mothers education is significantly different from that of the
respondents. The positive Mean (3.5) is signifying that the mean education of mothers is
lower.
As we mentioned in section 3.2.e, the use of the T and F tests hinges on the assumption of
normality of underlying distributions of the variables. Strictly speaking, one should not use
those testing methods if a variable has been shown not to be normally distributed (see section
3.2). Instead, non-parametric methods should be used-- these methods do not make any
assumptions about the underlying distribution types.
Let's assume you want to compare two variables: old_wage and new_wage. You want to know
if the distribution of the new_wage differs appreciably from that of the old wage. You want to
use the non-parametric method Two Related Samples Tests.
60
The basic rules for interpreting the significance values should be firmly implanted in your memory. The rules,
which are common for the interpretation of any significance test irrespective of test type (the most frequently used
types are the T, F, Chi-Square, and Z tests) and context (as you will see in later chapters, the context may be
regression, correlation, ANOVA, etc.), are:
If the value in the significance table is less than .01, then the estimated coefficient can be believed with
99% confidence
If the value in the significance table is less than .05, then the estimated coefficient can be believed with
95% confidence
If the value in the significance table is less than .1, then the estimated coefficient can be believed with 90%
confidence
If the value in the significance table is greater than .1, then the estimated coefficient is not statistically
significant, implying that the estimate should not be relied upon as reliable or accurate
www.vgupta.com
Chapter 4: Comparing similar variables 4-13
Ranks
Mean Sum of
N Rank Ranks
Wage_new Negative a
671 788.45 529047.50
- Wage_old Ranks
Positive b
1272 1068.83 1359549
Ranks
Ties c
73
Total 2016
a. Wage_new < Wage_old
b. Wage_new > Wage_old
c. Wage_new = Wage_old
If you want to compare more than two variables simultaneously, then use the option
STATISTICS / NONPARAMETRIC / K RELATED SAMPLES TESTS. Follow the same
procedures as shown above but with one exception:
Choose the "Friedman" test in the area "Test Type." If all the variables being tested are
dichotomous variables, then choose the "Cochran's Q" test.
www.vgupta.com
Chapter 4: Comparing similar variables 4-14
We cannot make the more powerful statement that the means are equal/unequal (as we
could with the T Test). You may see this as a trade-off: The non-parametric test is more
appropriate when the normality assumption does not hold, but the test does not produce
output as rich as a parametric T test.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-1
Ch 5. MULTIVARIATE STATISTICS
After performing univariate analysis (chapter 3) the next essential step is to understand the basic
relationship between/across variables. For example, to Find whether education levels are
different for categories of the variable gender (i.e. - "male" and "female") and for levels of the
categorical variable age.
Section 5.1 uses graphical procedures to analyze the statistical attributes of one variable
categorized by the values/categories of another (or more than one) categorical or dummy
variable. The power of these graphical procedures is the flexibility they offer: you can compare
a wide variety of statistical attributes, some of which you can custom design. Section 5.1.c
shows some examples of such graphs.
In section 5.3, we explain the meaning of correlations and then describe how to conduct and
interpret two types of correlation analysis: Bivariate and partial. Correlations give one number
(on a uniform and comparable scale of -1 to 1) that captures the relationship between two
variables.
In section 5.3, you will be introduced to the term "coefficient." A very rough intuitive
definition of this term is "an estimated parameter that captures the relationship between two
variables." Most econometrics projects are ultimately concerned with obtaining the estimates of
these coefficients. But please be careful not to become "coefficient-obsessed." The reasoning
will become clear when you read chapters 7 and 8. Whatever estimates you obtain must be
placed within the context of the reliability of the estimation process (captured by the "Sig" or
"Significance" value of an appropriate "reliability-testing" distribution like the T or F61).
SPSS has an extremely powerful procedure (EXPLORE) that can perform most of the above
procedures together, thereby saving time and effort. Section 5.4 describes how to use this
procedure and illustrates the exhaustive output produced.
Section 5.5 teaches comparison of means/distributions using error bars, T-Tests, Analysis of
Variance, and nonparametric testing.
61
If you think of the hypothesis testing distributions as "reliability-testing," then you will obtain a very clear idea of
the rationales behind the use of these distributions and their significance values.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-2
Ch 5. Section 1 Graphs
Note: Aside from the visual/graphical indicators used to plot the graph, the bar, line, area, and
(for univariate graphs) pie graphs are very similar. The graph type you choose must be capable
of showing the point for which you are using the graph (in your report/thesis). A bar graph
typically is better when the X-axis variable takes on a few values only, whereas a line graph is
better when the X-axis variable can take on one of several values and/or the graph has a third
dimension (that is, multiple lines). An area graph is used instead of a line graph when the value
on the Y-axis is of an aggregate nature (or if you feel that area graphs look better than line
graphs), and a pie graph is preferable when the number of "slices" of the pie is small. The
dialog boxes for these graph types (especially bar, line, and area) are very similar. Any
example we show with one graph type can also be applied using any of the other graph types.
Select GRAPHS/BAR.
62
If you have a category variable with numerous categories and/or if you want to compare the cases of two or more
variables, then a line or area graph is better. This section includes examples of area and line graphs.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-3
Press OK
www.vgupta.com
Chapter 5: Multivariate Statistics 5-4
10
4
Mean education
0
15.00 21.00 27.00 33.00 39.00 45.00 51.00 57.00 63.00
18.00 24.00 30.00 36.00 42.00 48.00 54.00 60.00
AGE
In the above bar graph, each bar gives the mean of the education level for each age (from 15 to
65). The mean education is highest in the age group 25-50.
Let's assume that you want to know whether the deviations of the education levels around the
mean are different across age levels? Do the lower educational levels for 15- and 64-year-olds
imply a similar dispersion of individual education levels for people of those age groups? To
answer this, we must see a graph of the standard deviations of the education variable, separated
by age.
Select GRAPHS/BAR.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-5
www.vgupta.com
Chapter 5: Multivariate Statistics 5-6
10
M
ea 6
n
ed
uc
4
ati
on
Std 2
Dev
Educ
0
15.00 21.00 27.00 33.00 39.00 45.00 51.00 57.00 63.00
18.00 24.00 30.00 36.00 42.00 48.00 54.00 60.00
AGE
Note: Aside from the visual/graphical indicators used to plot each graph, the bar, line, area, and
(for univariate graphs) pie graphs are very similar. The graph type you choose must be capable
of showing the point you are using the graph for (in your report/thesis). A bar graph is typically
better when the X-axis variable takes on only a few values, whereas a line graph is better when
the X-axis variable can take on one of several values and/or the graph has a third dimension
(that is, multiple lines). An area graph is used instead of a line graph when the value on the Y-
axis is of an aggregate nature (or you feel that area graphs look better than line graphs), and a
pie graph is preferable when the number of "slices" of the pie is small. The dialog boxes for
these graph types (especially bar, line, and area) are very similar. Any example we show with
one graph type can also be applied using any of the other graph types.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-7
63
The variable in Category Axis defines the X-axis.
64
Each line in the graph depicts the line for one category of the variable placed in Define Lines by.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-8
www.vgupta.com
Chapter 5: Multivariate Statistics 5-9
www.vgupta.com
Chapter 5: Multivariate Statistics 5-10
16
14
12
10
6
Med EDUCATION
0 P uMale
b li c S e c t o r
15 21 27 33 39 45 51 57 63
18 24 30 36 42 48 54 60
AGE
Click on Define.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-11
700
600
500
400
300
Sum EDUCATION
200
100 GEND ER
F e m a le
0 M a le
15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63
AGE
All the examples above used a standard statistic like the mean, median, sum, or standard
deviation. In section 5.1.c we explore the capacity of SPSS to use customized statistics (like
"Percent Below 81").
Apart from summary measures like mean, median, and standard deviation, SPSS permits some
customization of the function/information that can be depicted in a chart.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-12
www.vgupta.com
Chapter 5: Multivariate Statistics 5-13
www.vgupta.com
Chapter 5: Multivariate Statistics 5-14
N u m b e r in a g e g r o u p w it h a b o v e p r im a r y
40
30
20
N>6 for EDUCATION
10
0
15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63
A G E
www.vgupta.com
Chapter 5: Multivariate Statistics 5-15
30
20
10
0
EDUCATION
-10
N= 1613 403
Male Female
GENDER
Ch 5. Section 2 Scatters
www.vgupta.com
Chapter 5: Multivariate Statistics 5-16
www.vgupta.com
Chapter 5: Multivariate Statistics 5-17
60
40
20
WAGE
0
0 5 10 15 20 25
EDUCATION
Wages are in $ per hour
www.vgupta.com
Chapter 5: Multivariate Statistics 5-18
Let's assume you want to compare the relationship between age and wage with the relationship
between work experience and wage.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-19
www.vgupta.com
Chapter 5: Multivariate Statistics 5-20
Repeat the previous two steps for the pair wage and
work_ex.
70
60
50
40
30
20
W A G E /H R
Wage - Work
10 W O R K _E X
Experience
W A G E /H R
Wage - Age
0 e d u c a t io n
-10 0 10 20 30 40
www.vgupta.com
Chapter 5: Multivariate Statistics 5-21
Ch 5. Section 3 Correlations
The correlation coefficient depicts the basic relationship across two variables65: Do two
variables have a tendency to increase together or to change in opposite directions and, if so, by
how much66?
Bivariate correlations estimate the correlation coefficients between two variables at a time,
ignoring the effect of all other variables. Sections 5.3.a and 5.3.b describe this procedure.
Section 5.3.a shows the use of the Pearson correlation coefficient. The Pearson method should
be used only when each variable is quantitative in nature. Do not use it for ordinal or unranked
67
qualitative variables. For ordinal variables (ranked variables), use the Spearman correlation
coefficient. An example is shown in section 5.3.b.
The base SPSS system does not include any of the methods used to estimate the correlation
coefficient if one of the variables involved is unranked qualitative.
There is another type of correlation analysis referred to as Partial Correlations. It controls for
the effect of selected variables while determining the correlation between two variables68.
Section 5.3.c shows an example of obtaining and interpreting partial correlations.
Note: See section 5.2 to learn how to make scatter plots in SPSS. These plots provide a good
visual image of the correlation between the variables. The correlation coefficients measure the
linear correlation, so look for such linear patterns in the scatter plot. These will provide a rough
idea about the expected correlation and will show this correlation visually.
65
Do not confuse correlation with regression. While the former does not presume any causal link between X and Y,
the latter does.
66
The term "correlation" means "Co (together)" + "Relation." If variable X is higher (lower) when variable Z is
higher (higher), then the two variables have a positive (negative) correlation. A correlation captures the linear co-
relation, if any, shown in a scatter between the graphs (see section 5.2.)
67
Example of ranked variables: GPA (taking on the ranked category values A+, A,, C-). Example of an unranked
qualitative variable (sometimes referred to as a nominal variable): Gender (there is no ranking between the categories
male and female).
68
In our data set, a bivariate correlation of wage and work experience will ignore all other variables. But is that
realistic and/or intuitive? We know (from previous research and after doing other analysis) that gender and
education play a major role in wage determination and age plays a major role in the determination of work
experience. So, ideally, a pure correlation between wage and work experience should account for the effect of the
variables gender and education. Partial correlation does this.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-22
69
Check your statistics book for a description of "one-tailed" and "two-tailed."
www.vgupta.com
Chapter 5: Multivariate Statistics 5-23
70
If even one of the variables is ordinal (dummy or categorical) or not normally distributed, you cannot use the
method Pearson. Our approach is simplistic. Your textbook may explain several different types of correlations and
the guidelines regarding the use of each. Nevertheless, we follow an approach that is professionally accepted
practice. Note that Pearson's method requires all variables to be distributed normally. Most researchers dont even
bother to check if this is true! If the sample size is large enough (above 30) then most distributions start behaving
like the normal distribution - the oft-quoted Central Limit Theorem!
71
The mean and standard deviation are usually obtained in descriptives (see sections 3.2 and 3.3).
www.vgupta.com
Chapter 5: Multivariate Statistics 5-24
Note:
A high level of correlation is implied
by a correlation coefficient that is
greater than 0.5 in absolute terms (i.e.,
greater than "+0.5" or less than "
0.5").
The output gives the value of the correlation (between -1 and 1) and its level of significance,
indicating significant correlations with one or two * signs. First, check whether the correlation
is significant (look for the asterisk). You will then want to read its value to determine the
magnitude of the correlation.
Make this a habit. Be it correlation, regression (chapter 7 and 8), Logit (chapter 9), comparison
of means (sections 4.4 and 5.5), or the White's test (section 7.5), you should always follow this
simple rule - first look at the significance. If, and only if, the coefficient is significant, then
rely on the estimated coefficient and interpret its value.
This row contains the correlation coefficients between all the variables.
Correlations
AGE EDUCATION WAGE WORK_EX
AGE 1.000 -.051* .274** .674**
Pearson EDUCATION -.051* 1.000 .616** -.055*
Correlation WAGE .274** .616** 1.000 .254**
WORK_EX .674** -.055* .254** 1.000
AGE . .021 .000 .000
EDUCATION .021 . .000 .014
Sig. (2-tailed)
WAGE .000 .000 . .000
WORK_EX .000 .014 .000 .
AGE 2016 2016 1993 2016
EDUCATION 2016 2016 1993 2016
N
WAGE 1993 1993 1993 1993
WORK_EX 2016 2016 1993 2016
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
www.vgupta.com
Chapter 5: Multivariate Statistics 5-25
72
In practice, many researchers ignore this rule! They therefore ignore all we have in chapters 3-6, going straight
from descriptives (section 3.3) to regression (chapters 7 and 8). Completing all the steps (chapters 3-11) will
engender a thorough command over the project and an understanding of the implications of each result.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-26
Deselect all.
Note: Partial Correlation is an extremely powerful procedure that, unfortunately, is not taught in
most schools. In a sense, as you shall see on the next few pages, it provides a truer picture of
the correlation than the "Bivariate" correlation discussed in section 5.3.a.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-27
www.vgupta.com
Chapter 5: Multivariate Statistics 5-28
- - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S
- - -
Controlling for GENDER, PUB_SEC,EDUC The correlation is significant at the 0.01 % level (as P<
.01).
AGE WAGE
The interesting fact is that the partial correlation is
AGE 1.0000 .3404 higher than the bivariate (see section 5.3.a), implying
P= . P= .000 that once one has removed the impact of gender,
sector, and education, then age and wage have an even
WAGE .3404 1.0000 stronger relationship.
P= .000 P= .
Let's assume we want to find the differences in the attributes of the variables education and
wage across the categories of gender and sector.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-29
www.vgupta.com
Chapter 5: Multivariate Statistics 5-30
www.vgupta.com
Chapter 5: Multivariate Statistics 5-31
On the next few pages you will see a great deal of output. We apologize if it breaks from the
narrative, but by showing you the exhaustive output produced by EXPLORE, we hope to
impress upon you the great power of this procedure. The descriptives tables are excellent in
that they provide the confidence intervals for the mean, the range, interquartile range (75th - 25th
percentile), etc. The tables located two pages ahead show the extreme values.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-32
Tip: Some of the tables in this book are poorly formatted. Think it looks unprofessional and
sloppy? Read chapter 11 to learn how not to make the same mistakes that we did!
www.vgupta.com
Chapter 5: Multivariate Statistics 5-33
Descriptives of education and wage by gender Descriptives of education and wage by sector
Descriptives Descriptives
GENDER Statistic Std. Error SECTORWhether StatisticStd. Error
Mean 6.00 .14 Mean 4.06 .12
95% Confidence Lower Boun 5.73 95% Confidence Lower Boun 3.82
Interval for Mean Upper Boun 6.27 Interval for Mean Upper Boun 4.29
www.vgupta.com
Chapter 5: Multivariate Statistics 5-34
Extreme values (outliers included) of education and wage across categories of sector and
gender
Extreme Values
Whether Case
Public Number Value
1 4 20
2 993 20
Highest 3 1641 20
4 1614 19
Private 5 1629 .a
Sector 1 222 0
2 688 0
Lowest 3 551 0
4 709 0
5 75 .b
EDUCATION
1 1033 23
2 1034 23
Highest 3 1037 22
4 1503 22
Public 5 1035 .c
Sector 1 1260 0
2 1582 0
Lowest 3 1268 0
4 1273 0
5 1278 .b
1 5 153.88
2 3 125.00
Highest 3 2 119.32
4 1616 101.13
Private 5 4 75.76
Sector 1 731 .00
2 776 .01
Lowest 3 87 .13
4 1759 .23
5 237 .b
WAGE
1 1876 189.39
2 1037 119.32
Highest 3 1036 85.23
4 1039 63.13
Public 5 1038 60.42
Sector 1 1230 .01
2 1119 .11
Lowest 3 2016 .25
4 2015 .28
5 1987 .33
a. Only a partial list of cases with the value 19 are shown in the table of
upper extremes.
b. Only a partial list of cases with the value 0 are shown in the table of lower
extremes.
c. Only a partial list of cases with the value 22 are shown in the table of
upper extremes.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-35
73
Choosing Dependants together will change the order in which the plots are displayed in the output.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-36
Histogram Histogram
For G E N D E R = M ale For G E N D E R = Female
M ale
700 700
600 600
500 500
400 400
300 300
200 200
Frequency
Frequency
100 100
Std. Dev = 5.48 Std. Dev = 5.48
Mean = 6 .0 Mean = 6.0
0 N = 1613 .00 0 N = 1613.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5
ED UC A TION ED UC ATION
Histogram Histogram
For GENDER= Male For GENDER= Female
1000 300
800
200
600
400
100
Frequency
Frequency
200
20
30
4
50
60
70
80
9
10
1 0
12 0
13 0
14 0
15 0
0
.0
.0
.0
.0
.0
.
.
.
.
0
0
0
0
0
0
0
0
0
0.
0.
0.
10
0
.0
.0
.0
.0
.0
.0
.0
0.
0.
0.
0
0
0
.
.
.0
WAGE WAGE
30 300
20 200
1876
3
1037
2
10 100 1616
1036
4
1031
7039
1038
1
1035
1034
1033
1
9
1032
6040
8
1063
1081
14
1614
1615
18
1047
1048
1041
17
1062
1055
1046
20
1078
1061
1075
1058
1067
1071
15
1082
1054
1066
1076
1049
1045
11
1052
1068
1083
12
1050
1072
1060
1057
1056
1053
1042
1086
1070
19
1073
1077
10
1085
1079
1069
1059 1639
16
1051
1044
1074
1064
1065
13
1084
1080
1043
1320
1321
1027
1025
1343
1028
1376
1480
1571 1966
1968
1640
1641
1967
1965
994
1611
1539
1479
1486
1007
1390
995
1590
1483
1358
1368
1440 1646
1647
1971
1972
1970
1643
0 0
EDUCATION
WAGE
-10 -100
N= 1613 403 N= 1613 403
Male Female Male Female
GENDER GENDER
www.vgupta.com
Chapter 5: Multivariate Statistics 5-37
Histogram Histogram
F o r P U B _ S E C = P u b l i c S ec to r For PU B_SEC= Private Sector
200 800
600
100 400
200
Frequency
Frequency
Std. Dev = 5.87 Std. Dev = 4.27
Mean = 9.7 Mean = 4.1
0 N = 724.00 N = 1292.00
0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
ED UC A TION EDUCATION
Histogram Histogram
For PU B_SEC= Public Sector For PUB_SEC= Private Sector
500 1000
400 800
300 600
200 400
Frequency
Frequency
100 200
20
30
4
50
60
70
80
9
10
1 0
12 0
13 0
14 0
15 0
0.
10
20
30
40
50
60
70 0
80 0
90 0
10 0
11 .0
12 .0
13 .0
14 .0
15 .0
16 .0
17 .0
18 .0
19 .0
0.
0.
10
0
0
.0
.0
.0
.0
.0
.0
.0
0.
0.
0.
0
.0
.0
.0
.0
.0
.
.
.
.
0
0
0
0
0
0
0
0
0
0.
.
.
.0
0
WAGE WAGE
30 300
20 4
993
1641 200
1643
1614
1629 1876
1647
1642
979
18
1006
12
1644
32
9 64
6
5
978
5
1624
977
85
219
4 85
2
1
959
23
984
667
15
147
1632
1671
1666
1690
1640
1716
985
155
14
1747
1841
150
29
1648
42
142
987
1639
13
234
605
1008
1617
1742
1633
994
103
226
1646
1790
976
995
17
956
3
2 1037
10 100 1616
1036
4
7
1 1039
1038
1031
1035
1
9
1 614
8 615 1034
1033
1040
1032
6
14 1063
1081
18
17
20 1047
1048
1041
1062
1055
1046
1078
15 1061
1075
1058
1067
1071
1082
1054
1066
1076
11
12 1049
1045
1052
1068
1083
1060
1050
1072
1057
1056
19
10
1639 1053
1086
1042
1070
1073
1077
1085
1079
16
1640
13
1641
1027
1025
1028
1646
994
1647
1007
995
986
979
1643
1008
980
950
1645
953
956
1642
1006
985
942
941
996
1022
1000
993
955
951
1648
1004
971
949
1030
1644
964
965
978
977
969
962
991
1023
1012
976
1014
963
1634
1003
958
997
1013
998
1632
957
945
1011
1024
947
988
940
967
974
973
972
1638
999
952
983
1026
970
982
1633
1019
1636
1009
959
1629
943
968
0 0
EDUCATION
WAGE
-10 -100
N = 1292 724 N = 1292 724
www.vgupta.com
Chapter 5: Multivariate Statistics 5-38
Note: in the boxplot on the upper right and the histogram above it, the depictive power of the graph can
be increased significantly by restricting the range of X-values (for the histogram) and Y-values for the
boxplot. See section 11.2 to learn how to change the formatting.
Click on "Define."
www.vgupta.com
Chapter 5: Multivariate Statistics 5-39
www.vgupta.com
Chapter 5: Multivariate Statistics 5-40
In addition to the mean (the small box in the middle of each error bar) being higher for males,
the entire 95% confidence interval is higher for males. This adds great support to any statement
on differentials in wages.
12
11
10
95% CI WAGE
7
N= 1626 390
Female Male
GENDER
We want to test the hypothesis that the mean wage for males is the same as that for females.
The simplest test is the Independent-Samples T Test.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-41
See the option Cut Point. Let's assume you wanted to compare two groups, one defined by
education levels above 8 and the other by education levels below 8. One way to do this would be
to create a new dummy variable that captures this situation (using methods shown in sections 2.1
and 1.7). An easier way would be to simply define 8 as the cut point. To do this, click on the
button to the left of Cut Point and enter the number 8 into the text box provided.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-42
1. The first three columns test the hypothesis that the two groups of wage observations have
the same (homogenous) variances. Because the Sig value for the F is greater than 0.1, we
fail to reject the hypothesis (at the 90% confidence level) that the variances are equal.
2. The F showed us that we should use the row Equal variances assumed. Therefore, when
looking at values in the 4th to last columns (the T, Sig, etc.), use the values in the 1st row
(i.e. - the row that has a T of 4.04. In the next table we have blanked out the other row).
www.vgupta.com
Chapter 5: Multivariate Statistics 5-43
3. Find whether the T is significant. Because the Sig (2-tailed) value is below .05, the
coefficient is significant at 95% confidence.
4. The coefficient in this procedure is the difference in mean wage across the two groups.
Or stated differently, Mean (wage for gender=1 or female) Mean(wage for gender=0 or
male). The mean difference of 2.54 implies that we can say, with 95% confidence, that
the mean wage for males is 2.54 higher than that for females.
5. The last two columns provide the 95% confidence interval for this difference in mean. The
interval is (-3.78, -1.31).
Let's assume you have a variable with three values - 0, 1, and 2 (representing the concepts
conservative, moderate, and liberal). Can you use this variable as the grouping variable,
i.e. - first compare across conservative and moderate by using the values 0 and 1 in the
Define Groups dialog box, then compare conservative to "liberal" by using the values 0 and
2 in the same dialog box? The answer is no, one cannot break up a categorical variable into
pairs of groups and then use the Independent Samples T Test. Certain biases are introduced
into the procedure if such an approach is employed. We will not get into the details of these
biases, for they are beyond the scope of this book. However, the question remains - If the
Independent Samples T Test cannot be used, what should be used? The answer is the
ANOVA. In the next section we show an example of a simple One-Way ANOVA.
One can argue, correctly, that the T or F tests cannot be used for testing a hypothesis about the
variable wage because the variable is not distributed normally - see section 3.2. Instead, non-
parametric methods should be used - see section 5.5.d. Researchers typically ignore this fact
and proceed with the T-Test. If you would like to hold your analysis to a higher standard, use
the relevant non-parametric test shown in section 5.5.d.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-44
ANOVA is a major topic in itself, so we will show you only how to conduct and interpret a
basic ANOVA analysis.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-45
www.vgupta.com
Chapter 5: Multivariate Statistics 5-46
www.vgupta.com
Chapter 5: Multivariate Statistics 5-47
Levene
Statistic df1 df2 Sig.
WAGE 8.677 22 1993 .000
The ANOVA table below tests whether the difference between groups (i.e. - the deviations in
wages explained by differences in education level)74 is significantly higher than the deviations
within each education group. The Sig value indicates that the "Between Groups" variation can
explain a relatively large portion of the variation in wages. As such, it makes sense to go
further and compare the difference in mean wage across education levels (this point is more
clear when the opposite scenario is encountered). If the "Between Groups" deviations' relative
importance is not so large, i.e. - the F is not significant, then we can conclude that differences in
education levels do not play a major role in explaining deviations in wages.
Note: The "analysis" of variance is a key concept in multivariate statistics and in econometrics.
A brief explanation: the sum of squares is the sum of all the squared deviations from the mean.
So for the variable wage, the sum of squares is obtained by:
[a] obtaining the mean for each group.
[b] re-basing every value in a group by subtracting the mean from this value. This difference is
the "deviation."
[c] Squaring each deviation calculated in "b" above.
[d] Summing all the squared values from "c" above. By using the "squares" instead of the
"deviations," we permit two important aspects. When summing, the negative and positive
deviations do not cancel each other out (as the squared values are all positive) and more
importance is given to larger deviations than would be if the non-squared deviations were used
(e.g. - let's assume you have two deviation values 4 and 6. The second one is 1.5 times greater
than the first. Now square them. 4 and 6 become 16 and 36. The second one is 2.25 times
greater than the first).
74
The between groups sum of squares is, computationally, the sum of squares obtained if each group were seen as
one "observation," with this "observation" taking on the value of the mean of the group.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-48
ANOVA
Sum of Mean
Squares df Square F Sig.
WAGE Between
78239.146 22 3556.325 40.299 .000
Groups
Within
175880.9 1993 88.249
Groups
Total 254120.0 2015
This shows that the sub-groups of wage (each sub-group is defined by an education level) have
unequal (i.e. - heterogeneous) variances and, thus, we should only interpret the means-
comparison table that uses a method (here "Tamhane's T2") that assumes the same about the
variances.
SPSS will produce tables that compares the means. One table uses "Tukeys" method; the other
will use "Tamhane's" method. We do not reproduce the table here because of size constraints.
Rarely will you have to use a method that assumes homogenous variances. In our experience,
real world data typically have heterogeneous variances across sub-groups.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-49
www.vgupta.com
Chapter 5: Multivariate Statistics 5-50
Test Statisticsa
WAGE
Mann-Whitney
211656.5
U
Wilcoxon W 1534408
Z -10.213
Asymp. Sig.
.000
(2-tailed)
a. Grouping Variable:
GENDER
Note: if you have several groups, then use STATISTICS / NONPARAMETRIC TESTS / K
[SEVERAL] INDEPENDENT SAMPLES TESTS. In effect, you are conducting the non-
parametric equivalent of the ANOVA. Conduct the analysis in a similar fashion here, but
with two exceptions:
1. Enter the range of values that define the group into the box that is analogous to that on the
right. For example:
75
An explanation of the differences between these test types is beyond the scope of this book.
www.vgupta.com
Chapter 5: Multivariate Statistics 5-51
2. Choose the "Kruskal-Wallis H test" as the "Test type" unless the categories in the
grouping variable are ordered (i.e. - category 4 is better/higher than category 1, which is
better/higher than category 0).
www.vgupta.com
Chapter 7: Linear Regression 6-1
Ch 6. TABLES
In this chapter, you will learn how to extend your analysis to a disaggregated level by making
tables (called "Custom Tables"). SPSS can make excellent, well-formatted tables with ease.
Tables go one step further than charts76: they enable the production of numeric output at levels
of detail chosen by the user. Section 6.1 describes how to use custom tables to examine the
patterns and values of statistics (i.e. - mean, median, standard deviation, etc.) of a variable
across categories/values of other variables.
Section 6.2 describes how to examine the frequencies of the data at a disaggregated level. Such
an analysis complements and completes analysis done in section 6.1.
For understanding Multiple Response Sets and using them in tables, refer to section 2.3 after
reading this chapter.
Note: the SPSS system on your computer may not include the Custom Tables procedures.
If you are using Excel to make tables, you will find the speed and convenience of SPSS to be a
comfort. If you are using SAS or STATA to make tables, the formatting of the output will be
welcome.
76
The power of graphs is that the patterns are easily viewed. However, once you want to delve deeper into the data,
you want numeric information in addition to simple visual depictions.
77
Are the patterns the same for higher education levels? Does the pattern reverse itself for certain gender-age
combinations? Questions like these can be answered using custom tables. Interpretation of these tables also
strengthens ones understanding of the forces driving all the results in the analysis.
www.vgupta.com
Chapter 7: Linear Regression 6-2
78
Note: the base SPSS installed in your computer system may not include the Custom Tables procedures.
www.vgupta.com
Chapter 7: Linear Regression 6-3
www.vgupta.com
Chapter 7: Linear Regression 6-4
79
For example, "For all males," "For everyone with education level X," etc.
80
The three levels of aggregation will become apparent when you look at the output table.
81
Mean for all females and for all males irrespective of their education levels - in the last row, and means for each
education level irrespective of gender - in the last column.
www.vgupta.com
Chapter 7: Linear Regression 6-5
Click on OK.
www.vgupta.com
Chapter 7: Linear Regression 6-6
Click on Statistics
82
Compare the value in row education = 12 and column gender = 0 (male) to the value in cell education = 14
and column gender=1 (female). The double-tailed arrow points to these values.
www.vgupta.com
Chapter 7: Linear Regression 6-7
www.vgupta.com
Chapter 7: Linear Regression 6-8
www.vgupta.com
Chapter 7: Linear Regression 6-9
Mean andMean
Median WageforforEach
of Wage Each Gender-Education
Gender-Education Combination
Combination
Total for each gender category ("column total"). Total for each education category ("row total").
Inspect the table carefully. Look at the patterns in means and medians and compare the two.
For almost all the education-gender combinations, the medians are lower than the means,
implying that a few high earners are pushing the mean up in each unique education-gender
entry.
www.vgupta.com
Chapter 7: Linear Regression 6-10
The first table will be for private sector employees (pub_sec=0) and will be displayed in the
output window. The second table, for public sector employees, will not be displayed.
You need to view and print the second table (for pub_sec=1). To view it, first double click on
the table above in the output window. Click on the right mouse. You will see several options.
www.vgupta.com
Chapter 7: Linear Regression 6-11
Select Change Layers. Select the option Next. The custom table for pub_sec=1 will be
shown.
www.vgupta.com
Chapter 7: Linear Regression 6-12
GENDERGENDER Group
0 Male Female
1 Total
0 7.62 2.29 7.17
1 6.51 3.53 6.00
2 5.66 8.52 6.23
3 9.17 7.14 8.88
4 7.39 7.12 7.26
5 8.08 5.40 7.90
6 8.97 4.56 8.76
8 11.58 . 11.58
9 13.53 10.72 13.08
10 13.33 9.00 12.69
11 15.25 11.16 14.26
EDUCATION 12 15.19 11.67 14.35
13 14.77 11.51 13.72
14 15.96 11.26 14.74
15 15.29 14.89 15.14
16 17.11 12.87 15.55
17 22.53 16.27 20.44
18 24.16 16.13 22.73
19 37.97 15.33 28.92
20 36.42 . 36.42
21 23.63 26.14 24.13
22 37.50 . 37.50
23 . . .
www.vgupta.com
Chapter 7: Linear Regression 6-13
www.vgupta.com
Chapter 7: Linear Regression 6-14
www.vgupta.com
Chapter 7: Linear Regression 6-15
www.vgupta.com
Chapter 7: Linear Regression 6-16
The observations are pretty well spread out with some clumping
Distribution of Age for in the range 25-40, as expected. You can read interesting pieces
Males and Females
of information from the table: The number of young females (<
% 19) is greater than males, females seem to have a younger age
GENDER
profile, with many of observations in the 30-38 age range," etc.
0 1
AGE AGE
15 1.2% 2.5% Compare these facts with known facts about the distribution of
16 1.5% 2.5% the population. Do the cells in this table conform to reality?
17 1.5% 3.0%
18 2.3% 2.5%
19
Also note that at this stage you have been able to look at a
2.4% 2.5%
20 2.9% 2.0% very micro-level aggregation.
21 2.7% 1.5%
22 2.9% 3.0%
23 1.9% 2.7%
24 2.4% 2.7%
25 2.7% 3.2%
26 3.3% 2.7%
27 2.5% 3.5%
28 2.5% 2.7%
29 2.7% 2.5%
30 3.7% 4.0%
31 2.7% 4.0%
32 3.3% 4.7%
33 3.2% 4.2%
34 2.7% 2.7%
35 3.6% 3.5%
36 3.3% 2.7%
37 3.0% 4.2%
38 2.9% 2.2%
39 2.6% 2.7%
40 2.5% 1.5%
41 2.4% 1.5%
42 2.3% 1.5%
43 1.8% 2.2%
44 1.1% 1.5%
45 2.4% 1.0%
46 2.2% 1.0%
47 1.7% 1.0%
48 1.1% .5%
49 1.8% 1.7%
50 1.7% 1.5%
51 2.5% 2.0%
52 1.4%
53 .9% .5%
54 1.2% 1.2%
55 1.2% .7%
56 .6% 1.2%
57 .6% .5%
58 .8% 1.0%
59 .7% .2%
60 1.1%
61 .6% 1.2%
62 .1% .7%
63 .3% .5%
64 .1% .2%
65 .3% .5%
www.vgupta.com
Chapter 7: Linear Regression 7-17
Ch 7. LINEAR REGRESSION
Regression procedures are used to obtain statistically established causal relationships between
variables. Regression analysis is a multi-step technique. The process of conducting "Ordinary
Least Squares" estimation is shown in section 7.1.
Several options must be carefully selected while running a regression, because the all-important
process of interpretation and diagnostics depends on the output (tables and charts produced
from the regression procedure) of the regression and this output, in turn, depends upon the
options you choose.
Interpretation of regression output is discussed in section 7.283. Our approach might conflict
with practices you have employed in the past, such as always looking at the R-square first. As a
result of our vast experience in using and teaching econometrics, we are firm believers in our
approach. You will find the presentation to be quite simple - everything is in one place and
displayed in an orderly manner.
The acceptance (as being reliable/true) of regression results hinges on diagnostic checking for
84
the breakdown of classical assumptions . If there is a breakdown, then the estimation is
unreliable, and thus the interpretation from section 7.2 is unreliable. Section 7.3 lists the
85
various possible breakdowns and their implications for the reliability of the regression results .
Why is the result not acceptable unless the assumptions are met? The reason is that the strong
statements inferred from a regression (i.e. - "an increase in one unit of the value of variable X
causes an increase in the value of variable Y by 0.21 units") depend on the presumption that the
variables used in a regression, and the residuals from the regression, satisfy certain statistical
properties. These are expressed in the properties of the distribution of the residuals (that
explains why so many of the diagnostic tests shown in sections 7.4-7.5 and the corrective
methods shown chapter 8 are based on the use of the residuals). If these properties are
satisfied, then we can be confident in our interpretation of the results.
The above statements are based on complex formal mathematical proofs. Please check your
textbook if you are curious about the formal foundations of the statements.
Section 7.4 provides a schema for checking for the breakdown of classical assumptions. The
testing usually involves informal (graphical) and formal (distribution-based hypothesis tests like
the F and T) testing, with the latter involving the running of other regressions and computing of
variables.
83
Even though interpretation precedes checking for the breakdown of classical assumptions, it is good practice to
first check for the breakdown of classical assumptions (sections 7.4-7.5), then to correct for the breakdowns (chapter
8), and then, finally, to interpret the results of a regression analysis.
84
We will use the phrase "Classical Assumptions" often. Check your textbook for details about these assumptions.
In simple terms, regression is a statistical method. The fact that this generic method can be used for so many
different types of models and in so many different fields of study hinges on one area of commonality - the model
rests on the bedrock of the solid foundations of well-established and proven statistical properties/theorems. If the
specific regression model is in concordance with the certain assumptions required for the use of these
properties/theorems, then the generic regression results can be inferred. The classical assumptions constitute these
requirements.
85
If you find any breakdown(s) of the classical assumptions, then you must correct for it by taking appropriate
measures. Chapter 8 looks into these measures. After running the "corrected" model, you again must perform the
full range of diagnostic checks for the breakdown of classical assumptions. This process will continue until you no
longer have a serious breakdown problem, or the limitations of data compel you to stop.
www.vgupta.com
Chapter 7: Linear Regression 7-18
Section 7.5 explores in detail the many steps required to run one such formal test: White's test
for heteroskedasticity.
Similarly, formal tests are typically required for other breakdowns. Refer to a standard
econometrics textbook to review the necessary steps.
www.vgupta.com
Chapter 7: Linear Regression 7-19
www.vgupta.com
Chapter 7: Linear Regression 7-20
86
For example, the residuals are used in the Whites test while the predicted dependent variable is used in the RESET
test. (See section 7.5.)
87
"Distance Measurement" (and use) will be dealt with in a follow-up book and/or the next edition of this book in
January, 2000. The concept is useful for many procedures apart from Regressions.
www.vgupta.com
Chapter 7: Linear Regression 7-21
88
These provide the estimates for the coefficients on the independent variables, their standard errors & T-statistics
and the range of values within which we can say, with 95% confidence, that the coefficient lies.
89
If the model fit indicates an unacceptable F-statistic, then analyzing the remaining output is redundant - if a model
does not fit, then none of the results can be trusted. Surprisingly, we have heard a professor working for Springer-
Verlag dispute this basic tenet. We suggest that you ascertain your professors view on this issue.
www.vgupta.com
Chapter 7: Linear Regression 7-22
It is typically unnecessary to
change any option here.
Click on Plots."
www.vgupta.com
Chapter 7: Linear Regression 7-23
www.vgupta.com
Chapter 7: Linear Regression 7-24
www.vgupta.com
Chapter 7: Linear Regression 7-25
A digression:
90
If Sig < .01, then the model is significant at 99%, if Sig < .05, then the model is significant at 95%, and if Sig <.1,
the model is significant at 90%. Significance implies that we can accept the model. If Sig>.,1 then the model was
not significant (a relationship could not be found) or "R-square is not significantly different from zero."
www.vgupta.com
Chapter 7: Linear Regression 7-26
In your textbook you will encounter the terms TSS, ESS, and RSS (Total, Explained, and Residual Sum
of Squares, respectively). The TSS is the total deviations in the dependent variable. The ESS is the
amount of this total that could be explained by the model. The R-square, shown in the next table, is the
ratio ESS/TSS. It captures the percent of deviation from the mean in the dependent variable that could be
explained by the model. The RSS is the amount that could not be explained (TSS minus ESS). In the
previous table, the column "Sum of Squares" holds the values for TSS, ESS, and RSS. The row "Total" is
TSS (106809.9 in the example), the row "Regression" is ESS (54514.39 in the example), and the row
"Residual" contains the RSS (52295.48 in the example).
91
Look in the column Variables/Entered.
92
The Adjusted R-Square shows that 50.9% of the variance was explained.
93
The "R-Square"' tells us that 51% of the variation was explained.
94
Compare this to the mean of the variable you asked SPSS to create - "Unstandardized Predicted." If the Std. Error
is more than 10% of the mean, it is high.
www.vgupta.com
Chapter 7: Linear Regression 7-27
If the value in Sig. is less than 0.05, then we can assume that the estimate in column B can be
asserted as true with a 95% level of confidence95. Always interpret the "Sig" value first. If this value is
more than .1 then the coefficient estimate is not reliable because it has "too" much
dispersion/variance.
Coefficientsa
Unstandardized 95% Confidence
Coefficients Interval for B
Lower Upper
Model B Std. Error t Sig. Bound Bound
(Constant) -1.820 .420 -4.339 .000 -2.643 -.997
AGE .118 .014 8.635 .000 .091 .145
EDUCATION .777 .025 31.622 .000 .729 .825
1
GENDER -2.030 .289 -7.023 .000 -2.597 -1.463
PUB_SEC 1.741 .292 5.957 .000 1.168 2.314
WORK_EX .100 .017 5.854 .000 .067 .134
a. Dependent Variable: WAGE
and/or heteroskedasticity.
3
existence of mis-specification.
0
This test requires the running of
a new regression using the -1
variables you saved in this
regression - both the predicted -2
-4 -2 0 2 4 6 8 10
and residuals. You will be
required to create other R EGR ESSION ST ANDAR DIZ ED R ESIDUAL
transformations of these
variables (see section 2.2 to
95
If the value is greater than 0.05 but less than 0.1, we can only assert the veracity of the value in B with a 90%
level of confidence. If Sig is above 0.1, then the estimate in B is unreliable and is said to not be statistically
significant. The confidence intervals provide a range of values within which we can assert with a 95% level of
confidence that the estimated coefficient in B lies. For example, "The coefficient for age lies in the range .091 and
.145 with a 95% level of confidence, while the coefficient for gender lies in the range -2.597 and -1.463 at a 95%
level of confidence."
96
Incorrect functional form, omitted variable, or a mis-measured independent variable.
www.vgupta.com
Chapter 7: Linear Regression 7-28
learn how). Review your A formal test like the White's Test is necessary to
textbook for the step-by-step conclusively prove the existence of heteroskedasticity. We
description of the RESET test. will run the test in section 7.5.
-10
-20
-20 -10 0 10 20
EDUCATION
these variables. 30
20
Note: Sometimes these plots
may not show a pattern. The 10
www.vgupta.com
Chapter 7: Linear Regression 7-29
20
10
WAGE
-1 0
-2 0
-3 0 -2 0 -1 0 0 10 20 30
W OR K_EX
-1
0. 0
1.
2.
3.
4.
5.
6.
7.
8.
.00
.0
.00
00
00
00
00
00
00
00
00
0
0
EXPECTED CUM PROB
0.00
0.00 .25 .50 .75 1.00
97
See chapter 3 for interpretation of the P-P. The residuals should be distributed normally. If not, then some classical
assumption has been violated.
www.vgupta.com
Chapter 7: Linear Regression 7-30
Sig.-F Whether the model as a whole - below .01 for 99% The first statistic to look for in
is significant. It tests whether confidence in the ability SPSS output. If Sig.-F is
(in the R-square is significantly of the model to explain insignificant, then the regression
ANOVA different from zero the dependent variable as a whole has failed. No more
table) interpretation is necessary
(although some statisticians
disagree on this point). You
- below .05 for 95% must conclude that the
confidence in the ability "Dependent variable cannot be
of the model to explain explained by the
the dependent variable independent/explanatory
variables." The next steps could
be rebuilding the model, using
more data points, etc.
- below 0.1 for 90%
confidence in the ability
of the model to explain
the dependent variable
RSS, ESS & The main function of these The ESS should be high If the R-squares of two models
TSS values lies in calculating test compared to the TSS (the are very similar or rounded off
statistics like the F-test, etc. ratio equals the R- to zero or one, then you might
(in the square). Note for prefer to use the F-test formula
ANOVA interpreting the SPSS that uses RSS and ESS.
table) table, column "Sum of
Squares":
"Total" =TSS,
"Residual" = RSS
SE of The standard error of the There is no critical value. You may wish to comment on
Regression estimate predicted dependent Just compare the std. the SE, especially if it is too
variable error to the mean of the large or small relative to the
predicted dependent mean of the predicted/estimated
variable. The former values of the dependent variable.
(in the Model should be small (<10%)
Summary compared to the latter.
table)
R-Square Proportion of variation in the Between 0 and 1. A This often mis-used value
dependent variable that can be higher value is better. should serve only as a summary
(in the Model explained by the independent measure of Goodness of Fit. Do
Summary variables not use it blindly as a criterion
table) for model selection.
www.vgupta.com
Chapter 7: Linear Regression 7-31
Adjusted R- Proportion of variance in the Below 1. A higher value Another summary measure of
square dependent variable that can be is better Goodness of Fit. Superior to R-
explained by the independent square because it is sensitive to
(in the Model variables or R-square adjusted the addition of irrelevant
Summary for # of independent variables variables.
table)
T-Ratios The reliability of our estimate Look at the p-value (in For a one-tailed test (at 95%
of the individual beta the column Sig.) it confidence level), the critical
(in the must be low: value is (approximately) 1.65 for
Coefficients testing if the coefficient is
table) - below .01 for 99% greater than zero and
confidence in the value of (approximately) -1.65 for testing
the estimated coefficient if it is below zero.
Confidence The 95% confidence band for The upper and lower Any value within the confidence
Interval for each beta estimate values give the 95% interval cannot be rejected (as
beta confidence limits for the the true value) at 95% degree of
coefficient confidence
(in the
Coefficients
table)
www.vgupta.com
Chapter 7: Linear Regression 7-32
Charts: Provides an idea about the The distribution should A good way to observe the
Histograms distribution of the residuals look like a normal actual behavior of our residuals
of residuals distribution and to observe any severe
problem in the residuals (which
would indicate a breakdown of
the classical assumptions)
When using the table below, remember the ordering of the severity of an impact.
The worst impact is a bias in the F (then the model cant be trusted)
A second disastrous impact is a bias in the betas (the coefficient estimates are unreliable)
Compared to the above, biases in the standard errors and T are not so harmful (these biases
only affect the reliability of our confidence about the variability of an estimate, not the
reliability about the value of the estimate itself)
Irrelevant variable
X X
Omitted variable X X X X X X
www.vgupta.com
Chapter 7: Linear Regression 7-33
Upward bias.
Downward bias.
Ch 7. Section 4 Diagnostics
This section lists some methods of detecting for breakdowns of the classical assumptions.
With experience, you should develop the habit of doing the diagnostics before interpreting the
model's significance, explanatory power, and the significance and estimates of the regression
coefficients. If the diagnostics show the presence of a problem, you must first correct the
problem (using methods such as those shown in chapter 8) and then interpret the model.
Remember that the power of a regression analysis (after all, it is extremely powerful to be able
to say that "data shows that X causes Y by this slope factor") is based upon the fulfillment of
certain conditions that are specified in what have been dubbed the "classical" assumptions.
Refer to your textbook for a comprehensive listing of methods and their detailed descriptions.
98
Also called Multicollinearity.
www.vgupta.com
Chapter 7: Linear Regression 7-34
regression results. If the variables have a close linear relationship, then the estimated regression
coefficients and T-statistics may not be able to properly isolate the unique effect/role of each
variable and the confidence with which we can presume these effects to be true. The close
relationship of the variables makes this isolation difficult. Our explanation may not satisfy a
statistician, but we hope it conveys the fundamental principle of collinearity.
Collinearity Diagnostics a
Note: Mis-specification covers a list of problems discussed in sections 8.3 to 8.5. These
problems can cause moderate or severe damage to the regression analysis. Of graver
importance is the fact that most of these problems are caused not by the nature of the data/issue,
but by the modeling work done by the researcher. It is of the utmost importance that every
researcher realise that the responsibility of correctly specifying an econometric model lies solely
on them. A proper specification includes determining curvature (linear or not), functional form
(whether to use logs, exponentials, or squared variables), and the accuracy of measurement of
each variable, etc.
99
Some books advise using 0.8.
www.vgupta.com
Chapter 7: Linear Regression 7-35
If the correct relation between the variables is non-linear but you use a linear model and do not
transform the variables100, then the results will be biased. Listed below are methods of detecting
incorrect functional forms:
Perform a preliminary visual test. To do this, we asked SPSS for the plot ZPRED
and Y-PRED while running the regression (see section 7.1). Any pattern in this
plot implies mis-specification (and/or heteroskedasticity) due to the use of an
incorrect functional form or due to omission of a relevant variable.
If the visual test indicates a problem, perform a formal diagnostic test like the
101 102
RESET test or the DW test .
Check the mathematical derivation (if any) of the model.
Determine whether any of the scatter plots have a non-linear pattern. If so, is the
pattern log, square, etc?
The nature of the distribution of a variable may provide some indication of the
transformation that should be applied to it. For example, section 3.2 showed that
wage is non-normal but that its log is normal. This suggests re-specifying the
model by using the log of wage instead of wage.
Check your textbook for more methods.
Not including a variable that actually plays a role in explaining the dependent variable can bias
the regression results. Methods of detection 103 include:
Perform a preliminary visual test. To do this, we asked SPSS for the plot ZPRED
and Y-PRED while running the regression (see section 7.1). Any pattern in this
plot implies mis-specification (and/or heteroskedasticity) due to the use of an
incorrect functional form or due to the omission of a relevant variable.
If the visual test indicates a problem, perform a formal diagnostic test such as the
RESET test.
Apply your intuition, previous research, hints from preliminary Bivariate analysis,
etc. For example, in the model we ran, we believe that there may be an omitted
variable bias because of the absence of two crucial variables for wage
determination - whether the labor is unionized and the professional sector of work
(medicine, finance, retail, etc.).
Check your textbook for more methods.
100
In section 8.3, you will learn how to use square and log transformations to remove mis-specification.
101
The test requires the variables predicted Y and predicted residual. We obtained these when we asked SPSS to
save the "unstandardized" predicted dependent variable and the unstandardized residuals, respectively (see section
7.1).
102
Check your textbook for other formal tests.
103
The first three tests are similar to those for Incorrect Functional form.
www.vgupta.com
Chapter 7: Linear Regression 7-36
This is not a very severe problem if it only afflicts the dependent variable, but it may bias the T-
statistics. Methods of detecting this problem include:
Knowledge about problems/mistakes in data collection
There may be a measurement error if the variable you are using is a proxy for the
actual variable you intended to use. In our example, the wage variable includes the
monetized values of the benefits received by the respondent. But this is a
subjective monetization of respondents and is probably undervalued. As such, we
can guess that there is probably some measurement error.
Check your textbook for more methods
104
By dropping it, we improve the reliability of the T-statistics of the other variables (which are relevant to the
model). But, we may be causing a far more serious problem - an omitted variable! An insignificant T is not
necessarily a bad thing - it is the result of a "true" model. Trying to remove variables to obtain only significant T-
statistics is bad practice.
105
Other tests: Park, Glejser, Goldfelt-Quandt. Refer to your text book for a comprehensive listing of methods and
their detailed descriptions.
www.vgupta.com
Chapter 7: Linear Regression 7-37
The Whites test is usually used as a test for heteroskedasticity. In this test, a regression of the
squares of the residuals106 is run on the variables suspected of causing the heteroskedasticity,
their squares, and cross products.
106
The test requires the variables predicted residual. We obtained this when we asked SPSS to save the
unstandardized residuals (see section 7.1).
107
If you are unfamiliar with this procedure, please refer to section 2.2.
www.vgupta.com
Chapter 7: Linear Regression 7-38
www.vgupta.com
Chapter 7: Linear Regression 7-39
www.vgupta.com
Chapter 7: Linear Regression 7-40
Model Summary a
Std.
Error of
Variables
R Adjusted the
Entered Square R Square Estimate
SQ_WORK,
SQ_EDUC,
EDU_WORK,
.037 .035 .2102
Work
Experience,
EDUCATION
a. Dependent Variable: SQ_RES
Whites Test
2 2
Calculate n*R R = 0.037, n=2016 Thus, n*R2 = .037*2016
= 74.6.
2 (2016) = 124 obtained from 2 table. (For 955 confidence) As n*R2 < 2 ,
heteroskedasticity can not be confirmed.
Note: Please refer to your textbook for further information regarding the interpretation of the
White's test. If you have not encountered the Chi-Square distribution/test before, there is no
need to panic! The same rules apply for testing using any distribution - the T, F, Z, or Chi-
Square. First, calculate the required value from your results. Here the required value is the
sample size ("n") multiplied by the R-square. You must determine whether this value is higher
than that in the standard table for the relevant distribution (here the Chi-Square) at the
recommended level of confidence (usually 95%) for the appropriate degrees of freedom (for the
White's test, this equals the sample size "n") in the table for the distribution (which you will find
in the back of most econometrics/statistics textbooks). If the former is higher, then the
108
hypothesis is rejected. Usually the rejection implies that the test could not find a problem .
108
We use the phraseology "Confidence Level of "95%." Many professors may frown upon this, instead preferring
to use "Significance Level of 5%." Also, our explanation is simplistic. Do not use it in an exam! Instead, refer to the
chapter on "Hypothesis Testing" or "Confidence Intervals" in your textbook. A clear understanding of these concepts
is essential.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-1
In the introduction to this chapter, we place some notes containing intuitive explanations
of the reasons why the breakdowns cause a problem. (These notes have light shading.)
Our explanations are too informal for use in an exam. Our explanation may not satisfy a
statistician, but we hope it gets the intuitive picture across. We include them here to help
you understand the problems more clearly.
Why is the result not acceptable unless the assumptions are met? The reason is simple - the
strong statements inferred from a regression (e.g. - "an increase in one unit of the value of
variable X causes an increase of the value of variable Y by 0.21 units") depend on the
presumption that the variables used in a regression, and the residuals from that regression,
satisfy certain statistical properties. These are expressed in the properties of the distribution of
the residuals. That explains why so many of the diagnostic tests shown in sections 7.4-7.5 and
their relevant corrective methods, shown in this chapter, are based on the use of the residuals.
If these properties are satisfied, then we can be confident in our interpretation of the results.
The above statements are based on complex, formal mathematical proofs. Please refer to your
textbook if you are curious about the formal foundations of the statements.
If a formal109 diagnostic test confirms the breakdown of an assumption, then you must attempt
to correct for it. This correction usually involves running another regression on a transformed
version of the original model, with the exact nature of the transformation being a function of the
110
classical regression assumption that has been violated .
In section 8.1, you will learn how to correct for collinearity (also called multicollinearity)111.
109
Usually, a "formal" test uses a hypothesis testing approach. This involves the use of testing against distributions
like the T, F, or Chi-Square. An "informal' test typically refers to a graphical test.
110
Dont worry if this line confuses you at present - its meaning and relevance will become apparent as you read
through this chapter.
111
We have chosen this order of correcting for breakdowns because this is the order in which the breakdowns are
usually taught in schools. Ideally, the order you should follow should be based upon the degree of harm a particular
breakdown causes. First, correct for mis-specification due to incorrect functional form and simultaneity bias.
Second, correct for mis-specification due to an omitted variable and measurement error in an independent variable.
Third, correct for collinearity. Fourth, correct for heteroskedasticity and measurement error in the dependent
variable. Fifth, correct for the inclusion of irrelevant variables. Your professor may have a different opinion.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-2
Note: Heteroskedasticity implies that the variances (i.e. - the dispersion around the expected
mean of zero) of the residuals are not constant - that they are different for different
observations. This causes a problem. If the variances are unequal, then the relative reliability of
each observation (used in the regression analysis) is unequal. The larger the variance, the lower
should be the importance (or weight) attached to that observation. As you will see in section
8.2, the correction for this problem involves the downgrading in relative importance of those
observations with higher variance. The problem is more apparent when the value of the
variance has some relation to one or more of the independent variables. Intuitively, this is a
problem because the distribution of the residuals should have no relation with any of the
variables (a basic assumption of the classical model).
In section 8.3 you will learn how to correct for mis-specification due to incorrect functional
form.
Mis-specification covers a list of problems discussed in sections 8.3 to 8.5. These problems can
cause moderate or severe damage to the regression analysis. Of graver importance is the fact
that most of these problems are caused not by the nature of the data/issue, but by the modeling
work done by the researcher. It is of the utmost importance that every researcher realise that the
responsibility of correctly specifying an econometric model lies solely on them. A proper
specification includes determining curvature (linear or not), functional form (whether to use
logs, exponentials, or squared variables), and the measurement accuracy of each variable, etc.
Note: Why should an incorrect functional form lead to severe problems? Regression is based
on finding coefficients that minimize the "sum of squared residuals." Each residual is the
difference between the predicted value (the regression line) of the dependent variable versus the
realized value in the data. If the functional form is incorrect, then each point on the regression
"line" is incorrect because the line is based on an incorrect functional form. A simple example:
assume Y has a log relation with X (a log curve represents their scatter plot) but a linear relation
with "Log X." If we regress Y on X (and not on "Log X"), then the estimated regression line
will have a systemic tendency for a bias because we are fitting a straight line on what should be
a curve. The residuals will be calculated from the incorrect "straight" line and will be wrong. If
they are wrong, then the entire analysis will be biased because everything hinges on the use of
the residuals.
Section 8.4 teaches 2SLS, a procedure that corrects for simultaneity bias.
Note: Simultaneity bias may be seen as a type of mis-specification. This bias occurs if one or
more of the independent variables is actually dependent on other variables in the equation. For
example, we are using a model that claims that income can be explained by investment and
education. However, we might believe that investment, in turn, is explained by income. If we
were to use a simple model in which income (the dependent variable) is regressed on
investment and education (the independent variables), then the specification would be incorrect
because investment would not really be "independent" to the model - it is affected by income.
Intuitively, this is a problem because the simultaneity implies that the residual will have some
relation with the variable that has been incorrectly specified as "independent" - the residual is
capturing (more in a metaphysical than formal mathematical sense) some of the unmodeled
reverse relation between the "dependent" and "independent" variables.
Section 8.5 discusses how to correct for other specification problems: measurement errors,
omitted variable bias, and irrelevant variable bias.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-3
Note: Measurement errors causing problems can be easily understood. Omitted variable bias is
a bit more complex. Think of it this way - the deviations in the dependent variable are in reality
explained by the variable that has been omitted. Because the variable has been omitted, the
algorithm will, mistakenly, apportion what should have been explained by that variable to the
other variables, thus creating the error(s). Remember: our explanations are too informal and
probably incorrect by strict mathematical proof for use in an exam. We include them here to
help you understand the problems a bit better.
Our approach to all these breakdowns may be a bit too simplistic or crude for purists. We
have striven to be lucid and succinct in this book. As such, we may have used the most
common methods for correcting for the breakdowns. Please refer to your textbook for
more methods and for details on the methods we use.
Because we are following the sequence used by most professors and econometrics textbooks,
we first correct for collinearity and heteroskedasticity. Then we correct for mis-specification. It
is, however, considered standard practice to correct for mis-specification first. It may be helpful
to use the table in section 7.3 as your guide.
Also, you may sense that the separate sections in this chapter do not incorporate the corrective
procedures in the other sections. For example, the section on misspecification (section 8.3)
does not use the WLS for correcting for heteroskedasticity (section 8.2). The reason we have
done this is to make each corrective procedure easier to understand by treating it in isolation. In
practice, you should always incorporate the features of corrective measures.
The variables age and work experience are correlated (see section 7.3). There are several112
ways to correct for this. We show an example of one such method: "Dropping all but one of the
collinear variables from the analysis113."
112
Sometimes adding new data (increasing sample size) and/or combining cross-sectional and time series data can
also help reduce collinearity. Check your textbook for more details on the methods mentioned here.
113
Warning--many researchers, finding that two variables are correlated, drop one of them from the analysis.
However, the solution is not that simple because this may cause mis-specification due to the omission of a relevant
variable (that which was dropped), which is more harmful than collinearity.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-4
We know the model is significant because the Sig. of the F-statistic is below .05.
ANOVAa
Sum of Mean
Model Squares df Square F Sig.
Regression 52552.19 4 13138.05 481.378 .000b
1 Residual 54257.68 1988 27.293
Total 106809.9 1992
a. Dependent Variable: WAGE
b. Independent Variables: (Constant), WORK_EX, EDUCATION, GENDER,
PUB_SEC
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-5
Because we are following the sequence used by most professors and econometrics textbooks,
we have first corrected for collinearity and heteroskedasticity. We will later correct for mis-
specification. It is, however, considered standard practice to correct for mis-specification first
as it has the most severe implications for interpretation of regression results. It may be helpful
use the table in section 7.3 as your guide.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-6
Education1.5
We firmly believe that education should be used114, and we further feel that one of the above
three transformations of education would be best. We can let SPSS take over from here115. It
will find the best transformation of the three above, and then run a WLS regression with no
threat of heteroskedasticity.
114
See sections 7.2 and 7.5 for justification of our approach.
115
There exists another approach to solving for heteroskedasticity: White's Heteroskedasticity Consistent Standard
Errors. Using this procedure, no transformations are necessary. The regression uses a formula for standard errors
that automatically corrects for heteroskedasticity. Unfortunately, SPSS does not offer this method/procedure.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-7
116
SPSS will search through
.5+0=.5
.5+.5 = 1 and
.5+.5+.5 = 1.5
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-8
Now go back to
STATISTICS/REGRESSION/
WEIGHT ESTIMATION
Analysis of Variance:
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-9
Each coefficient can be interpreted directly (compare this to the indirect method shown at the
end of section 8.2.b.). The results do not suffer from heteroskedasticity. Unfortunately, the
output is not so rich (there are no plots or output tables produced) as that obtained when using
STATISTICS/REGRESSION/LINEAR (as in the earlier sections of this chapter, chapter 7, and
section 8.2.b).
A new variable wgt_2 is created. This represents the best heteroskedasticity-correcting power
of education.
117
The weight is = (1/(education).5 = education -.5
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-10
Press "OK."
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-11
The variables have been transformed in WLS. Do not make a direct comparison with the OLS results in
the previous chapter.
To make a comparison, you must map the new coefficients on the "real" coefficients on the original
(unweighted) variables. This is in contrast to the direct interpretation of coefficients in section 8.2.a.
Refer to your econometrics textbook to learn how to do this.
Coefficientsa,b
Unstandardized 95% Confidence
Coefficients Interval for B
Lower Upper
Model B Std. Error t Sig. Bound Bound
(Constant) -3.571 .849 -4.207 .000 -5.235 -1.906
EDUCATION .694 .026 26.251 .000 .642 .746
GENDER -1.791 .245 -7.299 .000 -2.272 -1.310
1
PUB_SEC 1.724 .279 6.176 .000 1.177 2.272
AGESQ -3.0E-03 .001 -4.631 .000 -.004 -.002
AGE .328 .049 6.717 .000 .232 .423
a. Dependent Variable: WAGE
b. Weighted Least Squares Regression - Weighted by Weight for WAGE from WLS, MOD_1 EDUC**
-.500
Note: other output suppressed and not interpreted. Refer to section 7.2 for detailed interpretation
guidelines.
We begin by creating and including a new variable, square of work experience118. The logic is
that the incremental effect on wages of a one-year increase in experience should reduce as the
experience level increases.
118
Why choose this transformation? Possible reasons for choosing this transformation: a hunch, the scatter plot may
have shown a slight concave curvature, or previous research may have established that such a specification of age is
appropriate for a wage determination model.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-12
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-13
The coefficient on sq_work is negative and significant, suggesting that the increase in wages resulting
from an increase in work_ex decreases as work_ex increases.
Coefficientsa
Unstandardized 95% Confidence
Coefficients Interval for B
Lower Upper
Model B Std. Error t Sig. Bound Bound
(Constant) .220 .278 .791 .429 -.326 .766
EDUCATION .749 .025 30.555 .000 .701 .797
GENDER -1.881 .291 -6.451 .000 -2.452 -1.309
1
PUB_SEC 2.078 .289 7.188 .000 1.511 2.645
WORK_EX .422 .037 11.321 .000 .349 .495
SQ_WORK -7.1E-03 .001 -6.496 .000 -.009 -.005
a. Dependent Variable: WAGE
-1
-2
-4 -2 0 2 4 6 8 10
What else may be causing mis-specification? Omitted variable bias may be a cause. Our theory
and intuition tells us that the nature of the wage-setting environment (whether unionized or not)
and area of work (law, administration, engineering, economics, etc.) should be relevant
variables, but we do not have data on them.
Another cause may be the functional form of the model equation. Should any of the variables
(apart from age) enter the model in a non-linear way? To answer this, one must look at:
The models used in previous research on the same topic, possibly with data on the same
region/era, etc.
Intuition based on one's understanding of the relationship between the variables and the
manner in which each variable behaves
Inferences from pre-regression analyses such as scatter-plots
119
We only did a graphical test. For formal tests like the RESET test, see a standard econometrics textbook like
Gujarati. The test will require several steps, just as the White's Test did in section 7.5.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-14
In our case, all three aspects listed below provide support for using a log transformation of
wages as the dependent variable.
Previous research on earnings functions has successfully used such a transformation and
thus justified its use.
Intuition suggests that the absolute change in wages will be different at different levels of
wages. As such, comparing percentage changes is better than comparing absolute changes.
This is exactly what the use of logs will allow us to do.
The scatters showed that the relations between wage and education and between wage and
work experience are probably non-linear. Further, the scatters indicate that using a log
dependent variable may be justified. We also saw that wage is not distributed normally but
its log is. So, in conformity with the classical assumptions, it is better to use the log of
wages.
Arguably, mis-specification is the most debilitating problem an analysis can incur. As shown in
section 7.3, it can bias all the results. Moreover, unlike measurement errors, the use of an
incorrect functional form is a mistake for which the analyst is to blame.
To run the re-specified model, we first must create the log transformation of wage.
Note: The creation of new variables was shown in section 2.2. We are repeating it here to
reiterate the importance of knowing this procedure.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-15
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-16
Now that the variable lnwage has been created, we must run the re-specified model.
The plot of predicted versus residual shows that the problem of mis-specification is gone!
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-17
Scatterplot
Dependent Variable: LNWAGE
4
-1
-2
-12 -10 -8 -6 -4 -2 0 2 4 6
Now the results can be trusted. They have no bias due to any major breakdown of the classical
assumptions.
ANOVA a
Sum of Mean
Model Squares df Square F Sig.
Regression 732.265 5 146.453 306.336 .000b
1 Residual 960.463 2009 .478
Total 1692.729 2014
a. Dependent Variable: LNWAGE
b. Independent Variables: (Constant), Work Experience, EDUCATION, GENDER,
Whether Public Sector Employee, SQAGE
Model Summary a
Std. Error
Variables
Adjusted of the
Entered R Square R Square Estimate
Work Experience, EDUCATION, GENDER,
.433 .431 .6914
Whether Public Sector Employee, SQAGE
a. Dependent Variable: LNWAGE
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-18
Coefficients a
None of the regression results before this (in chapter 7 and sections 8.1-8.2) can be compared to
this, as they were all biased due to mis-specification. This is the most important issue in
regression analysis. Focus your attention on diagnosing and correcting for the
breakdowns in the classical assumptions (and not on the R-square).
But what if the "independent" variable education is actually "dependent" on the variable
gender? Using the equation above would then be incorrect because one of the right-hand-side
variables (education) is not truly independent. If you just ran the equation above, simultaneity
bias would result, severely compromising the reliability of your results.
Instead, using 2SLS, you can run the real model that consists of two equations, one to explain
wage and another to explain education:
education = function(gender)
120
In our example, gender and work experience.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-19
2. Then, in the second stage regression, it will run the regression of interest to us - wage on
work experience and the predicted education from the first regression.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-20
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-21
R Square .0084
Adjusted R Square .0074
Standard Error 20.7887
Analysis of Variance:
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-22
Name Label
FIT_3 Fit for WAGE from 2SLS, MOD_5 Equation 1
Do not worry if the R-square is too "low." The R-square is a function of the model, the data,
sample size, etc. It is better to have a properly specified model (one conforming to the classical
assumptions) with a low R-square compared to an improperly specified model with a high R-
square. Honesty is a good policy - trying to inflate the R-square is a bad practice that an
incredible number of economists have employed (including so-called experts at Universities and
major research institutes).
Be careful not to cause this problem inadvertently while correcting for the problems of
collinearity121 or the inclusion of an irrelevant variable.
The other option would be to remove the "irrelevant" variables, a distressingly common
practice. Be careful - this approach has two problems:
121
When you use the correction method of dropping "all but one" of the collinear variables from the model.
www.vgupta.com
Chapter 8: Correcting for breakdowns of the classical assumptions 8-23
If, by error, one removes a "relevant" variable, then we may be introducing an omitted
variable bias, a far worse breakdown in comparison to the presence of an irrelevant
variable.
A tendency to remove all variables that have an insignificant T-statistic may result in a
choice to ignore theory and instead use statistics to construct regression models, an
incorrect approach. The aim of regression analysis is to prove/support certain theoretical
and intuitive beliefs. All models should be based upon these beliefs.
The fact that the T is insignificant is itself a result. It shows that that variable does not have a
significant effect. Or, it can be interpreted as "the impact of the variable as measured by the
beta coefficient is not reliable because the estimated probability distribution of this beta has a
standard error that is much too high."
Your professor may scoff at the simplicity of some of our approaches. In cases of conflict,
always listen to the person who is grading your work.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-1
All of these methods use an estimation technique called Maximum Likelihood Estimation
124
(MLE) , an advanced algorithm that calculates the coefficients that would maximize the
likelihood of viewing the data distributions as seen in the data set. MLE is a more powerful
method than linear regression. More importantly, it is not subject to the same degree to the
classical assumptions (mentioned ad nauseam in chapters 7 and 8) that must be met for a
reliable Linear Regression.
The output from MLE differs from that of Linear Regression. In addition, since these models
are not based on properties of the residuals, as is the case with OLS, there are different
goodness-of-fit tests. We will not delve into the details of MLE and related diagnostics. Those
topics are beyond the scope of this book.
Ch 9. Section 1 Logit
Logit (also called logistic) estimates models in which the dependent variable is a dichotomous
dummy variable - the variable can take only two values, 1 and 0. These models are typically
122
When an independent variable is a dummy, it can be used in a linear regression without a problem as long as it is
coded properly (as 0 and 1). What is the problem if the dependent variable is a dummy? If we run such a regression,
the predicted values will lie within and in the vicinity of the two values of the original dependent variable, namely the
values 0 and 1. What is the best interpretation of the predicted value? Answer: "The probability that the dependent
variable takes on the quality captured by the value 1." In a linear regression, such predicted probabilities may be
estimated at values less than 0 or greater than 1, both nonsensical. Also, for reasons we will not delve into here, the
R-square cannot be used, normality of the residuals is compromised, and a severe case of heteroskedasticity is always
present. For all these reasons, the linear regression should not be used. A stronger and simpler argument is that
imposing a linear regression on what is a non-linear model (as will become apparent with the Logit example later)
constitutes serious mis-specification (incorrect functional form).
123
Dummy and categorical variables are also called "Qualitative" variables because the values of the variable
describe a quality and not a quantity. For example, the dummy variable gender can take on two values - 0 and 1. The
former if the respondent is male and the latter if the respondent is a female, both of which are qualities.
124
You can estimate a linear model using the procedure "Non-Linear Regression." This may be useful if you want to
show that the results are robust in the sense that the estimates from Least Squares Estimation (linear regression) and
MLE are the same or if violations of the classical assumptions are proving particularly difficult to overcome.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-2
used to predict whether or not some event will occur, such as whether a person will vote yes
or no on a particular referendum, or whether a person will graduate this year (or not) from
high school, etc.
The other model used for estimating models with dichotomous models is the Probit. The Logit
and Probit techniques are similar and both use Maximum Likelihood Estimation methods. The
Logit is used more frequently because it is easier to interpret. That is why we only show the
Logit in this book.
Y=0 X
If you look at a graph of the Logit or Probit (see graph above), you will notice a few striking
features: as the value on the X-axis increases, the value on the Y-axis gradually tends towards 1
but never reaches it. Conversely, as the value on the X-axis tends towards negative infinity, the
Y-value never drops below zero. The fact that the Y-value remains inside the bounds of 0 and 1
provides the intuitive rationale for using the Logit or Probit. The X-axis represents the
independent variable(s) and the Y represents the probability of the dependent variable taking the
value of 1. Because of the nature of the curve, the probability always remains within the range
of 0 and 1, regardless of the values of the independent variables. This is a requirement for
estimating the predicted value of a dummy variable because the predicted value is interpreted as
a probability. A probability must lie between 0 and 1.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-3
125
The dependent variable can take only two values, 0 or 1. If you have any other values, then SPSS will generate an
error. If that happens, go back to the data editor and remove the other values by using the appropriate procedure(s)
from:
DATA/DEFINE VARIABLE/MISSING (see section 1.2)
TRANSFORM/RECODE/INTO SAME VARIABLES (see section 2.1)
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-4
126
The option "Logit" in residuals gives the residuals on a scale defined by the logistic distribution. Mathematically,
the "Logit" values will be = (residual)/(p*(1-p)) where p is the (predicted) probability.
127
In our example, the probability that the respondent works in the public sector
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-5
The MLE algorithm can be crudely described as "Maximizing the log of the likelihood that
what is observed in the data will occur." The endogenous or choice variables for these
algorithms are the coefficient estimates. The likelihood function is the joint probability of the
distribution of the data as captured by a logistic function (for the joint distribution). For those
who are curious as to why the "log" is used in MLE, we offer this explanation: When you take
128
If you run a Logit and the output informs you that the model did not converge or a solution could not be found,
then come back to this dialog box and increase the number of iterations specified in the box Maximum Iterations.
MLE runs an algorithm repetitively until the improvement in the result is less than the "Convergence Criterion,"
which in this case equals .01, or until the maximum number of iterations (20) is reached.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-6
the logs of a joint distribution (which, in essence, is the multiplication of the distribution of each
observation), the algorithm is converted into an additive function. It is much simpler to work
with a function with 20,000 additive components (if your sample size is 20,000) than to work
with 20,000 multiplicative components.
Logistic Regression
This tells you that a solution was found. If not, then go back to
Dependent Variable. PUB_SEC the options box and increase the number of iterations.
chi-square df Significance
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-7
Look at the Sig. If it is below 0.1, then the variable is significant at the 90% level. In this
example, education is significant, but gender is not.
Let's interpret the coefficient on education. Look in the column "Exp (B). The value is
1.2249130. First subtract 1 from this: 1.2249-1 = .2249.
Then multiply the answer by 100:
100*(.2249) = 22.49 %.
This implies that for a 1 unit increase in education (i.e. - one more year of education), the
131
odds of joining the public sector increase by 22.49%.
The "odds" interpretation may be less intuitive than an interpretation in terms of probability. To
do that, you will have to go back to column "B" and perform some complex calculations. Note
that the slope is not constant, so a one unit change in an independent variable will have a
different impact on the dependent variable, depending on the starting value of the independent
variables.
Note: Consult your textbook and class notes for further information on interpretation.
To compare between models when Maximum Likelihood Estimation is used (as it is throughout
this chapter), the relevant statistic is the "-2 Log Likelihood." In this example, the number is
2137.962. Consult your textbook for more details on the testing process.
The main advantage of non-linear estimation is that it allows for the use of highly flexible
functional forms.
We first use curve estimation to implement a simple 2 variable, 1 function, non-linear model
(See section 9.2.a). In section 9.2.b., we describe a more flexible method that can be used to
estimate complex multi-variable, multi-function models.
129
The Wald is equivalent to the T-test in Linear regression.
130
Remember that Logit is non-linear. As such, the impact of education on probability will depend upon the level of
education. An increase in education from 11 to 12 years may have a different effect on the problem of joining the
public sector than an increase in education from 17 to 18. Do not interpret it in the same manner as a linear
regression.
131
The odds of "yes" = Probability("yes")/Probability("no")
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-8
You may find that this entire section is beyond the scope of what you must know for your
thesis/exam/project.
Curve estimation is a Sub-set of non-linear estimation. The latter (shown in section 9.2.b) is far
more flexible and powerful. Still, we decided to first show an example of curve estimation so
that you do not get thrown headlong into non-linear estimation.
A zero value of the independent variable can disrupt our analysis. To avoid that, go to
DATA/DEFINE VARIABLE and define zero values of the independent variable (work_ex) as
missing values that are not to be included in the analysis (See section 1.2). You are now ready
for curve estimation
132
Curve estimation can only estimate 2 variable, 1 function models.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-9
Click on "Save."
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-10
Click on OK.
As presented in the shaded text below, the model is significant at 95% (as Sig. F < .05), the
intercept equals 5.41 (bo = 5.41), and the coefficient is 1.52. The R-square is a psuedo-rsquare,
but can be roughly interpreted as the R-square in Linear Regression. Note that the coefficient
estimate is not the slope. The slope at any point is dependent on the coefficient and the value of
the independent variable at the point at which the slope is being measured.
Independent: WORK_EX
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-11
In addition, you can place constraints on the possible values coefficients can take. For example,
3 1.5
In Linear regression, the "sum of squared residuals" was minimized. The Non Linear
Estimation procedure allows you to minimize other functions (e.g. - deviations in predicted
values) also.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-12
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-13
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-14
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-15
Click on OK.
Addendum
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-16
The model had a good fit as shown in the output reproduced in the shaded text below.
Implies that a solution was found. If it was not found, then go back and increase the number
of iterations by pressing the button "Options."
Asymptotic 95 %
Asymptotic Confidence Interval
Parameter Estimate Std. Error Lower Upper
B1 .8125 .021 .770 .8550
B2 -1.564 .288 -2.129 -.999
B3 2.252 .292 1.677 2.826
B4 1.268 .013 1.242 1.294
Note: No T-Statistic is produced. The confidence intervals give the range within which we can say (with 95%
confidence) that the coefficient lies. To obtain a "rough" T-estimate, divide the "Asymptotic Std. Error" by the
estimate. If the absolute value of the result is greater than 1.96 (1.64), then the coefficient is significant at the
95% (90%) level.
www.vgupta.com
Chapter 9: Non-Linear estimation (including Logit) 9-17
The coefficients on the first three variables can be interpreted as they would be in a linear
regression. The coefficient on the last variable cannot be interpreted as a slope. The slope
depends on the value of the estimated coefficient and the value of the variable at that point.
Note: The R-square is a psuedo-square, but can be roughly interpreted in the same way as in
Linear regression.
www.vgupta.com
Chapter 10: Comparative Analysis 10-1
Ch 10. COMPARATIVE
ANALYSIS
Comparative analysis is used to compare the results of any and all statistical and graphical
analysis across sub-groups of the data defined by the categories of dummy or categorical
variable(s). Comparisons always provide insight for social science statistical analysis because
many of the variables that define the units of interest to the analyst (e.g. - gender bias, racial
discrimination, location, and milieu, etc.) are represented by categorical or dummy variables.
For example, you can compare the regression coefficients on age, education, and sector of a
regression of wages for male and female respondents. You can also compare descriptives like
mean wage, median wage, etc. for the same subgroups.
This is an extremely powerful procedure for extending all your previous analysis (see chapters
3-9). You can see it as an additional stage in your project - procedures to be completed after
completing the desired statistical and econometric analysis (discussed in chapters 3-9), but
133
before writing the final report on the results of the analysis .
In section 10.1, we first show the use of one categorical variable (gender) as the criterion for
comparative analysis. We then show how to conduct comparisons across groups formed by the
interaction of three variables.
133
The note below may help or may confuse you, depending on your level of understanding of statistics, the process
of estimation, and this book. Nevertheless, we feel that the note may be of real use for some of you, so read on.
After you have conducted the procedures in chapters 3-9, you should interpret and link the different results. Look at
the regression results. What interesting or surprising result have they shown? Does that give rise to new questions?
Link these results to those obtained from chapters 3-6. Is a convincing, comprehensive, and logically consistent story
emerging? To test this, see if you can verbally, without using numbers, describe the results from the initial statistical
analysis (chapters 3-6), link them to the expected regression results (including the breakdown of classical
assumptions), and then trace them back to the initial statistics.
After the econometric procedures, you can do several things:
Manipulate the data so that you can filter in (using SELECT CASE as shown in section 1.7) only those cases
that are required for the analysis inspired by the above insights and queries.
Create variables from continuous variables. Then use these variables in a re-specified regression model, and in
other graphical and statistical procedures.
Make more detailed custom tables (chapter 6) and graphs (chapters 3-5) to better understand the data.
www.vgupta.com
Chapter 10: Comparative Analysis 10-2
In this chapter, we never use the third option, "Organize output by groups." You should
experiment with it. It is similar to the second option with one notable difference - the output
produced is arranged differently.
www.vgupta.com
Chapter 10: Comparative Analysis 10-3
Now, any procedure you conduct will be split in two - one for males and one for females.
www.vgupta.com
Chapter 10: Comparative Analysis 10-4
The fit of the model is better for females - the adjusted R-square is higher for them. We checked
that the F was significant for all models but we did not reproduce the table here in order to save
space.
Model Summarya
Std.
Error of
Variables
R Adjusted the
GENDER Model Entered Square R Square Estimate
Whether Public Sector Employee, Work
Male 1 .394 .392 .6464
Experience, EDUCATION
Whether Public Sector Employee, Work
Female 1 .514 .511 .7037
Experience, EDUCATION
a. Dependent Variable: LNWAGE
Note: In chapters 7 and 8 we stressed the importance of mis-specification and omitted variable
bias. Is the fact that we are splitting the data (and thus the regression) into 2 categories leaving us
vulnerable to omitted variable bias? Not really.
Reason: Now the population being sampled is "only males" or "only females," so gender is not a
valid explanatory agent in any regression using these samples as the parent population and
thereby the valid model has changed. Of course, some statisticians may frown at such a
simplistic rationale. Usually such comparative analysis is acceptable and often valued in the
workplace.
www.vgupta.com
Chapter 10: Comparative Analysis 10-5
www.vgupta.com
Chapter 10: Comparative Analysis 10-6
Coefficientsb
Whether Unstandardized
Public Coefficients
Sector
GENDER Employee QCL_1 B Std. Error t Sig.
(Constant) 3.53 .20 17.43 .00
High
Work_Ex .00 .01 -.34 .74
Income
EDUCATION .01 .01 .65 .53
(Constant) 1.33 .04 33.58 .00
Private Low
Work_Ex .01 .00 2.43 .02
Sector Income
EDUCATION .01 .01 1.52 .13
(Constant) 2.45 .07 33.87 .00
Mid
Work_Ex .01 .00 2.21 .03
Income
EDUCATION .02 .01 3.37 .00
Male
(Constant) 3.16 .16 20.17 .00
High
Work_Ex .00 .00 .92 .36
Income
EDUCATION .02 .01 2.75 .01
(Constant) 1.33 .10 13.27 .00
Public Low
Work_Ex .01 .01 2.57 .01
Sector Income
EDUCATION .04 .01 4.03 .00
(Constant) 2.42 .05 46.93 .00
Mid
Work_Ex 4.915E-03 .002 3.133 .002
Income
EDUCATION 1.944E-02 .003 6.433 .000
(Constant) 2.050 .068 30.301 .000
High
Work_Ex 2.816E-03 .004 .629 .537
Income
EDUCATION 2.587E-02 .005 4.786 .000
(Constant) 2.931 .459 6.392 .000
Private Low
Work_Ex 1.007E-02 .009 1.073 .319
Sector Income
EDUCATION 8.557E-03 .026 .326 .754
(Constant) .704 .080 8.820 .000
Mid
Work_Ex .01 .01 1.29 .20
Income
EDUCATION .05 .01 3.54 .00
Female
(Constant) 1.86 .12 15.10 .00
High
Work_Ex .01 .00 3.09 .00
Income
EDUCATION .04 .01 4.75 .00
(Constant) 1.88 .50 3.74 .00
Public Low
Work_Ex .02 .01 2.19 .05
Sector Income
EDUCATION .06 .02 2.30 .04
(Constant) .45 .28 1.57 .13
Mid
Work_Ex .03 .03 1.18 .24
Income
EDUCATION .05 .03 1.90 .07
b. Dependent Variable: LNWAGE
To remove the comparative analysis, go to DATA/SPLIT FILE and choose the option Analyze
all cases. Click on the button OK.
www.vgupta.com
Chapter 10: Comparative Analysis 10-7
Note: Your boss/professor may never realize the ease and power of using SPLIT FILE. Here is
your chance to really impress them with "your" efficiency and speed.
www.vgupta.com
Chapter 11: Formatting output 11-1
The professional world demands well-presented tables and crisp, accurate charts. Apart from
aesthetic appeal, good formatting has one more important function - it ensures that the output
shows the relevant results clearly, with no superfluous text, data, or markers.
The navigator window shows all the output tables and charts (see next picture).
www.vgupta.com
Chapter 11: Formatting output 11-2
Using the scroll bar on the extreme right, scroll to the table you want to format or click on the
table's name in the left half of the window. If you see a red arrow next to the table, then you
have been successful in choosing the table.
Several options will open. Select the last one - "SPSS Pivot Table Object." Within that, choose
the option "open" (see picture on next page).
www.vgupta.com
Chapter 11: Formatting output 11-3
A new window will open (see picture below). This window has only one item - the table you
chose. Now you can edit/format this table.
Click on the "maximize" button to fill the screen with this window. See the arrowhead at the
top right of the next picture to locate the "maximize" button.
www.vgupta.com
Chapter 11: Formatting output 11-4
This window has three menus you did not see in the earlier chapters: INSERT, PIVOT, and
FORMAT. These menus are used for formatting tables.
Model Summary
Std. Error
Variables Adjusted of the
Model Entered Removed R R Square R Square Estimate
EDUCATI
ON,
GENDER,
1 . .701 .492 .491 5.2242
WORK_E
X,
PUB_SEC
This can be easily corrected. Using the mouse, go to the column dividing line. Now you can
manually change the width of the column by dragging with the left mouse button. The next
table shows the effect of doing this - the column Entered has been widened.
www.vgupta.com
Chapter 11: Formatting output 11-5
Model Summary
Std. Error
Variables Adjusted of the
Model Entered Removed R R Square R Square Estimate
EDUCATION,
GENDER,
1 . .701 .492 .491 5.2242
WORK_EX,
PUB_SEC
Model Summary
Std. Error
Variables Adjusted of the
Model Entered Removed R R Square R Square Estimate
EDUCATION,
GENDER,
1 . .701 .492 .491 5.2242
WORK_EX,
PUB_SEC
Inside the Table/Chart Editing Window, click on the cell R." Then choose EDIT/SELECT.
Select DATA CELLS and LABEL. Press the keyboard key delete.
www.vgupta.com
Chapter 11: Formatting output 11-6
Model Summary
Model
1
EDUCA
TION,
GENDE
R,
Entered
Variables WORK_
EX,
PUB_SE
C
Removed .
R Square .492
To do this, choose PIVOT/TURN ROWS INTO COLUMNS. Compare the table above to the
one before it. Notice that the rows in the previous table have become columns in this example
and the columns in the previous table have become rows in this example.
Model Summary
Model
1
EDUCATION,
GENDER,
Entered
Variables WORK_EX,
PUB_SEC
Removed .
R Square .492
Autofit is a quick method to ensure that the row heights and column widths are adequate to
display the text or data in each individual cell.
www.vgupta.com
Chapter 11: Formatting output 11-7
Model Summary
EDUCATION,
GENDER,
Entered
Variables WORK_EX,
PUB_SEC
Removed .
R Square .492
Ch 11. Section 1.g. Editing (the data or text in) specific cells
You may want to edit specific cells. To edit a cell, choose the cell by clicking on it with the left
mouse then double click on the left mouse. The result of these mouse moves is, as you see in
the next picture, that the cell contents are highlighted, implying that you are in edit mode. You
are not restricted to the editing of cells - you can use a similar method to edit the title, footnotes,
etc.
Now, whatever you type in will replace the old text or number in the cell. Type the new text
"Explanatory Variables" on top of the highlighted text Variables." The table now looks like:
www.vgupta.com
Chapter 11: Formatting output 11-8
Model Summary
EDUCATION,
GENDER,
Entered
Explanatory Variables WORK_EX,
PUB_SEC
Removed .
R Square .492
Model Summary
www.vgupta.com
Chapter 11: Formatting output 11-9
INSERT/FOOTNOTE. With the footnote highlighted, double click with the left mouse. Type in
the desired text
EDUCATION,
Entered
GENDER,
Explanatory Variables WORK_EX,
PUB_SEC
Removed .
R Square .492
Scroll through the "TableLooks." A sample of the TableLook you have highlighted will be
shown in the area Sample. Select a look that you prefer. We have chosen the look AVANT-
GARDE." The table will be displayed using that style:
134
To learn how to change the default look, see section 12.1.
www.vgupta.com
Chapter 11: Formatting output 11-10
Model Summary a
Notice how so many of the formatting features have changed with this one procedure. Compare
this table with that in the previous example - the fonts, borders, shading, etc. have changed.
If you want to set a default look for all tables produced by SPSS, go to EDIT/OPTIONS within
the main data windows interface and click on the tab "Pivot Table." You will see the same list
of TableLooks as above. Choose those you desire. Click on "Apply" and then on "OK." See
also: chapter 15.
www.vgupta.com
Chapter 11: Formatting output 11-11
www.vgupta.com
Chapter 11: Formatting output 11-12
Click on Apply.
www.vgupta.com
Chapter 11: Formatting output 11-13
Click on Apply.
www.vgupta.com
Chapter 11: Formatting output 11-14
Click on Apply.
www.vgupta.com
Chapter 11: Formatting output 11-15
To do so, go to FORMAT/TABLE
PROPERTIES.
Click on Apply.
11.1l.
1. Dependent variable is wage
www.vgupta.com
Chapter 11: Formatting output 11-16
www.vgupta.com
Chapter 11: Formatting output 11-17
Click on Apply.
www.vgupta.com
Chapter 11: Formatting output 11-18
135
Formatting reduces the ever-present possibility of confusion. A common problem is the presence of outliers.
These tend to expand massively the scale of an axis, thereby flattening out any trends.
www.vgupta.com
Chapter 11: Formatting output 11-19
Using the scroll bar on the extreme right, scroll to the chart you want to format or click on the
chart's name in the left half of the window. If you see a red arrow next to the chart, then you
have been successful in choosing the chart.
www.vgupta.com
Chapter 11: Formatting output 11-20
To edit/format the chart, click on the right mouse and choose the option "SPSS Chart
Object/Open" or double click on the chart with the left mouse.
A new window called the "Chart Editor" will open. Maximize it by clicking on the maximize
button on the top right corner.
Notice that there are four new menus. The menus are:
1. GALLERY: this allows you to change the chart type. You can make a bar chart into a line,
area, or pie chart. You can change some of the data into lines and some into bars, etc. So,
if you made a bar chart and feel that a line chart would have been better, you can make the
change right here. If you have too many variables in the chart, then you might want to mix
the chart types (bar, line, and area). On the next few pages, we illustrate the use of this
menu.
www.vgupta.com
Chapter 11: Formatting output 11-21
2. CHART: using this, you can change the broad features that define a chart. These include
the frames around and in the chart and titles, sub-titles, footnotes, legends, etc.
3. SERIES: this allows you to remove certain series (variables) from a chart.
4. FORMAT: using this, you can format the fonts of text in labels, titles, or footnotes, format
an axis, rescale an axis, swap the X and Y axes, and change the colors, patterns, and
markers on data lines/bars/areas/pie slices, etc.
www.vgupta.com
Chapter 11: Formatting output 11-22
40
30
20
Mean WAGE
10 GENDER
Male
0 Female
0 2 4 6 9 11 13 15 17 19 21 23
EDUCATION
You can use the GALLERY menu to convert the above chart into a different type of chart.
0 Male
has only a few categories. Area 0 2 4 6 9 11 13 15 17 19 21 23
136
If you chose GALLERY/LINE, a line graph would be the result of the conversion from a bar graph.
www.vgupta.com
Chapter 11: Formatting output 11-23
Click on "Replace."
137
This feature is rarely used.
www.vgupta.com
Chapter 11: Formatting output 11-24
www.vgupta.com
Chapter 11: Formatting output 11-25
Click on Replace."
Each slice represents the mean wage for males with a specific level of educational attainment.
For example, the slice 6 shows the mean wage for males who had an educational attainment
of six years (primary schooling).
138
The other series must therefore be hidden.
www.vgupta.com
Chapter 11: Formatting output 11-26
0
1
2
3
4
5
6
23
8
9
10
11
22 12
13
14
21 15
16
20
17
19 18
Ch 11. Section 2.f. Using the SERIES menu: Changing the series
that are displayed
To show the use of this menu, we go back to our "Mixed" chart (shown below), which we made
in section 11.2.d.
70
60
50
40
30
20
10 GENDER
EDUCATION
www.vgupta.com
Chapter 11: Formatting output 11-27
displayed.
10
0
0 2 4 6 9 11 13 15 17 19 21 23
EDUCATION
www.vgupta.com
Chapter 11: Formatting output 11-28
Then go to
FORMAT/PATTERNS.
Click on Apply."
70
60
50
40
30
20
10 GENDER
www.vgupta.com
Chapter 11: Formatting output 11-29
Ch 11. Section 2.h. Changing the color of bars, lines, areas, etc.
For the series/point whose color you
want to change, click on a
bar/area/line/slice.
Click on Apply."
www.vgupta.com
Chapter 11: Formatting output 11-30
70
60
50
40
30
20
10 GENDER
Now we want to change the style and width of the lines and their markers.
Click on Apply."
www.vgupta.com
Chapter 11: Formatting output 11-31
70
60
50
40
30
20
10 GENDER
Ch 11. Section 2.j. Changing the format of the text in labels, titles,
or legends
www.vgupta.com
Chapter 11: Formatting output 11-32
60
50
40
30
20
The legend
has been
10 GENDER reformatted.
Female Mean WAGE
To do so, go to FORMAT/SWAP AXIS. The above chart flips its axis and changes into the
following chart.
139
If you made the chart incorrectly, or if you want to make a vertical bar chart.
www.vgupta.com
Chapter 11: Formatting output 11-33
4
Outer Frame
6
11
13
15
Inner Frame
17
19
21 GENDER
Female Mean WAGE
23
Male Mean WAGE
0 10 20 30 40 50 60 70
11
13
15
17
19
21 GENDER
Female Mean WAGE
23
Male Mean WAGE
0 10 20 30 40 50 60 70
www.vgupta.com
Chapter 11: Formatting output 11-34
The next chart has the two title lines and the subtitle line. Compare this to the previous chart
that contained no titles or subtitles.
11
13
15
17
19
GENDER
21
Female Mean WAGE
23
Male Mean WAGE
0 10 20 30 40 50 60 70
www.vgupta.com
Chapter 11: Formatting output 11-35
The two footnotes are inserted at the bottom. Because we asked for "Right-Justification," the
footnotes are aligned to the right side.
140
"Justification" is the same as "Alignment."
www.vgupta.com
Chapter 11: Formatting output 11-36
www.vgupta.com
Chapter 11: Formatting output 11-37
The labels of the legend entries have been changed. "WAGE" has been replaced by "wage."
www.vgupta.com
Chapter 11: Formatting output 11-38
2
4
6
9
11
13
15
17
19 GENDER
21
Female Mean Wage
23
Male Mean Wage
0 10 20 30 40 50 60 70
300
200
1876
3
1037
2
100 1616
1036
4
1031
7039
1038
1
1035
1034
1033
1040
9
1032
8 1614
1615
6
1063
1081
14
18
1047
1048
1041
17
1062
1055
1046
20
1078
1061
1075
1058
1067
1071
15
1082
1054
1066
1076
1049
1045
11
1052
1068
1083
12
1050
1072
1060
1057
1056
1053
1042
1086
1070
19
1073
1077
10
1085
1079
1069
1059 1639
16
1051
1044
1074
1064
1065
13
1084
1080
1043
1320
1321
1027
1025
1343
1028
1376
1480
1571 1966
1968
1640
1967
1641
1965
994
1611
1539
1479
1486
1007
1390
995
1590
1483
1358
1368
1440 1646
1647
1971
1972
1970
1643
0
WAGE
-100
N= 1613 403
Male Female
GENDER
www.vgupta.com
Chapter 11: Formatting output 11-39
www.vgupta.com
Chapter 11: Formatting output 11-40
Daily Wage
Based on 1990 household survey
PPP used
We advise you to experiment with axis formatting. It is the most important topic in this chapter.
Using the proper scale, increments, and labels for the axis is essential for accurately interpreting
a graph.
www.vgupta.com
Chapter 11: Formatting output 11-41
Daily Wage
Based on 1990 household survey
PPP used
www.vgupta.com
Chapter 11: Formatting output 11-42
0
2
4
6
9
11
13
15
17 GENDER
19
21 Female Mean Wage
23 Male Mean Wage
0 5 10 15 20 25 30 35 40 45 50 55 60
Daily Wage
www.vgupta.com
Chapter 11: Formatting output 11-43
Daily Wage
Based on 1990 household survey
PPP used
www.vgupta.com
Chapter 11: Formatting output 11-44
www.vgupta.com
Chapter 11: Formatting output 11-45
$
0
6
0
0
Daily Wage
Based on 1990 household survey
PPP used
www.vgupta.com
Chapter 12: Reading ASCII text data 12-1
The most frustrating stage of a research project should not be the reading of the data. The
"ASCII Text" format (often called simply "ASCII" or "Text" format) can make this stage
extremely frustrating and time consuming. Though we teach you how to read ASCII text data
into SPSS, please keep the following issues in mind:
If the suppliers of data can provide you with data in a simpler format (for instance, SPSS,
dbase, Excel), then ask them to do so!
The Windows point-and-click method for reading in ASCII text data is tedious and slow
and can be very painful if you make mistakes or have to re-do the process by adding more
variables. Using programming code is a much better method to achieve the same goal. In
fact, programming is also better for defining variables (see section 1.2) and some other
procedures. However, because this book is for beginners, we avoid going into the
intricacies of programming techniques and code.
An excellent option for quickly converting data from different file formats into SPSS format
is through the use of data conversion software like STATTRANSFER (web site:
www.stattransfer.com) or DBMSCOPY (web site: www.dbmscopy.com).
SPSS 9.0 has an easier procedure for reading ASCII text data. We do not discuss the procedure
because once you learn the procedures shown here, the procedures in SPSS 9.0 will be easy to
pick up. If we get feedback on the need to show how to read ASCII text data in SPSS 9 or 10,
we will place the instructions on the web site www.spss.org.
In sections 12.1.a and 12.1.b we explain what ASCII Text data is and the differences that exist
among different ASCII Text data formats.
Then, in sections 12.2-12.4, we describe in detail the steps involved in reading ASCII text data.
When ASCII data are being entered, the data-entry person types the variables next to each other,
separating data from different variables by a standard character141 called a "delimiter," or by
column positions. We now delve deeper into understanding the two broad types of ASCII text
141
The standard "delimiters" are tab, comma, or space.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-2
formats. These are "fixed field" (also called "fixed column") and "free field" (also called
"delimited").
Assume there are three variables - ID, First Name, and Last Name. The case to be entered is
"ID=9812348, "First Name = VIJAY, and "Last Name = GUPTA." The code book says that
the variable "ID" is to be entered in the position 1 to 9, "first name" in 10 to 14, and "last name"
in 15 to 21.
When you read the data into SPSS, you must provide the program with information on the
positions of the variables. That is, reading our sample file would involve, at the minimum, the
provision of the following information: ID=1 to 9, First Name =10 to 14, and Last Name
=15 to 21. This is shown in the next text box.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Location
In SPSS versions 6.x- 7.5, you need not specify the type of delimiter (space, tab, or comma). In
version 8.0, you are given the option of choosing the delimiter. If this line confuses you, please
re-read sections 12.1.a and 12.1.b.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-3
Note: The above process may not work. If you feel that there are problems, then use the
procedure shown in section 12.3.
142
So, request that the data provider supply the data in tab-delimited format.
143
ASCII file extensions can be misleading. ASCII data files typically have the extension ".dat," ".prn," ".csv,"
".asc," or ".txt." But the ASCII file may have a different extension. Find out the exact name of the file and its
extension from the data supplier.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-4
Click on "Open."
www.vgupta.com
Chapter 12: Reading ASCII text data 12-5
www.vgupta.com
Chapter 12: Reading ASCII text data 12-6
Click on "Add."
It is your job to provide a name for each variable, its data type, and the exact location (start
location/column and end location/column) where this variable is stored in the ASCII text file144
(See section 12.1.a.).
144
Remember that with the other kind of ASCII text format (freefield or delimited), all you had to enter were variable
names (and, if you choose, the data type). See section 12.3.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-7
Click on "Open."
www.vgupta.com
Chapter 12: Reading ASCII text data 12-8
supplier.
The location of each variable. This may be presented as either, (a) "start" and "end" column
or (b) "start" column and length. In the latter case, obtain the end column by using the
formula:
145
The code book may be another file on the CD-ROM that contains the ASCII text file.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-9
www.vgupta.com
Chapter 12: Reading ASCII text data 12-10
146
We have skipped the steps for fam_mem. These steps are the same as for the other variables.
www.vgupta.com
Chapter 12: Reading ASCII text data 12-11
www.vgupta.com
Chapter 12: Reading ASCII text data 12-12
Click on "Add."
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-1
Merging two files is a difficult process and one that is prone to error. These errors can severely
affect your analysis, with an inaccurate merge possibly creating a data set that is radically
different from that which your project requires.
Click on Open.
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-2
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-3
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-4
For this you will require a concordance key variable pair to match observations across the files.
In our example, the variable fam_id is to be matched with the variable lnfp for the merge.
Those were the names given to the social security number by the survey authorities.
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-5
Click on Open.
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-6
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-7
147
So, if fam_id = 121245 has an observation in the original file only, it will still be included - the observations for
the variables from the other file will be empty in the new merged data file.
148
The data set currently open and displayed in SPSS is the "working data set." The file (not opened) from which
you want to add variables is the "external file."
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-8
In contrast, by clicking on
Working Data File is keyed
table, the opposite can be
done: one-way merging in
which only those cases that
are in the external data file are
picked up from the working
data file.
www.vgupta.com
Chapter 13: Merging-- Adding Cases and Variables 13-9
The external file from which data must be added (not open but available on a drive).
fam_id educ gender
999999 18 0
555555 16 1
Note that the respondent "555555" is in both files but "111111" is only in the original file and
"999999" is only in the external file.
1. A two-way merge using fam_id as the key variable will merge all the data (all three
fam_id cases are included):
2. A one-way merge using the external data file as the keyed file will include only
those observations that occur in the non-keyed file, which is in the working data file.
Here, those have the fam_id "111111" and "555555." Fam_id 99999 is excluded.
3. A one-way merge using the working data file as the keyed file will include only
those observations that occur in the non-keyed file, which is in the external data file.
Here, those have the fam_id "999999" and "555555." Fam_id 111111 is excluded.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-1
In chapters 3-10 we used procedures that (for the most part) allowed for powerful hypothesis
testing. We used tests like the Z, T, F, and Chi-square. The T and F were used repeatedly. In
essence, the F was used to determine whether the entire "model" (e.g. - a regression as a whole)
was statistically significant and therefore trustworthy. The T was used to test whether specific
coefficients/parameters could be said to be equal to a hypothesized number (usually the number
zero) in a manner that was statistically reliable or significant. For maximum likelihood methods
(like the Logit) the Chi-Square, Wald, and other statistics were used. The use of these tests
allowed for the the drawing of conclusions from statistical results.
What is important to remember is that these tests all assume that underlying distribution of
variables (and/or estimated variables like the residuals in a regression) follow some
"parametric" distribution - the usual assumption is that the variables are distributed as a
"normal" distribution. We placed a great emphasis on checking whether a variable was
distributed normally (see section 3.2). Unfortunately, most researchers fail to acknowledge the
need to check for this assumption.
We leave the decision of how much importance to give the assumption of a "parametric"
distribution (whether normal or some other distribution) to you and your professor/boss.
However, if you feel that the assumption is not being met and you want to be honest in your
149
research, you should avoid using "parametric" methodologies and use "non-parametric"
assumptions instead. The latter does not assume that the variables have any specific
distributional properties. Unfortunately, non-parametric tests are usually less powerful than
parametric tests.
We have already shown the use of non-parametric tests. These have been placed in the sections
appropriate for them:
3.2.e (Kolmogirov-Smirnov),
4.3.c (Related Samples Test for differences in the distributions of two or more variables),
5.3.b (Spearman's Correlation), and
5.5.d (Independent Samples Test for independence of the distributions of sub-sets of a
continuous variable defined by categories of another variable)
In this chapter we show some more non-parametric tests. Section 14.1 teaches the Binomial test
and section 14.2 teaches the Chi-Square test. These test if the distribution of the proportions of
the values in a variable conform to hypothesized distribution of proportions for these values.
Section 14.3 teaches the Runs test; it checks whether a variable is distributed randomly.
Let's assume we have a variable whose distribution is binomial. That is, the variable can take
on only one of two possible values, X and Z.
149
Almost all of the methods used in this book are parametric - the T-tests, ANOVA, regression, Logit, etc. Note
that the Logit presumes that the model fits a Logistic distribution, not a Normal distribution.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-2
The standard example is a coin toss - the outcomes are distributed as binomial. There are two
and only two possible outcomes (heads or tails) and if one occurs on a toss then the other cannot
also occur on the same toss. The probability of a tails outcome and the probability of a
heads outcome are the relevant parameters of the distribution150. Once these are known, you
can calculate the mean, standard deviation, etc. Check your textbook for details.
A variable like gender is distributed binomially151. We want to test the parameters of the
distribution the probabilities of the variable gender taking on the value 0 (or female) versus
the probability of it taking on the value 1 (or male).
Look at the box Test Proportion. We have chosen the default of 0.50. We are asking for a
test that checks if the "Test Proportion" of .5 equals the probability of gender being equal to 0
(female) for any one observation. As the probabilities have to add to 1, it follows that we are
testing if the probability of gender being equal to 1 (male) for any one observation =1- 0.50 =
0.50.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-3
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-4
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-5
Let's assume you have a variable that is categorical or ranked ordinal. You want to test whether
the relative frequencies of the values of the variable are similar to a hypothesized distribution of
relative frequencies (or you can imagine you are testing observed proportions versus
hypothesized proportions). You do not know what the distribution type is, nor do you care.
All you are interested in is testing whether the measured relative frequencies/proportions are
similar to the expected relative frequencies/proportions.
For example, assume you want to check whether the proportions of all values of education
(measured in terms of years of schooling) are the same. A histogram of this hypothesized
distribution would be a perfect rectangle.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-6
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-7
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-8
The previous examples tested a very simple hypothesis - All the frequencies are equal. This
example shows the use of a more complex and realistic hypothesis.
Note: This is the way you will most often be using the test.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-9
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-10
An excellent application of this test is to determine whether the residuals from a regression are
distributed randomly or not. If not, then a classical assumption of linear regression has been
violated. See section 7.2 also.
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-11
www.vgupta.com
Chapter 14: Non-Parametric Testing 14-12
Interpretation: The Test Value in each output table corresponds to the statistic/value used as
the Cut Point. The median=5.95, mean=9.04, and mode=3.75.
Look at the rows Asymp., Sig., and (2-tailed). All the tests show that the null can be rejected.
We can therefore say that Runs Tests using all three measures of central tendency (median,
mean, and mode) indicated that wage comes from a random sample.
www.vgupta.com
Chapter 15: Setting system defaults 15-1
In section 15.1 we show how to set general system options. The most important settings are
those for the default format of output tables (called "Pivot Tables" in SPSS) and the labels on
output.
Section 15.2 shows how to change the manner in which data/text is shown on screen.
We would suggest choosing the options as shown above. You may want to change:
The Recently Used Files List to a number such as 8 or 10. When you open the menu
FILE, the files you used recently are shown at the bottom. When you choose to see 8 files,
then the last 8 files will be shown. You can go to any of those files by clicking on them.
www.vgupta.com
Chapter 15: Setting system defaults 15-2
Special Workspace Memory Limit may be increased by a factor of 1.5 to 2 if you find
that SPSS is crashing often. It's always a good idea to ask your system administrator before
making a change.
Click on the tab Pivot Tables. (See next picture). The box on the left shows table formatting
styles called Table Looks." Each item on the list corresponds to one look. The look defines
several formatting features:
Font type, size, style (bold, italic, color, etc.)
Cell shadings
Border width and type
Other features
When you click on the name of a look on the left side, a sample appears on the right side.
Choose the look you prefer and press Apply and OK. See section 11.1 for more on table
formatting and changing the "look" of individual tables.
Click on the tab Charts and choose the settings you like. (See next picture). Experiment until
you get the right combination of font, frames, grid lines, etc. When you are finished, press
Apply and OK.
www.vgupta.com
Chapter 15: Setting system defaults 15-3
The most important option is the choice of labels to depict variables and values of categorical
variables in output tables and charts. Click on Output Labels and choose Labels for all the
options. Press Apply and OK. (See next picture).
www.vgupta.com
Chapter 15: Setting system defaults 15-4
Finally, click on the tab Navigator. This is the Output Window, the window that includes all
the output tables and charts. Click on Item. You will see 7 items. (See next picture).
If they are not all accompanied by the option Shown, then simply:
Choose the relevant item (e.g. -Warnings) from the item list.
Choose the option Shown in the area Contents are initially. (See next picture).
Ch 15. Section 2 Choosing the default view of the data and screen
Choose the menu option VIEW.
www.vgupta.com
Chapter 15: Setting system defaults 15-5
You can also choose the font in which data are shown on screen. The font you choose does not
affect the font in output tables and charts.
www.vgupta.com
Chapter 16: Reading data from database formats 16-1
The data are stored in the same structure. This structure essentially consists of three parts -
the database, individual tables within the database, and individual fields within each table.
The best intuitive analogy is an Excel workbook - the entire file (also called workbook) is
analogous to the database file, each sheet within the workbook is analogous to a table, and
each column to a field. For this reason, Excel can be treated as a database if the data are
stored strictly in columns.
Table1 Table1.Field1
Table1.Field2
Database1 Table2
Table1.Field3
Table3 Table1.Field4
A common programming language (called SQL) can be used to manipulate data and run
procedures in all these programs. For example, in Excel, look at the option DATA/GET
EXTERNAL DATA.
For the purpose of learning how to read data into SPSS, you need not learn the details about
database structures or language. The important inference from the two points above is that,
irrespective of the source application, the commonality of data storage features permits one
process to be used for accessing data from any of the applications. If you learn the process for
one application, you can do it for the others. We provide an example using Access.
Note: In SPSS versions 9 and 10 you will see some more features for reading data. You can
ignore them; the procedures shown in this book should be sufficient.
www.vgupta.com
Chapter 16: Reading data from database formats 16-2
153
What about Oracle, SQL Server, etc? The reason why these five options are shown is that the system on the
computer we worked on had Drivers for these five formats. You can buy (and maybe even download for free from
web sites like cnet.com) and install drivers for other applications. If you are curious, look at the option ODBC
Drivers under your computers Control Panel. The easier way would be to ask your IT guru to do the install.
www.vgupta.com
Chapter 16: Reading data from database formats 16-3
www.vgupta.com
Chapter 16: Reading data from database formats 16-4
Click on Next.
www.vgupta.com
Chapter 16: Reading data from database formats 16-5
Click Next.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-1
A typical Time Series is US Income, Investment, and Consumption from 1970-98. The data are
usually arranged by ascending time sequence. Year, quarter (four months), month, etc. may
define a time period. The reasons for using ARIMA (and not a simple OLS Linear Regression
model) when any of the regression variables is a time series are:
The fact that the value of a variable in a period (e.g. - 1985) is typically related to lagged
154
(or previous) values of the same variable . In such a scenario, the lagged value(s) of the
dependent variable can function as independent variable(s)155. Omitting them may cause an
Omitted Variable Bias. The AR in ARIMA refers to the specification of this Auto-
Regressive component. Section 17.2 shows an example, which is reproduced below.
As the graph below depicts, the value for any single year is a function of the value for the
previous year(s) and some increment thereof.
Gross Domestic Product (in real 1995 prices, $)
300
200
100
0
1970 1974 1978 1982 1986 1990 1994 1998
1972 1976 1980 1984 1988 1992 1996
YEAR
The value at any period in a time series is related to its values in previous time periods. From
general knowledge, you would know that the value of a variable such as national income
(GDP), even when adjusted for inflation, has been increasing over time. This means that for
any sub-set defined by a period (e.g. - 1970-84, 195-96), the attributes of the variable are
154
For example, US Income in 1985 is related to the levels in 1980, 81, 82, 83. Income in 1956 is related to the incomes
in previous years.
155
For example,
GDPt = a + b*GDPt-1 + p*GDPt-2 + r*GDPt-3 + c*Invt + more
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-2
300
200
19
19
19
19
19
19
19
19
19
19
19
19
19
19
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
YEAR
The question of lagged influence also arises across variables. For example, investment in
1980, 81, 82, 83, 84, and 85 may influence the level of income (GDP) in 1985. The cross-
correlation function, shown in section 17.3, helps to determine if any lagged values of an
independent variable must be used in the regression. Hence, we may end up using three
variables for investment - this periods investment, last periods investment, and investment
157
from two periods prior.
The presence of a Moving Average relation across the residuals at each time period. Often,
the residuals from year T are a function of T-1, T-2, etc. A detailed description of Moving
156
Stationarity implies random, and non-stationarity the opposite. If a variable were truly random, then its value in
1985 should not be dependent on its own historical values. Essentially, time series data are in conflict with the classical
assumptions primarily because each variable that is non-stationary is not obeying a key classical assumption: each
variable is distributed randomly.
157
What about collinearity between these? Would not that cause a problem in the regression? The answers are:
Rarely is more than one transformation used in the same model
Once other transformations have taken place, there may be no such collinearity
In any case, collinearity is a lesser problem than mis-specification
Lastly, SPSS uses a Maximum Likelihood Estimation method (and not Linear Regression) to estimate the
ARIMA.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-3
Autocorrelation between the residuals. Section 17.6 shows a method to correct for first-order
autocorrelation. For higher-order autocorrelation, consult your textbooks for methods of
detection and correction.
Regarding Unit Roots, Non-Stationarity, Cointegration, DF Test, PACF, ARIMA, and other
complex tests: could this be much ado about nothing? A cynical view of Time Series analysis
would suggest as much. In practice, most macroeconomists dont even test for non-stationarity.
They simply transform everything into differenced forms, maybe using logs, and run a simple
OLS! From our experience, what you will learn in this chapter should suffice for most non-Ph.D.
Time Series analysis.
Graphical analysis is essential for time series. The first graph one should obtain is the pattern of
variables across time. Essentially, this involves a multiple-line graph with time on the X-axis.
Section 17.1 shows how to make Sequence charts and makes simple inferences from the charts
as to the implications for a linear regression model.
Section 17.2 tests for non-stationarity using the Partial Autocorrelation Function (PACF) charts.
SPSS does not conduct formal tests for Unit Roots like the Dickey Fuller test but, in our
experience, the PACF is usually sufficient for testing for non-stationarity and Unit Roots. We
also show the ACF (Autocorrelation function). Together, the PACF and ACF provide an
indication of the integration-order (differencing required to make a variable stationary) and the
Moving Average. If a variable is non-stationary, it cannot be used in a regression. Instead, a
non-stationary transformation of the variable must be used (if this is unclear, wait until the end of
section 17.2.)
Section 17.3 shows how to determine whether any lagged values of an independent variable must
be used in the regression. The method used is the cross-correlation function (CCF).
After testing for non-stationarity, one may have to create new variables for use in a regression.
The PACF will tell us about the type of transformation required. Section 17.4 shows how to
create these new transformed variables.
After creating the new variables, you are ready for running a regression on the time series data.
The generic method for such regressions is called ARIMA for Autoregressive etc. It allows the
incorporation of an autoregressive component (i.e. - the lagged value of the dependent variable as
an independent variable), differencing (for overcoming the obstacle of non-stationarity), and
moving average correction. Section 17.5 shows an example of a simple ARIMA model.
Even after correcting for non-stationarity in each variable, the model as a whole may still suffer
from the problem of autocorrelation among residuals. Section 17.6 shows a procedure that allows
for automatic correction of first-order autocorrelation and also allows for incorporation of an
autoregressive component.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-4
This autocorrelation is different from the autocorrelation in the PACF (section 17.2). There,
the autocorrelation being measured is for individual variables the relation of a variable's value
at time "T" to previous values.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-5
YEAR
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-6
300
200
1995 prices, $)
100
Gross Domestic Produ
19
19
19
19
19
19
19
19
19
19
19
19
19
19
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
YEAR
220
200
Consumption (in real 1995 prices, $)
180
160
140
120
100
80
60
1970 1974 1978 1982 1986 1990 1994 1998
1972 1976 1980 1984 1988 1992 1996
YEAR
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-7
200
100
0
1970 1974 1978 1982 1986 1990 1994 1998
1972 1976 1980 1984 1988 1992 1996
YEAR
90
80
70
Investment (in real 1995 prices, $)
60
50
40
30
20
1970 1974 1978 1982 1986 1990 1994 1998
1972 1976 1980 1984 1988 1992 1996
YEAR
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-8
Transforming the variables into a log format by calculating the log of each variable.
The logs may reduce the problem because the log scale flattens out the more pronounced
patterns.
Example 1: Differencing
The first method ("differencing") is more effective than the second method ("logs"). We show
examples of each and the results.
158
Note: Non-stationarity is similar to the Unit Root problem (this admittedly simplistic logic will be used by us in this
book).
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-9
As the graph below shows, the first- Has the problem of non-stationarity gone? To
differences have a zigzag pattern. This answer this, we must use Partial Auto
indicates a higher possibility of their Correlation Functions (see section 17.2).
being "random" as compared to the
original "level" variables.
20
10
19
19
19
19
19
19
19
19
19
19
19
19
19
72
74
76
78
80
82
84
86
88
90
92
94
96
98
YEAR
Transforms: difference (1)
Example 2: Logs
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-10
6.0
5.5
5.0
19
19
19
19
19
19
19
19
19
19
19
19
19
19
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
YEAR
Transforms: natural log
159
If the test is required, then SPSS can do it indirectly. You must create some new variables, run a regression, and use
some special diagnostic tables to interpret the result. All that is beyond the scope of this book.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-11
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-12
All three variables exhibit non-stationarity as at least one of the vertical bars is higher than the
horizontal line(s) that indicate the cut-off points for statistical significance. (See next chart to
see the bars and the line).
Furthermore, the non-stationarity is of the order "1" as only the first-lagged bar is significantly
higher than the cut-off line. So a first-differenced transformation will probably get rid of
the non-stationarity problem (as was hinted at in the previous section on sequence charts).
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
The numbers 1, 2,6 imply the "Partial Correlation between the value of the variable
today (that is at time "T") with the value at time "T-1," "T-2,""T-6" respectively.
The first-lag partial auto-correlation is above the critical limit. This indicates the presence of
non-stationarity and suggests first-order differencing as the remedy.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-13
95%
.5
confidence
interval. If
the bar lies
0.0 beyond this
range, then
the partial
-.5
Partial ACF
Confidence Limits correlation
coefficient is
-1.0 Coefficient statistically
1 2 3 4 5 6
significant
Lag Number
The first-lag partial auto-correlation is above the critical limit. This indicates the presence of
non-stationarity and suggests first-order differencing as the remedy.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-14
.5
0.0
-.5
Partial ACF Confidence Limits
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
The first-lag partial auto-correlation is above the critical limit. This indicates the presence of
non-stationarity and suggests first-order differencing as the remedy.
Autoregression
The second interpretation of the GDP PACF is as follows: In running a regression using GDP (or
some transformation of it) as the dependent variable, include the 1-lag of the same transformation
of GDP as an independent variable.
This is called an ARIMA(1,0,0) model. ARIMA (1 lag used for autoregression, differencing
levels required=0, moving average correction=0)
What if the PACF showed that the first three lags were significant? Then the model would
become:
The ACF
Two of the autocorrelation function (ACF) charts are shown below. What is the difference
between the PACF and ACF? A simple intuitive explanation: the PACF for lag 3 shows the
correlation between the current period and 3 periods back, disregarding the influence of 1 and 2
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-15
periods back. The ACF for lag 3 shows the combined impact of lags 1, 2 and 3. In our
experience, the PACF gives a clear indication of the presence of non-stationary and the level of
differencing required. The ACF (along with the PACF) are used to determine the Moving
Average process. The Moving Average process is beyond the scope of this book.
.5
0.0
-.5
Confidence Limits
ACF
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
.5
0.0
-.5
Confidence Limits
ACF
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-16
.5
0.0
Partial ACF
-.5
Confidence Limits
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
Transforms: difference (1)
Interpretation: None of the partial-autocorrelation coefficients are above the critical limit. This
indicates the absence of non-stationarity and strongly indicates the use of first-order differenced
transformations of this variable in any regression analysis.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-17
-.5
Confidence Limits
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
Transforms: difference (1)
Interpretation: None of the partial-autocorrelation coefficients are above the critical limit. This
indicates the absence of non-stationarity and strongly indicates the use of first-order differenced
transformations of this variable in any regression analysis.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-18
The problem
.5 of non-
stationarity is
gone as none
0.0
of the bars are
higher than the
critical limits
Partial ACF
-.5
Confidence Limits
(the horizontal
-1.0 Coefficient
lines).
1 2 3 4 5 6
Lag Number
Transforms: difference (1)
Interpretation: None of the partial-autocorrelation coefficients are above the critical limit. This
indicates the absence of non-stationarity and strongly indicates the use of first-order differenced
transformations of this variable in any regression analysis.
ARIMA (1,1,0) model. ARIMA( 1 lag used for autoregression, differencing levels required=1,
moving average correction=0)
What if the PACF showed that second order differencing was required?
This would be an ARIMA(1,2,0) model. ARIMA( 3 lags used for autoregression, differencing
levels required=2, moving average correction=0)
Note: Each entity inside a bracket is the first difference as GDP at time t is being
differenced with GDP from time t-1, a 1-period difference.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-19
The ACF
The autocorrelation function (ACF) charts are shown below. What is the difference between the
PACF and ACF? A simple intuitive explanation: the PACF for lag 3, shows the correlation
between the current period and 3 periods back, disregarding the influence of 1 and 2 periods back.
The ACF for lag 3 shows the combined impact of lags 1, 2, and 3. In our experience, the PACF
gives a good clear indication of the presence of non-stationarity and the level of differencing
required. The ACF (along with the PACF) are used to determine the Moving Average process.
The Moving Average process is beyond the scope of this book.
.5
0.0
-1.0 Coefficient
1 3 5 7 9 11 13 15
2 4 6 8 10 12 14 16
Lag Number
Transforms: difference (1)
.5
0.0
-1.0 Coefficient
1 3 5 7 9 11 13 15
2 4 6 8 10 12 14 16
Lag Number
Transforms: difference (1)
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-20
.5
0.0
-.5
Confidence Limits
ACF
-1.0 Coefficient
1 3 5 7 9 11 13 15
2 4 6 8 10 12 14 16
Lag Number
Transforms: difference (1)
.5
0.0
Partial ACF
-.5
Confidence Limits
-1.0 Coefficient
1 2 3 4 5 6
Lag Number
Transforms: natural log
The first-lag partial autocorrelation is above the critical limit. This indicates the presence of non-
stationarity and hints in disfavor of the use of logs as the remedy.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-21
The question whose answer we seek is: Should the model be:
Yt = a + b Yt-1 + Xt
or,
Yt = a + b Yt-1 + Xt-1
or,
Yt = a + b Yt-1 + Xt-2
or,
Yt = a + b Yt-1 + Xt + Xt-1 + Xt-2
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-22
Click on the button Options. Choose the maximum number of lags for which you wish to
check for cross-correlations. This number will depend on the number of observations (if sample
size is 30, choosing a lag of 29 is not useful) and some research, belief, and experience regarding
the lags. For example, if investment flows take up to 4 years to have an impact on GDP, then you
may choose a number greater than 4 but not too large. If the impact of consumption is felt in the
current and following year, then a smaller number of lags can be chosen. This dialog has been
filled completely. Return to the main dialog by clicking on the button Continue.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-23
-.5
Confidence Limits
CCF
-1.0 Coefficient
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Lag Number
.5
0.0
-.5
Confidence Limits
CCF
-1.0 Coefficient
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Lag Number
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-24
-.5
Not significantly
Confidence Limits
different from
the value zero as
CCF
-1.0 Coefficient
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 the bar is well
Lag Number
below the
confidence
Transforms: difference (1)
The only significant cross- interval line.
correlation is at lag=0, that is
at no lag.
GDP with INV
1.0
.5
0.0
-.5
Confidence Limits
CCF
-1.0 Coefficient
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Lag Number
Transforms: difference (1)
Now we are ready to create the new variables deemed important by the ACF, PACF, and CCF.
Once they have been created in 17.4, the ARIMA regression can be run (17.5).
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-25
The result is saying to use first differenced transformations (from the ACF/PACF) with an
autoregressive component of 1 lag (from the PACF), and with no lagged cons or inv.
Because the cross-correlations for the first differenced observations showed a correlation only at
the 0 lag, there is no change to the model.
But, if the cross-correlation between GDP and the first and third lags of investment were
significant, then the model would be:
It is an ARIMA (1,1,0) model. ARIMA (1 lag used for autoregression, differencing levels
required=1, moving average correction=0). The cross-correlation does not figure explicitly into
the ARIMA. You have to create new variables (see next section).
(GDPt-1 - GDPt-2 ) is a 1-lagged first difference autoregressive component (because it is the lag
of the dependent variable; in a sense, we are regressing GDP on itselfthus the term auto.
And,
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-26
In chapter 2 we learned how to create new variables using COMPUTE, RECODE, and some
other procedures. There is a simpler way to create time series transformed variables. Using
TRANSFORM/CREATE TIME SERIES, several variables can be transformed by a similar
mathematical function. This makes sense in Time series analysis because you will often want to
obtain the differenced transformations on many of the variables in your data.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-27
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-28
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-29
To do so, go to DATA/REPLACE
MISSING VALUES. Choose the
Method for replacing values and move
the variable with the missing values
into the area New Variables.
Press OK.
A moving average correction can be included. (Note: Moving Average is beyond the scope
of this book).
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-30
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-31
Click on Save.
Choose as shown.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-32
Choose as shown.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-33
MODEL: MOD_23
Number of residuals 28
Standard error 1.4625021 Use these only if part of your coursework. The
Log likelihood -48.504027 Log Likelihood or, more specifically, the "-2
AIC 105.00805 Log Likelihood," can be used to compare across
SBC 110.33687 models. Consult your textbook for details.
Analysis of Variance:
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-34
Name Label
Note: The new variables are predictions (and upper and lower confidence bounds) of the original
variables and not of their differenced transformations.
Interpretation of the coefficients: we have to reassure ourselves that our interpretation below is
correct
AR1: for every 1 unit increase in the change of GDP between two and 1 periods back (that is,
for example, in GDP of 1984 - 1983) the effect on the change in GDP between the last period
and the current period (that is, for the same example, in GDP of 1985 - 1984) is "-.50." If the
difference in GDP between 1983 and 1984 increases, then the difference between 1984 and
1985 decreases.
CONS: for every 1 unit increase in the change of consumption between the last and current
periods (that is, for example, in consumption of 1985 - 1984), the effect on the change in
GDP between the last period and the current period (that is, for the same example, in GDP of
1985 - 1984) is "1.05." If the difference in consumption between 1983 and 1984 increases,
then the difference between 1984 and 1985 decreases.
INV: note that the T is barely significant at the 90% level (as the Sig value = .10). For every
1 unit increase in the change of investment between the last and current period (that is, for
example, in investment of 1985 - 1984), the effect on the change in GDP between the last
period and the current period (that is, for the same example, in GDP of 1985 - 1984) is
"1.05." If the difference in investment between 1983 and 1984 increases, then the difference
between 1984 and 1985 decreases.
CONSTANT: not significant even at 90% as the sig value is well above .1.
ResidualsT = a * ResidualsT-1 + uT
(where uT is truly random and uncorrelated with previous period values)
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-35
The coefficient a is called Rho. In 7.1 we showed how to ask for the Durbin Watson statistic
(DW) from a linear regression. The easiest way to estimate Rho is by using this statistic.
Rho = (DW-2)/2
To correct for this problem, we would advise using the ARIMA process described in 17.5 along
with any transformations that are necessitated to correct for autocorrelation. Consult your
textbook for these transformations.
Luckily, for first order autocorrelation, 160 SPSS offers an automatic procedure -
AUTOREGRESSION. Unfortunately, it is a bit restrictive compared to the ARIMA because a
one-lag autoregressive component is automatically added, higher lag autoregressive components
cannot be added, and Moving Average correction cannot be incorporated. Still, you may find it
161
useful and we devote the rest of this section to showing the procedure.
160
The correlation among the residuals may be of a higher order. The Durbin Watson statistic cannot be used for
testing for the presence of such higher order correlations. Consult your textbook for testing methods and for corrective
methodology. Unfortunately, as in other econometric procedures, SPSS does not provide for automatic testing. It is
incredible that it still sells so much.
161
Especially because Moving Average corrections are rarely conducted rigorously.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-36
Otherwise choose 0.
Press OK.
The result is exactly the same as in the previous section (this is an ARIMA(1,1,0) model). The
only difference is that the algorithm corrected for first order autocorrelation.
The degree of autoregression (for example you can use a variable that has two-period lags).
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-37
300
200
0 1995 prices, $)
19
19
19
19
19
19
19
19
19
19
19
19
19
19
19
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
YEAR
Even though both series are non-stationary, can we find a relation between the two that is
162
stationary? That is, can we find a series that is calculated from GDP and consumption but
itself exhibits randomness over time?
For example:
GDP = intercept - 0.7*(CONS) + Residual
162
One condition is that both (or all) the variables should have the same level of integration. That is, the same level of
differencing should make them stationary. See 17.2 - we show that the variables all have the level of integration
of 1 as the PACF of the first differenced transformations show no non-stationarity. We believe that the term co-
integration has its roots in this condition Co (all variables in the relation) + integration(have the
same level of integration).
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-38
-1
New Series
-2
-3
1970 1974 1978 1982 1986 1990 1994 1998
1972 1976 1980 1984 1988 1992 1996
YEAR
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-1
Script. This language is used mainly for working on output tables and charts. Section
18.1 teaches how to use Scripts.
Syntax. This language is used for programming SPSS procedures. It is the more
important language. Section 18.2 teaches how to use Syntax.
Most Scripts work on SPSS output. So, to learn how to use Scripts, open an output file. (Or use
the one supplied with this document-- it is shown in the next picture).
Lets open a Script file. Choose the menu option FILE/OPEN. In the area "Files of type" choose
the file type "SPSS Script (.sbs)" as shown below. To locate the Script files, go to the folder
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-2
"SPSS/Scripts/" on the path SPSS is installed163. (Or do a search for files ending with the
extension ".sbs.")
Click on "Open." The Script file opens in a new window called the "Script Editor" as shown in
the picture below.
The Script file is basically a word-processing document in which text is written. The code starts
with the line "'Begin Description." The lines of code that start with an apostrophe are comments
that provide information to you on what the Script does and what requirements and actions it
needs from you. In the next picture I show the entire description and purpose of the Script
"Remove Labels."
163
If you used the default installation then the folder will be "C:\Program Files\SPSS\Scripts."
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-3
Scroll down the page. You will see lines of text that do not have any apostrophe at the starting.
These are the lines of functional code. Don't try to learn or worry about understanding the
code164.
Actually you dont need to change anything or do anything to the code. You just "Run" (that is,
execute) it. To run the code, first see if the Script has any "Requirements" from the user. (Look at
the lines that begin with an apostrophe-- one of them may start with the word "Requirements.")
This Script requires that the user first choose one output table (also called "Pivot" table). So, first
go to the "SPSS Output Navigator" window and choose one table. Then go back to the "Script
Editor" window and click on the icon with a rightward-pointing arrow. The icon is shown in the
next picture.
The code will run and perform the operations it is designed to do-- it will delete all row and
column labels on all the tables in the open output window.
Another way to run a script-- go to UTILITIES / RUN SCRIPT as shown in the picture below.
The available scripts will be shown in the "RUN SCRIPT" dialog box as shown in the next
picture. When you click on the name of a script, its description will be shown under the box
164
It is based on the SAX BASIC language and is very similar to Visual Basic for Applications, the language for
Microsoft Office.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-4
"Description" on the right-hand side. Within the description area, play close attention to the
paragraph "Requirements." In this case, you are required to choose one output table (also called a
"Pivot" table) and then run the script. So, first choose one table. Then, go to the Script window
shown below and click on "RUN." The Script will be executed.
More Scripts: you can look for more Scripts with a web search.
Choose the menu option GRAPHS / LINE GRAPH. Choose the option "Simple" and "Summary
of individual cases." Click on "Define." The following dialog box opens.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-5
Choose the options as shown above. Now, instead of clicking on the button "OK," click on the
button "Paste." Doing this automatically launches the "Syntax Editor" window and automatically
writes and pastes the code for the line graph you are planning to construct165. The syntax window
is shown below.
Variable names are in small case while all other code is in capital case
Each line (apart from the first line) starts with the key "?."
165
Note that the graph does not get constructed at this stage.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-6
1. Place the cursor after the end of the code and hit "Enter" a few times. This provides empty
space between bunches of code thereby increasing the ease of reading and using the code.
(This is shown in the next picture.)
2. Save the Syntax file-- go to FILE/SAVE AS and save it as a Syntax file (with the extension
".sps.")
How does one use/execute the code? Using the mouse, highlight the 3 lines of code and go to
RUN / ALL (or RUN / SELECTION) or click on the icon with the right-arrow sign (shown in the
next picture.)
Let's do another example. Make the same line graph as in the previous example, but this time, in
addition, click on the button "Titles" (within the dialog box for line graphs) and enter titles and
footnotes as shown in the next picture.
This dialog has been filled completely. Return to the main dialog by clicking on the button
Continue" and then on "Paste." The code is written and pasted onto the Syntax file
automatically by SPSS. The code is shown in the next picture. (It is the second bunch of code.)
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-7
This time the code includes new lines that capture the titles and footnotes you wrote. To run the
entire code (that is, both the line graphs) choose both bunches of code and go to RUN/
SELECTION. To run only one bunch of code (that is, only one of the line graphs) choose one
bunch of code and go to RUN/SELECTION. (Do you now realize the importance of placing
empty lines after each bunch of code?)
One more good housekeeping strategy-- write some comments before each bunch of code. This
comment may include what the code does, who pasted it, the date, etc. To write a comment, start
the line with an asterisk (*) and end the line with an asterisk and then a period (*.). This is shown
in the next picture.
Continue writing code this way. Choose a procedure and click on the button "Paste" instead of
"OK." And continue
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-8
1. Getting over any phobia/aversion to using (and maybe writing) software code. In general,
becoming more confident with software technology.
2. Documenting all the work done in a project. If you use the simple point-and-click windows
interface for your project then you will not have a record of all the procedures you conducted.
3. Because the Syntax allows you to document all your work, checking for errors becomes
easier. With experience, you will be able to understand what the Syntax code says-- then
checking for errors becomes very easy.
4. The main advantage is the massive saving of time and effort. How does syntax do this?
Several ways, a few of which are listed below.
Replication of code (including using Word to assist in replication) allows you to save
considerable time (in using the point-and-click windows interface) as shown in the
example above.
Assume you want to run the same 40 procedures on 25 different files (say on data for
five countries). If the files have the same variables that you are using in the
procedures and the same variable names, then considerable time can be saved by
creating the Syntax file using one country's data file and then running the same code
on the data files of the other counties. In a follow-up chapter ('Advanced
Programming in SPSS") I will show more ways to write time saving Syntax files.
Assume you have several files with similar data but with different variable names.
Create the syntax file for one data file. Then, for the other data files, just replace the
variable names on the original Syntax file!
A frustrating situation arises when you have to redo all you work because of data or
other issues. 166 SPSS can save an incredible amount of time as also the boredom
produced by repeating tasks.
After running some procedures, you may want to run them again on only a subset of
the data file (see section 1.7 in my book), or separately for sub-groups of the data
(see ch 10 in my book). Syntax makes this easy. If you have the syntax file for the
procedures you conducted on the entire data file, then the same procedures can be
redone for the sub-group(s) of the data by first making the sub-groups(s) and then re-
running the code in the Syntax file.
166
This may happen if you are provided a more accurate version of the data than the one you worked on for a few
weeks, or you have to change the choice of the dependent variable in all your regression models, or you used the
incorrect data file, or (and this is a frequent occurrence) you forgot (or were not informed of the need to) perform
some crucial data step like defining value labels or weighing cases.
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-9
graph of "Consumption" and "Year." Let's use Word to replicate the code for graphs of
"Investment" by "Year."
Select the two bunches of code you "wrote" earlier. Choose the menu option EDIT / COPY.
Open Microsoft Word (or WordPerfect). Choose the menu option EDIT / PASTE. The SPSS
code is pasted onto the Word document as shown in the next picture.
Choose the menu option EDIT / REPLACE and choose the options for replacing "cons" and
"consumption" by "inv" and "investment," respectively. (See the next two pictures.)
www.vgupta.com
Chapter 18: Programming without programming (Syntax and Script) 18-10
Copy the changed text (all of it) and go to EDIT / COPY. Go back to the SPSS syntax file and go
to EDIT / PASTE. To run the two new bunches of code, choose the lines of code and go to RUN
/ SELECTION.
www.vgupta.com
Index Index-1
INDEX
The index is in two parts:
1. Part 1 has the mapping of SPSS menu options to sections in the book.
Part 1: Relation between SPSS menu options and the sections in the
book
Menu Sub-Menu Section that teaches the
menu option
FILE NEW -
,, OPEN 1.1
,, DATABASE CAPTURE 16
,, READ ASCII DATA 12
,, SAVE -
,, SAVE AS -
,, DISPLAY DATA INFO -
,, APPLY DATA DICTIONARY -
,, STOP SPSS PROCESSOR -
EDIT OPTIONS 15.1
,, ALL OTHER SUB-MENUS -
VIEW STATUS BAR 15.2
,, TOOLBARS 15.2
,, FONTS 15.2
,, GRID LINES 15.2
,, VALUE LABELS 15.2
DATA DEFINE VARIABLE 1.2
,, DEFINE DATES -
,, TEMPLATES -
,, INSERT VARIABLE -
DATA INSERT CASE, GO TO CASE -
www.vgupta.com
Index Index-2
www.vgupta.com
Index Index-3
www.vgupta.com
Index Index-4
www.vgupta.com
Index Index-5
Based on your feedback, we will create sections on menu options we have ignored in this
book. These sections will be available for download from spss.org and vgupta.com. (We
do not want to add sections to the book because it is already more than 400 pages.)
www.vgupta.com
Index Index-6
Part 2: Regular
Column Format 1-13
Comma-Delimited ASCII data 12-2, 12-4
www.vgupta.com
Index Index -7
Format, output table 11-1 Logit, Why And When To Use 9-2
Homogeneity Of Variance, Testing For 5-46 Multiple Response Sets, Creating 2-25
If, in SELECT CASE 1-32 Multiple Response Sets, Using For 2-30
Custom Tables
Independent Samples T-Test 5-42
www.vgupta.com
Index Index -8
www.vgupta.com
Index Index -9
www.vgupta.com