Nothing Special   »   [go: up one dir, main page]

SPSS TrainingSection6&7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Practical Statistical Training

Using SPSS Software

Trainer – Hailegebriel Yirdaw


(PhD Candidate at AAU and University of Gothenburg)
E-mail: hailaenani@gmail.com

1
Is there any association?

Trainer: Hailegebriel Yirdaw 2


Correlation Analysis
Correlations are a measure of the linear
relationship between two variables.
It does not imply cause-effect relationship.
A correlation coefficient has a value ranging
from -1 to 1.
Pearson coefficient of correlation is defined as

Trainer: Hailegebriel Yirdaw 3


It is applicable only under the condition
that the variables are normally
distributed.
If the variables not come from normally
distributed populations we can compute
a non-parametric measure of association
that is based on ranks.
The Spearman rank correlation
coefficient is given by:

Trainer: Hailegebriel Yirdaw 4


There are two main options for
correlating variables: Bivariate, and Partial.
Bivariate correlation - it is used for
situations where you are interested only
in the relationship between two variables
or more variable.
To obtain a bivariate correlation:
Analyze  Correlate  Bivariate  Select
the variables  Activate either the Pearson
or Spearman  OK

Trainer: Hailegebriel Yirdaw 5


Activity 1. (15 Minutes)

a)Based on “Expenditure data” show and explain


the bivariate correlation between expenditure
and income.

b) Use “employee.sav” data and examine the


relationship between current salary and
beginning salary

(5 minutes)
Trainer: Hailegebriel Yirdaw 6
Partial Correlation
Partial Correlation measures the
correlation between X and Y controlling for Z
Comparing the bivariate (“zero-order”)
correlation to the partial (“first-order”)
correlation allows us to determine if the
relationship between X and Y is direct,
spurious, or intervening
◦ Interaction cannot be determined with partial
correlations
Trainer: Hailegebriel Yirdaw 7
Partial Correlation
Note the subscripts in the symbol for a partial correlation
coefficient:
rxy●z
which indicates that the correlation coefficient is for X
and Y controlling for Z

Trainer: Hailegebriel Yirdaw 8


Partial Correlation
Example
The table below lists husbands’ hours of housework per week (Y), number of
children (X), and husbands’ years of education (Z) for a sample of 12 dual-
career households

Trainer: Hailegebriel Yirdaw 9


Partial Correlation
Example
A correlation matrix appears below
The bivariate (zero-order) correlation between husbands’ housework
and number of children is +0.50
This indicates a positive relationship

Trainer: Hailegebriel Yirdaw 10


Partial Correlation
Example
Calculating the partial (first-order) correlation between husbands’
housework and number of children controlling for husbands’ years of
education yields +0.43

Trainer: Hailegebriel Yirdaw 11


Partial Correlation
Example
Comparing the bivariate correlation (+0.50)
to the partial correlation (+0.43) finds little
change
The relationship between number of children
and husbands’ housework controlling for
husbands’ education has not changed
Therefore, we have evidence of a direct
relationship

Trainer: Hailegebriel Yirdaw 12


Partial correlation – It measures
separately the relationship between two
variables in such a way that the effects of
other related variables are eliminated.
 Analyze  Correlate  Partial  Select the
variables that you want to correlate  Select
the variables that you want to control  OK

Trainer: Hailegebriel Yirdaw 13


Activity 2 . (7 minutes)

Based on “Expenditure data” show and


explain the partial correlation between
expenditure and income, controlling for
family size and education.

Trainer: Hailegebriel Yirdaw 14


Is there any causation?

Trainer: Hailegebriel Yirdaw 15


Introduction
Regression is a statistical technique of modeling the
relationship between variables.
Model Specification
A model has both a dependent variable and
independent variables.
The dependent variable is typically an outcome, and
the independent variables may contribute to
predicting that outcome.
Specifying a regression model involves selecting a
dependent variable and related independent
variables.
The type of dependent variable determines the type
of regression, linear or logistic etc.
Trainer: Hailegebriel Yirdaw 16
Trainer: Hailegebriel Yirdaw 17
Dependent variables
Note about dependent variables.
◦ continuous variable(s) is(are) appropriate for a
linear regression model.
◦ For a yes/no variable, such as whether a youth
has a job, the appropriate procedure would be a
Binary logistic regression.
◦ For categorical dependent variable with a
category of more than two the Multinomial
logistic regression.
◦ For categorical dependent variable with a
category of more than two and with ordinal
measure scale the Ordinal logistic regression is
appropriate. Trainer: Hailegebriel Yirdaw 18
Independent variables
Note about independent variables
◦ Continuous variables can directly be considered
in regression.
◦ It is necessary to use a set of dummies for each
response category in categorical variables.
◦ When category dummies are used in a model,
one of the categories is omitted from the
model.

Trainer: Hailegebriel Yirdaw 19


Simple linear regression Analysis
Simple linear regression – the model includes
one explanatory variable (independent) and one
explained variable (dependent)
Yi = α + β X i + εi

Where : α = the intercept


β = the slope
ε = the error term

In Regression, we deal with the stochastic relationship


and that is why we add the Error Term.

Trainer: Hailegebriel Yirdaw 20


There are some justifications for the
inclusion of the error term in the model.
These are:
1. Effect of the omitted variables from the
model
Due to numbers of reasons some variables
(factors), which might affect the dependent
variable, are omitted from the model:
◦ Lack of data (particularly on time series
variables)
◦ Lack of knowledge on some factors that affect
the dependent variable
Trainer: Hailegebriel Yirdaw 21
◦ Difficulty in measuring factors that affect the
dependent variable
◦ Some factors are random by their very nature
(e.g. earthquake)
◦ Some factors taken individually may have very
small effect
2. Measurement Error
Variables included in the model may be
measured inaccurately and some problems
arise due to methods of data collection
and processing. Such problems are
handled by the error term
Trainer: Hailegebriel Yirdaw 22
3. Wrong Mathematical Specification of
the Model
The equation may be miss-specified in the
sense that the particular functional forms
chosen may be incorrect.
Example we may have linearized non-linear
models and estimating a single model of
systems of equations.

Trainer: Hailegebriel Yirdaw 23


4. Errors in Aggregation
There may be errors in aggregation over
time, space, cross-section, etc and the
stochastic disturbance term captures these
errors in aggregation.
5. The randomness of human
behaviour
There are unpredictable elements in
human behaviour that are taken care by
the stochastic disturbance term

Trainer: Hailegebriel Yirdaw 24


In order to take all these sources of error into
account, we introduce the stochastic/random
disturbance term into our econometric
models
and hence the complete simple econometric
model becomes:
Yi = α + β X i + ε i
Variation in Y = Explained Variation + Unexplained Variation

Trainer: Hailegebriel Yirdaw 25


To conduct a regression analysis:
Analyze  Regression  Linear  Enter the
dependent variable  Enter the independent variable
 OK

Trainer: Hailegebriel Yirdaw 26


Regression output
R - The correlation between the observed
values of the outcome, and the values
predicted by the model.
R2 - The proportion of variance accounted
for by the model.
Adjusted R2 - An estimate of R2 adjusted
for degree of freedom
F-test – It indicate the overall significance
of the model (The null hypothesis is the
joint effect of the explanatory variables is
zero)
Trainer: Hailegebriel Yirdaw 27
t-value - It indicate the significance of a
given variable (The null hypothesis is the
effect of the explanatory variable is zero
independently)
Sig – tells the significance of a given
variable to explain the dependent variable.
Betas
◦ Beta values: - the change in the outcome associated
with a unit change in the predictor.
◦ Standardised beta values: - tell us the same but
expressed as standard deviations.

Trainer: Hailegebriel Yirdaw 28


Activity 1 (10 Minutes)
Use "Employee data”
a) Find out whether or not the previous
experience of workers has any effect on their
beginning salary?
b) Find out whether or not education level of
workers has any effect on their beginning
salary?
c) Plot the regression line for the Current Salary
and beginning salary using the “scatter plot”
menu.
Trainer: Hailegebriel Yirdaw 29
Multiple Regression Analysis
The model is a linear equation with at least two
explanatory variables (independent) and one
explained variable (dependent)

Y = a + b1X1 + b2X2 + ε
Notation
 a intercept, where the regression line crosses the Y axis
 b1 is the partial slope for X1 on Y
 b1 indicates the change in Y for one unit change in X1,
controlling for X2
 b2 is the partial slope for X2 on Y
 b2 indicates the change in Y for one unit change in
X2, controlling for X1

Trainer: Hailegebriel Yirdaw 30


To conduct a regression analysis:
Analyze  Regression  Linear  Enter the
dependent variable  Enter the independent variable
 OK

Trainer: Hailegebriel Yirdaw 31


Activity 1. (10 Minutes)

Based on “Expenditure data” run and


explain a multiple regression by
considering expenditure as dependent
variable and income, family size, age, and
education as independent variables, and
write a very brief interpretation.

Trainer: Hailegebriel Yirdaw 32


Considering categorical independent
variables
Whenever we have categorical
independent variables which we have
assigned numbers to each categories, there
is a need to created dummy variables as
large as the number of categories.
Then, include the dummies excluding one
of the categories (considered as base) with
which interpretation to be made.

Trainer: Hailegebriel Yirdaw 33


Activity 2. (7 minutes)

Use “employee” data set then run a


multiple regression by considering current
salary as dependent and beginning salary,
gender, level of education, minority group,
previous experience and employment
category as independent variables. Briefly
interpret the result.

Trainer: Hailegebriel Yirdaw 34


Activity 3. (10 minutes)

Use “employee” data set then run a


multiple regression by considering
logarithm of current salary as dependent
and logarithm of beginning salary, gender,
level of education, minority group, previous
experience and employment category as
independent variables. Briefly interpret the
result.

Trainer: Hailegebriel Yirdaw 35


Activity 4. (7 minutes)

Use “expenditure” data set then run a


multiple regression by considering saving
as dependent and household income,
family size, level of education, age and sex
as independent variables. Interpret the
result.

Trainer: Hailegebriel Yirdaw 36


Post Estimation tests
No Multicollinearity:
◦ Predictors must not be highly correlated.
Homoscedasticity:
◦ For each value of the predictors the variance of the
error term should be constant.
Independent Errors:
◦ For any pair of observations, the error terms should be
uncorrelated.
Normally-distributed Errors

Trainer: Hailegebriel Yirdaw 37


Multicollinearity diagnosis
Multicollinearity exists if predictors
are highly correlated.
This assumption can be checked with
collinearity diagnostics.
Analyze →Regression →Linear →
Statistics → Click ‘Collinearity diagnostics’
→ Click ‘Continue’
• Tolerance should be more than 0.2
(Menard, 1995)
• VIF should be less than 10 (Myers, 1990)
Trainer: Hailegebriel Yirdaw
38
Checking Assumptions about Errors

Homoscedacity/Independence of
Errors:
◦ Plot ZRESID against ZPRED.
Normality of Errors:
◦ Normal probability plot.

Trainer: Hailegebriel Yirdaw


39
Activity 1. (10 Minutes)

Based on “Expenditure data” run a


multiple regression by considering
expenditure as dependent variable and
income, family size, age, and education as
independent variables, and perform the
post estimation tests.

Trainer: Hailegebriel Yirdaw 40


Binary Logistic Regression
• Binary logistic regression is a type of regression
analysis where the dependent variable is a
dummy variable (coded 0, 1)
To obtain logistic regression
Analyze  Regression  Binary Logistic
Enter your variables
• Suppose a researcher is interested in the
likelihood of gun ownership in US, and what
would predict that.
1. Men are more likely to own guns than women
2. The older persons are, the more likely they are to
own guns
3. White people are more likely to own guns than those
of other races
4. The more educated persons are, the less likely they
are to own guns
Trainer: Hailegebriel Yirdaw 41
Binary Logistic Regression…
Interpreting Coefficients…
ln[p/(1-p)] = a + b1X1 + b2X2 + b3X3 + b4X4

eb

X1 b1
X2 b2
X3 b3
X4 b4
1 a

Which b’s are significant?

Being male, getting older, and being white have a positive effect on likelihood of
owning a gun. On the other hand, education does not affect owning a gun.

Trainer: Hailegebriel Yirdaw 42


Activity (15 minutes)

In child education.sav data set , 2004/05 Ethiopian


Demographic and Health Service - EDHS for 5828 children
aged 10 to 18.Moreover use the following variables as
reference category:
distance more than 5 kilometer to primary school
house hold size more than 8
female
response called “no” for child engaged in productive
economic activity
response called “no” for household own cattle
response called “no” for food shortage in the household

Trainer: Hailegebriel Yirdaw 43


Regress the dependent variable on the
independent variables interpret the odds ratio
( the probability of the child not attending formal
education given his/her personal characteristics)
for:
child engaged in productive economic activity
male
household own cattle
food shortage in the household
household size
distance to primary school (in kilometers)

Trainer: Hailegebriel Yirdaw 44


Multinomial Logistic Regression
We use when the dependent variable is
categorical, with > 2 categories
Example: Which program did you choose?
Our interest is to investigate what
determines students program choice.
◦ The outcome variable is prog, program type.
◦ The predictor variables are social economic
status, ses, a three-level categorical variable
and writing score, write, a continuous variable.

Trainer: Hailegebriel Yirdaw 45


The multinomial logistic regression model
is defined by the following assumptions:
◦ Observations Yi are statistically independent of
each other
◦ Observations Yi are a random sample from a
population where Yi has a multinomial
distribution with probability parameter.
Estimates of the model
Analyze  select the dependent variable
 Choose reference category  select
the independent variables (put categorical
variables in to factors and continuous in to
covariate)  ok
Trainer: Hailegebriel Yirdaw 46
Regression output
Model fitting information – indicate whether the
model improves our ability to predict the
outcome.
The statistically significant chi-square statistic
(p<.0005) indicates that the Final model we are
considering is fit.
In linear regression, R2 (the coefficient of
determination) summarizes the proportion of
variance in the outcome that can be accounted
for by the explanatory variables.
For logistic and ordinal regression models it is
not possible to compute the same R2 statistic as
in linear regression so three approximations are
computed instead as given by Pseudo R-square
Statistics
Trainer: Hailegebriel Yirdaw 47
Likelihood Ratio Tests
◦ The chi-square statistic is the difference in
-2 log-likelihoods between the final model and a
reduced model. The reduced model is formed
by omitting an effect from the final model. The
null hypothesis is that all parameters of that
effect are 0.
◦ In other term, it tests for the overall effect
of the explanatory variables on the
dependent variable.
Interpretation of the coefficients are just
like what we have done for the binary
logistics.
Trainer: Hailegebriel Yirdaw 48
A one-unit increase in the variable write is
associated with a 0.058 decrease in the
relative log odds of being in general
program versus academic program .
A one-unit increase in the variable write is
associated with a 0.1136 decrease in the
relative log odds of being in vocation
program versus academic program.
The relative log odds of being in general
program versus in academic program will
increase by 1.163 if moving from the
highest level of ses (ses = 3) to the lowest
level of ses (ses = 1).

Trainer: Hailegebriel Yirdaw 49


The ratio of the probability of choosing one
outcome category over the probability of
choosing the baseline category is often
referred to as relative risk (and it is also
sometimes referred to as odds as we have just
used to described the regression parameters
above).
Thus, exponentiating the linear equations
above yields relative risks.
Regression coefficients represent the change
in log relative risk (log odds) per unit change
in the predictor.
SPSS includes relative risk ratios in the output,
under the column "Exp(B)".
Trainer: Hailegebriel Yirdaw 50
◦ The relative risk ratio for a one-unit increase in
the variable write is 0.9437 (exp(-0.0579284)
from the output of the nomreg command
above) for being in general program versus
academic program.
◦ The relative risk ratio switching from ses = 3 to
1 is 3.199 for being in general program versus
academic program. In other words, the
expected risk of staying in the general program
is higher for subjects who are low in ses.

Trainer: Hailegebriel Yirdaw 51


Activity (15 minutes)

Use(Voter_Multinomialdata) to investigate
the likelihood of vote to the candidates
considering gender, age and education as
explanatory variables. Clearly explain the
result from the regression.

Trainer: Hailegebriel Yirdaw 52


Ordinal Logistic Regression
A multinomial logistic model can be used
here too, but it has the disadvantage of
ignoring the ordering
An ordinal logistic model (proportional
odds model) does take the ordering into
account

Trainer: Hailegebriel Yirdaw 53


Consider the following example
A study looks at factors that influence the
decision of whether to apply to graduate
school.
◦ College juniors are asked if they are unlikely,
somewhat likely, or very likely to apply to
graduate school.
◦ As an explanatory variable the following are use
Data on parental educational status, 0/1 variable
indicating whether at least one parent has a graduate
degree
Whether the undergraduate institution is public (1) or
private (0), and
Current GPA is also collected.
Trainer: Hailegebriel Yirdaw 54
Everything here is unchanged from binary
logistic models:
◦ Parameters are estimated using maximum
likelihood estimation
◦ Hypotheses of interest are typically of the
form j = 0, for one or more coefficients j
◦ Wald tests, likelihood ratio tests and
confidence intervals are defined and used as
before

Trainer: Hailegebriel Yirdaw 55


Before we run our ordinal logistic model,
we will see if any cells are empty or
extremely small.
If any are, we may have difficulty running
our model.
◦ Make simple crosstabs of the categorical
explanatory variables with the dependent
Estimates of the model
Analyze  select the dependent variable
 select the independent variables (put
categorical variables in to factors and
continuous in to covariate)  ok

Trainer: Hailegebriel Yirdaw 56


The interpretation of the regression outputs
is just like the case of multinomial logistic
regression.
◦ In the Case Processing Summary table, we see the
number and percentage of cases in each level of
our response variable. We would be concerned if
one level had very few cases in it.
The thresholds (cutpoints) are shown at the
top of the parameter estimates output, and
they indicate where the latent variable is cut
to make the three groups that we observe in
our data. These are not used in the
interpretation of the results.

Trainer: Hailegebriel Yirdaw 57


To obtain the proportional odds ratios you can
use calculator.
Exponentiated coefficients are interpreted as
partial odds ratios for being in the higher rather
than the lower half of the dichotomy.
One of the assumptions underlying ordered
logistic regression is that the relationship
between each pair of outcome groups is the
same.
In other words, ordered logistic regression
assumes that the coefficients that describe the
relationship between, say, the lowest versus all
higher categories of the response variable are the
same as those that describe the relationship
between the next lowest category and all higher
categories, etc.

Trainer: Hailegebriel Yirdaw 58


Interpretation
We would interpret these pretty much as we
would odds ratios from a binary logistic
regression.
e.g. exp( ^ pared) = 2.85: Controlling for the other
explanatory variables, those who has at least one
parent with a graduate degree have 185% higher
odds than those without giving a response that
indicates higher levels of application to the
graduate program.
e.g. exp( ^ gpa ) = 1.85: Controlling for the other
explanatory variables, a 1 point additional gpa is
associated with a 85% increase in odds of giving
a response that indicates higher levels of
application to the graduate program

Trainer: Hailegebriel Yirdaw 59


Activity (15 minutes)

Use (Customer_Ordinal data) to


investigate the likelihood of job
satisfaction due to income, education
and job catagory as explanatory
variables. Clearly explain the result
from the regression.

Trainer: Hailegebriel Yirdaw 60


Trainer: Hailegebriel Yirdaw 61

You might also like