11/24/2010
Why bother about Service
Sector?
MEASURING GOVERNMENT
OUTPUT IN THE UK
D NARAYANA
IGIDR Lecture 2
20 October 2010
• Global trend- economies becoming service
economies
• Share of service sector in GDP
– Between
B t
60 and
d 80% iin mostt OECD
• In India, over the last 40 years
– Agrl share over 60 to <20?
– Ind share 16 to 20?
– Service share 30 to >60?
• But studies?
A hint of the problem
• Studies have argued
• Service sector output overestimated
– Inadequate data
– faulty methodology
– price deflators inappropriate
• Problem more serious- a hint
•
(Recall three approaches to GDP: value of output=income
generated= total expenditure)
• Indian health sector- about 5% of GDP
spent on health care, but share 1.87%
Measurement of Output and of
Value Added
• Output at current prices = Sales + Changes in
inventories + work in progress
• Applicable for the market sector
• Some exceptions, like banks, insurance
• Alternative measure needed
• Banks invoice limited portion of their services
– Foreign exchange commissions, check handling
charges, stock market transactions
• Bulk of their services- making loans
• They accept deposits, lend, financial
intermediation
1
11/24/2010
Value Added in Non Market
Sector
• Non- market sector, mainly government
• Provide services, free ofcharge, or prices not
economically significant
– Defence, public education,public health
• Fi
Financed
d th
through
h ttaxation
ti or social
i l contributions
t ib ti
• No direct link between payment and service
• Some services provided on individual basis
– Family sends children to school
• Other services consumed collectively, like
defence, police etc
Conventional Approach
• Government output= total value of the inputs
• In the UK,
Input = the compensation of employees
the procurement cost of goods and
services
i
a charge for the consumption of
fixed capital
• In the US
Input is limited to employment
Government Output
• Government – all those agencies that provide
public services
• Examples: NHS and local authority provision of
social services
• Need to distinguish between
– Individual services (those consumed by
individual households)
– Collective services provided to society as a
whole
• Non –market output
– Supplied free , or
– At low prices, not economically significant
Major Problem
• Collective Services – it is hard to identify
the exact nature of the output
• Services supplied to the individuals –
- it is hard to place a value on
th
these
services
i
• Convention neglects increases in
productivity
• As productivity grows, the growth rate of
government output is understated
• The overall growth rate of GDP is
understated
2
11/24/2010
Post-1998 development in ONS measurement of government output
Limits to Productivity Growth
• Many public services involve essential
human input
• Labour is an end in itself
• Quality
Q lit is
i jjudged
d d iin tterms off amountt off
labour
• Computerisation may allow efficient
allocation of care workers
• But there are limits to replacement labour
Function
%Govt.
spending,
2000
Date introduced
Main components
Health
30.3
Introduced 1998, method
updated 2004
Hospital cost weighted activity
index, family health services
(number of GP consultations etc)
Education
17.1
Introduced 1998,with data
from 1986
Pupil members – Quality
adjustment of 0.25 percent to
primary and secondary schools
Administration
of Social
Security
2.7
Introduced 1998,with data
from 1986
Number of benefit claims for 12
largest benefits
No allowance for collection of
contributions
Administration
of Justice
3.0
Introduced in 2000,with full
impact in 2001, data back to
1994 Q1
Number of prisoners, legal aid
cases etc
Fire
1.1
Introduced 2001 , with data
from 1994 Q1
Number of fires, fire prevention
and special services
Personal Social
Services
7.4
Introduced 2001, with data
from 1994 Q1
Children and adults in care and
provision of home helps
Police
5.8
Experimental
Cleared-up crimes of different
types
Sectors that follow conventional
method
•
•
•
•
•
•
This applies to,
Defence, General Public Services,
Economic services,
Environmental Protection,
Recreation and Culture,
Housing and Community amenities.
Implications for Growth of GDP
•
•
•
•
•
GDP growth rate , 1995-2003.
Direct method – 2.5% per annum
Input method – 3% per annum
GDP growth in the US- 3.25%
Difference – accounts for nearly half the
difference in GDP growth rate
3
11/24/2010
Output and Input Volume Measures- 2
Output and Input Volume Measures- 1
01 General
Public Services
02 Defence
03 Public Order
and Safety
Local government
Output A
Central government
Output A
NA
Output A
Police- = output volume measures
are volumes of police activity,
crime related incidents, patrols,
traffic incidents etc.
Input A
Prisons – NA
Police- = output volume measures are
volumes of police activity, crime related
incidents, patrols, traffic incidents etc.
Input A
Prisons- Output volume measures are
measured directly using total numbers of
prisoners.
i
Input A
Probation – NA
Probation- Output volume measures are
measured directly using workload hours of
various areas of competence.
Input A
Courts- output volume measures
Courts- output volume measures for
for magistrates courts are
magistrates courts are measured directly
measured directly using caseloads using caseloads of courts weighted average
of courts weighted average hours hours or average costs
or average costs
Input A
Note: A = volume measures are deflated UK expenditure figures for pay,
procurement of goods and services and capital consumption. NA = Not Applicable
Output and Input Volume Measures- 3
Local government
Central government
Output volumes are measured Output volumes are measured directly
directly using pupil numbers using pupil numbers in pre-primary,
in pre-primary, primary and
primary and secondary schools
secondary schools obtained
obtained from DfES.
from DfES.
Input A
Input A
10 Social
Personal Social Service:
Social Security: Output volume
Protection
Output volume measures are measures are measured directly for
directly using :
administration of social security using
a) numbers of adults in care numbers of new benefit claims.
and home help contact
Input A
hours obtained from DH
b) numbers of children in
care from DfES
Input A
Administration of social
security : Output volumes are
measured directly for
administration of Social
Security using numbers of
housing benefit cases.
Input A
09 Education
Local Government
Central Government
Fire- output volume measures
for the fire service are
measured directly using
number of other services.
Input A
Fire- NA
04 Economic Affairs
Output A
Input A
05 Environmental
Protection
Output A
Output A
06 Housing and
Community Amenities.
Output A
Output A
07 Health
Output volume measures are
measured directly using :
a)treatment numbers and
reference costs data from DH.
b)In addition, further indicator
series are used for dental and
ophthalmic services. Input A
Output volume measures are measured
directly using :
a) treatment numbers and reference costs
data from DH.
b) In addition, further indicator series
are used for dental and ophthalmic
services. Input A
08 Recreation, Culture
and Religion
Output A
Output A
Note: A = Output volume measures are deflated UK expenditure figures for pay, procurement
of goods and services and capital consumption.
NA = Not Applicable
Conclusions
• UK has moved from input approach to direct
measures of output
• Direct measures cover 2/3 rds of Govt.Final
consumption
• Design of output measures needs care and
investment of resources.
• Also, continuous monitoring.
• Institutional change poses problems for output
measurement
• Effects of technological change may not be
captured.
Note: A = volume measures are deflated UK expenditure figures for pay,
procurement of goods and services and capital consumption NA = Not Applicable
4
11/24/2010
Health- An Illustration
• Health is the largest government service.
• 31 percent of government final consumption in
2003.
• Health care services funded from general
taxation.
• The provision of health care services in the
United Kingdom is devolved responsibility.
• It is providing hospital and some community
health services
• Services are free of charge at the point of
delivery.
Improved Methodology
• Uses information about volume and cost weights
for 1,200 Healthcare Resource Groups
• 400 other activity groupings
• 200 categories of general practice prescribing
• Cost ranges from less than £10 to £45,000
• Improvements comes from wider coverage
,increased level of detail, better cost weights
• Categories used are more homogenous
Methods of Output Measurement
• The UK Health output measure used before
June 2004
– Reflected movements in 16 different activity series
measuring health care.
– A single series counting total inpatient and day cases
accounted for about half the expenditure covered by
the index;
– Outpatient and community health treatments, GP
prescribing and dental treatments were measured
separately.
– An aggregate index was formed by weighting the
separate series
Improvements
• Come from
– Wider coverage
– Increased level of details
– Better cost weights
• Became possible
– NHS developed robust costs for a
standard list
5
11/24/2010
Future Methods-1
•
Recommendation 1 -- Extending the coverage of output
volume indicators for each function
• Recommendation 2 -- Improving UK coverage
• Recommendation 3 -- Whole courses of treatment,
technical change and substitution
– Linked outpatient
p
attendances,, investigations
g
etc
– Units are to be grouped by diagnosis
– Treatments are to be adjusted for quality factors
• Recommendation 4 -- Measuring quality change
– Saving lives and extending life spans; mitigating
effects of disease
– Speed of access to treatment
– Patient experience
Future Methods- 2
• Recommendation 5 – Inputs and Deflators
– More work needed to ensure health deflators meet
quality criteria
– More disaggregated approach to measure skill mix
• Recommendation 6 – Triangulation and
Productivity Measurement
– Productivity measure by dividing natioanl accounts
output by inputs
– Account should be taken of the changing skill mix of
staff
– Changing balance between grades of doctors
– Migration of treatments from expensive to cheaper
settings
• Recommendation 7 – Satellite accounts
References
• Atkinson Review: Final report
(Measurement of Government Output and
Productivity for the National Accounts),
http://www.statistics.gov.uk/about/data/met
hodology/specific/publicSector/atkinson/fin
gy p
p
al_report.asp.
6
Major
j Problem in Experiments
p
D i off E
Design
Experiments
i
t
How to control the extraneous
factors (nuisance variables) which
play
p
ay a
along
o g with
t the
t e cause factor
acto
D Sharad
Dr.
Sh
d Varde
V d
under investigation in the process
of influencing the effect factor
2
3
Sharad Varde
Solution #1: Matching
g Groups
p
Solution #1: Matching
g Groups
p
Example: Study the effect of a newly created herbal
compound on weight reduction
Group
p 1 of volunteers: Administer new treatment.
Group 2 of volunteers: No treatment
Extraneous factor: Gender
Solution: If we have 70 female & 30 male volunteers,
place 35 women & 15 men in each group
Thus,
Th
the
th effect
ff t off gender
d is
i uniformly
if
l distributed
di t ib t d
Change in weight is attributed only to new product.
Further, if we suspect age & affluence as
Further
other extraneous factors, we assign different
age brackets & wealth brackets to the two
groups of volunteers
Thus control the two extraneous factors too
Thus,
Problem: Some more extraneous factors
may exist.
i t B
But,
t we do
d nott know
k
them
th
all.
ll
Safer solution: Randomization.
Sharad Varde
4
Sharad Varde
5
7
Solution #2: Randomization
Solution #2: Randomization
GAssign 100 volunteers randomly to 2 groups
GThus, every volunteer has a known & equal
chance of being assigned to any group
GMethod: Throw 100 names in a basket
GPi k up one, it goes tto G
GPick
Gr 1
1.
GPick up next name, it goes to Gr 2.
GPick up a third name, it goes to Gr 1 & so on.
GOr, use standard table of random numbers.
Since every person has equal chance of getting
into any group:
Gevery known & unknown extraneous variable
has equal chance of getting into any group
Gand hence, all extraneous variables are
distributed equally to the two groups
GSo,, Group
G
p 1 is comparable
p
to Group
p2
GTherefore, change in weight can be safely
attributed onlyy to the new product.
p
Sharad Varde
6
Sharad Varde
Benefits of Randomization
Generalizability
y of Experiments
p
GEffective method to nullify confounding
influence of known & unknown extraneous
variables
i bl over th
the ffactor
t under
d study
t d
GThus it controls nuisance of known &
unknown extraneous variables
GNo need to enlist all extraneous variables
GWe can safely generalize the conclusions.
GLab experiments try to establish a cause
cause-&&
effect relationship firmly (beyond all doubts)
in artificially contrived lab setting
GField experiment then checks it in real life
GIf
If it confirms
fi
the
th cause-&-effect
& ff t relation,
l ti
reall
life decisions based on this conclusion can
b safely
be
f l made
d
GIt is the field expt that confers generalizability
Sharad Varde
8
Sharad Varde
Are Intelligent
g
Indians Rich?
Design
es g o
of Experiments
pe
e ts
# To
T test
t t whether
h th higher
hi h intelligence
i t lli
iimproves iincome
of adult Indians
How to
# Income is Effect Factor Y
# Intelligence
g
is Cause Factor X
DESIGN
# However, family background may also play a role
# But,
B t it iis nott fformally
ll a partt off this
thi study
t d
Experiments?
9
# Hence, family background is an Extraneous Factor.
Sharad Varde
10
Terminology
gy
Alternative Terminology
gy
G Effect
Eff t F
Factor:
t
A factor
f t (dependent
(d
d t variable)
i bl ) th
thatt iis
In the context
conte t of the use
se of the term
“Experiment’ in research studies:
to be explained or predicted by the experiment
GCause factor is also termed as ‘Treatment’
G Cause Factor: A factor (independent variable) that
is expected
p
to influence the Effect Factor
GVarious values of cause factor are called
‘Levels of Treatment’
G Extraneous Factors: Other factors (nuisance
variables) that may influence the effect factor
GExtraneous
GE
t
factors,
f t
being
b i cause ffactors
t
themselves, too are called ‘Treatment’.
G Control: Steps to reduce effect of extraneous factors
11
Sharad Varde
Sharad Varde
12
Sharad Varde
Controlling
g Extraneous Factors
Terminology
gy
ÄEx.:: To test whether athletes’
ÄEx
athletes performance in sports
events improves with provision of sports coach
GExperimental
GE
i
t l Group
G
(EG):
(EG) Experimental
E
i
t l
units exposed to experimental treatment.
(Athletes provided with a sports coach)
ÄProvide sports coach to one group of randomly
chosen athletes
GControl Group (CG): A comparable group of
similar units that is not exposed to the
experimental treatment
treatment. (Athletes who are
not provided with a sports coach).
ÄDo NOT provide sports coach to another group
ÄSports performance is Effect (dependent) factor
ÄProviding a sports coach is Cause factor
ÄObserve the performance of both groups.
13
Sharad Varde
14
Randomness
Two Types
yp of Experimental
p
Designs
g
Randomness is the essence of an
experiment to obtain reliability
Controls known/unknown extraneous factors
It equates Experimental Group with Control
Group
Appropriate DESIGN for the experiment
ensures accuracy, generalizability &
credibility
c
ed b y o
of co
conclusions
c us o s.
15
Sharad Varde
Sharad Varde
A. Elementary Designs: One Cause Factor
B. Advanced Designs: Many cause factors
i. e. several treatments
16
Sharad Varde
A. Elementary
y Designs
g
A.1: Randomized Two Group
p Design
g
X
A.1: Randomized Two Group Design
A.3: Solomon Four Group Design
Sharad Varde
EG
No X
O2
CG
¥ Randomly assign experimental units to EG & CG
G No p
pre-test measurements are taken
G Expose EG to cause factor (treatment) X
G Note post-test values of effect factor for EG & CG
¥ Treatment Effect is O1 – O2 (i.e. Avg. O1 – Avg. O2)
¥ Applicable when test units are homogeneous.
A2 B
A.2:
Before
f
& Aft
After T
Two G
Group D
Design
i
17
O1
R
18
An Example
p
Sharad Varde
A.2: Before & After Two Group
p Design
g
Ì To evaluate efficacy of a protein supplement product
Ì A random sample of 40 children is selected
O1
X
O2
EG
O3
No X
O4
CG
R
Ì Assigned randomly to EG & CG (Toss a coin & assign)
Ì Children in EG are asked to take product for one month
Ì Children in CG are not given any product (or a placebo)
V Describe
Ì After experiment, health check up performed on both
groups and findings are recorded
V What
the design.
is the treatment effect?
Ì Difference indicates effectiveness of the product
19
Sharad Varde
20
Sharad Varde
A.2: Before & After Two Group
p Design
g
Application
pp
in Community
y Health
~Randomly assign test units to EG & CG
~Note pre-test values of effect factor: O1 & O3
~E
~Expose
EG to
t cause factor
f t (treatment)
(t t
t) X
~Note post-test values of effect factor: O2, O4
~Treatment Effect is:
~(O2 – O1) – (O4 – O3)
~This design controls extraneous factors
additionally due to pre-test/post-test method.
~ Select sample of villagers at random
21
Sharad Varde
~ Talk to each one and note their attitude towards
personal hygiene
~ Randomly assign half of them to EG. Rest form CG.
~ Only
O l for
f EG,
EG conduct
d t the
th health
h lth education
d
ti program
~ Talk to EG and CG persons again and note their
attitude towards personal hygiene
~ Effectiveness measure is (O2 – O1) – (O4 – O3).
22
A.3: Solomon Four Group
p Design
g
O1
O3
X
No X
O2
O4
A.3: Solomon Four Group
p Design
g
Several Effectiveness measures:
EG1
CG1
Î O2 – O1
Î O2 – O4
R
Î O5 – O6
Î (O2 – O1) – (O4 – O3)
X
O5
EG2
No X
O6
CG2
z Combination of design A
A.1
1 & design A
A.2
2
z Addresses all extraneous factors.
23
Sharad Varde
Sharad Varde
Î (O5 – O1) – (O6 – O3) etc.
If all are significantly large, cause-&-effect relationship
is firmly established.
24
Sharad Varde
Characteristics off
C
Advanced ((Statistical)) Designs
g
The World of Numbers
One Effect Factor (Dependent Variable): It
A tri
trivial
ial b
butt important realit
reality:
All numbers are not of the same type
MUST be
b Measurable
M
bl , C
Cardinal,
di l Metric
M ti
All numbers can not be subjected to identical
treatment during their analysis
One or more Cause Factors (Independent
Variable):
) They
y MUST be Non-metric,,
Like in medical field, wrong treatment leads
t disastrous
to
di
t
consequences
Nominal, Ordinal, Categorical.
So, let us tour the world of numbers.
25
Sharad Varde
26
Types
yp of Numbers
Nominal Numbers
Nominal Numbers
z Ordinal Numbers
z Cardinal
C di l N
Numbers
b
z
Purpose: Identification of an Object
z
Example: House Number (10 Janpath)
z
27
Sharad Varde
Sharad Varde
Your Cellphone Number
Smart Card PINumber
Number on Cricket T-Shirt
28
z
Property:
p y Equivalence:
q
Two Different Nominal
Numbers Indicate Two Different Objects
z
Theyy p
possess no Quantitative Properties.
p
Sharad Varde
Ordinal Numbers
z
z
z
z
z
Cardinal Numbers
Purpose: Represent Position or Ranking
Example: India’s ranking in world trade
Exam grade (1
(1, 2
2, 3
3, . . )
Floor number
Properties:
p
Equivalence
q
& Order: Different Ordinal
Numbers Indicate Different Objects in Some Kind of
Relationship with Each Other
N Q
No
Quantitative
tit ti P
Properties
ti
Nominal & Ordinal numbers are also called as
Non-Metric or Categorical.
Categorical
29
Sharad Varde
Purpose: Represent Quantity
Example:
Sales Turnover: Rs in Crores
P d ti iin T
Production
Tons
Your Marks in Exams
Earning Per Share (EPS)
------------------------ope es Equivalence,
qu a e ce, Order,
O de , Quantity.
Qua t ty
Properties:
z
z
30
Cardinal Numbers
Sharad Varde
Cardinal Numbers
They Possess All Mathematical Properties:
Yo can comfortably
You
comfortabl and validly:
alidl
Order
Equivalence
Addition
Subtraction
Multiplication
Division . . . . . .
Cardinal Numbers are Truly Quantitative
They are also called ‘METRIC’.
31
Sharad Varde
32
²
Add them,
²
Subtract, multiply, divide them
²
Take square roots
roots, raise to a power
power, log
²
Develop mathematical models
²
Employ statistical techniques
²
Analyze interpret,
Analyze,
interpret and make decisions
decisions.
Sharad Varde
Example
p
Example
p
Zone
Code No
No.
Sales
Rank
S/W Version
Shoe Size
PIN Code
Shirt Size
(Rs. In Crores)
33
Northern
01
483
3
3.0
5
110001
38
Western
02
738
1
4.2
6
307429
40
Eastern
03
265
4
5.1
7
400004
42
Southern
04
567
2
6.3
8
411002
44
Type
Nominal
Cardinal
Ordinal
Ordinal
Nominal
Ordinal
Sharad Varde
34
Quantitative Techniques
q
Handling
g Numbers
1 Most of the quantitative techniques are
When you master
Wh
t numbers,
b
you will
ill no
longer be reading numbers, any more than
you read words in a book.
meant ONLY for cardinal numbers.
1 Never use them on nominal/ordinal nos.
1 A few methods, called Non-Parametric
Techniques, are especially developed to
analyze ordinal numbers like ranks.
1 Use them instead of wrongly using
cardinal techniques in such situations.
35
Sharad Varde
Cardinal
Sharad Varde
You will be,
be in fact,
fact reading meanings
meanings.
- W. E. B. Du Bois
American sociologist, historian & educator
36
Sharad Varde
11/24/2010
Outline of the Lectures
• At the conceptual level
Price Indices in National
Accounting
g
D. Narayana
Lecture 1-IGIDR
19 October 2010
Defining GDP
• GDP combines in a single figure, and with
no double counting, all the output (or
production) carried out by all the firms,
non-profit
non
profit institutions,
institutions government bodies
and households in a given country during
a given period, regardless of the type of
goods and services produced, provided
that the production takes place within the
country's economic territory.
– The issue of price indices in National Income
Accounting
– The problem of measuring services
• If time permits and you are interested
– The issue of advance estimates of National
Income and components
– The Statistical Strengthening System Project
GDP
• GDP = ∑ value added
• Of, each firm, govt institution, producing
household in a given country
• GDP = ∑ outputs
t t - ∑ iintermediate
t
di t
consumption
• GDP independent of pattern of
organisation
• Avoids double counting
1
11/24/2010
GDP and Other Aggregates
• Gross means inclusive of consumption of
fixed capital (=> Net domestic product)
• Domestic vs National
• GDP
GDP- Output produced within the territory
• GNI- Total income of all eco agents
residing within the territory
• Difference- earning of workers living in one
country working elsewhere, interest paid
on investments
Table 2. Reconciliation of GDP and GNI for Germany, Luxembourg and Ireland,
Millions of euros
Year 2003
Germany
Luxembourg
Ireland
Gross domestic product
2 128 200
23 956
134 786
+52 972
+30 296
+ primary income (including earnings)
received from the rest of the world+104 610
– primary income (including earnings)
paid to the rest of the world
–118 630
–55 722
–52 139
= Gross national income
2 114 180
21 206
112 943
–11.5
–16.2
Difference between GDP and GNI (%)
–0.7
Table 7. GDP: expenditure approach, Germany, 2004a
Reconciling global output and demand
Codes
GDP
P3
One of the Fundamental Equations of NIA
• GDP = Sum of final demand aggregates
• GDP + Imports = Household consumption
+ GCF + Exports
E
t
• GDP = Household consumption + GCF +
Net Exports
Million euros
Gross domestic product
2 177 000
Total final consumption
1 677 450
% of GDP
P31-S14
HH final consumption expenditure
1 225 870
56.3
P31-S15
Final consumption of NPISHs
44 900
2.1
P31-S13
General government final
18.7
consumption expenditure
406 680
P5
Gross capital formation
385 480
P51
Gross fixed capital formation
378 550
P52
Changes in inventories
6 930
B11
External balance of goods and services
114 070
P6
Exports
834 820
38.3
P7
Imports
720 750
33.1
17.4
This table shows the official SNA codes, which the reader can find on the website
accompanying this book. These codes facilitate the understanding and manipulation of
the data.
2
11/24/2010
Table 9. The three approaches to GDP, Germany, billion euros
Codes
Reconciling global output and income
Fundamental equation
• Output (sum of the values added)= Income
(employees’ salaries + company profits ) =
Final demand (Consumption + GCF + Net
exports)
• Three ways to measure GDP- (i) the
output approach; (ii) the final demand
approach; (iii) the income approach
1991
2004
GDP
Gross domestic product (output approach)
1 502.2
2 177.0
B1B
Value added at base-year prices
1 359.5
1 965.1
D21
+ taxes net of subsidies on the products
GDP
142.7
211.9
Gross domestic product (demand approach)
1 502.2
2 177.0
P3
Final consumption expenditure
1 140.9
1 677.5
P5
+ Gross capital formation
364.9
385.5
P6
+ Exports of goods and services
395.2
834.8
P7
– Imports of goods and services
398.7
720.8
GDP
Gross domestic product (income approach)
D1
Compensation of employees
B2 + B3 + Gross operating surplus and gross mixed income
D2
+ Taxes net of subsidies on production and imports
1 502.2
2 177.0
844.0
1 133.1
515.1
143.1
811.9
232.1
These are the official SNA codes
Growth of GDP
Average annual % GDP growth, 1980‐2003
Current prices
Netherlands +4.6
Mexico +37.1
Turkey +62.3
•GDP at market Prices
•Over 1980‐2003 for Netherlands, Mexico, and Turkey
•Formidable growth of Turkey and Mexico c f to Netherlands?
•The trap of inflation, or current prices
•Need to separate out growth in volume from changing prices
Table 1. GDP, volume and price indices
Average annual growth in percentage, 1980-2003
Volume
Prices
Netherlands +2.3
+2.3
Mexico
+2.4
+33.9
Turkey
+4.1
+60.0
Source: OECD (2006), National Accounts of OECD Countries, Volume I, Main Aggregates,
1993-2004, 2006 Edition, OECD, Paris.
StatLink: http://dx.doi.org/10.1787/508232480000
http://dx doi org/10 1787/508232480000
Table 2. GDP per capita, 2003
In US dollars
Netherlands = 100
Netherlands 31 602
100.0
Mexico
6 091
19.3
Turkey
3 385
10.7
Source: OECD(2006), National Accounts of OECD Countries: Volume I, Main Aggregates,
1993-2004, 2006 Edition, OECD, Paris.
3
11/24/2010
Volume measure Needed
•If only one product, problem solved easily
•Multitude of products- how to aggregate?
•Answer - in prices
•Problem with prices
•Prices change along with volumes.
•How to separate them out?
•E.g. Economy with two cars, small and large .
•Numbers produced Qst Qlt & Qst’Qlt’
•Price at period t, Ps , Pl
•Volume of production: Qst Ps + Qlt Pl , Qst’ Ps + Qlt’ Pl
•Constant price accounting.
Quantities Vs Volume
• Is quantity same as volume?
• Car economy
– 80 small cars 20 large in Y1
– 50 small and 50 large in Y2
• Quantity 100 in both years; no change.
• Is
I the
th volume
l
same iin th
the ttwo years?
?
• Suppose, Ps = 1 Pl = 2
• Volume 1 = 80 x 1+ 20 x 2 = 120
• Volume 2 = 50 x 1 + 50 x 2 = 150
• Volume increased by 25%.
• Volume takes into account quality.
Volume Indices Vs Price Indices
Laspeyres Index
• Volume Index = Weighted average of changes in
the quantities, weights being Prices
• Price Index = Weighted average of changes in
prices, weights being quantities
• Q
Quantityy ratio and price
p
ratio- qt/q
q0 and pt/p
p0
• Most national accounts systems use
• Vij = pij qij value at current prices of product
i in period j.
• Laspeyres volume index
∑vi0 qit/qi0
= ---------------............(1)
∑vi0
– Laspeyres indices to calculate volumes
– Paasche indices to calculate change in prices
• Weighted average of quantity or price ratios
• The period providing base period = reference
period
• By convention, reference period = 100
4
11/24/2010
• Eq (1) can be rewritten as,
∑p0qt
Lq = ---------- …………………(2)
∑ p 0q 0
• Paasche index, harmonic mean of price ratios
∑vt
∑ptqt
Pp = --------- = ---------- ………....... (3)
∑vt.p0/pt ∑p0qt
∑p0qt ∑ptqt
∑ptqt
∑vt
• Lq x Pp = --------- . -------- = -------- = ----- ...(4)
∑p0q0 ∑p0qt
∑p0q0 ∑v0
Constant prices
• Laspeyres volume index
∑p0q0 ∑p0q0, ∑p0q1 ∑p0q0, ∑p0q2 ∑p0q0 …(6)
• Multiply by ∑p0q0
∑p0q0, ∑p0q1,… ∑p0qt……………………..(7)
• Constant price series as price structure of the
fixed period used
• One great advantage- additive
• Eq (4) is the fundamental equation introduced
earlier.
• This is generally used to arrive at volume index.
∑vt
Lq = ----- Pp……….(5)
∑v0
• Easier to get price indices (Pp) and GDP at
current prices.
Problems with constant prices
• Constant prices – choice of a fixed year
• Using price structures remote from the current
structure
• The problem of computers and mobiles
• Indian case, 1999-00 prices
• Computers
p
very
y expensive
p
– 4 GB hard disk IBM cost me 70,000
– Today, 300GB costs 20,000
• Value at 1999-00 constant prices?
• Volume overstated?
• Price decrease understated?
5
11/24/2010
Chained Accounts
•
Three stages:
Figure 1. Difference between constant
1980 prices and chained prices
France, computers and other materials
1. Accounts calculated at prices of previous year
2. Chain these changes
•
•
Multiply each one by subsequent one
Obtain series of growth rates
3. Multiply by value of the accounts at the reference year
price.
price
•
•
•
400
350
300
250
Advantage
–
450
200
Price structure more relevant
Called Laspeyres chains
(Fisher chains- average of previous and current year
prices)
150
100
50
0
1980
1983
1986
1989
constant 1980 prices
1992
1995
1998
chained prices
Consequences of chaining
Difference- US
•
•
•
•
•
•
•
Difference great as seen for France above
Similar differences for US
US GDP growth between 2001-03
4.3% at constant prices
2.7% at chained prices
Difference largely from computers
Whose production increased greatly
• Chain linking adopted by US, OECD
• Advantage- More accurate volume growth
rates.
• Draw back
– Loss of additivity - Eq 5 breaks down!
– Accounting identities do not hold- rather their
growth rates cannot be decomposed
• Second fundamental equation,
GDP= C + GCF+ X-M does not hold
• An additional residual term with no economic
interpretation
6
11/24/2010
Another major problem
References
Volume vs Quantity
Price used to combine diverse products
Diverse quality within a product group
Price differences reflect quality differences
is the assumption
• What happens when quality improves but
price falls?
• Is there a way out?
• Lequiller, F. , and Blades, D. 2006.
Understanding National Accounts. OECD, Paris,
http://www.eastafritac.org/images/uploads/docu
ments_storage/Understanding_National_Accoun
ts_-_OECD.pdf
• CSO. National Accounts Statistics, Sources and
Methods 2007,
http://mospi.nic.in/rept%20_%20pubn/ftest.asp?r
ept_id=nad09_2007&type=NSSO (password
may be required for this at mospi.nic.in site.)
•
•
•
•
7
Sharad Varde
D i off E
Design
Experiments
i
t
M.
M Sc.
Sc (Statistics); Ph
Ph. D
D. (Operations Research)
Planning & Strategy Faculty in NIBM
VP:
VP S
Swedish
di h M
Match
t h (I
(International
t
ti
lB
Business)
i
)
CEO / MD: Bhor Ind., Kamala Group, Cyber Agro
Sectors: Banking, Engineering, Packaging, Textile,
Info-System Security, e-Commerce, Food, Plastics
D Sharad
Dr.
Sh
d Varde
V d
External Academic Expert: Univ of Warwick, UK
Co-Chairman, Food Prssg & Agri-Business: IMC.
2
A New Real Life Problem
Sharad Varde
Traditional Real Life Problem
Is the new formulation of fertilizer a
Is Bt Brinjal a potential health
decisively superior option for wheat
hazard a soil quality destroyer
hazard,
destroyer,
growers in North India in terms of
cost, quality and yield?
and an anti-farmer innovation?
3
Sharad Varde
4
Sharad Varde
A Corporate
p
Problem
A Public Utility
y Problem
Newly
N
l appointed
i t d VP (Fi
(Finance)) strongly
t
l
believes that staffing his accounts dept
only with CAs (instead of commerce &
g
g
graduates)) will g
grossly
y
management
improve its performance.
On what basis can the Company
accept or reject his idea?
ÎBEST wants to increase occupancy.
5
Sharad Varde
ÎShould it drop fares by 5%
5%, 7
7.5%,
5% or 10%?
ÎOn all days,
days only weekends
weekends, or weekdays?
ÎFor AC buses, express, or ordinary buses?
ÎIn city, west suburbs, or east suburbs?
6
A Union Budget
g Problem
A Public Health Problem
G
- Does pollution from pesticides increase
Would a 10% reduction in excise duty
lead to more than 10% increase in
human chest size?
demand & production?
7
Sharad Varde
- Do p
pesticides create extra oestrogen
g in the
G
For which product categories?
human body which then attempts to disturb
G
For large companies or SMEs or both?
hormonal development?
Sharad Varde
8
Sharad Varde
Real Life Problems
Research Process
Identify broad area of research Î Gather
preliminary data Î Define research problem
Î Identify
Id tif important
i
t t factors
f t
(variables)
( i bl ) Î
Generate hypotheses Î PREPARE
RESEARCH DESIGN Î Collect
C ll
d
data,
analyse & interpret Î Draw conclusions Î
Write report Î Present report for researchbased decision making.
These problems are too important to
be tackled in a naïve manner.
They need a systematic research
study.
d
9
Sharad Varde
10
Sharad Varde
Major
j Elements of Research Study
y
S
Scientific
f Research Design
A. Purpose of research study
B. Type of research investigation
C. Extent of researcher interference
D. Study setting.
12
Sharad Varde
A.1: Exploratory
p
y Research
A. Purpose
p
of Research Study
y
Example: Manpower planning for a Danish Group’s
Group s
new plant in Ahmednagar. No prior knowledge of
local work ethics. Important to find it first.
When research area is virgin: no past info
Extensive preliminary work needs to be done
Objective: To better comprehend problem
Lead to rigorous design for in-depth study
Tools: informal discussions with people
people, in-depth
in depth
interviews, focus groups, case studies, literature
review, secondary data.
1. Exploratory
2. Conclusive:
a. Descriptive
b. Causal
13
Sharad Varde
14
A.2: Descriptive
p
Research
A.3: Causal
Example: A bank wants to know profile of credit card
payment defaulters
Example:
E
ample A firm wants
ants to find o
outt whether
hether
doubling of its advertising budget would
significantly increase sales & profit
Descriptive
Descripti e research is done when
hen characteristics of
the variables of interest are known, but they need to
be profiled in detail for better understanding
When nature & q
quantum of relationships
p
among variables must be unearthed
Objective: To understand magnitude of problem
precisely: Find out WHO,
WHO WHAT,
WHAT WHEN,
WHEN WHERE
It may reveal causal relationships
Inputs: secondary and/or primary data.
15
Sharad Varde
Inputs: specially collected massive data.
Sharad Varde
16
Sharad Varde
Purpose
p
of Research Study
y
B. Type
yp of Research Investigation
g
M h d l i l rigour
Methodological
i
iincreases ffrom
1 Correlational
1.Correlational
exploratory
p
y to causal research
Hence, more cost and time
But, results are more reliable, precise and
2 Cause & Effect
2.Cause
generalizable
Resultant decisions are more realistic.
17
Sharad Varde
18
B. Type
yp of Research Investigation
g
B. Type
yp of Research Investigation
g
Correlational: Are teeth quality and blood sugar
levels related?
Is there a relationship between income and
intelligence of adult Indians?
Are TV viewing and insomnia related?
Do men who buy denims also buy sunglasses?
Conducted in natural environment
C
Cause & Eff
Effect:
t Do
D aerated
t d soft
ft
drinks cause digestive disorders?
Does corporate downsizing influence
performance of the surviving staff?
Researcher manipulates situation to
study effects
ff
off changes in cause
factors on the variable of interest.
Minimal interference by researcher.
19
Sharad Varde
Sharad Varde
20
Sharad Varde
Researcher Interference: Example
p
C. Extent of Researcher Interference
Minimal:
Mi
i l To
T check
h k whether
h th stress
t
on nurses
and emotional support given by doctors to
them are correlated
1. Minimal
Questionnaire to nurses on both factors
2 Moderate
2.
M d
t
No interference by researcher in normal
f
functioning
off hospital beyond administering
questionnaire.
3. Excessive
21
23
Sharad Varde
22
Sharad Varde
Researcher Interference: Example
p
Researcher Interference: Example
p
Moderate: To discover a ‘cause
cause-and-effect
and effect
relationship’ between support & stress
Three groups of nurses chosen for study
Group1:
p Those who say
y they
yg
get full support
pp
from doctors throughout duty hours
Group2: Cursory support given by doctors
Group3: No support provided at all.
Excessive: To firmly establish direct ‘cause
causeand-effect relationship’
Three groups of sensitive nurses are chosen
Onlyy troublesome p
patients chosen for study
y
Doctors asked to Step in to help (Gr1) / Give
partial solace (Gr2) / Ignore (Gr3)
After a week, administer questionnaire.
Sharad Varde
24
Sharad Varde
Whatt is
Wh
i an Experiment
E
i
t
in Research?
D. Study
y Setting
g
Studies involving researcher influence of moderate
Research method in which
or excessive nature are called Contrived Studies
we change values of cause
Theyy are conducted in two formats:
25
1. Field Experiments
factor to measure its exact
2. Lab Experiments.
influence over effect factor.
Sharad Varde
26
Contrived ‘Field Experiments’
p
Contrived ‘Lab Experiments’
p
Cause & effect relationship studies
conducted in natural environment with
moderate interference by researcher
Ca
Cause
se & effect relationship st
studies
dies
beyond possibility of least doubt
Create artificial contrived environment
Example: Study effect of a newly created herbal
compound on weight reduction
Group 1 of volunteers: Administer new treatment
Group 2 of volunteers: No treatment
Choose other subjects to respond to
manipulated stimuli
E.G.:
E G Administer
Ad i i t ttreatment
t
t to
t mice
i or pigs
i
But, their lifestyles could also influence results.
27
Sharad Varde
Sharad Varde
Control extraneous factors.
28
Sharad Varde
Process of Experiments
p
Methodology
gy of an Experiment
p
Us
Usually
all Lab Experiments
E periments are cond
conducted
cted first
so that real life setting is not disturbed
Study:
St
d Does red
reduction
ction in air fare (X) impro
improve
e
occupancy (Y) in holiday resorts?
Hypotheses are tested & conclusions drawn
Two factors: Cause Factor X, Effect Factor Y
Then a Field Experiment is conducted to
confirm (or reject) the tested hypotheses in
real life setting with moderate interference
Specific changes are induced in X
Resultant changes in Y are observed
If every change in X causes change in Y,
then we conclude that X is causal to Y.
Final conclusions are reached.
29
Sharad Varde
30
Logic
g of Experiments
p
Logic
g of Experiments
p
1. X will always occur before Y
Q So
So, all such
s ch extraneous
e traneo s (i
(i.e.
e n
nuisance)
isance)
factors must be held constant and
2 Changes in X will cause changes in Y
2.
Q Their effects neutralized by controlling the
situation somehow
3. To infer that X causes Y, other possible
p
causes (extraneous factors) must NOT exist
Q In Lab Experiments it is controlled artificially
Security perception, economic sentiments too influence
Q IIn Field
Fi ld Experiments
E
i
t it is
i controlled
t ll d with
ith
help of clever Designs of Experiments.
holiday plans. They are the extraneous factors.
31
Sharad Varde
Sharad Varde
32
Sharad Varde
Design
es g o
of Experiments
pe
e ts
How to
DESIGN
Experiments?
33
Sharad Varde
How to Select Appropriate
Sampling Design
Sampling
Dr. Sharad Varde
Choice Points in Sampling Design
For Non-probability Sampling Design
Question: Is representativeness of the
sample and generalizability of the
conclusions critical for research study?
If purpose of research is to get quick
but even partially reliable info, choose
convenience sampling design
If not, we can select an appropriate
non-probability sampling design.
If purpose is to extract info available
with only a few of the elements, choose
purposive sampling design.
If yes, select a probability design.
3
Sharad Varde
4
Sharad Varde
5
For Non-probability Sampling Design
Choice Points in Sampling Design
If researcher needs to use personal
judgment about who would be the best
respondents to serve the purpose, it is
judgment sampling design
If researcher has to resort to asking
respondents to suggest further
interviewees, it is snowball or referral
sampling design.
Question: Is representativeness of the
sample and generalizability of the
conclusions critical for research study?
Sharad Varde
If not, we can select an appropriate
non-probability sampling design
If yes, select a probability design.
6
For Probability Sampling Design
For Probability Sampling Design
If population naturally consists of
several mutually excusive groups
(strata) that are dissimilar to each other
(i.e. they have homogeneity within each
stratum & heterogeneity between
strata) and if the purpose is to assess
these difference, select stratified
random sampling design.
There are two methods of selecting
elements of the population for inclusion
in the sample:
A simple random sampling design
Or, systematic sampling design if
population is serially ordered or
elements emerge serially.
7
Sharad Varde
Sharad Varde
8
Sharad Varde
In Stratified Random Sampling
For Probability Sampling Design
If population naturally consists of
several groups (clusters) that are
similar to each other (i.e. inter-group
homogeneity and intra-group
heterogeneity) and if cost & time
budget is small, opt for cluster
sampling design.
If all strata have nearly equal number
of elements, choose proportionate
stratified random sampling design
If some strata are too large or some too
small, choose disproportionate
stratified random sampling design.
9
Sharad Varde
10
Important Issues in Sampling
For Probability Sampling Design
1. Sampling Design: Precisely how to draw a
sample from the population
2. Sample Size n: How many elements of the
population to be selected to form a
representative sample
Both depend upon cost & time budget of the
study, and desired reliability of conclusions
(confidence & precision).
If we need preliminary info on some
parameters of the population
immediately followed by detailed info
on some other parameters, and if cost
& time budgets do not permit drawing a
fresh sample for detailed second study,
select double sampling design.
11
Sharad Varde
Sharad Varde
12
Sharad Varde
Sampling Error
Two Important Concepts:
Precision & Confidence
Whatever be sampling design, sample estimate will
inevitably differ from actual parameter of population.
This difference is called ‘Sampling Error’
Larger the sample, smaller the sampling error
We must know the sampling error of our sampling
design so as to understand reliability of estimates.
14
15
Sharad Varde
Concept of Precision
Precision
It refers to how close our estimate of a
population characteristic (say, average
mileage before car battery fails) derived from
a sample is to true population characteristic
Rarely in practice, we make ‘point estimates’
(such as 36 months). Usually, we declare a
range (36 months ± 2 months i.e. 34 – 38
months). It is called ‘interval estimate’
Narrower this interval, greater the precision.
It is a function of the range of variability in the
probability distribution of sample mean
Sharad Varde
It is measured by sampling error S = s/√n,
where s is standard deviation of sample and
n is sample size
Large sample size n means low std error S
Low std error S means high precision.
16
Sharad Varde
Concept of Confidence
Confidence
It denotes how certain we are that the
estimate of a population parameter derived
from our sample is within the desired range
(say, ±5%) of the true (but unknown) value of
the population parameter
It is the probability (expressed in % form) that
sample parameter lies within the desired
range of population parameter.
Widest range: o - ∞ gives 100% confidence
17
Sharad Varde
Wider the range, higher the confidence
Wider the range, lower the precision
Higher confidence goes with lower precision
We need a trade off between them.
18
A Numerical Example
Sharad Varde
Standard Normal Distribution
Study: Find food bill value per college girl
Sample of 64 college girls at the gate of CCD
Sample Mean x = Rs.105. Std Dev. s = 10
Confidence interval for pop mean μ = x ± ZS
where Z is ‘z score’ for standard normal
distribution for the desired confidence.
19
Sharad Varde
20
For 90% confidence level, Z = 1.645
For 95% confidence level, Z = 1.96
For 99% confidence level, Z = 2.576
Sharad Varde
21
A Numerical Example
Trade Off
So, sampling error S = s/√n = 10 / √64 = 1.25
For 90% confidence level, interval estimate is
μ = 105 ± 1.645 (1.25) = 102.944 – 107.056
For 99% confidence level, interval estimate is
μ = 105 ± 2.576 (1.25) = 101.780 – 108.220
Higher confidence goes with wider interval
i.e. with lower precision.
Thus, we can use the formula
μ = x ± ZS
To increase or decrease original confidence
level and determine precision level, or
To increase or decrease original precision
level and determine confidence level
This is the trade off between precision level
and confidence level.
Sharad Varde
22
Sharad Varde
Formula for Sample Size ‘n’
Sample Size
n = (Zσ / e)2
Where, Z is z score for standard normal
distribution for the desired confidence level
σ is population standard deviation
e is tolerable margin of error (precision level)
Thus, higher confidence level ≡ higher Z ≡
bigger sample size
Higher precision level ≡ narrower margin of
error ≡ bigger the sample size.
24
Sharad Varde
25
An Example
Formula for Sample Size ‘n’
Jet Airways wants to be 95% confident of an
estimate of average number of customers
per weekday within a range of ± 500
A recent sample study of average number of
passengers per weekday showed a std dev
of 3500
So, Z = 1.96, σ = 3500, e = 500
Sample size n = (1.96 x 3500 / 500)2 = 188.
This formula n = (Zσ / e)2 does not consider
Sharad Varde
population size N
But, often we do not know the pop size
Or, population is too huge to enumerate
So, this formula is used when N is unknown.
26
Formula for Sample Size ‘n’
Real Life Problems
But, if population size N is small and known,
the corrected formula for n is:
The formula n = (Zσ / e)2 depends on
population standard deviation σ
Problem: How do we find σ?
n = N(e/Z)2 / {N(e/Z)2 – (e/Z)2 + σ2}
Solution: Look for any study conducted in
recent past on the same population to
estimate σ or do an exploratory study based
on a small sample.
This is the required sample size that
incorporates population size N
It is used when pop is not too large.
27
Sharad Varde
Sharad Varde
28
Sharad Varde
Real Life Problems
Solution in Real Life Situation
Problem: Difficult to get permission to
conduct two studies. Also, accuracy of
estimate of σ from small sample exploratory
study is questionable
Solution: Take σ as {range / 6} i.e. (largest
element minus smallest element) / 6
Problem: How do we get largest and
smallest elements of the population if it is not
readily available in ordered format?
1.
2.
3.
4.
29
Sharad Varde
Derive sample size from cost & time budget
Carry out sample study
Ask for confidence level desired (We get Z)
Use formula μ = x ± ZS to compute
precision S
5. If this precision S is not at desired level,
revise confidence level (We get a fresh Z)
6. Compute precision level S afresh
7. Carry on iterations till both are satisfactory.
30
Solution in Real Life Situation
Stratified Random Sampling
Or, ask for precision level S desired
The formula n = (Zσ / e)2 is for Simple
Random Sampling & Systematic Sampling
Use formula to compute confidence Z
For Stratified Random Sampling it is:
If confidence is not at desired level, revise
precision level
n = (Z / e)2 Σ Wi σi2
where, σi is standard deviation of ith stratum
of population (i = 1, 2, . . . ,k) that consists of
k strata & Wi is weight attached to ith stratum
Compute confidence level afresh
Carry on iterations till both are satisfactory.
31
Sharad Varde
Sharad Varde
32
Sharad Varde
Formula for Sample Size ‘n’
Formula for Sample Size ‘n’
In Stratified Proportionate Random Sampling
(SPRS) the weights are Wi = (Ni / N), where
Ni denotes the number of elements in the ith
stratum of the population of total size N
In Stratified Disproportionate Random
Sampling (SDRS) n = (Z / e)2 {Σ Wi σi }2
Sample of size n is then divided into k
samples of size n1, n2, . . . , nk as follows:
ni = n Wi
(i = 1, 2, . . . ,k).
Note: In SPRS, σi’s vary significantly, but
Ni’s are not too drastically different
33
Sharad Varde
Formula for Sample Size ‘n’
In Cluster Sampling and Double
Sampling, it is obvious that the
formula for Simple Random
Sampling applies.
35
Sharad Varde
And samples for individual strata are:
ni = n {Ni σi / Σ Ni σi }
(i = 1, 2, . . . ,k)
In SDRS, both σi’s and Ni’s vary significantly.
34
Sharad Varde
End of Sampling
Multivariate Statistical Analysis
Statistical Methods for
Simultaneous
Investigation of
Several Variables
Multivariate Analysis
Dr. Sharad Varde
Sharad Varde
Major Inter-dependence Methods
Research Studies
Several variables are to be studied
Data are obtained on them from sample
These variables may / may not be mutually
independent of each other
Some may hold strong correlation with some
other variables
Multi-collinearity may exist among variables
Data analysis methods in this situation are
called ‘Inter-dependence Methods’.
Sharad Varde
2
Factor Analysis to reduce several
correlated variables into a few
uncorrelated meaningful factors
Cluster Analysis to classify individual
elements of the population into a few
homogeneous groups.
3
Sharad Varde
4
Research Studies
Major Dependence Methods
Several variables are to be studied
Purpose is to establish a cause-andeffect relationship
One dependent (effect) variable and
several independent (cause) variables
Data are obtained on them from sample
Data analysis methods in such situations
are called ‘Dependence Methods’.
Sharad Varde
If the dependent variable (effect factor) is
Metric and independent variables (cause
factors) are non-metric (i.e. categorical),
use Design of Experiments to structure
the research study and use Analysis of
Variance to analyze the data.
5
Major Dependence Methods
Sharad Varde
6
Major Dependence Methods
If the dependent variable (effect factor) is
If the dependent variable (effect factor) is
non-Metric (Categorical) and the
Metric and the independent variables
independent variables (cause factors)
are metric, use Multiple Discriminant
(cause factors) are also metric, use
Analysis.
Multiple Regression Analysis.
Sharad Varde
7
Sharad Varde
8
Major Dependence Methods
Major Dependence Methods
Dependent Variable
Metric
Analysis of
Variance
Similarities
Number of
dependent
Variables
Categorical
Independent Variables
Categorical
ANOVA
Independent Variables
Metric
Multiple
Regression
Categorical
Metric
Number of
independent
variables
Canonical
Correlation
Multiple
Discriminant
Differences
Nature of the
dependent
Variables
Nature of the
independent
variables
Sharad Varde
9
DISCRIMINANT ANALYSIS
One
One
One
Many
Many
Many
Metric
Categorical
Categorical
Metric
Metric
Metric
Sharad Varde
Multivariate Analysis Methods
We will now study major Multivariate methods:
Factor Analysis
1. Factor Analysis
2. Cluster Analysis
3. Multivariate Discriminant Analysis
4. Multivariate Regression Analysis.
Sharad Varde
11
REGRESSION
10
Factor Analysis
What is a Factor
A factor is a linear combination of the observed
original variables V1 ,V2 , . . ,Vn:
It examines entire set of inter-dependent
relationships without making any
distinction between dependent and
independent variables
It reduces the total number of variables
in the research study to a smaller number
of factors by combining a few correlated
variables into a factor.
Sharad Varde
Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + WinVn
where
Fi = The ith factor (i = 1, 2,..,m ≤ ≤ n)
Wi = Weight (factor score coefficient)
n = Number of original variables
m
= Number of factors.
13
Sharad Varde
Factor Analysis
Case Study # 1
Discovers a smaller set of uncorrelated
Evaluate credit card usage & behavior of
factors (m) to represent the original set of
customers
correlated variables (n) significantly (m ≤ n)
Initial set of variables is large: Age, Gender,
These factors do not have multi-collinearity, i.e.
they are orthogonal to each other
Marital Status, Income, Education,
Employment Status, Credit History, Family
They can then be used in further multivariate
Background: Total 8 variables
analysis (regression or discriminant analysis).
Sharad Varde
14
Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + Wi8V8
15
Sharad Varde
16
Case Study # 1
Case Study # 1
Reduction of 8 variables into 3 factors (i = 3):
These 3 un-correlated factors can be identified by
1.
common characteristics of ‘variables with heavy
Factor 1: Heavy weightage for age, gender, &
weightages’ & named accordingly as follows:
marital status and low weightages to other variables
2.
1.
Demographic Status
Factor 2: Heavy weightage for income, education,
2.
employment status & low weightages to others
3.
Factor 2: (income, education, employment status) as
Socio-economic Status
Factor 3: Heavy weightage for credit history & family
3.
background and low weightages to other variables.
Sharad Varde
Factor 1: (age, gender, marital status) as
Factor 3: (credit history & family background) as
Background Status.
17
Sharad Varde
Case Study # 2
Case Study # 2
Reduction of 10 variables to 3 factors:
Evaluate customer motivation for buying a two wheeler
Initial set of variables is large:
1. Affordable
2. Sense of freedom
3. Economical
4. Man’s vehicle
5. Feel powerful
6. Friends jealous
7. Feel good to see ad of this brand
8. Comfortable ride
9. Safe travel
10. Ride for three.
Sharad Varde
18
Pride: (man’s vehicle, feel powerful, sense of freedom,
friends jealous, feel good to see ad of this brand)
Utility: ( economical, comfortable ride, safe travel)
Economy: (affordable, ride for three to be allowed)
We will now see how to carry out factor analysis.
19
Sharad Varde
20
Standard Normal Distribution
Standardize the Data
●Enlist all variables that can be important in
resolving the research problem
●Collect metric data on each variable from all
subjects sampled
●Convert all data on each variable into standard
format (Mean: 0 & Std. Dev.: 1) since different
variables may have different units of
measurement
●SPSS / SAS etc. do it automatically.
Sharad Varde
21
Two Steps in Factor Analysis
Sharad Varde
22
What Factor Extraction does
(a) It determines the minimum number of
factors that can comfortably represent all
variables in the research study
Factor Extraction
Obviously, maximum number of factors equals
the total number of variables
Factor Rotation
(b) It converts correlated variables into the
desired number of un-correlated factors
Tool: Principal Component Method (PCM).
Sharad Varde
23
Sharad Varde
24
Principal Component Method
Case Study # 3
SPSS gives inter-variable correlations
To determine the benefits consumers
PCM assists checking appropriateness of
factor analysis (Bartlett’s test)
seek from purchase of a toothpaste
Sample of 30 persons was interviewed
Assists checking adequacy of sample size
(KMO test)
Respondents were asked to indicate their
Gives initial eigen values
degree of agreement with the following
They determine the minimum number of
factors that can represent all variables.
statements using a 7 point scale:
Sharad Varde
(1=Strongly agree, 7=Strongly disagree)
25
Original Data: 30 persons, 6 variables
Six Important Variables
V1: Buy a toothpaste that prevents cavities
V2: Like a toothpaste that gives shiny teeth
V3: Toothpaste should strengthen your gums
V4: Prefer toothpaste that freshens breath
V5: Prevention of tooth decay is not an important
benefit
V6: Most important concern is attractive teeth
Data obtained are given in the next slide.
Sharad Varde
26
Sharad Varde
27
RESPONDENT
NUMBER
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
V1
7.00
1.00
6.00
4.00
1.00
6.00
5.00
6.00
3.00
2.00
6.00
2.00
7.00
4.00
1.00
6.00
5.00
7.00
2.00
3.00
1.00
5.00
2.00
4.00
6.00
3.00
4.00
3.00
4.00
2.00
V2
3.00
3.00
2.00
5.00
2.00
3.00
3.00
4.00
4.00
6.00
4.00
3.00
2.00
6.00
3.00
4.00
3.00
3.00
4.00
5.00
3.00
4.00
2.00
6.00
5.00
5.00
4.00
7.00
6.00
3.00
V3
6.00
2.00
7.00
4.00
2.00
6.00
6.00
7.00
2.00
2.00
7.00
1.00
6.00
4.00
2.00
6.00
6.00
7.00
3.00
3.00
2.00
5.00
1.00
4.00
4.00
4.00
7.00
2.00
3.00
2.00
Sharad Varde
V4
4.00
4.00
4.00
6.00
3.00
4.00
3.00
4.00
3.00
6.00
3.00
4.00
4.00
5.00
2.00
3.00
3.00
4.00
3.00
6.00
3.00
4.00
5.00
6.00
2.00
6.00
2.00
6.00
7.00
4.00
V5
2.00
5.00
1.00
2.00
6.00
2.00
4.00
1.00
6.00
7.00
2.00
5.00
1.00
3.00
6.00
3.00
3.00
1.00
6.00
4.00
5.00
2.00
4.00
4.00
1.00
4.00
2.00
4.00
2.00
7.00
V6
4.00
4.00
3.00
5.00
2.00
4.00
3.00
4.00
3.00
6.00
3.00
4.00
3.00
6.00
4.00
4.00
4.00
4.00
3.00
6.00
3.00
4.00
4.00
7.00
4.00
7.00
5.00
3.00
7.00
2.00
28
Inter-variable Correlations:
Correlation Matrix from SPSS
Variables
V1
V2
V3
V4
V5
V6
V1
V2
V3
V4
V5
1.000
-0.530 1.000
0.873 -0.155 1.000
-0.086 0.572 -0.248 1.000
-0.858 0.020 -0.778 -0.007 1.000
0.004 0.640 -0.018 0.640 -0.136
Bartlett’s Test
For valid factor analysis, many variables
must be correlated with each other
That means, if each original variable is
completely independent of each of the
remaining n-1 variables, there is no need
to perform factor analysis
i.e. if zero correlation among all variables
H0: Correlation matrix is unit matrix.
V6
1.000
29
Sharad Varde
H0: Correlation matrix is Unit Matrix
V1
V2
V3
----
----
Vn
V1
1
0
0
0
0
0
V2
0
1
0
0
0
0
V3
0
0
1
0
0
0
----
----
----
----
----
----
----
----
----
----
----
----
----
----
Vn
0
0
0
0
0
1
Sharad Varde
Sharad Varde
30
Bartlett’s Test
For valid factor analysis, many variables
must be correlated with each other
H0 : Correlation matrix is unit matrix
Here, SPSS gives p level < 0.05
Reject H0 with 95% level of confidence
So, correlation matrix is not unit matrix
Conclusion: Factor analysis can be
validly done.
31
Sharad Varde
32
Initial Eigen Values
KMO Test
SPSS gives Kaiser-Meyer-Olkin measure
of sampling adequacy in this case= 0.660
Values of KMO between 0.5 and 1.0
suggest that sample is adequate for
carrying out factor analysis. Otherwise,
we must draw additional sample.
Here, 0.660 > 0.5
Conclusion: Sample is adequate
Thus, these two tests together confirm
appropriateness of factor analysis.
Sharad Varde
33
Initial Eigen values
Factor
1
2
3
4
5
6
Eigen value % of variance Cumulat. %
2.731
45.520
45.520
2.218
36.969
82.488
0.442
7.360
89.848
0.341
5.688
95.536
0.183
3.044
98.580
0.085
1.420
100.000
Sharad Varde
Eigen Value
Principal Component Method
Variance of each standardized variable is 1
Each original variable has Eigen value = 1 due
to standardization
Total variance in study = Number of variables (here 6)
Fi = W i1V1 + W i2V2 + W i3V3 + . . . . . . . . . . . . . + W i6V6
Variance explained by a factor is called Eigen Value of
that factor
So, factors with eigen value < 1 are no better
than a single variable
Only factors with eigen value ≥ 1 are retained
It depends on (a) weights for different variables and (b)
correlations between the factor & each variable (called
Factor Loadings)
Principal Component Method determines the
least number of factors to explain maximum
variance.
Higher the eigen value of the factor, bigger is the
amount of variance explained by the factor.
Sharad Varde
34
35
Sharad Varde
36
Case Study # 3: Initial Eigen Values
PCM is a Sequential Process
Selects weights (i.e. factor score coefficients) in such a
manner that the first factor explains the largest portion
of the total variance
F1 = W 11V1 + W 12V2 + W 13V3 + . . . . . . . . . . . + W 1nVn
Then selects a second set of weights for
F2 = W 21V1 + W 22V2 + W 23V3 + . . . . . . . . . . . + W 2nVn
so that the second factor accounts for most of the
residual variance, subject to being uncorrelated with
first factor
Process goes on till cumulative variance explained
crosses a desired level, usually 60%.
Initial Eigen values
Factor
1
2
3
4
5
6
37
Sharad Varde
Eigen value % of variance Cumulat. %
2.731
45.520
45.520
2.218
36.969
82.488
0.442
7.360
89.848
0.341
5.688
95.536
0.183
3.044
98.580
0.085
1.420
100.000
Sharad Varde
38
Factor Loadings: Correlation Between
Each Factor & Each Variable
Two Factors Explain > 60% Variation
.
Factor Eigen Value
1
2.731
2
2.218
% of Variance
45.520
36.969
Cumulative %
45.520
82.488
Factor Matrix
Variables
V1
V2
V3
V4
V5
V6
Conclusion: Number of factors required
to explain >60% variation is 2.
Sharad Varde
39
Factor 1
0.928
-0.301
0.936
-0.342
-0.869
-0.177
Factor 2
0.253
0.795
0.131
0.789
-0.351
0.871
Sharad Varde
40
Factor Rotation
Factor Rotation
Initial factor matrix rarely results in factors that
In rotating the factors, we would like each
factor to have significant loadings or
coefficients for some of the variables.
can be easily interpreted
Therefore, through a process of rotation, the
initial factor matrix is transformed into a simpler
matrix that is easier to interpret
It leads to identify which factors are strongly
Let us see how it is done.
associated with which original variables.
Sharad Varde
The process of rotation is called
orthogonal rotation if the axes are
maintained at right angles
41
Factor Loadings: Correlation Between
Each Factor & Each Variable
Sharad Varde
42
Illustration of Rotation of Axes
.
Let us take a simpler illustration
Suppose factor loadings of 2 variables on 2 factors:
Factor 1
Factor 2
0.6
0.7
V1
0.5
- 0.5
V2
Factor Matrix
Variables
V1
V2
V3
V4
V5
V6
Factor 1
0.928
-0.301
0.936
-0.342
-0.869
-0.177
Factor 2
0.253
0.795
0.131
0.789
-0.351
0.871
Sharad Varde
Variation explained by V1 = (0.6)2 + (0.7)2 = 0.85
Variation explained by V2 = (0.5)2 + (-0.5)2 = 0.50
None of the loadings is too large or too small to reach
any meaningful conclusion
Let us rotate the two axes & see what happens.
43
Sharad Varde
44
Graph of Original Loadings
Graph of Rotated Axes (clockwise)
Factor 2 +1
Factor 2 +1
V1
-1
V1
Factor 1
+1
0
-1
Factor 1
+1
0
V2
V2
-1
-1
45
Sharad Varde
Graph of Rotated Axes
-1
Factor Loadings After Rotation
Factor loadings of 2 variables on 2 factors:
Factor 2
V1
V1
V2
0
Factor 1
Sharad Varde
Factor 1
-0.2
0.7
Factor 2
0.9
0.1
Variation explained by V1 = (-0.2)2 + (0.9)2 = 0.85
Variation explained by V2 = (0.7)2 + (0.1)2 = 0.50
Note that variation explained remains unchanged
Some of the loadings are too large or too small
Now, we can reach meaningful conclusion.
V2
-1
46
Sharad Varde
+1
47
Sharad Varde
48
Case Study # 3:
Factor Loadings after Rotation
Original Factor Loadings: Correlation
Between Each Factor & Each Variable
.
Rotated Factor Matrix
Factor Matrix
Variables
V1
V2
V3
V4
V5
V6
Factor 1
0.928
-0.301
0.936
-0.342
-0.869
-0.177
Variables
V1
V2
V3
V4
V5
V6
Factor 2
0.253
0.795
0.131
0.789
-0.351
0.871
Sharad Varde
Factor 1
0.962
-0.057
0.934
-0.098
-0.933
0.083
49
Factor 2
-0.027
0.848
-0.146
0.845
-0.084
0.885
Sharad Varde
Weightages to Variables for
Each Factor from SPSS
Factors: (6 Variables into 3 factors)
Factor Score Coefficient Matrix
Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + Wi6V6
Variables
V1
V2
V3
V4
V5
V6
Factor 1
0.358
-0.001
0.345
-0.017
-0.350
0.052
Factor 2
0.011
0.375
-0.043
0.377
-0.059
0.395
Sharad Varde
50
In case Study # 3:
F1 = 0.358V1 – 0.001V2 + 0.345V3 – 0.017V4 – 0.350V5 + 0.052V6
F2 = 0.011V1 + 0.375V2 – 0.043V3 + 0.377V4 – 0.059V5 + 0.395V6
51
Sharad Varde
52
Interpretation of Factors
Interpretation of Factors
A factor can then be interpreted in terms of the
F2 = 0.011V1 + 0.375V2 – 0.043V3 + 0.377V4 – 0.059V5 + 0.395V6
variables that load high on it from rotated factor matrix
FACTOR 2 has high coefficients on:
FACTOR 1 has high coefficients for:
V2: Like a toothpaste that gives shiny teeth
V1: Buy a toothpaste that prevents cavities
V4: Prefer toothpaste that freshens breath
V3: Toothpaste should strengthen your gums
V6: Most important concern is attractive teeth
V5: Prevention of tooth decay is not an important
FACTOR 2 may be labelled as Aesthetic Factor
benefit (Note: Coefficient is negative)
The factors are jointly called principal components.
FACTOR 1 may be labelled as Health Factor.
Sharad Varde
53
Sharad Varde
54
Conclusion
Selecting a Surrogate Variable
From the data gathered from 30
respondents on 6 basic variables, the
most important benefits consumers seek
from purchase of a toothpaste are
HEALTH and AESTHETICS
●Sometimes, we are not willing to discover new
factors but we want to stick to original variables
and want to know which ones are important
●By examining the factor matrix, we could select
for each factor just one variable with the
highest loading for that factor, if possible
Health has 45.5 % importance
●That variable could then be used as a
surrogate variable for the associated factor
Aesthetics has 36.9 % importance.
Sharad Varde
55
Sharad Varde
56
Factor Loadings After Rotation
Selecting Surrogate Variables
Rotated Factor Matrix
Variables
V1
V2
V3
V4
V5
V6
Factor 1
0.962
-0.057
0.934
-0.098
-0.933
0.083
●V1 has highest loading on F1
Factor 2
-0.027
0.848
-0.146
0.845
-0.084
0.885
●So, V1 is surrogate variable for F1
●Similarly V6 could be surrogate for F2
●So, we concentrate on only 2 variables:
V1 (Preventing Cavities) & V6 (Attracive
teeth).
Sharad Varde
End of
Factor Analysis
57
…
Sharad Varde
58
B. Advanced Designs
g
D i off E
Design
Experiments
i
t
B.1: Completely Randomized Design
B.2: Randomized Block Design
D Sharad
Dr.
Sh
d Varde
V d
B.3: Latin Square Design
B.4: Factorial Design.
2
B.1: Completely
p
y Randomised Design
g
Sharad Varde
B.1: Completely
p
y Randomized Design
g
Experiment: To determine effect of training on job performance
GOne Effect Factor (Dependent Variable):
Cause factor: Training. Effect factor: Job performance
Measurable Countable
Measurable,
Countable, Cardinal
Cardinal, Metric
No
N extraneous
t
factor
f t to
t influence
i fl
performance
f
Randomly assign (toss a coin) experimental units (employees)
GOne Cause Factors (Independent Variable):
to EG and CG equally
Expose EG to training. No training to CG
Non-metric, Nominal or Ordinal, Categorical
Evaluate
E l t job
j b performance
f
after
ft a while
hil
If difference is significant, conclude effectiveness of training.
GNo Extraneous Variable.
3
Sharad Varde
4
Sharad Varde
Statistical Model
B.1: Completely
p
y Randomized Design
g
yij = ȝ + tj + eij
This example: Two categories in cause factor
We can have many categories
Called LEVELS of TREATMENT
Example: 3 types of training: conventional
lectures, case studies, group discussion
Q: Which one is most effective?
Randomly assign equally to 3 groups.
5
Sharad Varde
where yij = ith observation for jth treatment level
where,
i = 1, 2, . . . . ,n
j = 1, 2, . . . . ,k treatments (3 types of training)
ȝ = overall mean
yp of training)
g)
tj = effect of the jth treatment ((type
eij = experimental error for ith observation
subjected to jth treatment.
6
Analysis of Variance Table for
Completely
p
y Randomised Design
g
Statistical Analysis
y
7
Sharad Varde
ȝ = (1/nk) Ȉi Ȉj yij
Overall Mean
T t
Treatment
t Means
M
ȝj = (1/n)
(1/ ) Ȉi yij
Total Sum of Squares SST = Ȉi Ȉj (yij – ȝ)2
Treatment Sum of Squares
q
SSTr = n Ȉj (ȝj – ȝ)2
Error Sum of Squares SSE = Ȉi Ȉj (yij – ȝj)2
f j=1
for
1, 2
2, . . . ,k
k
Sharad Varde
8
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean Squares
F Ratio
Between
Treatments
SSTr
k–1
MSTr =
SSTr / (k – 1)
MStr / MSE = FTr
Residual
Error
SSE
k(n – 1)
MSE =
SSE / k(n – 1)
Total
SST
nk – 1
Sharad Varde
9
Hypothesis
yp
Testing
g
B.1: Completely
p
y Randomized Design
g
For Treatments: H0: t1 = t2 = . . . . . . = tk
H1: H0 is not true
Assumption
Ass mption in this e
example:
ample No e
extraneous
traneo s
factor that would influence job performance
If FTr > F value for k – 1 & k(n – 1) degrees of
freedom for stipulated level of confidence
confidence,
say 95%, then we reject H0 for treatments
That means: Treatment effects significantly
vary from each other.
But, suppose we suspect gender (M / F) can
Sharad Varde
Say,
Say food processing or garment industry
Then, gender is another cause factor
To understand their effect on performance,
we use Randomized Block Design.
10
B.2: Randomised Block Design
g
B.2: Randomized Block Design
g
One Effect Factor (Dependent Variable):
Variable)
Measurable, Countable, Cardinal, Metric
To determine impact of price change on
sales of a health drink
Cause
C
ffactor:
t
Pi
Price.
Eff t factor:
Effect
f t
S l
Sales
Conduct experiment using 4 price levels
(C
(Cause
ffactor
t i.e.
i Treatment)
T
t
t) Rs.
R 100,
100 120
120,
150 & 175 (4 treatment levels)
Store type (chemist, grocery shop, and
supermarket) could also affect sales.
Two Cause Factors (Independent Variables),
p
p cause factor and one
Or,, one principal
extraneous factor: Both Non-metric, Nominal
or Ordinal
Ordinal, Categorical
No interaction between two Cause Factors.
11
Sharad Varde
Sharad Varde
12
Sharad Varde
Example of Randomised Block Design
B.2: Randomized Block Design
g
((Note only
y 12 obs Instead of 12 x 4 = 48 obs))
Thus,
Thus store type is an extraneous variable
(Called Block for historical reasons). Here, 3 blocks.
S
So, 4 x 3 = 12 retail
t il outlets
tl t are selected
l t d according
di tto
store type: 4 of each type:
Ch i t A
Chemist:
A, B
B, C
C, D (4 chemists);
h i t )
Grocery Shop: I, II, III, IV (4 grocery shops); and
Supermarket : P, Q, R, S (4 supermarkets)
Price levels assigned randomly to each retail outlet.
13
Sharad Varde
Chemist
Grocery Shop Supermarket
Rs 100
C
II
S
Rs 120
D
III
P
Rs 150
A
IV
Q
Rs 175
B
I
R
14
Example of Randomised Block Design
((Number of units sold))
15
Price
Sharad Varde
Statistical Model
Price
Chemist
Grocery
Shop
Supermarket
Total
yij = ȝ + ȕi + tj + eij
Rs 100
308
867
129
1304
Rs 120
216
669
104
989
Rs 150
163
557
95
815
Rs 175
142
490
86
718
where, yij = observation for jth treatment level in ith block
i = 1, 2, . . . . ,n blocks (3 store types)
j = 1, 2, . . . . ,k treatments (4 treatments)
ȝ = overall mean
ȕi = effect of the ith block
tj = effect of the jth treatment
eij = experimental error in the ith block subjected
to jth treatment.
Total
829
2583
414
3826
Sharad Varde
16
Sharad Varde
Experimental
p
Error
Sources of Error
Error-free
E
f
reall lif
life iis utopia
t i
/ Unexpected event during experiment
/ Subjects getting bored, aging during expt
/ Post-test familiar for pre-tested subjects
/ Non
Non-uniformity
uniformity of measurement tools
/ Unwillingness of some selected subjects
/ Outliers included in the random sample
/ Drop out / mortality during experiment.
spite
te o
of a
all p
precautions,
ecaut o s, so
some
ee
error
o ca
can
In sp
creep in the experiment
Best solution is to measure the experimental
error and check if it is within acceptable
p
limits
Say, ± 5%.
17
Sharad Varde
18
Analysis of Variance Table for
Randomised Block Design
g
Statistical Analysis
y
19
O
Overall
ll M
Mean
ȝ = (1/nk)
(1/ k) Ȉi Ȉj yij
Block Means
,n
ȝi. = (1/k) Ȉj yij
Treatment Means ȝ.jj = (1/n) Ȉi yijj
Sharad Varde
for i = 1, 2, . . . .
for j = 1, 2, . . . . ,k
ȝ)2
Total Sum of Squares SST = Ȉi Ȉj (yij –
Block Sum of Squares SSB = k Ȉi (ȝi. – ȝ)2
Treatment Sum of Squares SSTr = n Ȉj (ȝ.j – ȝ)2
Error Sum of Squares SSE = Ȉi Ȉj (yij – ȝi. – ȝ.j + ȝ)2
Sharad Varde
20
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean Squares
F Ratio
Between
Blocks
SSB
n–1
MSB =
SSB / (n – 1)
MSB / MSE = FB
Between
Treatments
SSTr
k–1
MSTr =
SSTr / (k – 1)
MStr / MSE = FTr
Residual
Error
SSE
(n – 1)(k – 1)
MSE =
)( – 1))
SSE / ((n – 1)(k
Total
SST
nk – 1
Sharad Varde
Hypothesis
yp
Testing
g
For Blocks:
Hypothesis
yp
Testing
g
If FB > F value
l ffor n – 1 & ((n – 1)(k – 1) d
degrees off
freedom for the stipulated level of confidence, say
95% then
95%,
th we reject
j t H0 for
f blocks,
bl k th
thatt means, th
the
block effects vary from each other
H0: ȕ1 = ȕ 2 = . . . . . .= ȕn
H1: H0 is not true
If FTr > F value for k – 1 & (n – 1)(k – 1) degrees of
freedom for the stipulated level of confidence, say
95%, then we reject H0 for treatments, that means,
the treatment effects vary from each other.
For Treatments: H0: t1 = t2 = . . . . . . = tk
H1: H0 is not true.
21
Sharad Varde
22
B.3: Latin Square
q
Design
g
B.3: Latin Square
q
Design
g
QOne Effect Factor (Dependent Variable):
Variable)
Measurable, Countable, Cardinal, Metric
GExperiment:
GE
periment To find out
o t impact of three
different ads on sales of refrigerators
QThree Cause Factors (Independent
), Or,, one principal
p
p cause factor and
Variables),
two extraneous factors: All Non-metric,
Nominal or Ordinal
Ordinal, Categorical
Effect factor: Sales
Cause factor: Ads (3 versions: A
A, B & C)
Two Extraneous factors:
1. Product Pricing (3 levels: Rs. 20000, 25K, 30K)
QNo interaction among three Cause Factors.
23
Sharad Varde
Sharad Varde
2. Consumer Income (3 levels: low, mid, high).
24
Sharad Varde
Example of Latin Square Design
B.3: Latin Square
q
Design
g
Construct a 3 x 3 table (Total 9 cells):
GRows show 3 levels of one extraneous factor
(Product Pricing)
GColumns show 3 levels of other extraneous
factor (Consumer Income)
GAssign 3 ad versions to 9 cells in such a way
that each row & each column has all 3 ads
GNote only 9 obs instead of 3x3x3 = 27 obs.
25
Sharad Varde
Low Income
Middle
Income
High Income
Rs 20000
Ad B
Ad-B
Ad A
Ad-A
Ad C
Ad-C
R 25000
Rs
Ad C
Ad-C
Ad B
Ad-B
Ad A
Ad-A
Rs 30000
Ad-A
Ad-C
Ad-B
26
Statistical Model
Sharad Varde
Statistical Analysis
y
yijk = ȝ + ri + cj + tk + eijk
yijkj = observation in ith row & jth column subjected to kth treatment
i = 1, 2, . . . . ,n
j = 1, 2, . . . . ,n
k = 1, 2, . . . . ,n
n = number of treatments
ȝ = overall mean
ri = effect of the ith row (ith level of extraneous factor 1)
cj = effect of the jth column (jth level of extraneous factor 2)
tk = effect of the kth level of treatment (cause factor)
eijk = experimental error in ith row & jth column subjected to
kth treatment.
27
Pricing
Levels
Sharad Varde
28
ȝ = (1/n2) Ȉi Ȉj yijk yijk
Overall Mean
Row Means
Column Means ȝ.i. = (1/n) Ȉi yijk for j = 1, 2, . . . . ,n
Treatment Means ȝ..k = (1/n) Ȉ yijk for k = 1, 2, . . . . ,n
ȝi.. = (1/n) Ȉj yijk for i = 1, 2, . . . . ,n
Sharad Varde
Analysis of Variance Table for
q
Design
g
Latin Square
Statistical Analysis
y
Total Sum of Squares SST = Ȉi Ȉj (yijk – ȝ)2
R
Row
S
Sum off Squares
S
SSR = n Ȉi (ȝ
( i.. – ȝ))2
Column Sum of Squares SSC = n Ȉj (ȝ.j. – ȝ)2
Treatment Sum of Squares
q
SSTr = n Ȉk (ȝ..kk – ȝ)2
Error Sum of Sq. SSE=ȈiȈj (yijk – ȝi.. – ȝ.j. – ȝ..k + 2ȝ)2
29
Sharad Varde
Source of
Variation
Sum of Squares
Degrees of
Freedom
Mean Squares
Between
Rows
SSR
n–1
MSR =
SSR / (n – 1)
MSR / MSE = FR
Between
Columns
SSC
n–1
MSC =
SSC / (n – 1)
MSC / MSE = FC
Between
Treatments
SSTr
n–1
MSTr =
SSTr / (n – 1)
MStr / MSE = FTr
Residual Error
SSE
(n – 1)(n – 2)
MSE =
SSE / (n
( – 1)(n
1)( – 2)
Total
SST
n2 – 1
30
F Ratio
Sharad Varde
Hypothesis
yp
Testing
g
Hypothesis
yp
Testing
g
For
F Rows:
R
If FR > F value
l ffor n – 1 & (n
( – 1)(n
1)( – 2) d
degrees off ffreedom
d
ffor
H0: r1 = r2 = . . . . . .= rn
stipulated level of confidence, say 95%, then we reject H0 for rows,
H1: H0 is not true
For Columns:
that means
means, the row (extraneous factor 1) effects vary from each other
H0: c1 = c2 = . . . . . .= cn
If FC > F value for n – 1 & (n – 1)(n – 2) degrees of freedom, then we
reject H0 for columns, that means, the column (extraneous factor 2)
H1: H0 is not true
effects vary from each other
For Treatments: H0: t1 = t2 = . . . . . . .= tn
If FTr > F value for n – 1 & ((n – 1)(n
)( – 2)) degrees
g
of freedom, then we
H1: H0 is not true.
reject H0 for treatments, that means, the treatment effects differ from
each other.
31
Sharad Varde
32
Sharad Varde
Latin Square
q
Design
g
Latin Square
q
Design
g
G Limitation: In Latin square,
square levels of all the three
cause factors must be same (here, 3 each)
¦Benefits: Needs substantially less number of
test units
¦P f
¦Performs
randomization
d i ti with
ith respectt tto row
and column effects
¦Th
¦Thus,
neutralizes
t li
effect
ff t off extraneous
t
factors
f t
¦Applies when the three cause factors do not
interact with each other
¦If they do, use Factorial Design.
G Each of the 3 ads is assigned to each cell randoml
randomly
G So, each row has all ads & each column has all ads
G Effect on sales is determined for each cell
G Analysis shows which ad influences sales most
irrespective of the two extraneous factors, viz.
Product Pricing & Income Levels of Consumers.
33
Sharad Varde
34
B.4: Factorial Design
g
Sharad Varde
B.4: Factorial Design
g
GOne
O Effect
Eff
Factor
F
(Dependent
(D
d
Variable):
V i bl )
Experiment to investigate
Metric
GMany Cause Factors (Independent
interaction among
Variables), Or, one principal cause factor and
g
several extraneous factors: All Categorical
all cause factors
35
Sharad Varde
GInteraction among all cause factors.
36
Sharad Varde
Factorial Design
g
Example
p of Interaction Effect
Two effects detected: Main effect &
Interaction effect
M i Effect
Main
Eff t off a cause factor
f t (ads)
( d ) is
i its
it
direct influence on the effect factor (sales)
Interaction
I t
ti Effect
Eff t off two
t
cause factors
f t
is
i the
th
influence of the interaction between the two
cause factors
f t
(consumer
(
income
i
and
d ads
d ) on
the effect factor (sales).
Experiment:
E
i
t To
T determine
d t
i b
believability
li
bilit off
two ads on 0-100 scale
37
Sharad Varde
Two different ads A & B are to be compared
Eff t ffactor:
Effect
t
B li
Believability
bilit
Cause factor: Ads
Gender of the reader is extraneous factor.
38
Example
p of Factorial Design
g
Sharad Varde
Example
p of 2 X 2 Factorial Design
g
This is a 2X2 factorial e
experiment
periment
Permits to test 3 hypothesis:
Men
Men
+ Ad A
+ Ad B
O1 = 60
O2 = 70
Women
Women
+ Ad A
+ Ad B
O3 = 80
O4 = 50
R
ÌWhich ad is more believable (Main effect)
ÌWhich gender tends to believe magazine
ads more (Main effect)
ÌWhich gender finds which ad more
believable (Interaction effect).
39
Sharad Varde
40
Sharad Varde
Believability
y Scores
Main Effects
Ad A
Ad B
Main Effect
of Gender
M
Men
60
70
65
Women
80
50
65
Main Effect
of Ad
70
60
Which ad is more believable (Main effect)
Ad A: (60 + 80) / 2 = 70
Ad B (70 + 50) / 2 = 60
Which gender tends to believe magazine
ads more (Main effect)
41
Men: (60 + 70) / 2 = 65
Sharad Varde
42
Interaction Effect
90
Which
Whi
h gender
d fifinds
d which
hi h ad
d more
believable ((Interaction effect))
W
om
en
80
Sharad Varde
Interaction Effects
Interaction Between Gender and
Advertising Copy
100
Women: (80 + 50) / 2 = 65
Believability
70
Men
60
50
40
Men:
Ad B:
70 against 60
Women:
Ad A:
80 against 50
30
20
10
Ad A
43
Ad B
Sharad Varde
44
Sharad Varde
Factorial Design
g
Structure of 2 X 2 Factorial Design
g
X1
X1
Useful when several cause factors are being
investigated
g
and when they
y interact with
+ X2
+ No X2
O1
O2
R
each other significantly (Multi-collinearity)
No X1 + X2
O3
No X1 + No X2
Factorial design covers all possible
O4
combinations of all factors under study
Obviously, it needs a fat cost & time budget.
45
Sharad Varde
46
Structure of 2 X 2 Factorial Design
47
X2
O1
X1
N X2
No
O2
No X1
X2
O3
No X1
No X2
O4
Sharad Varde
Sharad Varde
2 x 2 x 2 Factorial Design
g
Two cause factors X1 & X2 each at 2 levels
X1
Cause factors X1 and X2 each at 2 levels.
Experiment: To determine effect of training
on job performance
Effect factor: Job performance
Cause factors:
DTraining (2 levels: Training / No Training)
DGender (2 levels: Male / Female)
DScience or Non-science graduate (2 levels).
48
Sharad Varde
Structure of 2 X 2 X 2 Factorial Design
Factorial Design
g
(Three cause factors X1, X2 & X3 each at 2 levels)
49
X1
X2
X3
O1
X1
X2
No X3
O2
X1
N X2
No
X3
O3
X1
No X2
No X3
O4
No X1
X2
X3
O5
No X1
X2
No X3
O6
No X1
No X2
X3
O7
No X1
No X2
No X3
O8
Sharad Varde
End of
Design
g of Experiments
p
D Sharad
Dr.
Sh
d Varde
V d
A 3X2 ffactorial
t i ld
design
i h
has one cause ffactor
t
with 3 levels & second cause factor with 2
levels
A 3X3 factorial design has two cause factors
each with 3 levels
A 3X2X4X2 factorial
f
design has four
f
cause
factors with 3, 2, 4 & 2 levels respectively….
50
Sharad Varde
Multivariate Analysis
Cluster Analysis
Dr. Sharad Varde
A Clarification of Terminology
A Clarification of Terminology
In sampling, ‘CLUSTER’ is a term used to
denote a group of heterogeneous elements
Population consists of several such clusters
Each cluster offers the entire range of
variation available in the population
Each cluster is similar to other clusters
Inter-cluster homogeneity and intra-cluster
heterogeneity
We can choose any cluster as a sample
representative of the population.
In sampling, ‘STRATUM’ is a term used to
denote a group of homogeneous elements
Population consists of several such strata
Each stratum is different from other strata
Intra-strata homogeneity and inter-strata
heterogeneity
We must select all strata and choose a few
elements from each stratum to obtain a
sample representative of the population.
Sharad Varde
3
Sharad Varde
4
A Clarification of Terminology
Objective of Cluster Analysis
STRATUM in sampling is called CLUSTER
in multivariate analysis
So, in multivariate analysis, cluster is a group
of homogeneous elements
It is like dictionary meaning of word ‘cluster’
Population consists of several such clusters
Each cluster is different from other clusters
Inter-cluster heterogeneity and intra-cluster
homogeneity
Cluster analysis is also called Classification
Analysis, or Numerical Taxonomy.
Sharad Varde
To divide the heterogeneous population into a
number of homogeneous groups (clusters)
in such a manner that elements similar to each
other in respect of characteristics of our
interest are bunched together in a cluster
Population is thus divided into several
bunches called clusters
This is called ‘Market Segmentation’
We then study each cluster in detail.
5
Elements of a Population
Sharad Varde
6
Variable 1
Clustered Elements
Variable 2
Sharad Varde
7
Sharad Varde
8
Case Study # 4
Cluster 1
Indian Railways wanted to map profile of its
target audience (potential customers) in terms
of lifestyle, attitudes & perceptions
A set of 15 statements was prepared to
measure these characteristics
Respondents to tick 1: strongly agree, 2:
agree, 3: neutral, 4: disagree, 5: strongly
disagree against each statement
Cluster analysis divided respondents into 4
homogeneous clusters.
Sharad Varde
They are careful spenders. Feel that quality comes at
a price, car is not a necessity, people are not more
health-conscious now, women are not active decision
makers, foreign firms have increased efficiency of
Indian firms, politicians can play active role, don’t like
TV, fast food, credit card, movies, weekend outings.
Thus, they exhibit many traditional values.
9
Other Clusters
Sharad Varde
10
Conducting Cluster Analysis
Formulate the Problem
Cluster 2: They use credit cards, spend freely, travel,
believe in women power, believe in economics more
than in politics, feel quality products can cost less
Cluster 3: Health-conscious, spend carefully, brand
loyal, outgoing, extrovert nature, like to settle abroad
Cluster 4: Optimistic, love TV, believe in value for
money, free spenders on items they like, travel a lot
Select a Distance Measure
Select a Clustering Procedure
Decide on the Number of Clusters
IR then studied demographic DNAs of each cluster to
evolve a communication & marketing strategy
Let us see how this clustering is actually done.
Interpret and Profile Clusters
Assess the Validity of Clustering
Sharad Varde
11
Sharad Varde
12
Formulating the Problem
Distance Measure
Select variables most relevant to our inquiry
Basis of Cluster Analysis: Concept of
distance between two objects (respondents) in
terms of the variables of our interest
Inclusion of even one or two irrelevant
variables may distort an otherwise useful
clustering solution
Most commonly used measure is Euclidean
Distance.
In descriptive research, past studies & present
hypotheses help selection of variables
Euclidean Distance is the square root of the
sum of the squared differences in values for
each variable.
In exploratory research, use judgment &
intuition to select relevant variables.
Sharad Varde
13
14
Sharad Varde
Euclidean Distance between
Resp1 and Resp2 is 3.74
Example
Resp 1 Resp 2
Responses of person #1 & #2 to three
statements on five point scale:
I prefer to use e-mail rather than write a
letter
I feel that good quality products are
always priced high
Sharad Varde
15
(Diff)2
State
ment 1
1
3
|1 – 3| = 2
4
State
ment 2
5
2
|5 – 2| = 3
9
State
ment 3
3
4
|3 – 4| = 1
1
√Σ(Diff)2
TV is major source of entertainment.
Difference
3.74
Sharad Varde
16
Other Measures of Distance
Clustering Procedures
Basic Methods are of two types
1. Hierarchical (or Linkage) Methods:
A complete range of solutions is
provided by computers varying from 1 to
n – 1 clusters where n is number of
objects being studied (respondents)
2. Non-hierarchical (or Nodal) Methods:
Number of clusters to be extracted is
specified in advance.
The City-block or Manhattan Distance
between two objects j and k is the sum
of the absolute differences in values for
each variable: Σi|dij – dik| (6 in above example)
The Chebychev Distance between two
objects is the maximum absolute
difference in values for any variable:
Max |dij – dik| (3 in above example)
17
Sharad Varde
Classification of 8 Clustering Procedures
Sharad Varde
18
Hierarchical Clustering
Clustering Procedures
Nonhierarchical
Hierarchical
Agglomerative
Divisive
Sequential
Threshold
Linkage
Methods
Parallel
Threshold
Optimizing
Partitioning
Centroid
Methods
Variance
Methods
Ward’s Method
Single
Complete
Average
Sharad Varde
19
It is development of a hierarchy or tree-like structure.
It can be agglomerative or divisive
Agglomerative Clustering starts with each object in
a separate cluster (i.e. c = n). Then objects are
grouped into bigger and bigger clusters. This process
is continued until all objects are members of a single
cluster (i.e. c = 1)
Divisive Clustering is exactly opposite. It starts with
all the objects grouped in a single cluster (i.e. c = 1). It
is then progressively split until each object is in a
separate cluster (i.e. c = n).
Sharad Varde
20
Agglomerative Clustering
Linkage Methods
Single Linkage method is based on minimum
distance, or ‘nearest neighbour rule’. Here,
distance between two clusters is distance
between their two closest points
Complete Linkage method is based on
maximum distance or ‘farthest neighbour
rule’. Here, distance between two clusters is
calculated as distance between their two
farthest points
Average Linkage method works defines
distance as the average of distances between
all pairs of objects, one from each cluster.
Is most commonly used in research studies
They consist of
Linkage methods
Variance methods
Centroid methods
Linkage methods are of further three types:
Single Linkage
Complete Linkage
Average Linkage.
21
Sharad Varde
Pictorial Representation
22
Other Agglomerative Methods
Single Linkage
Variance Method generates clusters to minimize
within-cluster variation
Minimum Distance
Cluster 1
Sharad Varde
Cluster 2
Ward's Procedure is most popular variance method.
Complete Linkage
For each cluster, compute means for all variables
Maximum Distance
Then, for each object, calculate squared Euclidean
distance to the cluster means
Cluster 1
Sum them up for all objects.
Cluster 2
Average Linkage
At each stage, combine two clusters with smallest
increase in overall sum squared Euclidean distance.
Average Distance
Cluster 1
Sharad Varde
Cluster 2
23
Sharad Varde
24
Ward’s Procedure
V1
V2
Pictorial Representation
---
Vn
O1
O2
------Om
Means
(E.D.)2
Ward’s Procedure
Centroid Method
Σ(E.D.)2
Sharad Varde
25
Other Agglomerative Methods
Sharad Varde
26
Steps in Computerized Procedure
Run the Hierarchical Clustering Programme on the variables
Centroid Method computes distance between
Generate output called Agglomeration Schedule
the centroids (means for all the variables) of
It shows all possible solutions from 1 to n-1 clusters (n = number
of respondents or objects)
clusters. Every time objects are regrouped, a
Going up from the bottom of the Agglomeration Schedule look at
the column called Coefficients to decide on number of clusters
new centroid is computed.
In this column starting from the bottom, calculate difference in
the value of coefficient in the neighbouring rows.
Average Linkage and Ward's Procedure
If the maximum value of this difference occurs, say, between
third & fourth row from the bottom it indicates existence of 3
clusters (the lower row number). This is purely judgmental.
perform better than other hierarchical
methods.
Dendrogram gives essentially same information in graphical form.
Sharad Varde
27
Sharad Varde
28
Case Study # 5
Case Study # 5
Problem: Clustering of consumers based on
attitude towards shopping at Wonder Mall
Six attitudinal variables were identified
V1: Shopping is fun for me
V2: Shopping is bad for my budget
V3: I combine shopping with eating out
V4: I get best buys when shopping here
V5: I do not care about shopping
V6: I can save a lot of money by
comparing prices
Consumers were asked to express their
degree of agreement with these
statements on a 7 point scale
(1=Strongly Disagree; 7=Strongly Agree)
Data obtained from 20 respondents are
shown in next slide
In reality, sample size was much larger.
29
Sharad Varde
Case Study # 5 Input Data
Cons No.
V1
V2
V3
V4
V5
V6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
6
2
7
4
1
6
5
7
2
3
1
5
2
4
6
3
4
3
4
2
4
3
2
6
3
4
3
3
4
5
3
4
2
6
5
5
4
7
6
3
7
1
6
4
2
6
6
7
3
3
2
5
1
4
4
4
7
2
3
2
3
4
4
5
2
3
3
4
3
6
3
4
5
6
2
6
2
6
7
4
2
5
1
3
6
3
3
1
6
4
5
2
4
4
1
4
2
4
2
7
3
4
3
6
4
4
4
4
3
6
3
4
4
7
4
7
5
3
7
2
Sharad Varde
30
Sharad Varde
Results of Hierarchical Clustering
Agglomeration Schedule Using Ward’s Procedure
Stage cluster
first appears
Clusters combined
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
31
Cluster 1
14
6
2
5
3
10
6
9
4
1
5
4
1
1
2
1
4
2
1
Cluster 2
Coefficient
16
1.000000
7
2.000000
13
3.500000
11
5.000000
8
6.500000
14
8.160000
12
10.166667
20
13.000000
10
15.583000
6
18.500000
9
23.000000
19
27.750000
17
33.100000
15
41.333000
5
51.833000
3
64.500000
18
79.667000
4
172.662000
2
328.600000
Sharad Varde
Cluster 1 Cluster 2 Next stage
0
0
6
0
0
7
0
0
15
0
0
11
0
0
16
0
1
9
2
0
10
0
0
11
0
6
12
6
7
13
4
8
15
9
0
17
10
0
14
13
0
16
3
11
18
14
5
19
12
0
18
15
17
19
16
18
0
32
Dendrogram
Dendrogram Using Ward’s Method
A dendrogram, or tree graph, is a graphical
device for displaying clustering results
Vertical lines represent clusters that are joined
together
The position of the line on the scale indicates
the distances at which clusters were joined
Dendrogram is read from left to right.
33
Sharad Varde
Cluster Membership of Cases Using Ward’s Procedure
Number of Clusters
4
3
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
4
3
2
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
3
3
2
1
2
1
2
2
1
1
1
2
2
2
1
2
2
1
2
1
2
2
2
Sharad Varde
34
Interpretation
Results of Hierarchical Clustering
Label case
Sharad Varde
How many clusters?
Answer: Not too many; Not too few
Sometimes, decision makers may want
a particular number of clusters
Common sense considerations rule out
1 or 2 clusters as meaningless
A 3 cluster solution results in clusters
with 8, 6 & 6 respondents.
35
Sharad Varde
36
3 Clusters
Interpretation
A 4 cluster solution has 8, 6, 5 & 1
respondents
Cluster 1: Respondent No.: 1, 3, 6, 7,
8, 12, 15, 17
Meaningless to have a cluster with only
one case
Cluster 2: Respondent No.: 2, 5, 9, 11,
13, 20
So a 3 cluster solution is preferable
Cluster 3: Respondent No.: 4, 10, 14,
Interpreting & profiling clusters involves
examining cluster centroids.
16, 18, 19.
37
Sharad Varde
Cluster Centroids
38
Interpretation
Cluster 1 is
High on V1: Shopping is Fun
High on V3: Combine shopping with
eating out
Low on V5: Do not care about shopping
i.e. Care about shopping
In short: Fun loving & concerned
shoppers.
Means of Variables
V1
V2
V3
V4
V5
V6
1
5.750
3.625
6.000
3.125
1.750
3.875
2
1.667
3.000
1.833
3.500
5.500
3.333
3
3.500
5.833
3.333
6.000
3.500
6.000
Cluster No.
Sharad Varde
Sharad Varde
39
Sharad Varde
40
Interpretation
Interpretation
Cluster 2 is
Cluster 3 is
High on V5: Do not care about shopping
High on V2: Shopping upsets budget
Low on V1: Shopping is Fun
High on V4 : Try to get best buys
Low on V3: Combine shopping with
High on V6 : Can save a lot of money by
eating out
comparing prices
In short: Apathetic shoppers.
Sharad Varde
In short: Economical shoppers.
41
42
Non-hierarchical Clustering
Non-hierarchical Clustering Procedure
Number of clusters is specified in advance
Also called K-means clustering, it has 3 iterative methods:
Sequential Threshold Method: Select a cluster center and
group together objects within a pre-specified threshold value from
the center. Then select second cluster center and repeat process
for the un-clustered objects. And so on till you configure required
number of clusters.
Parallel Threshold Method: Select several cluster centers
simultaneously and assign objects within the threshold level to
the nearest center.
Optimizing Partitioning Method: Here, objects can later be
reassigned to clusters to optimize a criterion, such as, average
within-cluster distance for a given number of clusters.
Number of clusters is specified by
decision maker
Now run non-hierarchical clustering
procedure on the input data
Output gives final configuration of each
cluster.
Sharad Varde
Sharad Varde
43
Sharad Varde
44
Further Work
Further profiling can be done on the
basis of variables not used for clustering
End of
Cluster Analysis
Identification factors e.g. demographic,
economic variables are used to identify
members of each cluster
The variables that significantly
differentiate between clusters can be
obtained through Discriminant Analysis.
Sharad Varde
45
Major Dependence Methods
Multiple
Discriminant Analysis
Dependent Variable
Metric
Categorical
Independent Variables
Categorical
Analysis of
Variance
Independent Variables
Metric
Multiple
Regression
Categorical
Metric
Canonical
Correlation
Multiple
Discriminant
Sharad Varde
48
Difference Between Cluster
Analysis & Discriminant Analysis
Discriminant Analysis
Both classify population elements into groups
Helps in discriminating between two or more sets of
objects or people based on the knowledge of some of
their characteristics. For example:
Cluster Analysis classifies them into relatively
homogeneous groups called clusters. Elements in
each cluster are dissimilar to those in other clusters
Discriminate between bones of males & females
Discriminant Analysis develops a classification rule
to assign a new element to a particular cluster of the
population
Classifying people into potential buyers or non-buyers
Classifying individuals as excellent, acceptable or bad
credit risk
In cluster analysis there is no a-priori information
about which element belongs to which cluster.
Clusters are formed by the data.
Sharad Varde
Classifying companies as A, B or C investment risks
Discriminate between brand loyals & brand switchers
49
Terminology
Sharad Varde
50
What Discriminant Analysis does
Predictor: Independent variable (metric)
1. Analyses past data on predictors & criterion
Criterion: Dependent variable (categorical)
2. Develops a Discriminant Function to
Discriminant Function: Linear combination of
of the criterion
the predictors (independent variables), which
3. Evaluates accuracy of classification
will best discriminate between the different
4. Classifies objects or people to one of the
categories of the criterion (dependent variable)
Sharad Varde
discriminate between the different categories
51
categories based on values of predictors.
Sharad Varde
52
Discriminant Analysis Model
Process of Discriminant Analysis
D = b0 + b1X1 + b2X2 + b3X3 + . . . . . . . + bkXk
Identify objectives, criterion & predictors.
where
Predictors must consist of two or more
mutually exclusive and collectively exhaustive
categories (Gender: M, F; Investment Risk: A,
B, C; People: Buyers, Non-buyers)
D
=
discriminant score
b's
=
discriminant coefficients or weights
X's
=
predictors (independent variables)
Draw a sample of objects from the population.
Coefficients, or weights (b), are estimated so
that the groups differ as much as possible on
the values of the discriminant function.
Sharad Varde
Collect data from sampled objects on predictor
variables for each category of criterion.
53
54
Conducting Discriminant Analysis
Process of Discriminant Analysis
Split the sample into two unequal parts.
Formulate the Problem
Bigger part of the sample is called ‘analysis
sample’ or ‘estimation sample’. It is used to
estimate coefficients (weights) b’s of the
discriminant function.
Estimate the Discriminant Function Coefficients
Determine the Significance of the Discriminant Function
Other part is called the ‘validation sample’ or
‘holdout sample’. It is reserved to evaluate
accuracy of the discriminant function.
Sharad Varde
Sharad Varde
Interpret the Results
Assess Validity of Discriminant Analysis
55
Sharad Varde
56
Case Study # 6
Case Study # 6
Problem: To discover salient characteristics
of families that visited a vacation resort during
last two years
Data were obtained from a sample of 42
families of which 30 were included in analysis
sample & 12 in validation sample
Families that visited resort were coded as 1 &
those that did not as 2
Both samples were balanced in terms of visits
Predictor variables selected were
V1: Family income
V2: Attitude towards travel measured on a
9-point scale
V3: Importance attached to family vacation
measured on a 9-point scale
V4: Household Size
V5: Age of the head of the family.
57
Sharad Varde
Case Study # 6: Input Data
Case Study # 6: Input Data
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Resort
Visit
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Annual
Attitude
Family
Toward
Income Travel
(Rs0000)
50.2
70.3
62.9
48.5
52.7
75.0
46.2
57.0
64.1
68.1
73.4
71.9
56.2
49.3
62.0
5
6
7
7
6
8
5
2
7
7
6
5
1
4
5
Importance Household Age of
Attached
Size
Head of
to Family
Household
Vacation
8
7
5
5
6
7
3
4
5
6
7
8
8
2
6
Sharad Varde
3
4
6
5
4
5
3
6
4
5
5
4
6
3
2
58
Sharad Varde
No.
43
61
52
36
55
68
62
51
57
45
44
64
54
56
58
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
59
Resort
Visit
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Annual
Attitude
Family
Toward
Income Travel
(Rs0000)
32.1
36.2
43.2
50.4
44.1
38.3
55.0
46.1
35.0
37.3
41.8
57.0
33.4
37.5
41.3
5
4
2
5
6
6
1
3
6
2
5
8
6
3
3
Importance Household Age of
Attached
Size
Head of
to Family
Household
Vacation
4
3
5
2
6
6
2
5
4
7
1
3
8
2
3
Sharad Varde
3
2
2
4
3
2
2
3
5
4
3
2
2
3
2
58
55
57
37
42
45
57
51
64
54
56
36
50
48
42
60
Computerized Discriminant Analysis
Case Study # 6: Validation Sample
No.
Annual
Attitude
Family
Toward
Income Travel
(Rs0000)
Resort
Visit
Importance Household Age of
Attached
Size
Head of
to Family
Household
Vacation
GROUP MEANS
VISIT
INCOME
TRAVEL VACATION
HSIZE
1
2
Total
60.52000
41.91333
51.21667
5.40000
4.33333
4.86667
4.33333
2.80000
3.56667
5.80000
4.06667
4.9333
AGE
53.73333
50.13333
51.93333
Group Standard Deviations
1
2
3
4
5
6
7
8
9
10
11
12
1
1
1
1
1
1
2
2
2
2
2
2
63.6
50.8
54.0
45.0
68.0
62.1
35.0
49.6
39.4
37.0
54.5
38.2
7
4
6
5
6
5
4
5
6
2
7
2
4
7
7
4
6
6
3
3
5
6
3
2
7
3
4
3
6
3
4
5
3
5
3
3
1
2
Total
55
45
58
60
46
56
54
39
44
51
37
49
9.83065
7.55115
12.79523
1.91982
1.95180
1.97804
Pooled Within-Groups Correlation Matrix
INCOME
TRAVEL
VACATION
INCOME
TRAVEL
VACATION
HSIZE
AGE
1.00000
0.19745
0.09148
0.08887
- 0.01431
1.00000
0.08434
-0.01681
-0.19709
1.00000
0.07046
0.01742
1.23443
.94112
1.33089
HSIZE
1.00000
-0.04301
8.77062
8.27101
8.57395
AGE
1.00000
Wilks' (U-statistic) and univariate F ratio with 1 and 28 degrees of freedom
Variable
Wilks'
INCOME
TRAVEL
VACATION
HSIZE
AGE
0.45310
0.92479
0.82377
0.65672
0.95441
F
33.800
2.277
5.990
14.640
1.338
61
Sharad Varde
1.82052
2.05171
2.09981
Computerized Discriminant Analysis
Significance
0.0000
0.1425
0.0209
0.0007
0.2572
62
Sharad Varde
Computerized Discriminant Analysis
CANONICAL DISCRIMINANT FUNCTIONS
Function
1*
Eigenvalue
1.7862
% of
Variance
100.00
Cum Canonical After
Wilks'
%
Correlation Function
λ
: 0
0 .3589
100.00
0.8007
:
Standard Canonical Discriminant Function Coefficients
FUNC
1
Canonical discriminant functions evaluated at group means (group centroids)
0.74301
0.09611
0.23329
0.46911
0.20922
Group
1
2
INCOME
HSIZE
VACATION
TRAVEL
AGE
1
0.82202
0.54096
0.34607
0.21337
0.16354
Sharad Varde
Contd.
63
FUNC 1
1.29118
-1.29118
Classification results for cases selected for use in analysis
Structure Matrix:
Pooled within-groups correlations between discriminating variables & canonical discriminant functions
(variables ordered by size of correlation within function)
FUNC
FUNC 1
0.8476710E-01
0.4964455E-01
0.1202813
0.4273893
0.2454380E-01
-7.975476
INCOME
TRAVEL
VACATION
HSIZE
AGE
(constant)
* marks the 1 canonical discriminant functions remaining in the analysis.
INCOME
TRAVEL
VACATION
HSIZE
AGE
Unstandardized Canonical Discriminant Function Coefficients
Chi-square df Significance
26.130
5
0.0001
Actual Group
Predicted
No. of Cases
Group Membership
1
2
Group
1
15
12
80.0%
3
20.0%
Group
2
15
0
0.0%
15
100.0%
Percent of grouped cases correctly classified: 90.00%
Sharad Varde
64
Interpretation
Classification Using the Model
Un-standardized discriminant function is
Group Centroids are values of
discriminant function at Group Means
Group Centroids are:
Group
Centroid
1
1.335
2
-1.256
Average of two Group Centroids gives
cut-off point
Average of two centroids is 0.0395
D= -7.975476 +
+0.8476710 E-01 (INCOME)
+0.4964455 E-01 (TRAVEL)
+0.1202813 (VACATION)
+0.4273893 (H SIZE)
+0.2454380 E-01 (AGE)
Sharad Varde
65
Classification Using the Model
Case Study # 6: Validation Sample
Average of two centroids is 0.0395
No.
Therefore:
Any value of discriminant score D > 0.0395
will classify the object as ‘Resort Visit’
1
2
3
4
5
6
7
8
9
10
11
12
Any value of discriminant score D < 0.0395
will classify the object as ‘No Resort Visit’
Now, let us asses the validity of D using the
validation sample.
Sharad Varde
66
Sharad Varde
67
Resort
Visit
1
1
1
1
1
1
2
2
2
2
2
2
Annual
Attitude
Family
Toward
Income Travel
(Rs0000)
63.6
50.8
54.0
45.0
68.0
62.1
35.0
49.6
39.4
37.0
54.5
38.2
7
4
6
5
6
5
4
5
6
2
7
2
Importance Household Age of
Attached
Size
Head of
to Family
Household
Vacation
4
7
7
4
6
6
3
3
5
6
3
2
Sharad Varde
7
3
4
3
6
3
4
5
3
5
3
3
55
45
58
60
46
56
54
39
44
51
37
49
68
Classification Using Model
Classification Using Model
Value of Discriminant Function for the 1st family in
Validation Sample is:
D= -7.975476 +
+0.8476710E-01(63.6)
+0.4964455E-01(7)
+0.1202813(4)
+0.4273893(7)
+0.2454380E-01(55)
= +2.5865
Thus respondent belongs to group 1 (Traveled: Correct)
Sharad Varde
End of
Multiple Discriminant Analysis
Classification Results for cases not selected for use in the analysis (validation sample)
Actual Group
Predicted
No. of Cases
Group Membership
1
2
Group
1
6
4
66.7%
2
33.3%
Group
2
6
0
0.0%
6
100.0%
Percent of grouped cases correctly classified: 83.33%.
69
Sharad Varde
70
Research Process
Sampling
Identify broad area of research gather
preliminary data Define research problem
Identify important factors (variables)
Generate hypotheses Prepare research
design COLLECT DATA, analyse &
interpret Draw conclusions Write report
Present report for research-based
decision making.
Dr. Sharad Varde
2
Data Collection
Terminology
Data must be collected from the people,
events, or objects that can provide correct
answers to the research problem
Population: Entire group of people, events, or
objects of interest in context of research
Element: A single member of the population
Population Frame: List of all elements in the
population from which a sample is drawn
Process of selecting the right people, events,
or objects in right numbers is called
Example: List of all students in a college, list of all ent. events in
Mumbai in Oct 2010, list of all songs sung by Lata Mangeshkar
SAMPLING
3
Sharad Varde
Population Parameters: Pop. mean & variance.
Sharad Varde
4
Sharad Varde
Terminology
Representativeness of Sample
Sample: A subset of population selected for data
collection in the research study
Should enable generalizing conclusions for
entire population
Hence, sample should honestly represent the
population in respect of characteristics under
investigation
Representative sample should ensure
sample mean ≈ population mean & sample
variance ≈ population variance.
Subject: A single member of the sample
Sampling: Process of selecting sufficient number of
elements from the population
Sampling saves time & cost of research
Sampling Parameters: Sample mean (central
tendency) & sample variance (dispersion).
5
7
Sharad Varde
6
Sharad Varde
Important Issues in Sampling
Important Issues in Sampling
1. Sampling Design: Precisely how to draw a
sample from the population
2. Sample Size n: How many elements of the
population to be selected to form a
representative sample
Both depend upon cost & time budget of the
study, and on the desired reliability of
conclusions (confidence & precision).
Larger sample higher chance of accuracy
Sharad Varde
But, larger sample higher cost & time
Hence, one must strike a balance
Sampling design serves the purpose
It enables better precision and higher
confidence with smaller sample.
8
Sharad Varde
Two Basic Types
Types of Sampling Designs
A. Non-probability Sampling: Elements of
population do not have a known or
predetermined chance of selection. Used
when quick results with low generalizability
are needed at meager cost (e. g. exit polls)
B. Probability Sampling: They do have. This
design produces representative samples
and wider generalizability.
10
Their Application Areas
Case Studies
Accounts Manager has put in place a new
fully computerized accounting system
Before making further improvements, he
wants to get accounting staff’s reaction to it
without making it seem that he has doubts
about its utility & practicality
So, he casually talks to the first five guys that
walk into the office.
A. Non-probability Sampling:
Mostly in exploratory studies
B. Probability Sampling: Mostly in
descriptive and causal studies.
11
Sharad Varde
Sharad Varde
12
Sharad Varde
Case Studies
Case Studies
A TV journalist wants instant reactions of
aam janta to the budget proposals just
announced in the Loksabha
While she wants responses of the man on
the street, she knows that any tom unaware
of the budget exercise and exact proposals
will not serve her purpose
She moves on to pick up persons who, in her
judgment, fit the bill.
GMAC (Graduate Management Admission
Council) surveyed 740 university professors
across the world who are intensively
knowledgeable of GMAT formats over the
past years to find out their suggestions for
bettering GMAT (Graduate Management
Aptitude Test) before launching the 10th
generation of GMAT in 2013.
13
15
Sharad Varde
14
Sharad Varde
Case Studies
A. Non-probability Sampling
Galaxy Tours & Travels wants to find out
strong & weak points of its competitor Star
Travels from customers’ view point
Having come across a lady client of Star
Travels to talk with, the interviewer asks her
after the interview to introduce him to some
one else who in her knowledge recently used
the services of Star Travels
Process goes on till he gets 20 such persons
1. Convenience Sampling: Conveniently
Sharad Varde
available elements are chosen (Example:
Audience reaction to film: 1st day 1st show)
2. Purposive Sampling: Specific types of
elements who have and can provide the
desired information are chosen.
16
Sharad Varde
A. Non-probability Sampling
B. Probability Sampling
3. Judgment Sampling: Researcher uses her
judgment about who would be the best
respondents to serve the purpose
Example: A sample of 100 TVs to be drawn
from 10,000 TVs produced in Oct 2010
Each TV has 100 ÷ 10,000 = 0.01 i.e. 1%
chance of being chosen
4. Snowball or Referral Sampling: A
respondent is asked to name someone he
knows who too can provide valuable info. It
sets a chain process.
17
Sharad Varde
Sampling Design tells researcher precisely
how to pick up 100 TVs
There are five major designs of this type.
18
B.1: Simple Random Sampling
A Case Study
●Two lucky numbers to be drawn out of 100
tokens. Put all 100 tokens in a basket. Stir
well. Close eyes and pick up two tokens
●For larger population, assign serial numbers
to each element. Use a standard table of
random numbers. Select the required
number of elements one after other
●But, enlisting large populations is tedious.
19
Sharad Varde
Sharad Varde
HR Director of a software firm with 1926
engineers wants to find out desirability of
changing the current 10 – 6 working hours to
flexitime along with its benefits & drawbacks
perceived by the engineers before the next
board meeting
She would pick up a few engineers randomly
& ask them appropriate questions.
20
Sharad Varde
B.2: Systematic Sampling
21
A Case Study
●A sample of 50 cars to be selected from
10,000 cars produced in 2009
Maruti Suzuki Ltd. wants to check response
●10,000 ÷ 50 = 200. Select every 200th car
introduced in its small car segment
●More precisely, select a random number
between 1 and 200, say 30. Select 30th car
From the dealers alphabetical list, the
●Starting from 30th car, select every 200th car:
30, 230, 430, 630, 830, 1030, 1230, 1430…
senior marketing manager to talk to them.
Sharad Varde
of prospective buyers to the new features
Company selects every 50th dealer & sends a
22
B.3: Stratified Random Sampling
B.3: Stratified Random Sampling
●If population contains identifiable subgroups
of elements, researcher must provide proper
representation to each subgroup
●Process: Divide the population into mutually
exclusive identifiable subgroups (strata)
●Draw a simple random sample (or systematic
sample) from each stratum
●Size of sample from each stratum directly
proportional to size of the stratum
●Homogeneity within each stratum
●Heterogeneity between strata.
●Ex.: Population: All students of a college
●Identifiable Subgroups: males / females; arts/
science / commerce; brilliant / average / poor
●Lata M. songs: By language, solo / duet etc.
23
Sharad Varde
Sharad Varde
24
Sharad Varde
Study of Absenteeism (2% sample)
B.3: Stratified Random Sampling
Category (Stratum)
5 Strata
Total Number
7750
Sample Size
155
●It is Proportionate stratified random sample
Managers
250
5
Junior Managers
500
10
Assistants
2000
40
Skilled Workers
4000
80
Unskilled Labour
1000
20
25
Sharad Varde
●If all strata are of comparable sizes, it is OK
●But, if some are too large or too small, we
need to draw a Disproportionate stratified
random sample
●Larger than proportionate representation to
smaller strata and vice versa.
26
Study of Motivation (2% sample)
27
Sharad Varde
B.3: Stratified Random Sampling
Category
(Stratum)
6 Strata
Total Number
7100
Proportionate
Sample Size
142
Disproportionate
Sample Size
142
Sr. Managers
100
2
7
●Observe the spread (variance) in each strata
Middle Mgrs.
300
6
15
Jr. Managers
500
10
20
●Low variance: relatively more homogeneous
stratum needs smaller sample
Supervisors
1000
20
30
Clerks
5000
100
60
Secretaries
200
4
●Rule in drawing Disproportionate stratified
random sample:
●High variance: relatively less homogeneous
stratum needs bigger sample.
10
Sharad Varde
28
Sharad Varde
A Case Study
B.3: Stratified Random Sampling
●Stratified random sampling involves
dividing population into strata
●Hence, it needs higher time and cost
●But, it provides desired precision with
smaller sample than simple random or
systematic sample.
29
Sharad Varde
1.
2.
3.
4.
30
Sharad Varde
B.4: Cluster Sampling
A Case Study
●Used when population consists of several
groups of elements in such a manner that:
●Groups are similar to each other and
●Each group (CLUSTERS) is heterogeneous
●So, population has inter-group homogeneity
and intra-group heterogeneity
●Exactly opposite of stratified population
●Process: Select a few clusters randomly.
The consultant randomly picks up some
employees from each category. Since,
group 2 & 3 are smaller than 1 and group 4
is largest, she picks up 2% of group 4, 5%
of group 1, and 10% of group 2 & 3 persons
and talks with them at length.
This is a case of stratified disproportionately
random sampling.
31
A manufacturing company wants to conduct stress
management programs to its employees. The
consultant wants to get a first hand feel of the
stress levels experienced by employees. He
classifies them into 4 categories:
Workmen constantly handling dangerous chemicals
Foremen responsible for quality & productivity
Sales personnel under monthly targets
All others
Sharad Varde
32
Sharad Varde
B.4: Cluster Sampling Examples
B.4: Cluster Sampling Examples
● Complex of many identical buildings. We can select
●A truckload of mangoes in 4 dozen boxes.
Each box has upper layer of top quality fruits.
Quality & size drops layer by layer.
●Thus, homogeneity between boxes &
heterogeneity within each box.
●Draw a random or systematic sample of a
few boxes, open them and study them.
●No need to open other boxes from the truck.
5 out of 50 buildings
● A Mgmt Inst: 2000 students per year. 50 per batch.
40 batches run concurrently. Each has some active,
some ordinary & some passive students, and 75%
boys, 25% girls. Choose 4 batches and talk to all
200 students without disturbing other 36 batches.
33
Sharad Varde
34
B.4: Cluster Sampling
A Case Study
●Convenient
Under a community health program for
tribals, it was necessary to discover their
current state of nutrition, health & beliefs
Since adivasi padas are located at long
distances from each other in tribal areas, a
few adivasi padas were selected at random
and all residents from infants to old ones
were checked.
●Sample size smaller
●Less time and cost
●But, restrictive in application: You don’t
frequently get such populations.
35
Sharad Varde
Sharad Varde
36
Sharad Varde
B.5: Double Sampling
B.5: Double Sampling
●Used when we need some preliminary and
some detailed information about population
●Example: Preliminary Info: Investible
surplus with bank depositors
●Detailed Info: Perception about different
types of investments available to individuals,
their advantages, disadvantages, risks &
benefits, and depositors’ preparedness to
invest how much % in which scheme.
●Process: First draw a random sample
(simple, systematic or stratified) of bank
depositors. Collect info on their investible
surplus funds
●Then draw a random sample from this
sample (sub-sample) for administering a
detailed questionnaire to find out
subsampled subjects’ knowledge &
perception of various investment avenues.
37
39
Sharad Varde
38
Sharad Varde
A Case Study
Exercise
GoI wants to know industry opinion about
withdrawal of 1 year old stimulus package
Large sample of companies across the
sectors is drawn to seek opinion
A smaller sub-sample was selected to
probe deeper into industry psyche and to
obtain practical suggestions to maintain
industrial growth.
A conglomerate deals with appliances,
machine tools, furniture, storage solutions,
office equipment, processed foods, chicken,
agri-products, mosquito repellents, edible
oils, chemicals, healthcare, cosmetics,
detergents, etc.
Sharad Varde
Its earnings are under competitive pressure.
40
Sharad Varde
Exercise
Exercise
It wants to surge ahead of competitors
through following strategies:
Determine sampling designs to gather vital
information required to work on each of the
above 5 strategies
1. Developing new products
Time is the essence.
2. Enhancing advertising effectiveness
41
3. Tapping creative ideas within the group
The company wants to make decisions in
the next quarterly board meeting
4. Improving employee motivation.
So, all these inputs are needed in 30 days...
Sharad Varde
42
Sharad Varde
Major Dependence Methods
Dependent Variable
Multivariate Analysis
Categorical
Metric
Independent Variables
Dr. Sharad Varde
Categorical
Analysis of
Variance
Independent Variables
Metric
Multiple
Regression
Categorical
Metric
Canonical
Correlation
Multiple
Discriminant
Sharad Varde
2
Scatter Plot: Horizontal Axis: Reasoning Scores
Two Basic Concepts
Vertical Axis: Creativity Scores
1.Scatter Plot
2.Correlation
Sharad Varde
3
Sharad Varde
4
Correlation Coefficient
For Cardinal Variables
Basic Patterns of Scatter Plot
Both Move Together
Move In Opposite Way
Data: Actual measurements on both variables
No Relationship
Formula:
=
Mean of Products of Values – Product of the Two Means
-------------------------------------------------------------------------Product of the Two Standard Deviations
Name: Pearson’s Correlation Coefficient
Statisticians call it Pearson’s r.
Sharad Varde
5
Sharad Varde
Correlation Coefficient
For Ordinal Variables
Simple Regression
Model
Actual Measurements on Both Variables
Not Available
Available Data are in the Form of Ranks
6 x ∑ Square of Rank Diff
Formula: 1 - --------------------------------------n x (n2 -1)
where n denotes number of observations
Name: Rank Correlation Coefficient.
Sharad Varde
7
6
Regression
Story of Regression
₪Dictionary Says: The act of returning or
stepping back to a previous stage
₪Do quantitative methods force us to
regress instead of progress?
₪Or, is it Back to the Future?
₪Statistics, like any other field, adopts
crazy names arising from some
important historical events.
Sharad Varde
Sir Francis Galton studied the heights of the sons in
relation to the heights of their fathers
His Conclusion: Sons of tall fathers were not so tall &
sons of short fathers were not so short as their fathers
Path Breaking Finding: Human heights tend to
REGRESS back to normalcy
Since then, similar studies on the nature and extent of
influence of one or more variables on some other
variable acquired the name ‘Regression Analysis’.
9
Regression Curve
Sharad Varde
10
Regression Analysis
Horizontal Axis: Cause Variable: Reasoning Scores
Vertical Axis: Effect Variable: Creativity Scores
₪A quantitative method which tries to
estimate the value of a Cardinal Variable
(Effect) by studying its relationship with
other Cardinal Variables (Cause)
₪This relationship is expressed by a
custom-designed statistical formula
called the Regression Equation.
Sharad Varde
11
Sharad Varde
12
Purpose of Regression Analysis
Patterns of Regression Curves
Pattern
2. To determine the quantum of influence.
# 1: Upward Sloping Straight Line
Model: Y = a + bX + ε (b > 0)
Relationship: Increase in X leads to
proportionate increase in Y
3. To estimate the value of Effect Variable from
Y
1. To establish exact nature of influence of
Statistical
Cause Variable on Effect Variable
value of Cause Variable and assess error
4. To forecast future values of Effect Variable
from information about Cause Variable.
Sharad Varde
X
13
Estimating Regression
Parameters a & b
14
Least Square Method
Formula for Regression Coefficient b:
Formulae for regression parameters a & b are
worked out by a method that assures
Minimum Total Error of Estimation/
Forecasting, namely,
∑ ε² = ∑(Actual values of Y – Estimated values of Y)²
It is Least Square Error method
Divide ∑ ε² by the number of observations to
get Mean Square Error (MSE)
Minimum Mean Square Error (MMSE) method.
Mean of Products of Values – Product of the Two Means
= -------------------------------------------------------------------------Variance of Cause Variable
Formula for Regression Constant a :
a = Mean of Effect Variable Minus b times Mean of
Cause Variable
Regression coefficient ‘b’ and regression
constant ‘a’ are jointly called ‘Regression
Parameters’.
Sharad Varde
Sharad Varde
15
Sharad Varde
16
Mean Square Error
Concept: Error of Estimation
Note the difference between the actual values
Errors must be small for the model to be
of Effect Variable (Salary) and the values
a good fit and to guide us into future
estimated by the Regression Model
This is called the Error of Estimation
MSE must be within the range permitted
Less the Error, Better the Model. Ideally 0.
by the sponsor of research
Statistical Model: Y = a + b X + ε
Errors should be erratic / haphazard.
If Correlation is Perfect (+1 or -1), ε = 0.
17
Sharad Varde
18
Sharad Varde
Pattern # 3: Simple Exponential
Other Patterns of Regression Curves
Pattern
# 2: Downward Sloping Straight Line
Statistical Model: Y = a – bX + ε
(b > 0)
Relationship: Increase in X leads to
proportionate decrease in Y.
Y
Increase in X leads to faster increase in Y
Y
X
X
Sharad Varde
19
Sharad Varde
20
Statistical Model
Pattern
Linear Conversion
Statistical Model: Y = ea + bX + ε
# 3: Simple Exponential
Relationship:
Logarithm pulls in the curvature and flattens
the curve (Note: Log 1 = 0; Log 10 = 1)
Increase in X leads to faster
increase in Y
Statistical
Linear Conversion: Log Y = a + bX + ε
Model: Y = ea + bX + error
Call Z = Log Y
where, e = 2.71828183 (Euler's number or
Now, fit Pattern # 1 to Z and X
Napier's constant).
Z = α + βX + ε.
21
Sharad Varde
Simple Exponential
22
Sharad Varde
Pattern # 4: Upward Curvilinear
Log Y
Y
Increase in X leads to slower increase in Y
X
X
Sharad Varde
23
Sharad Varde
24
Pattern # 4: Upward Curvilinear
Pattern # 5: Downward Curvilinear
Relationship: Increase in X leads to slower
Y
increase in Y
Statistical Model: Y = a + b Log X + ε
Increase in X leads to faster decrease in Y
Now, fit Pattern # 1 to Y and Log X
A Tip: Try double logarithm if single log fails to
flatten the curve satisfactorily. In that case,
X
Y = a + b Log (Log X) + ε.
Sharad Varde
25
26
Sharad Varde
Pattern # 6: Negative Exponential
Pattern # 5: Downward Curvilinear
Relationship: Increase in X leads to
Y
faster decrease in Y
Statistical Model: 1/Y = ea + bX + error
Increase in X leads to slower decrease in Y
Linear Conversion:
Loge (1/Y) = a + bX + ε
X
Now, fit Pattern # 1 to Loge (1/Y) and X.
Sharad Varde
27
Sharad Varde
28
Power of Logarithm
Pattern # 6: Negative Exponential
Two standard patterns: a straight line
Relationship: Increase in X leads to
Two standard patterns: Log X converts to a
straight line (Patterns # 4 & # 6)
slower decrease in Y
Statistical Model:
Two standard patterns: Log Y (# 3) or Log 1/Y
(# 5) converts to a straight line
Y = a – b Loge X + ε
Logarithm sucks in the curvature
Now, fit Pattern # 2 to Y and Loge X.
Double Log can flatten deeper curvature.
29
Sharad Varde
Pattern # 7: Logistic or S Curve
30
Pattern # 7: Logistic or S Curve
Relationship: Increase in X leads
initially to faster increase, then to steady
increase, & finally to slower increase in Y
Y
Increase in X leads initially to faster
increase in Y, then to steady increase,
and finally to slower increase in Y
Statistical Model:
1/Y = (1/a) + (b/a) ecX + ε
where, e is the base of the natural
logarithm (e = 2.71828...).
X
Sharad Varde
Sharad Varde
31
Sharad Varde
32
Your Role
Steps for Fitting Regression Model
A.
B.
C.
D.
E.
F.
G.
Collect a set of reliable cardinal observations
on Effect variable (Y) and corresponding
cardinal values of Cause variable (X)
Compute correlation. If high, proceed further.
Plot Y vs. X & detect presence of a pattern
Identify nature of cause-&-effect relationship
Compute quantum of the relationship
Conduct error analysis: small, haphazard, MSE
If OK, use the model for forecasting.
Sharad Varde
33
Understand
the situation in totality
Detect a logical cause-and-effect relationship
Identify relevant cardinal variables X and Y
Obtain reliable data on X and Y
Compute Pearson’s correlation coefficient
If it is high (+ or -), draw a scatter plot, join the
points by a free hand, and identify the pattern
Compute regression parameters a & b for the
pattern and fit regression model using SPSS.
Sharad Varde
A Word of Caution
Multiple Regression
Model
Undertake regression analysis only for
cardinal variables (effect and cause)
Select the variables only if you logically
suspect influence
of oneModel
over the other
Simple Regression
Carry out regression analysis only after
completing correlation analysis AND
only if the selected cause and effect
variables are in fact highly correlated
If not, choose a better cause variable.
Sharad Varde
35
34
Multiple Regression Model
Multiple Regression Analysis
A
technique to analyze the joint effect of
many cause variables on effect variable
Multiple Regression Model of pattern # 1:
Simple Regression: One cause variable
influences the effect variable
Some real life phenomena are amenable to
simple two-variable regression analysis
BUT, NOT ALL.
Multiple Regression: Several cause variables
jointly influence effect variable
Also called Multivariate Regression.
Sharad Varde
37
Multiple Regression Analysis
Sharad Varde
38
Steps in Multiple Regression Analysis
1. Understand the situation in totality
2. Detect the effect variable Y that is crucial for
decision making / planning and all possible
cause variables X1, X2, - - - -, Xn
3. Obtain reliable data on all variables
4. Compute Pearson’s correlation coefficient
(SPSS / SAS) between Y and each of the n
cause variables
5. Drop those X’s which exhibit poor correlation
with Y.
Multiple Regression Model :
Y = a + b1X1 + b2X2 + - - - - +bnXn + ε
It is not necessary that all independent
variables (X’s) influence the dependent
variable (Y) in above simple fashion
A generalized process detects complex
relationships.
Sharad Varde
Y = a + b1X1 + b2X2 + - - - - +bnXn + ε
Caution: Cause Variables X1, X2, - - - -,
Xn SHOULD NOT BE Inter-Correlated
Otherwise, your model will suffer from a
disease called Multi-Collinearity.
39
Sharad Varde
40
Steps in Multiple Regression Analysis
Steps in Multiple Regression Analysis
6. For the balance X’s, compute correlation
coefficients between each X and other X’s to
check orthogonality (lack of multi-collinearity)
7. If a pair of X’s shows high correlation, drop
the one that bears weaker correlation with Y
8. Now you are left with Y and a shorter number
of X’s which:
- individually bear strong correlation with Y,
- but poor correlation among themselves.
Sharad Varde
41
9. Proceed Step by Step. Start with the Cause
variable that shows highest correlation with Y
10. Draw its scatter plot with Y, & identify pattern
11. If it does not resemble a straight line, use
logarithms to flatten the curve
12. Fit two-variable regression model:
Y = a + bf(X) + ε
13. If errors are haphazard, small, & MSE is
within the set limit, stop
14. If not, select the cause factor that shows next
highest correlation with Y. Repeat process.
Sharad Varde
Steps in Multiple Regression Analysis
End of
Multivariate Analysis
15. Fit three-variable regression model:
Y = a + b1 f1(X1) + b2 f2(X2) + ε
16. If errors are haphazard, small, & MSE is
within the set limit, stop
17. If not, select the cause variable that shows
the third highest correlation with Y. Repeat
the process till you reach acceptable errors.
18. You will finally get multiple regression model:
Y = a + b1 f1(X1) + b2 f2(X2) + ------- + bn fn(Xn) + ε…
Sharad Varde
43
42
THANK YOU
Dr. Sharad Varde