Research Methodology

11/24/2010 Why bother about Service Sector? MEASURING GOVERNMENT OUTPUT IN THE UK D NARAYANA IGIDR Lecture 2 20 October 2010 • Global trend- economies becoming service economies • Share of service sector in GDP – Between B t 60 and d 80% iin mostt OECD • In India, over the last 40 years – Agrl share over 60 to <20? – Ind share 16 to 20? – Service share 30 to >60? • But studies? A hint of the problem • Studies have argued • Service sector output overestimated – Inadequate data – faulty methodology – price deflators inappropriate • Problem more serious- a hint • (Recall three approaches to GDP: value of output=income generated= total expenditure) • Indian health sector- about 5% of GDP spent on health care, but share 1.87% Measurement of Output and of Value Added • Output at current prices = Sales + Changes in inventories + work in progress • Applicable for the market sector • Some exceptions, like banks, insurance • Alternative measure needed • Banks invoice limited portion of their services – Foreign exchange commissions, check handling charges, stock market transactions • Bulk of their services- making loans • They accept deposits, lend, financial intermediation 1 11/24/2010 Value Added in Non Market Sector • Non- market sector, mainly government • Provide services, free ofcharge, or prices not economically significant – Defence, public education,public health • Fi Financed d th through h ttaxation ti or social i l contributions t ib ti • No direct link between payment and service • Some services provided on individual basis – Family sends children to school • Other services consumed collectively, like defence, police etc Conventional Approach • Government output= total value of the inputs • In the UK, Input = the compensation of employees the procurement cost of goods and services i a charge for the consumption of fixed capital • In the US Input is limited to employment Government Output • Government – all those agencies that provide public services • Examples: NHS and local authority provision of social services • Need to distinguish between – Individual services (those consumed by individual households) – Collective services provided to society as a whole • Non –market output – Supplied free , or – At low prices, not economically significant Major Problem • Collective Services – it is hard to identify the exact nature of the output • Services supplied to the individuals – - it is hard to place a value on th these services i • Convention neglects increases in productivity • As productivity grows, the growth rate of government output is understated • The overall growth rate of GDP is understated 2 11/24/2010 Post-1998 development in ONS measurement of government output Limits to Productivity Growth • Many public services involve essential human input • Labour is an end in itself • Quality Q lit is i jjudged d d iin tterms off amountt off labour • Computerisation may allow efficient allocation of care workers • But there are limits to replacement labour Function %Govt. spending, 2000 Date introduced Main components Health 30.3 Introduced 1998, method updated 2004 Hospital cost weighted activity index, family health services (number of GP consultations etc) Education 17.1 Introduced 1998,with data from 1986 Pupil members – Quality adjustment of 0.25 percent to primary and secondary schools Administration of Social Security 2.7 Introduced 1998,with data from 1986 Number of benefit claims for 12 largest benefits No allowance for collection of contributions Administration of Justice 3.0 Introduced in 2000,with full impact in 2001, data back to 1994 Q1 Number of prisoners, legal aid cases etc Fire 1.1 Introduced 2001 , with data from 1994 Q1 Number of fires, fire prevention and special services Personal Social Services 7.4 Introduced 2001, with data from 1994 Q1 Children and adults in care and provision of home helps Police 5.8 Experimental Cleared-up crimes of different types Sectors that follow conventional method • • • • • • This applies to, Defence, General Public Services, Economic services, Environmental Protection, Recreation and Culture, Housing and Community amenities. Implications for Growth of GDP • • • • • GDP growth rate , 1995-2003. Direct method – 2.5% per annum Input method – 3% per annum GDP growth in the US- 3.25% Difference – accounts for nearly half the difference in GDP growth rate 3 11/24/2010 Output and Input Volume Measures- 2 Output and Input Volume Measures- 1 01 General Public Services 02 Defence 03 Public Order and Safety Local government Output A Central government Output A NA Output A Police- = output volume measures are volumes of police activity, crime related incidents, patrols, traffic incidents etc. Input A Prisons – NA Police- = output volume measures are volumes of police activity, crime related incidents, patrols, traffic incidents etc. Input A Prisons- Output volume measures are measured directly using total numbers of prisoners. i Input A Probation – NA Probation- Output volume measures are measured directly using workload hours of various areas of competence. Input A Courts- output volume measures Courts- output volume measures for for magistrates courts are magistrates courts are measured directly measured directly using caseloads using caseloads of courts weighted average of courts weighted average hours hours or average costs or average costs Input A Note: A = volume measures are deflated UK expenditure figures for pay, procurement of goods and services and capital consumption. NA = Not Applicable Output and Input Volume Measures- 3 Local government Central government Output volumes are measured Output volumes are measured directly directly using pupil numbers using pupil numbers in pre-primary, in pre-primary, primary and primary and secondary schools secondary schools obtained obtained from DfES. from DfES. Input A Input A 10 Social Personal Social Service: Social Security: Output volume Protection Output volume measures are measures are measured directly for directly using : administration of social security using a) numbers of adults in care numbers of new benefit claims. and home help contact Input A hours obtained from DH b) numbers of children in care from DfES Input A Administration of social security : Output volumes are measured directly for administration of Social Security using numbers of housing benefit cases. Input A 09 Education Local Government Central Government Fire- output volume measures for the fire service are measured directly using number of other services. Input A Fire- NA 04 Economic Affairs Output A Input A 05 Environmental Protection Output A Output A 06 Housing and Community Amenities. Output A Output A 07 Health Output volume measures are measured directly using : a)treatment numbers and reference costs data from DH. b)In addition, further indicator series are used for dental and ophthalmic services. Input A Output volume measures are measured directly using : a) treatment numbers and reference costs data from DH. b) In addition, further indicator series are used for dental and ophthalmic services. Input A 08 Recreation, Culture and Religion Output A Output A Note: A = Output volume measures are deflated UK expenditure figures for pay, procurement of goods and services and capital consumption. NA = Not Applicable Conclusions • UK has moved from input approach to direct measures of output • Direct measures cover 2/3 rds of Govt.Final consumption • Design of output measures needs care and investment of resources. • Also, continuous monitoring. • Institutional change poses problems for output measurement • Effects of technological change may not be captured. Note: A = volume measures are deflated UK expenditure figures for pay, procurement of goods and services and capital consumption NA = Not Applicable 4 11/24/2010 Health- An Illustration • Health is the largest government service. • 31 percent of government final consumption in 2003. • Health care services funded from general taxation. • The provision of health care services in the United Kingdom is devolved responsibility. • It is providing hospital and some community health services • Services are free of charge at the point of delivery. Improved Methodology • Uses information about volume and cost weights for 1,200 Healthcare Resource Groups • 400 other activity groupings • 200 categories of general practice prescribing • Cost ranges from less than £10 to £45,000 • Improvements comes from wider coverage ,increased level of detail, better cost weights • Categories used are more homogenous Methods of Output Measurement • The UK Health output measure used before June 2004 – Reflected movements in 16 different activity series measuring health care. – A single series counting total inpatient and day cases accounted for about half the expenditure covered by the index; – Outpatient and community health treatments, GP prescribing and dental treatments were measured separately. – An aggregate index was formed by weighting the separate series Improvements • Come from – Wider coverage – Increased level of details – Better cost weights • Became possible – NHS developed robust costs for a standard list 5 11/24/2010 Future Methods-1 • Recommendation 1 -- Extending the coverage of output volume indicators for each function • Recommendation 2 -- Improving UK coverage • Recommendation 3 -- Whole courses of treatment, technical change and substitution – Linked outpatient p attendances,, investigations g etc – Units are to be grouped by diagnosis – Treatments are to be adjusted for quality factors • Recommendation 4 -- Measuring quality change – Saving lives and extending life spans; mitigating effects of disease – Speed of access to treatment – Patient experience Future Methods- 2 • Recommendation 5 – Inputs and Deflators – More work needed to ensure health deflators meet quality criteria – More disaggregated approach to measure skill mix • Recommendation 6 – Triangulation and Productivity Measurement – Productivity measure by dividing natioanl accounts output by inputs – Account should be taken of the changing skill mix of staff – Changing balance between grades of doctors – Migration of treatments from expensive to cheaper settings • Recommendation 7 – Satellite accounts References • Atkinson Review: Final report (Measurement of Government Output and Productivity for the National Accounts), http://www.statistics.gov.uk/about/data/met hodology/specific/publicSector/atkinson/fin gy p p al_report.asp. 6

Major j Problem in Experiments p D i off E Design Experiments i t How to control the extraneous factors (nuisance variables) which play p ay a along o g with t the t e cause factor acto D Sharad Dr. Sh d Varde V d under investigation in the process of influencing the effect factor 2 3 Sharad Varde Solution #1: Matching g Groups p Solution #1: Matching g Groups p Example: Study the effect of a newly created herbal compound on weight reduction Group p 1 of volunteers: Administer new treatment. Group 2 of volunteers: No treatment Extraneous factor: Gender Solution: If we have 70 female & 30 male volunteers, place 35 women & 15 men in each group Thus, Th the th effect ff t off gender d is i uniformly if l distributed di t ib t d Change in weight is attributed only to new product. Further, if we suspect age & affluence as Further other extraneous factors, we assign different age brackets & wealth brackets to the two groups of volunteers Thus control the two extraneous factors too Thus, Problem: Some more extraneous factors may exist. i t B But, t we do d nott know k them th all. ll Safer solution: Randomization. Sharad Varde 4 Sharad Varde 5 7 Solution #2: Randomization Solution #2: Randomization GAssign 100 volunteers randomly to 2 groups GThus, every volunteer has a known & equal chance of being assigned to any group GMethod: Throw 100 names in a basket GPi k up one, it goes tto G GPick Gr 1 1. GPick up next name, it goes to Gr 2. GPick up a third name, it goes to Gr 1 & so on. GOr, use standard table of random numbers. Since every person has equal chance of getting into any group: Gevery known & unknown extraneous variable has equal chance of getting into any group Gand hence, all extraneous variables are distributed equally to the two groups GSo,, Group G p 1 is comparable p to Group p2 GTherefore, change in weight can be safely attributed onlyy to the new product. p Sharad Varde 6 Sharad Varde Benefits of Randomization Generalizability y of Experiments p GEffective method to nullify confounding influence of known & unknown extraneous variables i bl over th the ffactor t under d study t d GThus it controls nuisance of known & unknown extraneous variables GNo need to enlist all extraneous variables GWe can safely generalize the conclusions. GLab experiments try to establish a cause cause-&& effect relationship firmly (beyond all doubts) in artificially contrived lab setting GField experiment then checks it in real life GIf If it confirms fi the th cause-&-effect & ff t relation, l ti reall life decisions based on this conclusion can b safely be f l made d GIt is the field expt that confers generalizability Sharad Varde 8 Sharad Varde Are Intelligent g Indians Rich? Design es g o of Experiments pe e ts # To T test t t whether h th higher hi h intelligence i t lli iimproves iincome of adult Indians How to # Income is Effect Factor Y # Intelligence g is Cause Factor X DESIGN # However, family background may also play a role # But, B t it iis nott fformally ll a partt off this thi study t d Experiments? 9 # Hence, family background is an Extraneous Factor. Sharad Varde 10 Terminology gy Alternative Terminology gy G Effect Eff t F Factor: t A factor f t (dependent (d d t variable) i bl ) th thatt iis In the context conte t of the use se of the term “Experiment’ in research studies: to be explained or predicted by the experiment GCause factor is also termed as ‘Treatment’ G Cause Factor: A factor (independent variable) that is expected p to influence the Effect Factor GVarious values of cause factor are called ‘Levels of Treatment’ G Extraneous Factors: Other factors (nuisance variables) that may influence the effect factor GExtraneous GE t factors, f t being b i cause ffactors t themselves, too are called ‘Treatment’. G Control: Steps to reduce effect of extraneous factors 11 Sharad Varde Sharad Varde 12 Sharad Varde Controlling g Extraneous Factors Terminology gy ÄEx.:: To test whether athletes’ ÄEx athletes performance in sports events improves with provision of sports coach GExperimental GE i t l Group G (EG): (EG) Experimental E i t l units exposed to experimental treatment. (Athletes provided with a sports coach) ÄProvide sports coach to one group of randomly chosen athletes GControl Group (CG): A comparable group of similar units that is not exposed to the experimental treatment treatment. (Athletes who are not provided with a sports coach). ÄDo NOT provide sports coach to another group ÄSports performance is Effect (dependent) factor ÄProviding a sports coach is Cause factor ÄObserve the performance of both groups. 13 Sharad Varde 14 Randomness Two Types yp of Experimental p Designs g Randomness is the essence of an experiment to obtain reliability Controls known/unknown extraneous factors It equates Experimental Group with Control Group Appropriate DESIGN for the experiment ensures accuracy, generalizability & credibility c ed b y o of co conclusions c us o s. 15 Sharad Varde Sharad Varde A. Elementary Designs: One Cause Factor B. Advanced Designs: Many cause factors i. e. several treatments 16 Sharad Varde A. Elementary y Designs g A.1: Randomized Two Group p Design g X A.1: Randomized Two Group Design A.3: Solomon Four Group Design Sharad Varde EG No X O2 CG ¥ Randomly assign experimental units to EG & CG G No p pre-test measurements are taken G Expose EG to cause factor (treatment) X G Note post-test values of effect factor for EG & CG ¥ Treatment Effect is O1 – O2 (i.e. Avg. O1 – Avg. O2) ¥ Applicable when test units are homogeneous. A2 B A.2: Before f & Aft After T Two G Group D Design i 17 O1 R 18 An Example p Sharad Varde A.2: Before & After Two Group p Design g Ì To evaluate efficacy of a protein supplement product Ì A random sample of 40 children is selected O1 X O2 EG O3 No X O4 CG R Ì Assigned randomly to EG & CG (Toss a coin & assign) Ì Children in EG are asked to take product for one month Ì Children in CG are not given any product (or a placebo) V Describe Ì After experiment, health check up performed on both groups and findings are recorded V What the design. is the treatment effect? Ì Difference indicates effectiveness of the product 19 Sharad Varde 20 Sharad Varde A.2: Before & After Two Group p Design g Application pp in Community y Health ~Randomly assign test units to EG & CG ~Note pre-test values of effect factor: O1 & O3 ~E ~Expose EG to t cause factor f t (treatment) (t t t) X ~Note post-test values of effect factor: O2, O4 ~Treatment Effect is: ~(O2 – O1) – (O4 – O3) ~This design controls extraneous factors additionally due to pre-test/post-test method. ~ Select sample of villagers at random 21 Sharad Varde ~ Talk to each one and note their attitude towards personal hygiene ~ Randomly assign half of them to EG. Rest form CG. ~ Only O l for f EG, EG conduct d t the th health h lth education d ti program ~ Talk to EG and CG persons again and note their attitude towards personal hygiene ~ Effectiveness measure is (O2 – O1) – (O4 – O3). 22 A.3: Solomon Four Group p Design g O1 O3 X No X O2 O4 A.3: Solomon Four Group p Design g Several Effectiveness measures: EG1 CG1 Î O2 – O1 Î O2 – O4 R Î O5 – O6 Î (O2 – O1) – (O4 – O3) X O5 EG2 No X O6 CG2 z Combination of design A A.1 1 & design A A.2 2 z Addresses all extraneous factors. 23 Sharad Varde Sharad Varde Î (O5 – O1) – (O6 – O3) etc. If all are significantly large, cause-&-effect relationship is firmly established. 24 Sharad Varde Characteristics off C Advanced ((Statistical)) Designs g The World of Numbers One Effect Factor (Dependent Variable): It A tri trivial ial b butt important realit reality: All numbers are not of the same type MUST be b Measurable M bl , C Cardinal, di l Metric M ti All numbers can not be subjected to identical treatment during their analysis One or more Cause Factors (Independent Variable): ) They y MUST be Non-metric,, Like in medical field, wrong treatment leads t disastrous to di t consequences Nominal, Ordinal, Categorical. So, let us tour the world of numbers. 25 Sharad Varde 26 Types yp of Numbers Nominal Numbers Nominal Numbers z Ordinal Numbers z Cardinal C di l N Numbers b z Purpose: Identification of an Object z Example: House Number (10 Janpath) z 27 Sharad Varde Sharad Varde Your Cellphone Number Smart Card PINumber Number on Cricket T-Shirt 28 z Property: p y Equivalence: q Two Different Nominal Numbers Indicate Two Different Objects z Theyy p possess no Quantitative Properties. p Sharad Varde Ordinal Numbers z z z z z Cardinal Numbers Purpose: Represent Position or Ranking Example: India’s ranking in world trade Exam grade (1 (1, 2 2, 3 3, . . ) Floor number Properties: p Equivalence q & Order: Different Ordinal Numbers Indicate Different Objects in Some Kind of Relationship with Each Other N Q No Quantitative tit ti P Properties ti Nominal & Ordinal numbers are also called as Non-Metric or Categorical. Categorical 29 Sharad Varde Purpose: Represent Quantity Example: Sales Turnover: Rs in Crores P d ti iin T Production Tons Your Marks in Exams Earning Per Share (EPS) ------------------------ope es Equivalence, qu a e ce, Order, O de , Quantity. Qua t ty Properties: z z 30 Cardinal Numbers Sharad Varde Cardinal Numbers They Possess All Mathematical Properties: Yo can comfortably You comfortabl and validly: alidl Order Equivalence Addition Subtraction Multiplication Division . . . . . . Cardinal Numbers are Truly Quantitative They are also called ‘METRIC’. 31 Sharad Varde 32 ² Add them, ² Subtract, multiply, divide them ² Take square roots roots, raise to a power power, log ² Develop mathematical models ² Employ statistical techniques ² Analyze interpret, Analyze, interpret and make decisions decisions. Sharad Varde Example p Example p Zone Code No No. Sales Rank S/W Version Shoe Size PIN Code Shirt Size (Rs. In Crores) 33 Northern 01 483 3 3.0 5 110001 38 Western 02 738 1 4.2 6 307429 40 Eastern 03 265 4 5.1 7 400004 42 Southern 04 567 2 6.3 8 411002 44 Type Nominal Cardinal Ordinal Ordinal Nominal Ordinal Sharad Varde 34 Quantitative Techniques q Handling g Numbers 1 Most of the quantitative techniques are When you master Wh t numbers, b you will ill no longer be reading numbers, any more than you read words in a book. meant ONLY for cardinal numbers. 1 Never use them on nominal/ordinal nos. 1 A few methods, called Non-Parametric Techniques, are especially developed to analyze ordinal numbers like ranks. 1 Use them instead of wrongly using cardinal techniques in such situations. 35 Sharad Varde Cardinal Sharad Varde You will be, be in fact, fact reading meanings meanings. - W. E. B. Du Bois American sociologist, historian & educator 36 Sharad Varde

11/24/2010 Outline of the Lectures • At the conceptual level Price Indices in National Accounting g D. Narayana Lecture 1-IGIDR 19 October 2010 Defining GDP • GDP combines in a single figure, and with no double counting, all the output (or production) carried out by all the firms, non-profit non profit institutions, institutions government bodies and households in a given country during a given period, regardless of the type of goods and services produced, provided that the production takes place within the country's economic territory. – The issue of price indices in National Income Accounting – The problem of measuring services • If time permits and you are interested – The issue of advance estimates of National Income and components – The Statistical Strengthening System Project GDP • GDP = ∑ value added • Of, each firm, govt institution, producing household in a given country • GDP = ∑ outputs t t - ∑ iintermediate t di t consumption • GDP independent of pattern of organisation • Avoids double counting 1 11/24/2010 GDP and Other Aggregates • Gross means inclusive of consumption of fixed capital (=> Net domestic product) • Domestic vs National • GDP GDP- Output produced within the territory • GNI- Total income of all eco agents residing within the territory • Difference- earning of workers living in one country working elsewhere, interest paid on investments Table 2. Reconciliation of GDP and GNI for Germany, Luxembourg and Ireland, Millions of euros Year 2003 Germany Luxembourg Ireland Gross domestic product 2 128 200 23 956 134 786 +52 972 +30 296 + primary income (including earnings) received from the rest of the world+104 610 – primary income (including earnings) paid to the rest of the world –118 630 –55 722 –52 139 = Gross national income 2 114 180 21 206 112 943 –11.5 –16.2 Difference between GDP and GNI (%) –0.7 Table 7. GDP: expenditure approach, Germany, 2004a Reconciling global output and demand Codes GDP P3 One of the Fundamental Equations of NIA • GDP = Sum of final demand aggregates • GDP + Imports = Household consumption + GCF + Exports E t • GDP = Household consumption + GCF + Net Exports Million euros Gross domestic product 2 177 000 Total final consumption 1 677 450 % of GDP P31-S14 HH final consumption expenditure 1 225 870 56.3 P31-S15 Final consumption of NPISHs 44 900 2.1 P31-S13 General government final 18.7 consumption expenditure 406 680 P5 Gross capital formation 385 480 P51 Gross fixed capital formation 378 550 P52 Changes in inventories 6 930 B11 External balance of goods and services 114 070 P6 Exports 834 820 38.3 P7 Imports 720 750 33.1 17.4 This table shows the official SNA codes, which the reader can find on the website accompanying this book. These codes facilitate the understanding and manipulation of the data. 2 11/24/2010 Table 9. The three approaches to GDP, Germany, billion euros Codes Reconciling global output and income Fundamental equation • Output (sum of the values added)= Income (employees’ salaries + company profits ) = Final demand (Consumption + GCF + Net exports) • Three ways to measure GDP- (i) the output approach; (ii) the final demand approach; (iii) the income approach 1991 2004 GDP Gross domestic product (output approach) 1 502.2 2 177.0 B1B Value added at base-year prices 1 359.5 1 965.1 D21 + taxes net of subsidies on the products GDP 142.7 211.9 Gross domestic product (demand approach) 1 502.2 2 177.0 P3 Final consumption expenditure 1 140.9 1 677.5 P5 + Gross capital formation 364.9 385.5 P6 + Exports of goods and services 395.2 834.8 P7 – Imports of goods and services 398.7 720.8 GDP Gross domestic product (income approach) D1 Compensation of employees B2 + B3 + Gross operating surplus and gross mixed income D2 + Taxes net of subsidies on production and imports 1 502.2 2 177.0 844.0 1 133.1 515.1 143.1 811.9 232.1 These are the official SNA codes Growth of GDP Average annual % GDP growth, 1980‐2003 Current prices Netherlands +4.6 Mexico +37.1 Turkey +62.3 •GDP at market Prices •Over 1980‐2003 for Netherlands, Mexico, and Turkey •Formidable growth of Turkey and Mexico c f to Netherlands? •The trap of inflation, or current prices •Need to separate out growth in volume from changing prices Table 1. GDP, volume and price indices Average annual growth in percentage, 1980-2003 Volume Prices Netherlands +2.3 +2.3 Mexico +2.4 +33.9 Turkey +4.1 +60.0 Source: OECD (2006), National Accounts of OECD Countries, Volume I, Main Aggregates, 1993-2004, 2006 Edition, OECD, Paris. StatLink: http://dx.doi.org/10.1787/508232480000 http://dx doi org/10 1787/508232480000 Table 2. GDP per capita, 2003 In US dollars Netherlands = 100 Netherlands 31 602 100.0 Mexico 6 091 19.3 Turkey 3 385 10.7 Source: OECD(2006), National Accounts of OECD Countries: Volume I, Main Aggregates, 1993-2004, 2006 Edition, OECD, Paris. 3 11/24/2010 Volume measure Needed •If only one product, problem solved easily •Multitude of products- how to aggregate? •Answer - in prices •Problem with prices •Prices change along with volumes. •How to separate them out? •E.g. Economy with two cars, small and large . •Numbers produced Qst Qlt & Qst’Qlt’ •Price at period t, Ps , Pl •Volume of production: Qst Ps + Qlt Pl , Qst’ Ps + Qlt’ Pl •Constant price accounting. Quantities Vs Volume • Is quantity same as volume? • Car economy – 80 small cars 20 large in Y1 – 50 small and 50 large in Y2 • Quantity 100 in both years; no change. • Is I the th volume l same iin th the ttwo years? ? • Suppose, Ps = 1 Pl = 2 • Volume 1 = 80 x 1+ 20 x 2 = 120 • Volume 2 = 50 x 1 + 50 x 2 = 150 • Volume increased by 25%. • Volume takes into account quality. Volume Indices Vs Price Indices Laspeyres Index • Volume Index = Weighted average of changes in the quantities, weights being Prices • Price Index = Weighted average of changes in prices, weights being quantities • Q Quantityy ratio and price p ratio- qt/q q0 and pt/p p0 • Most national accounts systems use • Vij = pij qij value at current prices of product i in period j. • Laspeyres volume index ∑vi0 qit/qi0 = ---------------............(1) ∑vi0 – Laspeyres indices to calculate volumes – Paasche indices to calculate change in prices • Weighted average of quantity or price ratios • The period providing base period = reference period • By convention, reference period = 100 4 11/24/2010 • Eq (1) can be rewritten as, ∑p0qt Lq = ---------- …………………(2) ∑ p 0q 0 • Paasche index, harmonic mean of price ratios ∑vt ∑ptqt Pp = --------- = ---------- ………....... (3) ∑vt.p0/pt ∑p0qt ∑p0qt ∑ptqt ∑ptqt ∑vt • Lq x Pp = --------- . -------- = -------- = ----- ...(4) ∑p0q0 ∑p0qt ∑p0q0 ∑v0 Constant prices • Laspeyres volume index ∑p0q0 ∑p0q0, ∑p0q1 ∑p0q0, ∑p0q2 ∑p0q0 …(6) • Multiply by ∑p0q0 ∑p0q0, ∑p0q1,… ∑p0qt……………………..(7) • Constant price series as price structure of the fixed period used • One great advantage- additive • Eq (4) is the fundamental equation introduced earlier. • This is generally used to arrive at volume index. ∑vt Lq = ----- Pp……….(5) ∑v0 • Easier to get price indices (Pp) and GDP at current prices. Problems with constant prices • Constant prices – choice of a fixed year • Using price structures remote from the current structure • The problem of computers and mobiles • Indian case, 1999-00 prices • Computers p very y expensive p – 4 GB hard disk IBM cost me 70,000 – Today, 300GB costs 20,000 • Value at 1999-00 constant prices? • Volume overstated? • Price decrease understated? 5 11/24/2010 Chained Accounts • Three stages: Figure 1. Difference between constant 1980 prices and chained prices France, computers and other materials 1. Accounts calculated at prices of previous year 2. Chain these changes • • Multiply each one by subsequent one Obtain series of growth rates 3. Multiply by value of the accounts at the reference year price. price • • • 400 350 300 250 Advantage – 450 200 Price structure more relevant Called Laspeyres chains (Fisher chains- average of previous and current year prices) 150 100 50 0 1980 1983 1986 1989 constant 1980 prices 1992 1995 1998 chained prices Consequences of chaining Difference- US • • • • • • • Difference great as seen for France above Similar differences for US US GDP growth between 2001-03 4.3% at constant prices 2.7% at chained prices Difference largely from computers Whose production increased greatly • Chain linking adopted by US, OECD • Advantage- More accurate volume growth rates. • Draw back – Loss of additivity - Eq 5 breaks down! – Accounting identities do not hold- rather their growth rates cannot be decomposed • Second fundamental equation, GDP= C + GCF+ X-M does not hold • An additional residual term with no economic interpretation 6 11/24/2010 Another major problem References Volume vs Quantity Price used to combine diverse products Diverse quality within a product group Price differences reflect quality differences is the assumption • What happens when quality improves but price falls? • Is there a way out? • Lequiller, F. , and Blades, D. 2006. Understanding National Accounts. OECD, Paris, http://www.eastafritac.org/images/uploads/docu ments_storage/Understanding_National_Accoun ts_-_OECD.pdf • CSO. National Accounts Statistics, Sources and Methods 2007, http://mospi.nic.in/rept%20_%20pubn/ftest.asp?r ept_id=nad09_2007&type=NSSO (password may be required for this at mospi.nic.in site.) • • • • 7

Sharad Varde D i off E Design Experiments i t M. M Sc. Sc (Statistics); Ph Ph. D D. (Operations Research) Planning & Strategy Faculty in NIBM VP: VP S Swedish di h M Match t h (I (International t ti lB Business) i ) CEO / MD: Bhor Ind., Kamala Group, Cyber Agro Sectors: Banking, Engineering, Packaging, Textile, Info-System Security, e-Commerce, Food, Plastics D Sharad Dr. Sh d Varde V d External Academic Expert: Univ of Warwick, UK Co-Chairman, Food Prssg & Agri-Business: IMC. 2 A New Real Life Problem Sharad Varde Traditional Real Life Problem Is the new formulation of fertilizer a Is Bt Brinjal a potential health decisively superior option for wheat hazard a soil quality destroyer hazard, destroyer, growers in North India in terms of cost, quality and yield? and an anti-farmer innovation? 3 Sharad Varde 4 Sharad Varde A Corporate p Problem A Public Utility y Problem Newly N l appointed i t d VP (Fi (Finance)) strongly t l believes that staffing his accounts dept only with CAs (instead of commerce & g g graduates)) will g grossly y management improve its performance. On what basis can the Company accept or reject his idea? ÎBEST wants to increase occupancy. 5 Sharad Varde ÎShould it drop fares by 5% 5%, 7 7.5%, 5% or 10%? ÎOn all days, days only weekends weekends, or weekdays? ÎFor AC buses, express, or ordinary buses? ÎIn city, west suburbs, or east suburbs? 6 A Union Budget g Problem A Public Health Problem G - Does pollution from pesticides increase Would a 10% reduction in excise duty lead to more than 10% increase in human chest size? demand & production? 7 Sharad Varde - Do p pesticides create extra oestrogen g in the G For which product categories? human body which then attempts to disturb G For large companies or SMEs or both? hormonal development? Sharad Varde 8 Sharad Varde Real Life Problems Research Process Identify broad area of research Î Gather preliminary data Î Define research problem Î Identify Id tif important i t t factors f t (variables) ( i bl ) Î Generate hypotheses Î PREPARE RESEARCH DESIGN Î Collect C ll d data, analyse & interpret Î Draw conclusions Î Write report Î Present report for researchbased decision making. These problems are too important to be tackled in a naïve manner. They need a systematic research study. d 9 Sharad Varde 10 Sharad Varde Major j Elements of Research Study y S Scientific f Research Design A. Purpose of research study B. Type of research investigation C. Extent of researcher interference D. Study setting. 12 Sharad Varde A.1: Exploratory p y Research A. Purpose p of Research Study y Example: Manpower planning for a Danish Group’s Group s new plant in Ahmednagar. No prior knowledge of local work ethics. Important to find it first. When research area is virgin: no past info Extensive preliminary work needs to be done Objective: To better comprehend problem Lead to rigorous design for in-depth study Tools: informal discussions with people people, in-depth in depth interviews, focus groups, case studies, literature review, secondary data. 1. Exploratory 2. Conclusive: a. Descriptive b. Causal 13 Sharad Varde 14 A.2: Descriptive p Research A.3: Causal Example: A bank wants to know profile of credit card payment defaulters Example: E ample A firm wants ants to find o outt whether hether doubling of its advertising budget would significantly increase sales & profit Descriptive Descripti e research is done when hen characteristics of the variables of interest are known, but they need to be profiled in detail for better understanding When nature & q quantum of relationships p among variables must be unearthed Objective: To understand magnitude of problem precisely: Find out WHO, WHO WHAT, WHAT WHEN, WHEN WHERE It may reveal causal relationships Inputs: secondary and/or primary data. 15 Sharad Varde Inputs: specially collected massive data. Sharad Varde 16 Sharad Varde Purpose p of Research Study y B. Type yp of Research Investigation g M h d l i l rigour Methodological i iincreases ffrom 1 Correlational 1.Correlational exploratory p y to causal research Hence, more cost and time But, results are more reliable, precise and 2 Cause & Effect 2.Cause generalizable Resultant decisions are more realistic. 17 Sharad Varde 18 B. Type yp of Research Investigation g B. Type yp of Research Investigation g Correlational: Are teeth quality and blood sugar levels related? Is there a relationship between income and intelligence of adult Indians? Are TV viewing and insomnia related? Do men who buy denims also buy sunglasses? Conducted in natural environment C Cause & Eff Effect: t Do D aerated t d soft ft drinks cause digestive disorders? Does corporate downsizing influence performance of the surviving staff? Researcher manipulates situation to study effects ff off changes in cause factors on the variable of interest. Minimal interference by researcher. 19 Sharad Varde Sharad Varde 20 Sharad Varde Researcher Interference: Example p C. Extent of Researcher Interference Minimal: Mi i l To T check h k whether h th stress t on nurses and emotional support given by doctors to them are correlated 1. Minimal Questionnaire to nurses on both factors 2 Moderate 2. M d t No interference by researcher in normal f functioning off hospital beyond administering questionnaire. 3. Excessive 21 23 Sharad Varde 22 Sharad Varde Researcher Interference: Example p Researcher Interference: Example p Moderate: To discover a ‘cause cause-and-effect and effect relationship’ between support & stress Three groups of nurses chosen for study Group1: p Those who say y they yg get full support pp from doctors throughout duty hours Group2: Cursory support given by doctors Group3: No support provided at all. Excessive: To firmly establish direct ‘cause causeand-effect relationship’ Three groups of sensitive nurses are chosen Onlyy troublesome p patients chosen for study y Doctors asked to Step in to help (Gr1) / Give partial solace (Gr2) / Ignore (Gr3) After a week, administer questionnaire. Sharad Varde 24 Sharad Varde Whatt is Wh i an Experiment E i t in Research? D. Study y Setting g Studies involving researcher influence of moderate Research method in which or excessive nature are called Contrived Studies we change values of cause Theyy are conducted in two formats: 25 1. Field Experiments factor to measure its exact 2. Lab Experiments. influence over effect factor. Sharad Varde 26 Contrived ‘Field Experiments’ p Contrived ‘Lab Experiments’ p Cause & effect relationship studies conducted in natural environment with moderate interference by researcher Ca Cause se & effect relationship st studies dies beyond possibility of least doubt Create artificial contrived environment Example: Study effect of a newly created herbal compound on weight reduction Group 1 of volunteers: Administer new treatment Group 2 of volunteers: No treatment Choose other subjects to respond to manipulated stimuli E.G.: E G Administer Ad i i t ttreatment t t to t mice i or pigs i But, their lifestyles could also influence results. 27 Sharad Varde Sharad Varde Control extraneous factors. 28 Sharad Varde Process of Experiments p Methodology gy of an Experiment p Us Usually all Lab Experiments E periments are cond conducted cted first so that real life setting is not disturbed Study: St d Does red reduction ction in air fare (X) impro improve e occupancy (Y) in holiday resorts? Hypotheses are tested & conclusions drawn Two factors: Cause Factor X, Effect Factor Y Then a Field Experiment is conducted to confirm (or reject) the tested hypotheses in real life setting with moderate interference Specific changes are induced in X Resultant changes in Y are observed If every change in X causes change in Y, then we conclude that X is causal to Y. Final conclusions are reached. 29 Sharad Varde 30 Logic g of Experiments p Logic g of Experiments p 1. X will always occur before Y Q So So, all such s ch extraneous e traneo s (i (i.e. e n nuisance) isance) factors must be held constant and 2 Changes in X will cause changes in Y 2. Q Their effects neutralized by controlling the situation somehow 3. To infer that X causes Y, other possible p causes (extraneous factors) must NOT exist Q In Lab Experiments it is controlled artificially Security perception, economic sentiments too influence Q IIn Field Fi ld Experiments E i t it is i controlled t ll d with ith help of clever Designs of Experiments. holiday plans. They are the extraneous factors. 31 Sharad Varde Sharad Varde 32 Sharad Varde Design es g o of Experiments pe e ts How to DESIGN Experiments? 33 Sharad Varde

How to Select Appropriate Sampling Design Sampling Dr. Sharad Varde Choice Points in Sampling Design For Non-probability Sampling Design Question: Is representativeness of the sample and generalizability of the conclusions critical for research study? If purpose of research is to get quick but even partially reliable info, choose convenience sampling design If not, we can select an appropriate non-probability sampling design. If purpose is to extract info available with only a few of the elements, choose purposive sampling design. If yes, select a probability design. 3 Sharad Varde 4 Sharad Varde 5 For Non-probability Sampling Design Choice Points in Sampling Design If researcher needs to use personal judgment about who would be the best respondents to serve the purpose, it is judgment sampling design If researcher has to resort to asking respondents to suggest further interviewees, it is snowball or referral sampling design. Question: Is representativeness of the sample and generalizability of the conclusions critical for research study? Sharad Varde If not, we can select an appropriate non-probability sampling design If yes, select a probability design. 6 For Probability Sampling Design For Probability Sampling Design If population naturally consists of several mutually excusive groups (strata) that are dissimilar to each other (i.e. they have homogeneity within each stratum & heterogeneity between strata) and if the purpose is to assess these difference, select stratified random sampling design. There are two methods of selecting elements of the population for inclusion in the sample: A simple random sampling design Or, systematic sampling design if population is serially ordered or elements emerge serially. 7 Sharad Varde Sharad Varde 8 Sharad Varde In Stratified Random Sampling For Probability Sampling Design If population naturally consists of several groups (clusters) that are similar to each other (i.e. inter-group homogeneity and intra-group heterogeneity) and if cost & time budget is small, opt for cluster sampling design. If all strata have nearly equal number of elements, choose proportionate stratified random sampling design If some strata are too large or some too small, choose disproportionate stratified random sampling design. 9 Sharad Varde 10 Important Issues in Sampling For Probability Sampling Design 1. Sampling Design: Precisely how to draw a sample from the population 2. Sample Size n: How many elements of the population to be selected to form a representative sample Both depend upon cost & time budget of the study, and desired reliability of conclusions (confidence & precision). If we need preliminary info on some parameters of the population immediately followed by detailed info on some other parameters, and if cost & time budgets do not permit drawing a fresh sample for detailed second study, select double sampling design. 11 Sharad Varde Sharad Varde 12 Sharad Varde Sampling Error Two Important Concepts: Precision & Confidence Whatever be sampling design, sample estimate will inevitably differ from actual parameter of population. This difference is called ‘Sampling Error’ Larger the sample, smaller the sampling error We must know the sampling error of our sampling design so as to understand reliability of estimates. 14 15 Sharad Varde Concept of Precision Precision It refers to how close our estimate of a population characteristic (say, average mileage before car battery fails) derived from a sample is to true population characteristic Rarely in practice, we make ‘point estimates’ (such as 36 months). Usually, we declare a range (36 months ± 2 months i.e. 34 – 38 months). It is called ‘interval estimate’ Narrower this interval, greater the precision. It is a function of the range of variability in the probability distribution of sample mean Sharad Varde It is measured by sampling error S = s/√n, where s is standard deviation of sample and n is sample size Large sample size n means low std error S Low std error S means high precision. 16 Sharad Varde Concept of Confidence Confidence It denotes how certain we are that the estimate of a population parameter derived from our sample is within the desired range (say, ±5%) of the true (but unknown) value of the population parameter It is the probability (expressed in % form) that sample parameter lies within the desired range of population parameter. Widest range: o - ∞ gives 100% confidence 17 Sharad Varde Wider the range, higher the confidence Wider the range, lower the precision Higher confidence goes with lower precision We need a trade off between them. 18 A Numerical Example Sharad Varde Standard Normal Distribution Study: Find food bill value per college girl Sample of 64 college girls at the gate of CCD Sample Mean x = Rs.105. Std Dev. s = 10 Confidence interval for pop mean μ = x ± ZS where Z is ‘z score’ for standard normal distribution for the desired confidence. 19 Sharad Varde 20 For 90% confidence level, Z = 1.645 For 95% confidence level, Z = 1.96 For 99% confidence level, Z = 2.576 Sharad Varde 21 A Numerical Example Trade Off So, sampling error S = s/√n = 10 / √64 = 1.25 For 90% confidence level, interval estimate is μ = 105 ± 1.645 (1.25) = 102.944 – 107.056 For 99% confidence level, interval estimate is μ = 105 ± 2.576 (1.25) = 101.780 – 108.220 Higher confidence goes with wider interval i.e. with lower precision. Thus, we can use the formula μ = x ± ZS To increase or decrease original confidence level and determine precision level, or To increase or decrease original precision level and determine confidence level This is the trade off between precision level and confidence level. Sharad Varde 22 Sharad Varde Formula for Sample Size ‘n’ Sample Size n = (Zσ / e)2 Where, Z is z score for standard normal distribution for the desired confidence level σ is population standard deviation e is tolerable margin of error (precision level) Thus, higher confidence level ≡ higher Z ≡ bigger sample size Higher precision level ≡ narrower margin of error ≡ bigger the sample size. 24 Sharad Varde 25 An Example Formula for Sample Size ‘n’ Jet Airways wants to be 95% confident of an estimate of average number of customers per weekday within a range of ± 500 A recent sample study of average number of passengers per weekday showed a std dev of 3500 So, Z = 1.96, σ = 3500, e = 500 Sample size n = (1.96 x 3500 / 500)2 = 188. This formula n = (Zσ / e)2 does not consider Sharad Varde population size N But, often we do not know the pop size Or, population is too huge to enumerate So, this formula is used when N is unknown. 26 Formula for Sample Size ‘n’ Real Life Problems But, if population size N is small and known, the corrected formula for n is: The formula n = (Zσ / e)2 depends on population standard deviation σ Problem: How do we find σ? n = N(e/Z)2 / {N(e/Z)2 – (e/Z)2 + σ2} Solution: Look for any study conducted in recent past on the same population to estimate σ or do an exploratory study based on a small sample. This is the required sample size that incorporates population size N It is used when pop is not too large. 27 Sharad Varde Sharad Varde 28 Sharad Varde Real Life Problems Solution in Real Life Situation Problem: Difficult to get permission to conduct two studies. Also, accuracy of estimate of σ from small sample exploratory study is questionable Solution: Take σ as {range / 6} i.e. (largest element minus smallest element) / 6 Problem: How do we get largest and smallest elements of the population if it is not readily available in ordered format? 1. 2. 3. 4. 29 Sharad Varde Derive sample size from cost & time budget Carry out sample study Ask for confidence level desired (We get Z) Use formula μ = x ± ZS to compute precision S 5. If this precision S is not at desired level, revise confidence level (We get a fresh Z) 6. Compute precision level S afresh 7. Carry on iterations till both are satisfactory. 30 Solution in Real Life Situation Stratified Random Sampling Or, ask for precision level S desired The formula n = (Zσ / e)2 is for Simple Random Sampling & Systematic Sampling Use formula to compute confidence Z For Stratified Random Sampling it is: If confidence is not at desired level, revise precision level n = (Z / e)2 Σ Wi σi2 where, σi is standard deviation of ith stratum of population (i = 1, 2, . . . ,k) that consists of k strata & Wi is weight attached to ith stratum Compute confidence level afresh Carry on iterations till both are satisfactory. 31 Sharad Varde Sharad Varde 32 Sharad Varde Formula for Sample Size ‘n’ Formula for Sample Size ‘n’ In Stratified Proportionate Random Sampling (SPRS) the weights are Wi = (Ni / N), where Ni denotes the number of elements in the ith stratum of the population of total size N In Stratified Disproportionate Random Sampling (SDRS) n = (Z / e)2 {Σ Wi σi }2 Sample of size n is then divided into k samples of size n1, n2, . . . , nk as follows: ni = n Wi (i = 1, 2, . . . ,k). Note: In SPRS, σi’s vary significantly, but Ni’s are not too drastically different 33 Sharad Varde Formula for Sample Size ‘n’ In Cluster Sampling and Double Sampling, it is obvious that the formula for Simple Random Sampling applies. 35 Sharad Varde And samples for individual strata are: ni = n {Ni σi / Σ Ni σi } (i = 1, 2, . . . ,k) In SDRS, both σi’s and Ni’s vary significantly. 34 Sharad Varde End of Sampling

Multivariate Statistical Analysis Statistical Methods for Simultaneous Investigation of Several Variables Multivariate Analysis Dr. Sharad Varde Sharad Varde Major Inter-dependence Methods Research Studies Several variables are to be studied Data are obtained on them from sample These variables may / may not be mutually independent of each other Some may hold strong correlation with some other variables Multi-collinearity may exist among variables Data analysis methods in this situation are called ‘Inter-dependence Methods’. Sharad Varde 2 Factor Analysis to reduce several correlated variables into a few uncorrelated meaningful factors Cluster Analysis to classify individual elements of the population into a few homogeneous groups. 3 Sharad Varde 4 Research Studies Major Dependence Methods Several variables are to be studied Purpose is to establish a cause-andeffect relationship One dependent (effect) variable and several independent (cause) variables Data are obtained on them from sample Data analysis methods in such situations are called ‘Dependence Methods’. Sharad Varde If the dependent variable (effect factor) is Metric and independent variables (cause factors) are non-metric (i.e. categorical), use Design of Experiments to structure the research study and use Analysis of Variance to analyze the data. 5 Major Dependence Methods Sharad Varde 6 Major Dependence Methods If the dependent variable (effect factor) is If the dependent variable (effect factor) is non-Metric (Categorical) and the Metric and the independent variables independent variables (cause factors) are metric, use Multiple Discriminant (cause factors) are also metric, use Analysis. Multiple Regression Analysis. Sharad Varde 7 Sharad Varde 8 Major Dependence Methods Major Dependence Methods Dependent Variable Metric Analysis of Variance Similarities Number of dependent Variables Categorical Independent Variables Categorical ANOVA Independent Variables Metric Multiple Regression Categorical Metric Number of independent variables Canonical Correlation Multiple Discriminant Differences Nature of the dependent Variables Nature of the independent variables Sharad Varde 9 DISCRIMINANT ANALYSIS One One One Many Many Many Metric Categorical Categorical Metric Metric Metric Sharad Varde Multivariate Analysis Methods We will now study major Multivariate methods: Factor Analysis 1. Factor Analysis 2. Cluster Analysis 3. Multivariate Discriminant Analysis 4. Multivariate Regression Analysis. Sharad Varde 11 REGRESSION 10 Factor Analysis What is a Factor A factor is a linear combination of the observed original variables V1 ,V2 , . . ,Vn: It examines entire set of inter-dependent relationships without making any distinction between dependent and independent variables It reduces the total number of variables in the research study to a smaller number of factors by combining a few correlated variables into a factor. Sharad Varde Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + WinVn where Fi = The ith factor (i = 1, 2,..,m ≤ ≤ n) Wi = Weight (factor score coefficient) n = Number of original variables m = Number of factors. 13 Sharad Varde Factor Analysis Case Study # 1 Discovers a smaller set of uncorrelated Evaluate credit card usage & behavior of factors (m) to represent the original set of customers correlated variables (n) significantly (m ≤ n) Initial set of variables is large: Age, Gender, These factors do not have multi-collinearity, i.e. they are orthogonal to each other Marital Status, Income, Education, Employment Status, Credit History, Family They can then be used in further multivariate Background: Total 8 variables analysis (regression or discriminant analysis). Sharad Varde 14 Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + Wi8V8 15 Sharad Varde 16 Case Study # 1 Case Study # 1 Reduction of 8 variables into 3 factors (i = 3): These 3 un-correlated factors can be identified by 1. common characteristics of ‘variables with heavy Factor 1: Heavy weightage for age, gender, & weightages’ & named accordingly as follows: marital status and low weightages to other variables 2. 1. Demographic Status Factor 2: Heavy weightage for income, education, 2. employment status & low weightages to others 3. Factor 2: (income, education, employment status) as Socio-economic Status Factor 3: Heavy weightage for credit history & family 3. background and low weightages to other variables. Sharad Varde Factor 1: (age, gender, marital status) as Factor 3: (credit history & family background) as Background Status. 17 Sharad Varde Case Study # 2 Case Study # 2 Reduction of 10 variables to 3 factors: Evaluate customer motivation for buying a two wheeler Initial set of variables is large: 1. Affordable 2. Sense of freedom 3. Economical 4. Man’s vehicle 5. Feel powerful 6. Friends jealous 7. Feel good to see ad of this brand 8. Comfortable ride 9. Safe travel 10. Ride for three. Sharad Varde 18 Pride: (man’s vehicle, feel powerful, sense of freedom, friends jealous, feel good to see ad of this brand) Utility: ( economical, comfortable ride, safe travel) Economy: (affordable, ride for three to be allowed) We will now see how to carry out factor analysis. 19 Sharad Varde 20 Standard Normal Distribution Standardize the Data ●Enlist all variables that can be important in resolving the research problem ●Collect metric data on each variable from all subjects sampled ●Convert all data on each variable into standard format (Mean: 0 & Std. Dev.: 1) since different variables may have different units of measurement ●SPSS / SAS etc. do it automatically. Sharad Varde 21 Two Steps in Factor Analysis Sharad Varde 22 What Factor Extraction does (a) It determines the minimum number of factors that can comfortably represent all variables in the research study Factor Extraction Obviously, maximum number of factors equals the total number of variables Factor Rotation (b) It converts correlated variables into the desired number of un-correlated factors Tool: Principal Component Method (PCM). Sharad Varde 23 Sharad Varde 24 Principal Component Method Case Study # 3 SPSS gives inter-variable correlations To determine the benefits consumers PCM assists checking appropriateness of factor analysis (Bartlett’s test) seek from purchase of a toothpaste Sample of 30 persons was interviewed Assists checking adequacy of sample size (KMO test) Respondents were asked to indicate their Gives initial eigen values degree of agreement with the following They determine the minimum number of factors that can represent all variables. statements using a 7 point scale: Sharad Varde (1=Strongly agree, 7=Strongly disagree) 25 Original Data: 30 persons, 6 variables Six Important Variables V1: Buy a toothpaste that prevents cavities V2: Like a toothpaste that gives shiny teeth V3: Toothpaste should strengthen your gums V4: Prefer toothpaste that freshens breath V5: Prevention of tooth decay is not an important benefit V6: Most important concern is attractive teeth Data obtained are given in the next slide. Sharad Varde 26 Sharad Varde 27 RESPONDENT NUMBER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 V1 7.00 1.00 6.00 4.00 1.00 6.00 5.00 6.00 3.00 2.00 6.00 2.00 7.00 4.00 1.00 6.00 5.00 7.00 2.00 3.00 1.00 5.00 2.00 4.00 6.00 3.00 4.00 3.00 4.00 2.00 V2 3.00 3.00 2.00 5.00 2.00 3.00 3.00 4.00 4.00 6.00 4.00 3.00 2.00 6.00 3.00 4.00 3.00 3.00 4.00 5.00 3.00 4.00 2.00 6.00 5.00 5.00 4.00 7.00 6.00 3.00 V3 6.00 2.00 7.00 4.00 2.00 6.00 6.00 7.00 2.00 2.00 7.00 1.00 6.00 4.00 2.00 6.00 6.00 7.00 3.00 3.00 2.00 5.00 1.00 4.00 4.00 4.00 7.00 2.00 3.00 2.00 Sharad Varde V4 4.00 4.00 4.00 6.00 3.00 4.00 3.00 4.00 3.00 6.00 3.00 4.00 4.00 5.00 2.00 3.00 3.00 4.00 3.00 6.00 3.00 4.00 5.00 6.00 2.00 6.00 2.00 6.00 7.00 4.00 V5 2.00 5.00 1.00 2.00 6.00 2.00 4.00 1.00 6.00 7.00 2.00 5.00 1.00 3.00 6.00 3.00 3.00 1.00 6.00 4.00 5.00 2.00 4.00 4.00 1.00 4.00 2.00 4.00 2.00 7.00 V6 4.00 4.00 3.00 5.00 2.00 4.00 3.00 4.00 3.00 6.00 3.00 4.00 3.00 6.00 4.00 4.00 4.00 4.00 3.00 6.00 3.00 4.00 4.00 7.00 4.00 7.00 5.00 3.00 7.00 2.00 28 Inter-variable Correlations: Correlation Matrix from SPSS Variables V1 V2 V3 V4 V5 V6 V1 V2 V3 V4 V5 1.000 -0.530 1.000 0.873 -0.155 1.000 -0.086 0.572 -0.248 1.000 -0.858 0.020 -0.778 -0.007 1.000 0.004 0.640 -0.018 0.640 -0.136 Bartlett’s Test For valid factor analysis, many variables must be correlated with each other That means, if each original variable is completely independent of each of the remaining n-1 variables, there is no need to perform factor analysis i.e. if zero correlation among all variables H0: Correlation matrix is unit matrix. V6 1.000 29 Sharad Varde H0: Correlation matrix is Unit Matrix V1 V2 V3 ---- ---- Vn V1 1 0 0 0 0 0 V2 0 1 0 0 0 0 V3 0 0 1 0 0 0 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- Vn 0 0 0 0 0 1 Sharad Varde Sharad Varde 30 Bartlett’s Test For valid factor analysis, many variables must be correlated with each other H0 : Correlation matrix is unit matrix Here, SPSS gives p level < 0.05 Reject H0 with 95% level of confidence So, correlation matrix is not unit matrix Conclusion: Factor analysis can be validly done. 31 Sharad Varde 32 Initial Eigen Values KMO Test SPSS gives Kaiser-Meyer-Olkin measure of sampling adequacy in this case= 0.660 Values of KMO between 0.5 and 1.0 suggest that sample is adequate for carrying out factor analysis. Otherwise, we must draw additional sample. Here, 0.660 > 0.5 Conclusion: Sample is adequate Thus, these two tests together confirm appropriateness of factor analysis. Sharad Varde 33 Initial Eigen values Factor 1 2 3 4 5 6 Eigen value % of variance Cumulat. % 2.731 45.520 45.520 2.218 36.969 82.488 0.442 7.360 89.848 0.341 5.688 95.536 0.183 3.044 98.580 0.085 1.420 100.000 Sharad Varde Eigen Value Principal Component Method Variance of each standardized variable is 1 Each original variable has Eigen value = 1 due to standardization Total variance in study = Number of variables (here 6) Fi = W i1V1 + W i2V2 + W i3V3 + . . . . . . . . . . . . . + W i6V6 Variance explained by a factor is called Eigen Value of that factor So, factors with eigen value < 1 are no better than a single variable Only factors with eigen value ≥ 1 are retained It depends on (a) weights for different variables and (b) correlations between the factor & each variable (called Factor Loadings) Principal Component Method determines the least number of factors to explain maximum variance. Higher the eigen value of the factor, bigger is the amount of variance explained by the factor. Sharad Varde 34 35 Sharad Varde 36 Case Study # 3: Initial Eigen Values PCM is a Sequential Process Selects weights (i.e. factor score coefficients) in such a manner that the first factor explains the largest portion of the total variance F1 = W 11V1 + W 12V2 + W 13V3 + . . . . . . . . . . . + W 1nVn Then selects a second set of weights for F2 = W 21V1 + W 22V2 + W 23V3 + . . . . . . . . . . . + W 2nVn so that the second factor accounts for most of the residual variance, subject to being uncorrelated with first factor Process goes on till cumulative variance explained crosses a desired level, usually 60%. Initial Eigen values Factor 1 2 3 4 5 6 37 Sharad Varde Eigen value % of variance Cumulat. % 2.731 45.520 45.520 2.218 36.969 82.488 0.442 7.360 89.848 0.341 5.688 95.536 0.183 3.044 98.580 0.085 1.420 100.000 Sharad Varde 38 Factor Loadings: Correlation Between Each Factor & Each Variable Two Factors Explain > 60% Variation . Factor Eigen Value 1 2.731 2 2.218 % of Variance 45.520 36.969 Cumulative % 45.520 82.488 Factor Matrix Variables V1 V2 V3 V4 V5 V6 Conclusion: Number of factors required to explain >60% variation is 2. Sharad Varde 39 Factor 1 0.928 -0.301 0.936 -0.342 -0.869 -0.177 Factor 2 0.253 0.795 0.131 0.789 -0.351 0.871 Sharad Varde 40 Factor Rotation Factor Rotation Initial factor matrix rarely results in factors that In rotating the factors, we would like each factor to have significant loadings or coefficients for some of the variables. can be easily interpreted Therefore, through a process of rotation, the initial factor matrix is transformed into a simpler matrix that is easier to interpret It leads to identify which factors are strongly Let us see how it is done. associated with which original variables. Sharad Varde The process of rotation is called orthogonal rotation if the axes are maintained at right angles 41 Factor Loadings: Correlation Between Each Factor & Each Variable Sharad Varde 42 Illustration of Rotation of Axes . Let us take a simpler illustration Suppose factor loadings of 2 variables on 2 factors: Factor 1 Factor 2 0.6 0.7 V1 0.5 - 0.5 V2 Factor Matrix Variables V1 V2 V3 V4 V5 V6 Factor 1 0.928 -0.301 0.936 -0.342 -0.869 -0.177 Factor 2 0.253 0.795 0.131 0.789 -0.351 0.871 Sharad Varde Variation explained by V1 = (0.6)2 + (0.7)2 = 0.85 Variation explained by V2 = (0.5)2 + (-0.5)2 = 0.50 None of the loadings is too large or too small to reach any meaningful conclusion Let us rotate the two axes & see what happens. 43 Sharad Varde 44 Graph of Original Loadings Graph of Rotated Axes (clockwise) Factor 2 +1 Factor 2 +1 V1 -1 V1 Factor 1 +1 0 -1 Factor 1 +1 0 V2 V2 -1 -1 45 Sharad Varde Graph of Rotated Axes -1 Factor Loadings After Rotation Factor loadings of 2 variables on 2 factors: Factor 2 V1 V1 V2 0 Factor 1 Sharad Varde Factor 1 -0.2 0.7 Factor 2 0.9 0.1 Variation explained by V1 = (-0.2)2 + (0.9)2 = 0.85 Variation explained by V2 = (0.7)2 + (0.1)2 = 0.50 Note that variation explained remains unchanged Some of the loadings are too large or too small Now, we can reach meaningful conclusion. V2 -1 46 Sharad Varde +1 47 Sharad Varde 48 Case Study # 3: Factor Loadings after Rotation Original Factor Loadings: Correlation Between Each Factor & Each Variable . Rotated Factor Matrix Factor Matrix Variables V1 V2 V3 V4 V5 V6 Factor 1 0.928 -0.301 0.936 -0.342 -0.869 -0.177 Variables V1 V2 V3 V4 V5 V6 Factor 2 0.253 0.795 0.131 0.789 -0.351 0.871 Sharad Varde Factor 1 0.962 -0.057 0.934 -0.098 -0.933 0.083 49 Factor 2 -0.027 0.848 -0.146 0.845 -0.084 0.885 Sharad Varde Weightages to Variables for Each Factor from SPSS Factors: (6 Variables into 3 factors) Factor Score Coefficient Matrix Fi = Wi1V1 + Wi2V2 + Wi3V3 + . . . + Wi6V6 Variables V1 V2 V3 V4 V5 V6 Factor 1 0.358 -0.001 0.345 -0.017 -0.350 0.052 Factor 2 0.011 0.375 -0.043 0.377 -0.059 0.395 Sharad Varde 50 In case Study # 3: F1 = 0.358V1 – 0.001V2 + 0.345V3 – 0.017V4 – 0.350V5 + 0.052V6 F2 = 0.011V1 + 0.375V2 – 0.043V3 + 0.377V4 – 0.059V5 + 0.395V6 51 Sharad Varde 52 Interpretation of Factors Interpretation of Factors A factor can then be interpreted in terms of the F2 = 0.011V1 + 0.375V2 – 0.043V3 + 0.377V4 – 0.059V5 + 0.395V6 variables that load high on it from rotated factor matrix FACTOR 2 has high coefficients on: FACTOR 1 has high coefficients for: V2: Like a toothpaste that gives shiny teeth V1: Buy a toothpaste that prevents cavities V4: Prefer toothpaste that freshens breath V3: Toothpaste should strengthen your gums V6: Most important concern is attractive teeth V5: Prevention of tooth decay is not an important FACTOR 2 may be labelled as Aesthetic Factor benefit (Note: Coefficient is negative) The factors are jointly called principal components. FACTOR 1 may be labelled as Health Factor. Sharad Varde 53 Sharad Varde 54 Conclusion Selecting a Surrogate Variable From the data gathered from 30 respondents on 6 basic variables, the most important benefits consumers seek from purchase of a toothpaste are HEALTH and AESTHETICS ●Sometimes, we are not willing to discover new factors but we want to stick to original variables and want to know which ones are important ●By examining the factor matrix, we could select for each factor just one variable with the highest loading for that factor, if possible Health has 45.5 % importance ●That variable could then be used as a surrogate variable for the associated factor Aesthetics has 36.9 % importance. Sharad Varde 55 Sharad Varde 56 Factor Loadings After Rotation Selecting Surrogate Variables Rotated Factor Matrix Variables V1 V2 V3 V4 V5 V6 Factor 1 0.962 -0.057 0.934 -0.098 -0.933 0.083 ●V1 has highest loading on F1 Factor 2 -0.027 0.848 -0.146 0.845 -0.084 0.885 ●So, V1 is surrogate variable for F1 ●Similarly V6 could be surrogate for F2 ●So, we concentrate on only 2 variables: V1 (Preventing Cavities) & V6 (Attracive teeth). Sharad Varde End of Factor Analysis 57 … Sharad Varde 58

B. Advanced Designs g D i off E Design Experiments i t B.1: Completely Randomized Design B.2: Randomized Block Design D Sharad Dr. Sh d Varde V d B.3: Latin Square Design B.4: Factorial Design. 2 B.1: Completely p y Randomised Design g Sharad Varde B.1: Completely p y Randomized Design g Experiment: To determine effect of training on job performance GOne Effect Factor (Dependent Variable): Cause factor: Training. Effect factor: Job performance Measurable Countable Measurable, Countable, Cardinal Cardinal, Metric No N extraneous t factor f t to t influence i fl performance f Randomly assign (toss a coin) experimental units (employees) GOne Cause Factors (Independent Variable): to EG and CG equally Expose EG to training. No training to CG Non-metric, Nominal or Ordinal, Categorical Evaluate E l t job j b performance f after ft a while hil If difference is significant, conclude effectiveness of training. GNo Extraneous Variable. 3 Sharad Varde 4 Sharad Varde Statistical Model B.1: Completely p y Randomized Design g yij = ȝ + tj + eij This example: Two categories in cause factor We can have many categories Called LEVELS of TREATMENT Example: 3 types of training: conventional lectures, case studies, group discussion Q: Which one is most effective? Randomly assign equally to 3 groups. 5 Sharad Varde where yij = ith observation for jth treatment level where, i = 1, 2, . . . . ,n j = 1, 2, . . . . ,k treatments (3 types of training) ȝ = overall mean yp of training) g) tj = effect of the jth treatment ((type eij = experimental error for ith observation subjected to jth treatment. 6 Analysis of Variance Table for Completely p y Randomised Design g Statistical Analysis y 7 Sharad Varde ȝ = (1/nk) Ȉi Ȉj yij Overall Mean T t Treatment t Means M ȝj = (1/n) (1/ ) Ȉi yij Total Sum of Squares SST = Ȉi Ȉj (yij – ȝ)2 Treatment Sum of Squares q SSTr = n Ȉj (ȝj – ȝ)2 Error Sum of Squares SSE = Ȉi Ȉj (yij – ȝj)2 f j=1 for 1, 2 2, . . . ,k k Sharad Varde 8 Source of Variation Sum of Squares Degrees of Freedom Mean Squares F Ratio Between Treatments SSTr k–1 MSTr = SSTr / (k – 1) MStr / MSE = FTr Residual Error SSE k(n – 1) MSE = SSE / k(n – 1) Total SST nk – 1 Sharad Varde 9 Hypothesis yp Testing g B.1: Completely p y Randomized Design g For Treatments: H0: t1 = t2 = . . . . . . = tk H1: H0 is not true Assumption Ass mption in this e example: ample No e extraneous traneo s factor that would influence job performance If FTr > F value for k – 1 & k(n – 1) degrees of freedom for stipulated level of confidence confidence, say 95%, then we reject H0 for treatments That means: Treatment effects significantly vary from each other. But, suppose we suspect gender (M / F) can Sharad Varde Say, Say food processing or garment industry Then, gender is another cause factor To understand their effect on performance, we use Randomized Block Design. 10 B.2: Randomised Block Design g B.2: Randomized Block Design g One Effect Factor (Dependent Variable): Variable) Measurable, Countable, Cardinal, Metric To determine impact of price change on sales of a health drink Cause C ffactor: t Pi Price. Eff t factor: Effect f t S l Sales Conduct experiment using 4 price levels (C (Cause ffactor t i.e. i Treatment) T t t) Rs. R 100, 100 120 120, 150 & 175 (4 treatment levels) Store type (chemist, grocery shop, and supermarket) could also affect sales. Two Cause Factors (Independent Variables), p p cause factor and one Or,, one principal extraneous factor: Both Non-metric, Nominal or Ordinal Ordinal, Categorical No interaction between two Cause Factors. 11 Sharad Varde Sharad Varde 12 Sharad Varde Example of Randomised Block Design B.2: Randomized Block Design g ((Note only y 12 obs Instead of 12 x 4 = 48 obs)) Thus, Thus store type is an extraneous variable (Called Block for historical reasons). Here, 3 blocks. S So, 4 x 3 = 12 retail t il outlets tl t are selected l t d according di tto store type: 4 of each type: Ch i t A Chemist: A, B B, C C, D (4 chemists); h i t ) Grocery Shop: I, II, III, IV (4 grocery shops); and Supermarket : P, Q, R, S (4 supermarkets) Price levels assigned randomly to each retail outlet. 13 Sharad Varde Chemist Grocery Shop Supermarket Rs 100 C II S Rs 120 D III P Rs 150 A IV Q Rs 175 B I R 14 Example of Randomised Block Design ((Number of units sold)) 15 Price Sharad Varde Statistical Model Price Chemist Grocery Shop Supermarket Total yij = ȝ + ȕi + tj + eij Rs 100 308 867 129 1304 Rs 120 216 669 104 989 Rs 150 163 557 95 815 Rs 175 142 490 86 718 where, yij = observation for jth treatment level in ith block i = 1, 2, . . . . ,n blocks (3 store types) j = 1, 2, . . . . ,k treatments (4 treatments) ȝ = overall mean ȕi = effect of the ith block tj = effect of the jth treatment eij = experimental error in the ith block subjected to jth treatment. Total 829 2583 414 3826 Sharad Varde 16 Sharad Varde Experimental p Error Sources of Error Error-free E f reall lif life iis utopia t i / Unexpected event during experiment / Subjects getting bored, aging during expt / Post-test familiar for pre-tested subjects / Non Non-uniformity uniformity of measurement tools / Unwillingness of some selected subjects / Outliers included in the random sample / Drop out / mortality during experiment. spite te o of a all p precautions, ecaut o s, so some ee error o ca can In sp creep in the experiment Best solution is to measure the experimental error and check if it is within acceptable p limits Say, ± 5%. 17 Sharad Varde 18 Analysis of Variance Table for Randomised Block Design g Statistical Analysis y 19 O Overall ll M Mean ȝ = (1/nk) (1/ k) Ȉi Ȉj yij Block Means ,n ȝi. = (1/k) Ȉj yij Treatment Means ȝ.jj = (1/n) Ȉi yijj Sharad Varde for i = 1, 2, . . . . for j = 1, 2, . . . . ,k ȝ)2 Total Sum of Squares SST = Ȉi Ȉj (yij – Block Sum of Squares SSB = k Ȉi (ȝi. – ȝ)2 Treatment Sum of Squares SSTr = n Ȉj (ȝ.j – ȝ)2 Error Sum of Squares SSE = Ȉi Ȉj (yij – ȝi. – ȝ.j + ȝ)2 Sharad Varde 20 Source of Variation Sum of Squares Degrees of Freedom Mean Squares F Ratio Between Blocks SSB n–1 MSB = SSB / (n – 1) MSB / MSE = FB Between Treatments SSTr k–1 MSTr = SSTr / (k – 1) MStr / MSE = FTr Residual Error SSE (n – 1)(k – 1) MSE = )( – 1)) SSE / ((n – 1)(k Total SST nk – 1 Sharad Varde Hypothesis yp Testing g For Blocks: Hypothesis yp Testing g If FB > F value l ffor n – 1 & ((n – 1)(k – 1) d degrees off freedom for the stipulated level of confidence, say 95% then 95%, th we reject j t H0 for f blocks, bl k th thatt means, th the block effects vary from each other H0: ȕ1 = ȕ 2 = . . . . . .= ȕn H1: H0 is not true If FTr > F value for k – 1 & (n – 1)(k – 1) degrees of freedom for the stipulated level of confidence, say 95%, then we reject H0 for treatments, that means, the treatment effects vary from each other. For Treatments: H0: t1 = t2 = . . . . . . = tk H1: H0 is not true. 21 Sharad Varde 22 B.3: Latin Square q Design g B.3: Latin Square q Design g QOne Effect Factor (Dependent Variable): Variable) Measurable, Countable, Cardinal, Metric GExperiment: GE periment To find out o t impact of three different ads on sales of refrigerators QThree Cause Factors (Independent ), Or,, one principal p p cause factor and Variables), two extraneous factors: All Non-metric, Nominal or Ordinal Ordinal, Categorical Effect factor: Sales Cause factor: Ads (3 versions: A A, B & C) Two Extraneous factors: 1. Product Pricing (3 levels: Rs. 20000, 25K, 30K) QNo interaction among three Cause Factors. 23 Sharad Varde Sharad Varde 2. Consumer Income (3 levels: low, mid, high). 24 Sharad Varde Example of Latin Square Design B.3: Latin Square q Design g Construct a 3 x 3 table (Total 9 cells): GRows show 3 levels of one extraneous factor (Product Pricing) GColumns show 3 levels of other extraneous factor (Consumer Income) GAssign 3 ad versions to 9 cells in such a way that each row & each column has all 3 ads GNote only 9 obs instead of 3x3x3 = 27 obs. 25 Sharad Varde Low Income Middle Income High Income Rs 20000 Ad B Ad-B Ad A Ad-A Ad C Ad-C R 25000 Rs Ad C Ad-C Ad B Ad-B Ad A Ad-A Rs 30000 Ad-A Ad-C Ad-B 26 Statistical Model Sharad Varde Statistical Analysis y yijk = ȝ + ri + cj + tk + eijk yijkj = observation in ith row & jth column subjected to kth treatment i = 1, 2, . . . . ,n j = 1, 2, . . . . ,n k = 1, 2, . . . . ,n n = number of treatments ȝ = overall mean ri = effect of the ith row (ith level of extraneous factor 1) cj = effect of the jth column (jth level of extraneous factor 2) tk = effect of the kth level of treatment (cause factor) eijk = experimental error in ith row & jth column subjected to kth treatment. 27 Pricing Levels Sharad Varde 28 ȝ = (1/n2) Ȉi Ȉj yijk yijk Overall Mean Row Means Column Means ȝ.i. = (1/n) Ȉi yijk for j = 1, 2, . . . . ,n Treatment Means ȝ..k = (1/n) Ȉ yijk for k = 1, 2, . . . . ,n ȝi.. = (1/n) Ȉj yijk for i = 1, 2, . . . . ,n Sharad Varde Analysis of Variance Table for q Design g Latin Square Statistical Analysis y Total Sum of Squares SST = Ȉi Ȉj (yijk – ȝ)2 R Row S Sum off Squares S SSR = n Ȉi (ȝ ( i.. – ȝ))2 Column Sum of Squares SSC = n Ȉj (ȝ.j. – ȝ)2 Treatment Sum of Squares q SSTr = n Ȉk (ȝ..kk – ȝ)2 Error Sum of Sq. SSE=ȈiȈj (yijk – ȝi.. – ȝ.j. – ȝ..k + 2ȝ)2 29 Sharad Varde Source of Variation Sum of Squares Degrees of Freedom Mean Squares Between Rows SSR n–1 MSR = SSR / (n – 1) MSR / MSE = FR Between Columns SSC n–1 MSC = SSC / (n – 1) MSC / MSE = FC Between Treatments SSTr n–1 MSTr = SSTr / (n – 1) MStr / MSE = FTr Residual Error SSE (n – 1)(n – 2) MSE = SSE / (n ( – 1)(n 1)( – 2) Total SST n2 – 1 30 F Ratio Sharad Varde Hypothesis yp Testing g Hypothesis yp Testing g For F Rows: R If FR > F value l ffor n – 1 & (n ( – 1)(n 1)( – 2) d degrees off ffreedom d ffor H0: r1 = r2 = . . . . . .= rn stipulated level of confidence, say 95%, then we reject H0 for rows, H1: H0 is not true For Columns: that means means, the row (extraneous factor 1) effects vary from each other H0: c1 = c2 = . . . . . .= cn If FC > F value for n – 1 & (n – 1)(n – 2) degrees of freedom, then we reject H0 for columns, that means, the column (extraneous factor 2) H1: H0 is not true effects vary from each other For Treatments: H0: t1 = t2 = . . . . . . .= tn If FTr > F value for n – 1 & ((n – 1)(n )( – 2)) degrees g of freedom, then we H1: H0 is not true. reject H0 for treatments, that means, the treatment effects differ from each other. 31 Sharad Varde 32 Sharad Varde Latin Square q Design g Latin Square q Design g G Limitation: In Latin square, square levels of all the three cause factors must be same (here, 3 each) ¦Benefits: Needs substantially less number of test units ¦P f ¦Performs randomization d i ti with ith respectt tto row and column effects ¦Th ¦Thus, neutralizes t li effect ff t off extraneous t factors f t ¦Applies when the three cause factors do not interact with each other ¦If they do, use Factorial Design. G Each of the 3 ads is assigned to each cell randoml randomly G So, each row has all ads & each column has all ads G Effect on sales is determined for each cell G Analysis shows which ad influences sales most irrespective of the two extraneous factors, viz. Product Pricing & Income Levels of Consumers. 33 Sharad Varde 34 B.4: Factorial Design g Sharad Varde B.4: Factorial Design g GOne O Effect Eff Factor F (Dependent (D d Variable): V i bl ) Experiment to investigate Metric GMany Cause Factors (Independent interaction among Variables), Or, one principal cause factor and g several extraneous factors: All Categorical all cause factors 35 Sharad Varde GInteraction among all cause factors. 36 Sharad Varde Factorial Design g Example p of Interaction Effect Two effects detected: Main effect & Interaction effect M i Effect Main Eff t off a cause factor f t (ads) ( d ) is i its it direct influence on the effect factor (sales) Interaction I t ti Effect Eff t off two t cause factors f t is i the th influence of the interaction between the two cause factors f t (consumer ( income i and d ads d ) on the effect factor (sales). Experiment: E i t To T determine d t i b believability li bilit off two ads on 0-100 scale 37 Sharad Varde Two different ads A & B are to be compared Eff t ffactor: Effect t B li Believability bilit Cause factor: Ads Gender of the reader is extraneous factor. 38 Example p of Factorial Design g Sharad Varde Example p of 2 X 2 Factorial Design g This is a 2X2 factorial e experiment periment Permits to test 3 hypothesis: Men Men + Ad A + Ad B O1 = 60 O2 = 70 Women Women + Ad A + Ad B O3 = 80 O4 = 50 R ÌWhich ad is more believable (Main effect) ÌWhich gender tends to believe magazine ads more (Main effect) ÌWhich gender finds which ad more believable (Interaction effect). 39 Sharad Varde 40 Sharad Varde Believability y Scores Main Effects Ad A Ad B Main Effect of Gender M Men 60 70 65 Women 80 50 65 Main Effect of Ad 70 60 Which ad is more believable (Main effect) Ad A: (60 + 80) / 2 = 70 Ad B (70 + 50) / 2 = 60 Which gender tends to believe magazine ads more (Main effect) 41 Men: (60 + 70) / 2 = 65 Sharad Varde 42 Interaction Effect 90 Which Whi h gender d fifinds d which hi h ad d more believable ((Interaction effect)) W om en 80 Sharad Varde Interaction Effects Interaction Between Gender and Advertising Copy 100 Women: (80 + 50) / 2 = 65 Believability 70 Men 60 50 40 Men: Ad B: 70 against 60 Women: Ad A: 80 against 50 30 20 10 Ad A 43 Ad B Sharad Varde 44 Sharad Varde Factorial Design g Structure of 2 X 2 Factorial Design g X1 X1 Useful when several cause factors are being investigated g and when they y interact with + X2 + No X2 O1 O2 R each other significantly (Multi-collinearity) No X1 + X2 O3 No X1 + No X2 Factorial design covers all possible O4 combinations of all factors under study Obviously, it needs a fat cost & time budget. 45 Sharad Varde 46 Structure of 2 X 2 Factorial Design 47 X2 O1 X1 N X2 No O2 No X1 X2 O3 No X1 No X2 O4 Sharad Varde Sharad Varde 2 x 2 x 2 Factorial Design g Two cause factors X1 & X2 each at 2 levels X1 Cause factors X1 and X2 each at 2 levels. Experiment: To determine effect of training on job performance Effect factor: Job performance Cause factors: DTraining (2 levels: Training / No Training) DGender (2 levels: Male / Female) DScience or Non-science graduate (2 levels). 48 Sharad Varde Structure of 2 X 2 X 2 Factorial Design Factorial Design g (Three cause factors X1, X2 & X3 each at 2 levels) 49 X1 X2 X3 O1 X1 X2 No X3 O2 X1 N X2 No X3 O3 X1 No X2 No X3 O4 No X1 X2 X3 O5 No X1 X2 No X3 O6 No X1 No X2 X3 O7 No X1 No X2 No X3 O8 Sharad Varde End of Design g of Experiments p D Sharad Dr. Sh d Varde V d A 3X2 ffactorial t i ld design i h has one cause ffactor t with 3 levels & second cause factor with 2 levels A 3X3 factorial design has two cause factors each with 3 levels A 3X2X4X2 factorial f design has four f cause factors with 3, 2, 4 & 2 levels respectively…. 50 Sharad Varde

Multivariate Analysis Cluster Analysis Dr. Sharad Varde A Clarification of Terminology A Clarification of Terminology In sampling, ‘CLUSTER’ is a term used to denote a group of heterogeneous elements Population consists of several such clusters Each cluster offers the entire range of variation available in the population Each cluster is similar to other clusters Inter-cluster homogeneity and intra-cluster heterogeneity We can choose any cluster as a sample representative of the population. In sampling, ‘STRATUM’ is a term used to denote a group of homogeneous elements Population consists of several such strata Each stratum is different from other strata Intra-strata homogeneity and inter-strata heterogeneity We must select all strata and choose a few elements from each stratum to obtain a sample representative of the population. Sharad Varde 3 Sharad Varde 4 A Clarification of Terminology Objective of Cluster Analysis STRATUM in sampling is called CLUSTER in multivariate analysis So, in multivariate analysis, cluster is a group of homogeneous elements It is like dictionary meaning of word ‘cluster’ Population consists of several such clusters Each cluster is different from other clusters Inter-cluster heterogeneity and intra-cluster homogeneity Cluster analysis is also called Classification Analysis, or Numerical Taxonomy. Sharad Varde To divide the heterogeneous population into a number of homogeneous groups (clusters) in such a manner that elements similar to each other in respect of characteristics of our interest are bunched together in a cluster Population is thus divided into several bunches called clusters This is called ‘Market Segmentation’ We then study each cluster in detail. 5 Elements of a Population Sharad Varde 6 Variable 1 Clustered Elements Variable 2 Sharad Varde 7 Sharad Varde 8 Case Study # 4 Cluster 1 Indian Railways wanted to map profile of its target audience (potential customers) in terms of lifestyle, attitudes & perceptions A set of 15 statements was prepared to measure these characteristics Respondents to tick 1: strongly agree, 2: agree, 3: neutral, 4: disagree, 5: strongly disagree against each statement Cluster analysis divided respondents into 4 homogeneous clusters. Sharad Varde They are careful spenders. Feel that quality comes at a price, car is not a necessity, people are not more health-conscious now, women are not active decision makers, foreign firms have increased efficiency of Indian firms, politicians can play active role, don’t like TV, fast food, credit card, movies, weekend outings. Thus, they exhibit many traditional values. 9 Other Clusters Sharad Varde 10 Conducting Cluster Analysis Formulate the Problem Cluster 2: They use credit cards, spend freely, travel, believe in women power, believe in economics more than in politics, feel quality products can cost less Cluster 3: Health-conscious, spend carefully, brand loyal, outgoing, extrovert nature, like to settle abroad Cluster 4: Optimistic, love TV, believe in value for money, free spenders on items they like, travel a lot Select a Distance Measure Select a Clustering Procedure Decide on the Number of Clusters IR then studied demographic DNAs of each cluster to evolve a communication & marketing strategy Let us see how this clustering is actually done. Interpret and Profile Clusters Assess the Validity of Clustering Sharad Varde 11 Sharad Varde 12 Formulating the Problem Distance Measure Select variables most relevant to our inquiry Basis of Cluster Analysis: Concept of distance between two objects (respondents) in terms of the variables of our interest Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solution Most commonly used measure is Euclidean Distance. In descriptive research, past studies & present hypotheses help selection of variables Euclidean Distance is the square root of the sum of the squared differences in values for each variable. In exploratory research, use judgment & intuition to select relevant variables. Sharad Varde 13 14 Sharad Varde Euclidean Distance between Resp1 and Resp2 is 3.74 Example Resp 1 Resp 2 Responses of person #1 & #2 to three statements on five point scale: I prefer to use e-mail rather than write a letter I feel that good quality products are always priced high Sharad Varde 15 (Diff)2 State ment 1 1 3 |1 – 3| = 2 4 State ment 2 5 2 |5 – 2| = 3 9 State ment 3 3 4 |3 – 4| = 1 1 √Σ(Diff)2 TV is major source of entertainment. Difference 3.74 Sharad Varde 16 Other Measures of Distance Clustering Procedures Basic Methods are of two types 1. Hierarchical (or Linkage) Methods: A complete range of solutions is provided by computers varying from 1 to n – 1 clusters where n is number of objects being studied (respondents) 2. Non-hierarchical (or Nodal) Methods: Number of clusters to be extracted is specified in advance. The City-block or Manhattan Distance between two objects j and k is the sum of the absolute differences in values for each variable: Σi|dij – dik| (6 in above example) The Chebychev Distance between two objects is the maximum absolute difference in values for any variable: Max |dij – dik| (3 in above example) 17 Sharad Varde Classification of 8 Clustering Procedures Sharad Varde 18 Hierarchical Clustering Clustering Procedures Nonhierarchical Hierarchical Agglomerative Divisive Sequential Threshold Linkage Methods Parallel Threshold Optimizing Partitioning Centroid Methods Variance Methods Ward’s Method Single Complete Average Sharad Varde 19 It is development of a hierarchy or tree-like structure. It can be agglomerative or divisive Agglomerative Clustering starts with each object in a separate cluster (i.e. c = n). Then objects are grouped into bigger and bigger clusters. This process is continued until all objects are members of a single cluster (i.e. c = 1) Divisive Clustering is exactly opposite. It starts with all the objects grouped in a single cluster (i.e. c = 1). It is then progressively split until each object is in a separate cluster (i.e. c = n). Sharad Varde 20 Agglomerative Clustering Linkage Methods Single Linkage method is based on minimum distance, or ‘nearest neighbour rule’. Here, distance between two clusters is distance between their two closest points Complete Linkage method is based on maximum distance or ‘farthest neighbour rule’. Here, distance between two clusters is calculated as distance between their two farthest points Average Linkage method works defines distance as the average of distances between all pairs of objects, one from each cluster. Is most commonly used in research studies They consist of Linkage methods Variance methods Centroid methods Linkage methods are of further three types: Single Linkage Complete Linkage Average Linkage. 21 Sharad Varde Pictorial Representation 22 Other Agglomerative Methods Single Linkage Variance Method generates clusters to minimize within-cluster variation Minimum Distance Cluster 1 Sharad Varde Cluster 2 Ward's Procedure is most popular variance method. Complete Linkage For each cluster, compute means for all variables Maximum Distance Then, for each object, calculate squared Euclidean distance to the cluster means Cluster 1 Sum them up for all objects. Cluster 2 Average Linkage At each stage, combine two clusters with smallest increase in overall sum squared Euclidean distance. Average Distance Cluster 1 Sharad Varde Cluster 2 23 Sharad Varde 24 Ward’s Procedure V1 V2 Pictorial Representation --- Vn O1 O2 ------Om Means (E.D.)2 Ward’s Procedure Centroid Method Σ(E.D.)2 Sharad Varde 25 Other Agglomerative Methods Sharad Varde 26 Steps in Computerized Procedure Run the Hierarchical Clustering Programme on the variables Centroid Method computes distance between Generate output called Agglomeration Schedule the centroids (means for all the variables) of It shows all possible solutions from 1 to n-1 clusters (n = number of respondents or objects) clusters. Every time objects are regrouped, a Going up from the bottom of the Agglomeration Schedule look at the column called Coefficients to decide on number of clusters new centroid is computed. In this column starting from the bottom, calculate difference in the value of coefficient in the neighbouring rows. Average Linkage and Ward's Procedure If the maximum value of this difference occurs, say, between third & fourth row from the bottom it indicates existence of 3 clusters (the lower row number). This is purely judgmental. perform better than other hierarchical methods. Dendrogram gives essentially same information in graphical form. Sharad Varde 27 Sharad Varde 28 Case Study # 5 Case Study # 5 Problem: Clustering of consumers based on attitude towards shopping at Wonder Mall Six attitudinal variables were identified V1: Shopping is fun for me V2: Shopping is bad for my budget V3: I combine shopping with eating out V4: I get best buys when shopping here V5: I do not care about shopping V6: I can save a lot of money by comparing prices Consumers were asked to express their degree of agreement with these statements on a 7 point scale (1=Strongly Disagree; 7=Strongly Agree) Data obtained from 20 respondents are shown in next slide In reality, sample size was much larger. 29 Sharad Varde Case Study # 5 Input Data Cons No. V1 V2 V3 V4 V5 V6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 6 2 7 4 1 6 5 7 2 3 1 5 2 4 6 3 4 3 4 2 4 3 2 6 3 4 3 3 4 5 3 4 2 6 5 5 4 7 6 3 7 1 6 4 2 6 6 7 3 3 2 5 1 4 4 4 7 2 3 2 3 4 4 5 2 3 3 4 3 6 3 4 5 6 2 6 2 6 7 4 2 5 1 3 6 3 3 1 6 4 5 2 4 4 1 4 2 4 2 7 3 4 3 6 4 4 4 4 3 6 3 4 4 7 4 7 5 3 7 2 Sharad Varde 30 Sharad Varde Results of Hierarchical Clustering Agglomeration Schedule Using Ward’s Procedure Stage cluster first appears Clusters combined Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 31 Cluster 1 14 6 2 5 3 10 6 9 4 1 5 4 1 1 2 1 4 2 1 Cluster 2 Coefficient 16 1.000000 7 2.000000 13 3.500000 11 5.000000 8 6.500000 14 8.160000 12 10.166667 20 13.000000 10 15.583000 6 18.500000 9 23.000000 19 27.750000 17 33.100000 15 41.333000 5 51.833000 3 64.500000 18 79.667000 4 172.662000 2 328.600000 Sharad Varde Cluster 1 Cluster 2 Next stage 0 0 6 0 0 7 0 0 15 0 0 11 0 0 16 0 1 9 2 0 10 0 0 11 0 6 12 6 7 13 4 8 15 9 0 17 10 0 14 13 0 16 3 11 18 14 5 19 12 0 18 15 17 19 16 18 0 32 Dendrogram Dendrogram Using Ward’s Method A dendrogram, or tree graph, is a graphical device for displaying clustering results Vertical lines represent clusters that are joined together The position of the line on the scale indicates the distances at which clusters were joined Dendrogram is read from left to right. 33 Sharad Varde Cluster Membership of Cases Using Ward’s Procedure Number of Clusters 4 3 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 1 3 2 1 1 1 2 3 2 1 2 3 1 3 1 4 3 2 1 2 1 3 2 1 1 1 2 3 2 1 2 3 1 3 1 3 3 2 1 2 1 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 2 2 Sharad Varde 34 Interpretation Results of Hierarchical Clustering Label case Sharad Varde How many clusters? Answer: Not too many; Not too few Sometimes, decision makers may want a particular number of clusters Common sense considerations rule out 1 or 2 clusters as meaningless A 3 cluster solution results in clusters with 8, 6 & 6 respondents. 35 Sharad Varde 36 3 Clusters Interpretation A 4 cluster solution has 8, 6, 5 & 1 respondents Cluster 1: Respondent No.: 1, 3, 6, 7, 8, 12, 15, 17 Meaningless to have a cluster with only one case Cluster 2: Respondent No.: 2, 5, 9, 11, 13, 20 So a 3 cluster solution is preferable Cluster 3: Respondent No.: 4, 10, 14, Interpreting & profiling clusters involves examining cluster centroids. 16, 18, 19. 37 Sharad Varde Cluster Centroids 38 Interpretation Cluster 1 is High on V1: Shopping is Fun High on V3: Combine shopping with eating out Low on V5: Do not care about shopping i.e. Care about shopping In short: Fun loving & concerned shoppers. Means of Variables V1 V2 V3 V4 V5 V6 1 5.750 3.625 6.000 3.125 1.750 3.875 2 1.667 3.000 1.833 3.500 5.500 3.333 3 3.500 5.833 3.333 6.000 3.500 6.000 Cluster No. Sharad Varde Sharad Varde 39 Sharad Varde 40 Interpretation Interpretation Cluster 2 is Cluster 3 is High on V5: Do not care about shopping High on V2: Shopping upsets budget Low on V1: Shopping is Fun High on V4 : Try to get best buys Low on V3: Combine shopping with High on V6 : Can save a lot of money by eating out comparing prices In short: Apathetic shoppers. Sharad Varde In short: Economical shoppers. 41 42 Non-hierarchical Clustering Non-hierarchical Clustering Procedure Number of clusters is specified in advance Also called K-means clustering, it has 3 iterative methods: Sequential Threshold Method: Select a cluster center and group together objects within a pre-specified threshold value from the center. Then select second cluster center and repeat process for the un-clustered objects. And so on till you configure required number of clusters. Parallel Threshold Method: Select several cluster centers simultaneously and assign objects within the threshold level to the nearest center. Optimizing Partitioning Method: Here, objects can later be reassigned to clusters to optimize a criterion, such as, average within-cluster distance for a given number of clusters. Number of clusters is specified by decision maker Now run non-hierarchical clustering procedure on the input data Output gives final configuration of each cluster. Sharad Varde Sharad Varde 43 Sharad Varde 44 Further Work Further profiling can be done on the basis of variables not used for clustering End of Cluster Analysis Identification factors e.g. demographic, economic variables are used to identify members of each cluster The variables that significantly differentiate between clusters can be obtained through Discriminant Analysis. Sharad Varde 45 Major Dependence Methods Multiple Discriminant Analysis Dependent Variable Metric Categorical Independent Variables Categorical Analysis of Variance Independent Variables Metric Multiple Regression Categorical Metric Canonical Correlation Multiple Discriminant Sharad Varde 48 Difference Between Cluster Analysis & Discriminant Analysis Discriminant Analysis Both classify population elements into groups Helps in discriminating between two or more sets of objects or people based on the knowledge of some of their characteristics. For example: Cluster Analysis classifies them into relatively homogeneous groups called clusters. Elements in each cluster are dissimilar to those in other clusters Discriminate between bones of males & females Discriminant Analysis develops a classification rule to assign a new element to a particular cluster of the population Classifying people into potential buyers or non-buyers Classifying individuals as excellent, acceptable or bad credit risk In cluster analysis there is no a-priori information about which element belongs to which cluster. Clusters are formed by the data. Sharad Varde Classifying companies as A, B or C investment risks Discriminate between brand loyals & brand switchers 49 Terminology Sharad Varde 50 What Discriminant Analysis does Predictor: Independent variable (metric) 1. Analyses past data on predictors & criterion Criterion: Dependent variable (categorical) 2. Develops a Discriminant Function to Discriminant Function: Linear combination of of the criterion the predictors (independent variables), which 3. Evaluates accuracy of classification will best discriminate between the different 4. Classifies objects or people to one of the categories of the criterion (dependent variable) Sharad Varde discriminate between the different categories 51 categories based on values of predictors. Sharad Varde 52 Discriminant Analysis Model Process of Discriminant Analysis D = b0 + b1X1 + b2X2 + b3X3 + . . . . . . . + bkXk Identify objectives, criterion & predictors. where Predictors must consist of two or more mutually exclusive and collectively exhaustive categories (Gender: M, F; Investment Risk: A, B, C; People: Buyers, Non-buyers) D = discriminant score b's = discriminant coefficients or weights X's = predictors (independent variables) Draw a sample of objects from the population. Coefficients, or weights (b), are estimated so that the groups differ as much as possible on the values of the discriminant function. Sharad Varde Collect data from sampled objects on predictor variables for each category of criterion. 53 54 Conducting Discriminant Analysis Process of Discriminant Analysis Split the sample into two unequal parts. Formulate the Problem Bigger part of the sample is called ‘analysis sample’ or ‘estimation sample’. It is used to estimate coefficients (weights) b’s of the discriminant function. Estimate the Discriminant Function Coefficients Determine the Significance of the Discriminant Function Other part is called the ‘validation sample’ or ‘holdout sample’. It is reserved to evaluate accuracy of the discriminant function. Sharad Varde Sharad Varde Interpret the Results Assess Validity of Discriminant Analysis 55 Sharad Varde 56 Case Study # 6 Case Study # 6 Problem: To discover salient characteristics of families that visited a vacation resort during last two years Data were obtained from a sample of 42 families of which 30 were included in analysis sample & 12 in validation sample Families that visited resort were coded as 1 & those that did not as 2 Both samples were balanced in terms of visits Predictor variables selected were V1: Family income V2: Attitude towards travel measured on a 9-point scale V3: Importance attached to family vacation measured on a 9-point scale V4: Household Size V5: Age of the head of the family. 57 Sharad Varde Case Study # 6: Input Data Case Study # 6: Input Data No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Resort Visit 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Annual Attitude Family Toward Income Travel (Rs0000) 50.2 70.3 62.9 48.5 52.7 75.0 46.2 57.0 64.1 68.1 73.4 71.9 56.2 49.3 62.0 5 6 7 7 6 8 5 2 7 7 6 5 1 4 5 Importance Household Age of Attached Size Head of to Family Household Vacation 8 7 5 5 6 7 3 4 5 6 7 8 8 2 6 Sharad Varde 3 4 6 5 4 5 3 6 4 5 5 4 6 3 2 58 Sharad Varde No. 43 61 52 36 55 68 62 51 57 45 44 64 54 56 58 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 59 Resort Visit 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Annual Attitude Family Toward Income Travel (Rs0000) 32.1 36.2 43.2 50.4 44.1 38.3 55.0 46.1 35.0 37.3 41.8 57.0 33.4 37.5 41.3 5 4 2 5 6 6 1 3 6 2 5 8 6 3 3 Importance Household Age of Attached Size Head of to Family Household Vacation 4 3 5 2 6 6 2 5 4 7 1 3 8 2 3 Sharad Varde 3 2 2 4 3 2 2 3 5 4 3 2 2 3 2 58 55 57 37 42 45 57 51 64 54 56 36 50 48 42 60 Computerized Discriminant Analysis Case Study # 6: Validation Sample No. Annual Attitude Family Toward Income Travel (Rs0000) Resort Visit Importance Household Age of Attached Size Head of to Family Household Vacation GROUP MEANS VISIT INCOME TRAVEL VACATION HSIZE 1 2 Total 60.52000 41.91333 51.21667 5.40000 4.33333 4.86667 4.33333 2.80000 3.56667 5.80000 4.06667 4.9333 AGE 53.73333 50.13333 51.93333 Group Standard Deviations 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 1 1 2 2 2 2 2 2 63.6 50.8 54.0 45.0 68.0 62.1 35.0 49.6 39.4 37.0 54.5 38.2 7 4 6 5 6 5 4 5 6 2 7 2 4 7 7 4 6 6 3 3 5 6 3 2 7 3 4 3 6 3 4 5 3 5 3 3 1 2 Total 55 45 58 60 46 56 54 39 44 51 37 49 9.83065 7.55115 12.79523 1.91982 1.95180 1.97804 Pooled Within-Groups Correlation Matrix INCOME TRAVEL VACATION INCOME TRAVEL VACATION HSIZE AGE 1.00000 0.19745 0.09148 0.08887 - 0.01431 1.00000 0.08434 -0.01681 -0.19709 1.00000 0.07046 0.01742 1.23443 .94112 1.33089 HSIZE 1.00000 -0.04301 8.77062 8.27101 8.57395 AGE 1.00000 Wilks' (U-statistic) and univariate F ratio with 1 and 28 degrees of freedom Variable Wilks' INCOME TRAVEL VACATION HSIZE AGE 0.45310 0.92479 0.82377 0.65672 0.95441 F 33.800 2.277 5.990 14.640 1.338 61 Sharad Varde 1.82052 2.05171 2.09981 Computerized Discriminant Analysis Significance 0.0000 0.1425 0.0209 0.0007 0.2572 62 Sharad Varde Computerized Discriminant Analysis CANONICAL DISCRIMINANT FUNCTIONS Function 1* Eigenvalue 1.7862 % of Variance 100.00 Cum Canonical After Wilks' % Correlation Function λ : 0 0 .3589 100.00 0.8007 : Standard Canonical Discriminant Function Coefficients FUNC 1 Canonical discriminant functions evaluated at group means (group centroids) 0.74301 0.09611 0.23329 0.46911 0.20922 Group 1 2 INCOME HSIZE VACATION TRAVEL AGE 1 0.82202 0.54096 0.34607 0.21337 0.16354 Sharad Varde Contd. 63 FUNC 1 1.29118 -1.29118 Classification results for cases selected for use in analysis Structure Matrix: Pooled within-groups correlations between discriminating variables & canonical discriminant functions (variables ordered by size of correlation within function) FUNC FUNC 1 0.8476710E-01 0.4964455E-01 0.1202813 0.4273893 0.2454380E-01 -7.975476 INCOME TRAVEL VACATION HSIZE AGE (constant) * marks the 1 canonical discriminant functions remaining in the analysis. INCOME TRAVEL VACATION HSIZE AGE Unstandardized Canonical Discriminant Function Coefficients Chi-square df Significance 26.130 5 0.0001 Actual Group Predicted No. of Cases Group Membership 1 2 Group 1 15 12 80.0% 3 20.0% Group 2 15 0 0.0% 15 100.0% Percent of grouped cases correctly classified: 90.00% Sharad Varde 64 Interpretation Classification Using the Model Un-standardized discriminant function is Group Centroids are values of discriminant function at Group Means Group Centroids are: Group Centroid 1 1.335 2 -1.256 Average of two Group Centroids gives cut-off point Average of two centroids is 0.0395 D= -7.975476 + +0.8476710 E-01 (INCOME) +0.4964455 E-01 (TRAVEL) +0.1202813 (VACATION) +0.4273893 (H SIZE) +0.2454380 E-01 (AGE) Sharad Varde 65 Classification Using the Model Case Study # 6: Validation Sample Average of two centroids is 0.0395 No. Therefore: Any value of discriminant score D > 0.0395 will classify the object as ‘Resort Visit’ 1 2 3 4 5 6 7 8 9 10 11 12 Any value of discriminant score D < 0.0395 will classify the object as ‘No Resort Visit’ Now, let us asses the validity of D using the validation sample. Sharad Varde 66 Sharad Varde 67 Resort Visit 1 1 1 1 1 1 2 2 2 2 2 2 Annual Attitude Family Toward Income Travel (Rs0000) 63.6 50.8 54.0 45.0 68.0 62.1 35.0 49.6 39.4 37.0 54.5 38.2 7 4 6 5 6 5 4 5 6 2 7 2 Importance Household Age of Attached Size Head of to Family Household Vacation 4 7 7 4 6 6 3 3 5 6 3 2 Sharad Varde 7 3 4 3 6 3 4 5 3 5 3 3 55 45 58 60 46 56 54 39 44 51 37 49 68 Classification Using Model Classification Using Model Value of Discriminant Function for the 1st family in Validation Sample is: D= -7.975476 + +0.8476710E-01(63.6) +0.4964455E-01(7) +0.1202813(4) +0.4273893(7) +0.2454380E-01(55) = +2.5865 Thus respondent belongs to group 1 (Traveled: Correct) Sharad Varde End of Multiple Discriminant Analysis Classification Results for cases not selected for use in the analysis (validation sample) Actual Group Predicted No. of Cases Group Membership 1 2 Group 1 6 4 66.7% 2 33.3% Group 2 6 0 0.0% 6 100.0% Percent of grouped cases correctly classified: 83.33%. 69 Sharad Varde 70

Research Process Sampling Identify broad area of research gather preliminary data Define research problem Identify important factors (variables) Generate hypotheses Prepare research design COLLECT DATA, analyse & interpret Draw conclusions Write report Present report for research-based decision making. Dr. Sharad Varde 2 Data Collection Terminology Data must be collected from the people, events, or objects that can provide correct answers to the research problem Population: Entire group of people, events, or objects of interest in context of research Element: A single member of the population Population Frame: List of all elements in the population from which a sample is drawn Process of selecting the right people, events, or objects in right numbers is called Example: List of all students in a college, list of all ent. events in Mumbai in Oct 2010, list of all songs sung by Lata Mangeshkar SAMPLING 3 Sharad Varde Population Parameters: Pop. mean & variance. Sharad Varde 4 Sharad Varde Terminology Representativeness of Sample Sample: A subset of population selected for data collection in the research study Should enable generalizing conclusions for entire population Hence, sample should honestly represent the population in respect of characteristics under investigation Representative sample should ensure sample mean ≈ population mean & sample variance ≈ population variance. Subject: A single member of the sample Sampling: Process of selecting sufficient number of elements from the population Sampling saves time & cost of research Sampling Parameters: Sample mean (central tendency) & sample variance (dispersion). 5 7 Sharad Varde 6 Sharad Varde Important Issues in Sampling Important Issues in Sampling 1. Sampling Design: Precisely how to draw a sample from the population 2. Sample Size n: How many elements of the population to be selected to form a representative sample Both depend upon cost & time budget of the study, and on the desired reliability of conclusions (confidence & precision). Larger sample higher chance of accuracy Sharad Varde But, larger sample higher cost & time Hence, one must strike a balance Sampling design serves the purpose It enables better precision and higher confidence with smaller sample. 8 Sharad Varde Two Basic Types Types of Sampling Designs A. Non-probability Sampling: Elements of population do not have a known or predetermined chance of selection. Used when quick results with low generalizability are needed at meager cost (e. g. exit polls) B. Probability Sampling: They do have. This design produces representative samples and wider generalizability. 10 Their Application Areas Case Studies Accounts Manager has put in place a new fully computerized accounting system Before making further improvements, he wants to get accounting staff’s reaction to it without making it seem that he has doubts about its utility & practicality So, he casually talks to the first five guys that walk into the office. A. Non-probability Sampling: Mostly in exploratory studies B. Probability Sampling: Mostly in descriptive and causal studies. 11 Sharad Varde Sharad Varde 12 Sharad Varde Case Studies Case Studies A TV journalist wants instant reactions of aam janta to the budget proposals just announced in the Loksabha While she wants responses of the man on the street, she knows that any tom unaware of the budget exercise and exact proposals will not serve her purpose She moves on to pick up persons who, in her judgment, fit the bill. GMAC (Graduate Management Admission Council) surveyed 740 university professors across the world who are intensively knowledgeable of GMAT formats over the past years to find out their suggestions for bettering GMAT (Graduate Management Aptitude Test) before launching the 10th generation of GMAT in 2013. 13 15 Sharad Varde 14 Sharad Varde Case Studies A. Non-probability Sampling Galaxy Tours & Travels wants to find out strong & weak points of its competitor Star Travels from customers’ view point Having come across a lady client of Star Travels to talk with, the interviewer asks her after the interview to introduce him to some one else who in her knowledge recently used the services of Star Travels Process goes on till he gets 20 such persons 1. Convenience Sampling: Conveniently Sharad Varde available elements are chosen (Example: Audience reaction to film: 1st day 1st show) 2. Purposive Sampling: Specific types of elements who have and can provide the desired information are chosen. 16 Sharad Varde A. Non-probability Sampling B. Probability Sampling 3. Judgment Sampling: Researcher uses her judgment about who would be the best respondents to serve the purpose Example: A sample of 100 TVs to be drawn from 10,000 TVs produced in Oct 2010 Each TV has 100 ÷ 10,000 = 0.01 i.e. 1% chance of being chosen 4. Snowball or Referral Sampling: A respondent is asked to name someone he knows who too can provide valuable info. It sets a chain process. 17 Sharad Varde Sampling Design tells researcher precisely how to pick up 100 TVs There are five major designs of this type. 18 B.1: Simple Random Sampling A Case Study ●Two lucky numbers to be drawn out of 100 tokens. Put all 100 tokens in a basket. Stir well. Close eyes and pick up two tokens ●For larger population, assign serial numbers to each element. Use a standard table of random numbers. Select the required number of elements one after other ●But, enlisting large populations is tedious. 19 Sharad Varde Sharad Varde HR Director of a software firm with 1926 engineers wants to find out desirability of changing the current 10 – 6 working hours to flexitime along with its benefits & drawbacks perceived by the engineers before the next board meeting She would pick up a few engineers randomly & ask them appropriate questions. 20 Sharad Varde B.2: Systematic Sampling 21 A Case Study ●A sample of 50 cars to be selected from 10,000 cars produced in 2009 Maruti Suzuki Ltd. wants to check response ●10,000 ÷ 50 = 200. Select every 200th car introduced in its small car segment ●More precisely, select a random number between 1 and 200, say 30. Select 30th car From the dealers alphabetical list, the ●Starting from 30th car, select every 200th car: 30, 230, 430, 630, 830, 1030, 1230, 1430… senior marketing manager to talk to them. Sharad Varde of prospective buyers to the new features Company selects every 50th dealer & sends a 22 B.3: Stratified Random Sampling B.3: Stratified Random Sampling ●If population contains identifiable subgroups of elements, researcher must provide proper representation to each subgroup ●Process: Divide the population into mutually exclusive identifiable subgroups (strata) ●Draw a simple random sample (or systematic sample) from each stratum ●Size of sample from each stratum directly proportional to size of the stratum ●Homogeneity within each stratum ●Heterogeneity between strata. ●Ex.: Population: All students of a college ●Identifiable Subgroups: males / females; arts/ science / commerce; brilliant / average / poor ●Lata M. songs: By language, solo / duet etc. 23 Sharad Varde Sharad Varde 24 Sharad Varde Study of Absenteeism (2% sample) B.3: Stratified Random Sampling Category (Stratum) 5 Strata Total Number 7750 Sample Size 155 ●It is Proportionate stratified random sample Managers 250 5 Junior Managers 500 10 Assistants 2000 40 Skilled Workers 4000 80 Unskilled Labour 1000 20 25 Sharad Varde ●If all strata are of comparable sizes, it is OK ●But, if some are too large or too small, we need to draw a Disproportionate stratified random sample ●Larger than proportionate representation to smaller strata and vice versa. 26 Study of Motivation (2% sample) 27 Sharad Varde B.3: Stratified Random Sampling Category (Stratum) 6 Strata Total Number 7100 Proportionate Sample Size 142 Disproportionate Sample Size 142 Sr. Managers 100 2 7 ●Observe the spread (variance) in each strata Middle Mgrs. 300 6 15 Jr. Managers 500 10 20 ●Low variance: relatively more homogeneous stratum needs smaller sample Supervisors 1000 20 30 Clerks 5000 100 60 Secretaries 200 4 ●Rule in drawing Disproportionate stratified random sample: ●High variance: relatively less homogeneous stratum needs bigger sample. 10 Sharad Varde 28 Sharad Varde A Case Study B.3: Stratified Random Sampling ●Stratified random sampling involves dividing population into strata ●Hence, it needs higher time and cost ●But, it provides desired precision with smaller sample than simple random or systematic sample. 29 Sharad Varde 1. 2. 3. 4. 30 Sharad Varde B.4: Cluster Sampling A Case Study ●Used when population consists of several groups of elements in such a manner that: ●Groups are similar to each other and ●Each group (CLUSTERS) is heterogeneous ●So, population has inter-group homogeneity and intra-group heterogeneity ●Exactly opposite of stratified population ●Process: Select a few clusters randomly. The consultant randomly picks up some employees from each category. Since, group 2 & 3 are smaller than 1 and group 4 is largest, she picks up 2% of group 4, 5% of group 1, and 10% of group 2 & 3 persons and talks with them at length. This is a case of stratified disproportionately random sampling. 31 A manufacturing company wants to conduct stress management programs to its employees. The consultant wants to get a first hand feel of the stress levels experienced by employees. He classifies them into 4 categories: Workmen constantly handling dangerous chemicals Foremen responsible for quality & productivity Sales personnel under monthly targets All others Sharad Varde 32 Sharad Varde B.4: Cluster Sampling Examples B.4: Cluster Sampling Examples ● Complex of many identical buildings. We can select ●A truckload of mangoes in 4 dozen boxes. Each box has upper layer of top quality fruits. Quality & size drops layer by layer. ●Thus, homogeneity between boxes & heterogeneity within each box. ●Draw a random or systematic sample of a few boxes, open them and study them. ●No need to open other boxes from the truck. 5 out of 50 buildings ● A Mgmt Inst: 2000 students per year. 50 per batch. 40 batches run concurrently. Each has some active, some ordinary & some passive students, and 75% boys, 25% girls. Choose 4 batches and talk to all 200 students without disturbing other 36 batches. 33 Sharad Varde 34 B.4: Cluster Sampling A Case Study ●Convenient Under a community health program for tribals, it was necessary to discover their current state of nutrition, health & beliefs Since adivasi padas are located at long distances from each other in tribal areas, a few adivasi padas were selected at random and all residents from infants to old ones were checked. ●Sample size smaller ●Less time and cost ●But, restrictive in application: You don’t frequently get such populations. 35 Sharad Varde Sharad Varde 36 Sharad Varde B.5: Double Sampling B.5: Double Sampling ●Used when we need some preliminary and some detailed information about population ●Example: Preliminary Info: Investible surplus with bank depositors ●Detailed Info: Perception about different types of investments available to individuals, their advantages, disadvantages, risks & benefits, and depositors’ preparedness to invest how much % in which scheme. ●Process: First draw a random sample (simple, systematic or stratified) of bank depositors. Collect info on their investible surplus funds ●Then draw a random sample from this sample (sub-sample) for administering a detailed questionnaire to find out subsampled subjects’ knowledge & perception of various investment avenues. 37 39 Sharad Varde 38 Sharad Varde A Case Study Exercise GoI wants to know industry opinion about withdrawal of 1 year old stimulus package Large sample of companies across the sectors is drawn to seek opinion A smaller sub-sample was selected to probe deeper into industry psyche and to obtain practical suggestions to maintain industrial growth. A conglomerate deals with appliances, machine tools, furniture, storage solutions, office equipment, processed foods, chicken, agri-products, mosquito repellents, edible oils, chemicals, healthcare, cosmetics, detergents, etc. Sharad Varde Its earnings are under competitive pressure. 40 Sharad Varde Exercise Exercise It wants to surge ahead of competitors through following strategies: Determine sampling designs to gather vital information required to work on each of the above 5 strategies 1. Developing new products Time is the essence. 2. Enhancing advertising effectiveness 41 3. Tapping creative ideas within the group The company wants to make decisions in the next quarterly board meeting 4. Improving employee motivation. So, all these inputs are needed in 30 days... Sharad Varde 42 Sharad Varde

Major Dependence Methods Dependent Variable Multivariate Analysis Categorical Metric Independent Variables Dr. Sharad Varde Categorical Analysis of Variance Independent Variables Metric Multiple Regression Categorical Metric Canonical Correlation Multiple Discriminant Sharad Varde 2 Scatter Plot: Horizontal Axis: Reasoning Scores Two Basic Concepts Vertical Axis: Creativity Scores 1.Scatter Plot 2.Correlation Sharad Varde 3 Sharad Varde 4 Correlation Coefficient For Cardinal Variables Basic Patterns of Scatter Plot Both Move Together Move In Opposite Way Data: Actual measurements on both variables No Relationship Formula: = Mean of Products of Values – Product of the Two Means -------------------------------------------------------------------------Product of the Two Standard Deviations Name: Pearson’s Correlation Coefficient Statisticians call it Pearson’s r. Sharad Varde 5 Sharad Varde Correlation Coefficient For Ordinal Variables Simple Regression Model Actual Measurements on Both Variables Not Available Available Data are in the Form of Ranks 6 x ∑ Square of Rank Diff Formula: 1 - --------------------------------------n x (n2 -1) where n denotes number of observations Name: Rank Correlation Coefficient. Sharad Varde 7 6 Regression Story of Regression ₪Dictionary Says: The act of returning or stepping back to a previous stage ₪Do quantitative methods force us to regress instead of progress? ₪Or, is it Back to the Future? ₪Statistics, like any other field, adopts crazy names arising from some important historical events. Sharad Varde Sir Francis Galton studied the heights of the sons in relation to the heights of their fathers His Conclusion: Sons of tall fathers were not so tall & sons of short fathers were not so short as their fathers Path Breaking Finding: Human heights tend to REGRESS back to normalcy Since then, similar studies on the nature and extent of influence of one or more variables on some other variable acquired the name ‘Regression Analysis’. 9 Regression Curve Sharad Varde 10 Regression Analysis Horizontal Axis: Cause Variable: Reasoning Scores Vertical Axis: Effect Variable: Creativity Scores ₪A quantitative method which tries to estimate the value of a Cardinal Variable (Effect) by studying its relationship with other Cardinal Variables (Cause) ₪This relationship is expressed by a custom-designed statistical formula called the Regression Equation. Sharad Varde 11 Sharad Varde 12 Purpose of Regression Analysis Patterns of Regression Curves Pattern 2. To determine the quantum of influence. # 1: Upward Sloping Straight Line Model: Y = a + bX + ε (b > 0) Relationship: Increase in X leads to proportionate increase in Y 3. To estimate the value of Effect Variable from Y 1. To establish exact nature of influence of Statistical Cause Variable on Effect Variable value of Cause Variable and assess error 4. To forecast future values of Effect Variable from information about Cause Variable. Sharad Varde X 13 Estimating Regression Parameters a & b 14 Least Square Method Formula for Regression Coefficient b: Formulae for regression parameters a & b are worked out by a method that assures Minimum Total Error of Estimation/ Forecasting, namely, ∑ ε² = ∑(Actual values of Y – Estimated values of Y)² It is Least Square Error method Divide ∑ ε² by the number of observations to get Mean Square Error (MSE) Minimum Mean Square Error (MMSE) method. Mean of Products of Values – Product of the Two Means = -------------------------------------------------------------------------Variance of Cause Variable Formula for Regression Constant a : a = Mean of Effect Variable Minus b times Mean of Cause Variable Regression coefficient ‘b’ and regression constant ‘a’ are jointly called ‘Regression Parameters’. Sharad Varde Sharad Varde 15 Sharad Varde 16 Mean Square Error Concept: Error of Estimation Note the difference between the actual values Errors must be small for the model to be of Effect Variable (Salary) and the values a good fit and to guide us into future estimated by the Regression Model This is called the Error of Estimation MSE must be within the range permitted Less the Error, Better the Model. Ideally 0. by the sponsor of research Statistical Model: Y = a + b X + ε Errors should be erratic / haphazard. If Correlation is Perfect (+1 or -1), ε = 0. 17 Sharad Varde 18 Sharad Varde Pattern # 3: Simple Exponential Other Patterns of Regression Curves Pattern # 2: Downward Sloping Straight Line Statistical Model: Y = a – bX + ε (b > 0) Relationship: Increase in X leads to proportionate decrease in Y. Y Increase in X leads to faster increase in Y Y X X Sharad Varde 19 Sharad Varde 20 Statistical Model Pattern Linear Conversion Statistical Model: Y = ea + bX + ε # 3: Simple Exponential Relationship: Logarithm pulls in the curvature and flattens the curve (Note: Log 1 = 0; Log 10 = 1) Increase in X leads to faster increase in Y Statistical Linear Conversion: Log Y = a + bX + ε Model: Y = ea + bX + error Call Z = Log Y where, e = 2.71828183 (Euler's number or Now, fit Pattern # 1 to Z and X Napier's constant). Z = α + βX + ε. 21 Sharad Varde Simple Exponential 22 Sharad Varde Pattern # 4: Upward Curvilinear Log Y Y Increase in X leads to slower increase in Y X X Sharad Varde 23 Sharad Varde 24 Pattern # 4: Upward Curvilinear Pattern # 5: Downward Curvilinear Relationship: Increase in X leads to slower Y increase in Y Statistical Model: Y = a + b Log X + ε Increase in X leads to faster decrease in Y Now, fit Pattern # 1 to Y and Log X A Tip: Try double logarithm if single log fails to flatten the curve satisfactorily. In that case, X Y = a + b Log (Log X) + ε. Sharad Varde 25 26 Sharad Varde Pattern # 6: Negative Exponential Pattern # 5: Downward Curvilinear Relationship: Increase in X leads to Y faster decrease in Y Statistical Model: 1/Y = ea + bX + error Increase in X leads to slower decrease in Y Linear Conversion: Loge (1/Y) = a + bX + ε X Now, fit Pattern # 1 to Loge (1/Y) and X. Sharad Varde 27 Sharad Varde 28 Power of Logarithm Pattern # 6: Negative Exponential Two standard patterns: a straight line Relationship: Increase in X leads to Two standard patterns: Log X converts to a straight line (Patterns # 4 & # 6) slower decrease in Y Statistical Model: Two standard patterns: Log Y (# 3) or Log 1/Y (# 5) converts to a straight line Y = a – b Loge X + ε Logarithm sucks in the curvature Now, fit Pattern # 2 to Y and Loge X. Double Log can flatten deeper curvature. 29 Sharad Varde Pattern # 7: Logistic or S Curve 30 Pattern # 7: Logistic or S Curve Relationship: Increase in X leads initially to faster increase, then to steady increase, & finally to slower increase in Y Y Increase in X leads initially to faster increase in Y, then to steady increase, and finally to slower increase in Y Statistical Model: 1/Y = (1/a) + (b/a) ecX + ε where, e is the base of the natural logarithm (e = 2.71828...). X Sharad Varde Sharad Varde 31 Sharad Varde 32 Your Role Steps for Fitting Regression Model A. B. C. D. E. F. G. Collect a set of reliable cardinal observations on Effect variable (Y) and corresponding cardinal values of Cause variable (X) Compute correlation. If high, proceed further. Plot Y vs. X & detect presence of a pattern Identify nature of cause-&-effect relationship Compute quantum of the relationship Conduct error analysis: small, haphazard, MSE If OK, use the model for forecasting. Sharad Varde 33 Understand the situation in totality Detect a logical cause-and-effect relationship Identify relevant cardinal variables X and Y Obtain reliable data on X and Y Compute Pearson’s correlation coefficient If it is high (+ or -), draw a scatter plot, join the points by a free hand, and identify the pattern Compute regression parameters a & b for the pattern and fit regression model using SPSS. Sharad Varde A Word of Caution Multiple Regression Model Undertake regression analysis only for cardinal variables (effect and cause) Select the variables only if you logically suspect influence of oneModel over the other Simple Regression Carry out regression analysis only after completing correlation analysis AND only if the selected cause and effect variables are in fact highly correlated If not, choose a better cause variable. Sharad Varde 35 34 Multiple Regression Model Multiple Regression Analysis A technique to analyze the joint effect of many cause variables on effect variable Multiple Regression Model of pattern # 1: Simple Regression: One cause variable influences the effect variable Some real life phenomena are amenable to simple two-variable regression analysis BUT, NOT ALL. Multiple Regression: Several cause variables jointly influence effect variable Also called Multivariate Regression. Sharad Varde 37 Multiple Regression Analysis Sharad Varde 38 Steps in Multiple Regression Analysis 1. Understand the situation in totality 2. Detect the effect variable Y that is crucial for decision making / planning and all possible cause variables X1, X2, - - - -, Xn 3. Obtain reliable data on all variables 4. Compute Pearson’s correlation coefficient (SPSS / SAS) between Y and each of the n cause variables 5. Drop those X’s which exhibit poor correlation with Y. Multiple Regression Model : Y = a + b1X1 + b2X2 + - - - - +bnXn + ε It is not necessary that all independent variables (X’s) influence the dependent variable (Y) in above simple fashion A generalized process detects complex relationships. Sharad Varde Y = a + b1X1 + b2X2 + - - - - +bnXn + ε Caution: Cause Variables X1, X2, - - - -, Xn SHOULD NOT BE Inter-Correlated Otherwise, your model will suffer from a disease called Multi-Collinearity. 39 Sharad Varde 40 Steps in Multiple Regression Analysis Steps in Multiple Regression Analysis 6. For the balance X’s, compute correlation coefficients between each X and other X’s to check orthogonality (lack of multi-collinearity) 7. If a pair of X’s shows high correlation, drop the one that bears weaker correlation with Y 8. Now you are left with Y and a shorter number of X’s which: - individually bear strong correlation with Y, - but poor correlation among themselves. Sharad Varde 41 9. Proceed Step by Step. Start with the Cause variable that shows highest correlation with Y 10. Draw its scatter plot with Y, & identify pattern 11. If it does not resemble a straight line, use logarithms to flatten the curve 12. Fit two-variable regression model: Y = a + bf(X) + ε 13. If errors are haphazard, small, & MSE is within the set limit, stop 14. If not, select the cause factor that shows next highest correlation with Y. Repeat process. Sharad Varde Steps in Multiple Regression Analysis End of Multivariate Analysis 15. Fit three-variable regression model: Y = a + b1 f1(X1) + b2 f2(X2) + ε 16. If errors are haphazard, small, & MSE is within the set limit, stop 17. If not, select the cause variable that shows the third highest correlation with Y. Repeat the process till you reach acceptable errors. 18. You will finally get multiple regression model: Y = a + b1 f1(X1) + b2 f2(X2) + ------- + bn fn(Xn) + ε… Sharad Varde 43 42 THANK YOU Dr. Sharad Varde

Log In

Research Methodology

Related papers

Related papers