Sample Final

Sample Final Exam (SMMD – Term I)
Part A: Each question in this part is worth 1 point.
1. Suppose you are interested in examining the determinants of earnings. You have
information on the age of the individual as well as their level of education: high school
graduate, college graduate or graduate degree. Let Y = earnings, X1= age, X2= 1 if the
person has studied till high school or less and 0 otherwise, X3 = 1 if the person has
studied more than high school but earned less than a masters degree and 0 otherwise,
X4= 1 if the person has earned a masters level or higher degree and 0 otherwise. Which of
the following model specifications cannot be estimated?
A. Y = b0 + b1X1 + b2X2 + b3X3

B. Y = b0 + b1X1 + b2X2 + b3X3 +b4X4
C. Y = b0 + b1X1
D. None of the above.
2. A manager of a newly opened coffee shop is struggling to manage the waiting time of
customers. Based on preliminary data he has collected, he hypothesizes that there is a
linear relationship between the average waiting time experienced by customers to
receive their drink and number of customers sitting in the coffee shop. He does not have
access to any statistical software but only has the following descriptive statistics.
Variable Mean Covariance matrix

Waiting time Number of customers
Waiting time 23 min 136 min2 86cust-min
Number of customers 12cust 86cust-min 75 cust2
The average waiting time for a customer who walks into an empty coffee shop is
(ignore set-up time)
A. 8.5 min
B. 9.24 min
C. 12.07 min
D. Not enough information
Questions 3 through 6 are based on the following situation:
A researcher intended to study the relationship between the number of major natural
calamities such as tornadoes, hurricanes, earthquakes, floods that occurred during a year (X)
and the average profit (in millions of dollars) of all insurance companies in the country in
that year (Y). She took a random sample of 10 years in which number of calamities per year
varied from 10 through 23 and found that the estimated least squares regression line is 𝑦" =
212.6 − 1.9𝑥.
1
3. The number 212.6 in the above regression can be interpreted most reasonably as
A. The part of the average profit of the insurance companies that is not associated
with the number of natural calamities
B. Change in profit of insurance companies associated with an additional natural
calamity
C. The average profits for all the insurance companies in the country in a year with
no major natural calamities
D. None of the above
4. For the above regression equation, correlation between X and Y will be
A. -1.9
B. Negative but cannot determine the magnitude.
C. +1.9
D. Positive but cannot determine the magnitude.
5. A randomly selected year had 24 major calamities, and the actual average profits in that
year were $200 million. The residual associated with this year is
A. $200 million
B. $167 million
C. $33 million
D. - $33 million
6. The reason for the residual in the previous question is:
A. Sampling variability – the coefficients were estimated from a random sample

B. Insurance company profits are determined by things other than number of
natural calamities
C. Both of the above
7. While comparing two regression models for the same response variable from the same
dataset, it was found that R2 of model A is 0.80 while that of model B is 0.512. Which of
the following is true about the ratio of the RMSE (se) for the two models (Model
A/Model B)?
A. It’s exact value cannot be determined based on the information given

B. It will be greater than 1
C. It is equal to 0.64
D. Both A and B
Questions 8 through 10 are based on the following regression output which was obtained
from a study which linked Age and Smoking (1 = Smoker, 0 = Non-smoker) to the risk of
heart disease:
2
Analysis of Variance, ANOVA
Degrees Sum of Mean Square,
Freedom, df Squares, SS MS F-Ratio p-Value
Regression 2 2633.388 1316.694 14.371 0.00022
Error 17 1557.562 91.621
Total 19 4190.950
Regression Equation Results

Dependent Variable, Y: RISK
RISK = -28.086 + 0.689 AGE + 14.396 SMOKER
Standard 95% Conf. 95% Conf.

Indep. X Variables Coefficient Error t Statistic p-Value Lower Upper VIF
Intercept -28.086 16.707 -1.6811 0.11103 -63.334 7.163
AGE 0.689 0.25 2.7501 0.01367 0.16 1.217 1.203
SMOKER 14.396 4.695 4.49 24.302 1.203
R-squared
Multiple R
Adj. R-squared 58.46%
Standard Error of Estimate 9.572
Durbin-Watson 1.684
Number of Observations 20
8. The missing value of R2 in the regression should be:
A. 62.84%
B. 72.97%
C. 59.46%
D. 58.46%
9. The missing value of the t-statistic for SMOKER should be:
A. 3.066
B. 5.863
C. 2.853
D. 2.750
10. Assuming that the OLS assumptions hold, an unbiased estimate for standard deviation
of the error term (σε)is:
A. – 28.086
B. 91.621
C. 9.572
D. 16.707
11. Which of the following statements are TRUE in reference to a simple linear regression
line of y on x?
3
I. The regression line will always pass through at least one of the sample
points (xi, yi)
II. The regression line will always pass through the point ( x , y ), where x
and y are respectively the sample means of x and y
III. The point ( x , y ), where x and y are as above, is always included as one
of the points in the sample data set that OLS uses to estimate the
parameters of the model
A. I., II. and III.

B. II. only
C. I. and II. only
D. None of the statements is TRUE
12. Consider a multiple regression model with two predictors. If the overall F-ratio is
significant, i.e., if p-value associated with F-ratio is less than a-value, then we can
conclude that
A. b0, b1 and b2 are different from zero

B. b1 and b2 are different from zero
C. b1 or b2is different from zero, but not both
D. Either b1 or b2or both are different from zero
13. You did a multiple linear regression with a set of predictor variables and found that the
overall regression was significant, while the individual predictors were all insignificant.
The most likely explanation for this is that:
A. The response has no linear relationship with any of the predictors

B. The predictors are each cancelling out the other predictors’ effects
C. There is multicollinearity among the predictor variables
D. We’ll need to look at the value of R2 before trying to explain this
14. The correlation between two variables in a sample equals zero. This implies that:
A. The two variables must be independent in the population

B. A regression with one of these variables as a response, and the other one as a
predictor will be significant
C. The adjusted R2 for the regression in alternative B will be negative
15. Consider the plot of residuals versus predictor below for a simple linear regression.
Which of the following statements is true for this regression?
4
Standardised Residuals v Food
Residuals
Stand. Res
0
0 5 10 15 20 25 30
-1
-2
-3
-4
Food
X
A. This regression is insignificant

B. The prediction intervals based on this regression will be incorrect
C. RMSE is more than 3
D. The errors are likely to be autocorrelated
16. In a regression model involving 34 observations, the following estimated regression

model was obtained: yˆ = 48 + 2.5 x1 + 1.2 x2 - 0.7 x3 . For this model, the following
statistics were given: SST = 960 and SSE = 270. Then, the value of the F statistic for
testing the validity of this model is:
A. 25.56
B. 7.94
C. 28.24
D. 22.26
17. (IGNORE)
A total of 82 games were played in the 2006-2007 season of the National
Hockey League by every team. A team could win maximum of two points per
match or a total of 164 points. For each of the 30 teams, data on the number of
goals scored per game (Goals/G) and the percentage of the 164 possible
points they won (Win%) during the season were collected. The least squares
fit between the two had the following equation:
Win% = 0.932 + 19.022 (Goals/G)
With RMSE = 2.2 and R2 = 0.40. If assumptions of Simple Regression Model are satisfied,
what is the probability that a team scoring 2.5 goals per game will have a Win% of 54.2
or more?
A. Less than 0.5%

B. Between 0.5% and 1%
C. Between 1% and 2.5%
D. Between 2.5% and 5%
5
18. Following estimated regression equation compares total compensation among top
executives in a large set of US public corporations in the 1990s. The variables in the
data set are:
Earnings: Total compensation (in $ ‘000s)

Female: Dummy variable – Equals 1 for females and 0 for males
MarketValue: A measure of firm size (in $ millions)
Return: Stock return (a measure of firm performance in percentage points)
The estimated regression equation is (all predictors were significant):
6
ln(𝐸𝑎𝑟𝑛𝚤𝑛𝑔𝑠) = 3.86 − 0.28 𝐹𝑒𝑚𝑎𝑙𝑒 + 0.37 ln(𝑀𝑎𝑟𝑘𝑒𝑡𝑉𝑎𝑙𝑢𝑒) + 0.004𝑅𝑒𝑡𝑢𝑟𝑛
We can conclude from the above regression equation is that, controlling for return:
A. Females in larger companies suffer a smaller salary discount than smaller

companies
B. Females in smaller companies suffer a smaller salary discount than larger
companies
C. A 1% increase in size (as measured by MarketValue) decreases the salary
discount of female executives, on average, by 0.37%
Questions 19 – 21 are based on the following problem.
A professor decides to run an experiment to measure the effect of time pressure on final
exam scores. He gives each of the 285 students in his course the same final exam, but some
students have 90 minutes to complete the exam while others have 120 minutes to complete
it. Each student is randomly assigned one of the examination times based on the flip of a
coin. Consider a regression model of the form: Score = b0 + b1X + e.
19. The professor is considering two different choices for X. The first choice would be to
treat X as the time given for the exam (in the sample data set, it would only take
values of 90 and 120 minutes). The second choice would be to make X into a dummy
variable, and code it as 0 for students who are given 90 minutes, and code it as 1
otherwise. Unable to choose between these two alternatives, the professor decided to
include both variables as X1 and X2 respectively in a multiple regression of Score on
these two predictors. Which of the following statements is true:
A. The estimated coefficients of both predictors will be identical

B. The estimated coefficient of the time variable will be 30 times the coefficient of
the dummy variable
C. The estimated coefficient of the dummy variable will be 30 times the coefficient of
the dummy variable
20. After more deliberation, the professor decided to go with the time variable instead of
the dummy variable. The estimated simple linear equation was E(Score) = 60 + 0.5X.
Based on this, and the accompanying regression output, the professor estimated with
reasonable confidence that the extra 30 minutes resulted in an expected increase in
6
score of somewhere between 10 and 20 points. The uncertainty in this estimate is
primarily driven by:
A. Sampling variation
B. The difficulty of the exam
C. The choice of 90 and 120 minutes as the test times
21. Which of the following might be the most likely driver of the intercept term 60 in the
above equation in question 20?
A. The random allocation of students to the two groups

B. The variation in intelligence levels among students
C. The variation in susceptibility to time pressure among students
D. The degree of difficulty of the exam
22. Consider a simple regression where the estimated coefficient of the x variable is
greater than 1. This necessarily implies that:
A. The variance of y-variable is higher than the variance of x-variable

B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable
D. We cannot conclude any of the above without further information
23. In the previous problem, suppose the coefficient of the x-variable is lower than 1.
This necessarily implies that
A. The variance of y-variable is higher than the variance of x-variable

B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable
D. We cannot conclude any of the above without further information
24. Let R2YX be the R2 value associated with the regression of Y (response) on X
(predictor). Let R2XY be the R2 value associated with the regression of X (response) on
Y (predictor). Which of the following is true?
A. R2XY = R2YX
B. R2XY>R2YX
C. R2XY<R2YX
D. None of the above have to be true
25. (IGNORE)
An IT project manager uses a random sample of the projects he has managed in the
past 4 years to estimate a linear relationship between delay in project completion (in
days) and the size of the code (’00s of lines): Estimated delay = 8.30 + 0.125*(code size).
Which of the following can we conclude without any further information on the data
or the rest of the regression output?
7
A. Minimum delay in project completion is 8.30 days
B. Larger codes result in larger delays
C. There is a positive relationship between size of the code and project delay in the
company
D. None of the above.
Part B: Each question in this part is worth 2 points.
26. A multiple regression model with a person’s weight as response and the person’s
height and the average number of calories consumed per day as predictors was
found to have both slopes positive and significant. Assume further that the height is
positively correlated with calorie consumption. If we consider a simple linear
regression model of weight (response) on height (predictor), the estimated slope
from this regression will be:
A. An unbiased estimate of the true population parameter

B. The estimate will be biased upwards
C. The estimate will be biased downwards
D. Cannot conclude any of the above without further information
27. A famous auction house in London is initiating a data analytics approach to

understand factors associated with the value of antique clocks. A regression of a
random sample of 32 clocks sold in the last 10 years with Price (‘00$) gave the
following output.
The general manager of the auction house claims that this is evidence against the
industry maxim that each additional year of a clock’s age is associated with an average
increase of $1500 in the value of the clock. Do you agree?
A. Yes, at 5% significance level

B. Yes, at 1% significance level
C. Yes at 0.1% significance level
D. Cannot be determined from the output
Questions 28 – 29 are related to the following description and the JMP reports that follow.
The dataset comes from a set of 420 school districts in California. The response variable is
the test scores of 5th grade students in these districts and is calculated as the average of math
and reading scores for students in each district. The Superintendent of education is
considering whether to decrease class sizes (decrease the student to teacher ratio, labeled
STR hereafter), and wondering whether this would improve student performance on test
scores. Of course, the flip side is that decreasing STR would increase costs and take up much
8
of the scarce financial resources these school districts have. The regression output of a
simple linear regression of test scores on class sizes (STR) is shown below.
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
13 14 15 16 17 18 19 20 21 22 23 24 25 26
STR
TestScr = 698.93295 - 2.2798083*STR

Summary of Fit Analysis of Variance
RSquare 0.05124 Sum of
RSquare Adj 0.04897 Source DF Squares Mean Square F Ratio
Root Mean Square Error 18.58097 Model 1 7794.11 7794.11 22.5751
Mean of Response 654.1565 Error 418 144315.48 345.25 Prob > F
Observations (or Sum Wgts) 420 C. Total 419 152109.59 <.0001*
Parameter Estimates
Term Estim ate Std Error t Ratio Prob>|t|
Intercept 698.93295 9.467491 73.82 <.0001*
STR -2.279808 0.479826 -4.75 <.0001*
28. Based on the regression output, we can say that
A. The regression is not statistically significant because the R2 is only 5%

B. The regression is not statistically significant because the RMSE is much bigger
than the absolute value of slope
C. STR isn’t a significant predictor of test scores because the intercept term
dominates the slope term
29. A second predictor, the percentage of students whose native language is not English
(PctEL), is added to the regression. While admitting that this leads to an impressive
increase in R2, the school superintendent decides to take a look at the scatterplot of
Test scores on PctEL. The scatterplot, reproduced below, MOST LIKELY indicates
that:
9
Bivariate Fit of TestScr By PctEL
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
0 10 20 30 40 50 60 70 80 90
PctEL
A. There are a number of influential observations

B. The R2 number must have been misread because test scores seem to decrease
with PctEL
C. The errors in the multiple regression will be heteroskedastic
D. The errors are correlated.
30. Suppose that you fit a simple regression line between response variable Y and
predictor variable X. Further, suppose that you fit a second regression line between
response variable Y and the predicted values of the response variable Ŷ. Which of the
following statements will be true about this second fitted line?
a. Its slope will be the same as the slope between Y and X

b. Its slope will be the reciprocal of the slope between Y and X
c. Its slope will be 1
d. Its slope will be zero
10

Sample Final

Uploaded by

Copyright:

Available Formats

Sample Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Final

Uploaded by

Copyright:

Available Formats

Sample Final Exam (SMMD – Term I)

Part A: Each question in this part is worth 1 point.

A. Y = b0 + b1X1 + b2X2 + b3X3

Variable Mean Covariance matrix

Questions 3 through 6 are based on the following situation:

4. For the above regression equation, correlation between X and Y will be

6. The reason for the residual in the previous question is:

A. Sampling variability – the coefficients were estimated from a random sample

A. It’s exact value cannot be determined based on the information given

Regression Equation Results

Standard 95% Conf. 95% Conf.

8. The missing value of R2 in the regression should be:

9. The missing value of the t-statistic for SMOKER should be:

A. I., II. and III.

A. b0, b1 and b2 are different from zero

A. The response has no linear relationship with any of the predictors

A. The two variables must be independent in the population

A. This regression is insignificant

16. In a regression model involving 34 observations, the following estimated regression

Win% = 0.932 + 19.022 (Goals/G)

A. Less than 0.5%

Earnings: Total compensation (in $ ‘000s)

The estimated regression equation is (all predictors were significant):

A. Females in larger companies suffer a smaller salary discount than smaller

Questions 19 – 21 are based on the following problem.

A. The estimated coefficients of both predictors will be identical

A. The random allocation of students to the two groups

A. The variance of y-variable is higher than the variance of x-variable

A. The variance of y-variable is higher than the variance of x-variable

Part B: Each question in this part is worth 2 points.

A. An unbiased estimate of the true population parameter

27. A famous auction house in London is initiating a data analytics approach to

A. Yes, at 5% significance level

TestScr = 698.93295 - 2.2798083*STR

28. Based on the regression output, we can say that

A. The regression is not statistically significant because the R2 is only 5%

A. There are a number of influential observations

a. Its slope will be the same as the slope between Y and X

You might also like