Sample Final
Sample Final
Sample Final
1. Suppose you are interested in examining the determinants of earnings. You have
information on the age of the individual as well as their level of education: high school
graduate, college graduate or graduate degree. Let Y = earnings, X1= age, X2= 1 if the
person has studied till high school or less and 0 otherwise, X3 = 1 if the person has
studied more than high school but earned less than a masters degree and 0 otherwise,
X4= 1 if the person has earned a masters level or higher degree and 0 otherwise. Which of
the following model specifications cannot be estimated?
2. A manager of a newly opened coffee shop is struggling to manage the waiting time of
customers. Based on preliminary data he has collected, he hypothesizes that there is a
linear relationship between the average waiting time experienced by customers to
receive their drink and number of customers sitting in the coffee shop. He does not have
access to any statistical software but only has the following descriptive statistics.
The average waiting time for a customer who walks into an empty coffee shop is
(ignore set-up time)
A. 8.5 min
B. 9.24 min
C. 12.07 min
D. Not enough information
A researcher intended to study the relationship between the number of major natural
calamities such as tornadoes, hurricanes, earthquakes, floods that occurred during a year (X)
and the average profit (in millions of dollars) of all insurance companies in the country in
that year (Y). She took a random sample of 10 years in which number of calamities per year
varied from 10 through 23 and found that the estimated least squares regression line is 𝑦" =
212.6 − 1.9𝑥.
1
3. The number 212.6 in the above regression can be interpreted most reasonably as
A. The part of the average profit of the insurance companies that is not associated
with the number of natural calamities
B. Change in profit of insurance companies associated with an additional natural
calamity
C. The average profits for all the insurance companies in the country in a year with
no major natural calamities
D. None of the above
A. -1.9
B. Negative but cannot determine the magnitude.
C. +1.9
D. Positive but cannot determine the magnitude.
5. A randomly selected year had 24 major calamities, and the actual average profits in that
year were $200 million. The residual associated with this year is
A. $200 million
B. $167 million
C. $33 million
D. - $33 million
7. While comparing two regression models for the same response variable from the same
dataset, it was found that R2 of model A is 0.80 while that of model B is 0.512. Which of
the following is true about the ratio of the RMSE (se) for the two models (Model
A/Model B)?
Questions 8 through 10 are based on the following regression output which was obtained
from a study which linked Age and Smoking (1 = Smoker, 0 = Non-smoker) to the risk of
heart disease:
2
Analysis of Variance, ANOVA
Degrees Sum of Mean Square,
Freedom, df Squares, SS MS F-Ratio p-Value
Regression 2 2633.388 1316.694 14.371 0.00022
Error 17 1557.562 91.621
Total 19 4190.950
R-squared
Multiple R
Adj. R-squared 58.46%
Standard Error of Estimate 9.572
Durbin-Watson 1.684
Number of Observations 20
A. 62.84%
B. 72.97%
C. 59.46%
D. 58.46%
A. 3.066
B. 5.863
C. 2.853
D. 2.750
10. Assuming that the OLS assumptions hold, an unbiased estimate for standard deviation
of the error term (σε)is:
A. – 28.086
B. 91.621
C. 9.572
D. 16.707
11. Which of the following statements are TRUE in reference to a simple linear regression
line of y on x?
3
I. The regression line will always pass through at least one of the sample
points (xi, yi)
II. The regression line will always pass through the point ( x , y ), where x
and y are respectively the sample means of x and y
III. The point ( x , y ), where x and y are as above, is always included as one
of the points in the sample data set that OLS uses to estimate the
parameters of the model
12. Consider a multiple regression model with two predictors. If the overall F-ratio is
significant, i.e., if p-value associated with F-ratio is less than a-value, then we can
conclude that
13. You did a multiple linear regression with a set of predictor variables and found that the
overall regression was significant, while the individual predictors were all insignificant.
The most likely explanation for this is that:
14. The correlation between two variables in a sample equals zero. This implies that:
15. Consider the plot of residuals versus predictor below for a simple linear regression.
Which of the following statements is true for this regression?
4
Standardised Residuals v Food
Residuals
Stand. Res
0
0 5 10 15 20 25 30
-1
-2
-3
-4
Food
X
A. 25.56
B. 7.94
C. 28.24
D. 22.26
17. (IGNORE)
A total of 82 games were played in the 2006-2007 season of the National
Hockey League by every team. A team could win maximum of two points per
match or a total of 164 points. For each of the 30 teams, data on the number of
goals scored per game (Goals/G) and the percentage of the 164 possible
points they won (Win%) during the season were collected. The least squares
fit between the two had the following equation:
With RMSE = 2.2 and R2 = 0.40. If assumptions of Simple Regression Model are satisfied,
what is the probability that a team scoring 2.5 goals per game will have a Win% of 54.2
or more?
5
18. Following estimated regression equation compares total compensation among top
executives in a large set of US public corporations in the 1990s. The variables in the
data set are:
6
ln(𝐸𝑎𝑟𝑛𝚤𝑛𝑔𝑠) = 3.86 − 0.28 𝐹𝑒𝑚𝑎𝑙𝑒 + 0.37 ln(𝑀𝑎𝑟𝑘𝑒𝑡𝑉𝑎𝑙𝑢𝑒) + 0.004𝑅𝑒𝑡𝑢𝑟𝑛
We can conclude from the above regression equation is that, controlling for return:
A professor decides to run an experiment to measure the effect of time pressure on final
exam scores. He gives each of the 285 students in his course the same final exam, but some
students have 90 minutes to complete the exam while others have 120 minutes to complete
it. Each student is randomly assigned one of the examination times based on the flip of a
coin. Consider a regression model of the form: Score = b0 + b1X + e.
19. The professor is considering two different choices for X. The first choice would be to
treat X as the time given for the exam (in the sample data set, it would only take
values of 90 and 120 minutes). The second choice would be to make X into a dummy
variable, and code it as 0 for students who are given 90 minutes, and code it as 1
otherwise. Unable to choose between these two alternatives, the professor decided to
include both variables as X1 and X2 respectively in a multiple regression of Score on
these two predictors. Which of the following statements is true:
20. After more deliberation, the professor decided to go with the time variable instead of
the dummy variable. The estimated simple linear equation was E(Score) = 60 + 0.5X.
Based on this, and the accompanying regression output, the professor estimated with
reasonable confidence that the extra 30 minutes resulted in an expected increase in
6
score of somewhere between 10 and 20 points. The uncertainty in this estimate is
primarily driven by:
A. Sampling variation
B. The difficulty of the exam
C. The choice of 90 and 120 minutes as the test times
D. None of the above
21. Which of the following might be the most likely driver of the intercept term 60 in the
above equation in question 20?
22. Consider a simple regression where the estimated coefficient of the x variable is
greater than 1. This necessarily implies that:
23. In the previous problem, suppose the coefficient of the x-variable is lower than 1.
This necessarily implies that
24. Let R2YX be the R2 value associated with the regression of Y (response) on X
(predictor). Let R2XY be the R2 value associated with the regression of X (response) on
Y (predictor). Which of the following is true?
A. R2XY = R2YX
B. R2XY>R2YX
C. R2XY<R2YX
D. None of the above have to be true
25. (IGNORE)
An IT project manager uses a random sample of the projects he has managed in the
past 4 years to estimate a linear relationship between delay in project completion (in
days) and the size of the code (’00s of lines): Estimated delay = 8.30 + 0.125*(code size).
Which of the following can we conclude without any further information on the data
or the rest of the regression output?
7
A. Minimum delay in project completion is 8.30 days
B. Larger codes result in larger delays
C. There is a positive relationship between size of the code and project delay in the
company
D. None of the above.
26. A multiple regression model with a person’s weight as response and the person’s
height and the average number of calories consumed per day as predictors was
found to have both slopes positive and significant. Assume further that the height is
positively correlated with calorie consumption. If we consider a simple linear
regression model of weight (response) on height (predictor), the estimated slope
from this regression will be:
The general manager of the auction house claims that this is evidence against the
industry maxim that each additional year of a clock’s age is associated with an average
increase of $1500 in the value of the clock. Do you agree?
Questions 28 – 29 are related to the following description and the JMP reports that follow.
The dataset comes from a set of 420 school districts in California. The response variable is
the test scores of 5th grade students in these districts and is calculated as the average of math
and reading scores for students in each district. The Superintendent of education is
considering whether to decrease class sizes (decrease the student to teacher ratio, labeled
STR hereafter), and wondering whether this would improve student performance on test
scores. Of course, the flip side is that decreasing STR would increase costs and take up much
8
of the scarce financial resources these school districts have. The regression output of a
simple linear regression of test scores on class sizes (STR) is shown below.
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
13 14 15 16 17 18 19 20 21 22 23 24 25 26
STR
Parameter Estimates
Term Estim ate Std Error t Ratio Prob>|t|
Intercept 698.93295 9.467491 73.82 <.0001*
STR -2.279808 0.479826 -4.75 <.0001*
29. A second predictor, the percentage of students whose native language is not English
(PctEL), is added to the regression. While admitting that this leads to an impressive
increase in R2, the school superintendent decides to take a look at the scatterplot of
Test scores on PctEL. The scatterplot, reproduced below, MOST LIKELY indicates
that:
9
Bivariate Fit of TestScr By PctEL
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
0 10 20 30 40 50 60 70 80 90
PctEL
30. Suppose that you fit a simple regression line between response variable Y and
predictor variable X. Further, suppose that you fit a second regression line between
response variable Y and the predicted values of the response variable Ŷ. Which of the
following statements will be true about this second fitted line?
10