Homework 1
Homework 1
Homework 1
HOMEWORK
ASSIGNMENT 1
Probability and Statistics
SCENARIO: U.S. CENSUS
We took a random sample from the 2000 U.S. Census. Here is part of the dataset:
Background
Clinical depression is the most common mental illness in the United States, affecting 19
million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who
experience a major episode will have a recurrence within 2 to 3 years. Researchers are
interested in comparing therapeutic solutions that could delay or reduce the incidence of
recurrence.
• Hospt: The patient's hospital, represented by a code for each of the 5 hospitals (1, 2, 3,
5, or 6)
• AcuteT: The time in days that the patient was depressed prior to the study.
• Age: The age of the patient in years, when the patient entered the study.
Here's a snapshot of the first 50 patients in the dataset with gender recoded to display
Female or Male:
4. Who are the individuals described by this data?
5. Which of the following variables is categorical? Check all that apply.
6. Which of the following variables is quantitative? Check all that apply.
7. What is the difference between the two bar charts?
8. What do the results suggest about how the students are divided
across the three body image categories?
9. How does the middle group of students (19.6%) feel about their
weight?
10. How do the vast majority of students (71.3%) feel about their
weight?
11. What was the body perception that occurred the least often?
EXAM GRADES
Here are the exam grades of 15 students:
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
We first need to break the range of values into intervals (also called "bins" or "classes").
In this case, since our dataset consists of exam scores, it will make sense to choose
intervals that typically correspond to the range of a letter grade, 10 points wide: 40-50,
50-60, ... 90-100. By counting how many of the 15 observations fall in each of the
intervals, we get the following table:
Exam Grades
Score Count
[40-50) 1
[50-60) 2
[60-70) 4
[70-80) 5
[80-90) 2
[90-100] 1
12. What percentage of students earned less than a grade of 70 on the exam?
13. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:
What do the numbers on the horizontal axis represent?
14. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:
15. An instructor asked her students how much time (to the nearest hour) they spent
studying for the midterm. The data are displayed in the following histogram:
What percentage of students study 6 or more hours for the midterm?
16. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.
How many of the students surveyed eat at least 4 servings of fruits and vegetables
daily?
17. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.
What percentage of the students surveyed eat no more than 3 servings of fruits and
vegetables daily?
18. Thirty-two students were asked the number of servings of fruits and vegetables
they eat daily. The results are displayed in the histogram below.
What proportion of the students surveyed eats exactly 5 servings of fruits and
vegetables daily?
19. A survey was conducted to see how many phone calls people made daily. The
results are displayed in the table below:
Number of calls made Frequency
1-4 16
5-8 11
9 - 12 5
13 - 16 3
17 - 20 1
How many of the people surveyed make less than 9 phone calls daily?
20. A survey was conducted to see how many phone calls people made daily. The
results are displayed in the table below:
25. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?
SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
26. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?
SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
27. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?
SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
28. Which of the following is the best description of the data used to generate this
histogram (note that the horizontal axis has no scale, so you will make your choice
based solely upon the histogram's shape)?
SAT Math scores of 1,000 future engineers and scientists. Results of rolling a six-sided die 1,000 times.
Cholesterol levels of 1,000 adults. Shoe sizes of 1,000 men and women. Prices of 1,000 California
homes.
29. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15
30. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15
What kind of distribution is formed by the data from the above 9 students?
31. Here are the number of hours that 9 students spend on the computer on a typical
day: 1 6 7 5 5 8 11 12 15
Which of the following is the mean number of hours spent on the computer?
32. A recent survey asked 90 students, How many hours do you spend on the computer
in a typical day?
Background
A study was done in order to find out whether pamphlets containing information for
cancer patients are written at a level that the cancer patients can understand. Tests
were administered to measure the reading levels of 63 cancer patients, and the
readability levels of 30 cancer pamphlets were evaluated based on such factors as the
lengths of the sentences and the number of polysyllabic words. Both the reading and
readability levels correspond to grade levels, but patients' reading levels of less than
grade 3 and above grade 12 cannot be determined exactly. (Source: Short, Moriarty,
and Cooly. (1995). "Readability of Educational Materials for Cancer Patients." Journal of
Statistics Education, v.3, n.2)
The following tables indicate the number of patients at each reading level and the
number of pamphlets at each readability level.
Comment
Note: For both the reading level and readability level, the data are presented in a
grouped form where the count represents the frequency of occurrence of that level. In
the readability data, for example, the count of level 6 is 3 which means that the first
three data points are 6 6 6; the count of level 7 is 3 which means that the next three
data points are 7 7 7; the count of level 8 is 8 which means that the next eight data
points are 8 8 8 8 8 8 8 8, and so on.
34. Which of the following is the mode of the readability level for the pamphlets?
35. Explain why you cannot calculate the mean (average) reading level of patients
given the above data.
36. Now find the median reading level of the patients. Note that the data are already
ordered—that's good!
37. How many observations (n) are there for the reading level of cancer patients?
38. Since n is odd, the median reading level for patients will be which observation in the
ordered data? (center observation or average of the two center observations)
39. What is the rank of the observation that represents the median reading level for
patients?
40. Find the median readability level of the pamphlets.
41. Can you conclude that the pamphlets are well matched to the patients' reading
levels? Look carefully at the data.
42. In the histogram below, what will be the relationship between the mean and median
of the collected data?
43. The SAT Math scores of 1,000 future engineers and physicists are recorded. What
will be the relationship between the mean and median of the collected data?
Depression Days
Days Count
[20-60] 5
[60-100] 10
[100-140] 20
[140-180] 30
[180-220] 16
[220-260] 10
[260-300] 6
[300-340] 4
[340-380] 2
[380-420] 0
[420-460] 0
[460-500] 0
[500-540] 2
Which of the following is a possible value of the median number of days that patients
were depressed?
44. Using this same histogram of 105 patients, which of the following is most likely to
be true?
47. A survey taken of 140 sports fans asked the question: "What is the most you have
ever spent for a ticket to a sporting event?" The five-number summary for the data
collected is: min = 85, Q1 = 130, Median = 145, Q3 = 150, Max = 250
Now let's look at the five number summary of the age of Best Actor Oscar winners
(1970-2001). The five number summary is: Min: 29 Q1: 38 M: 43.5 Q3: 50.5 Max: 76
48. Half of the actors won the Oscar before what age?
49. What is the range covered by the middle 50% of the ages?
50. The boxplot below displays ratings for TV shows during sweeps week.
A) What percentage of the graduates will have a debt greater than $25,000?
B) 50% of the debts owed are smaller than what amount?
C) Within which interval of debts would you expect to find the largest number of
graduates?
SCENARIO: GRADUTION RATE
The percentage of each entering Freshman class that graduated on time was recorded
for each of six colleges at a major university over a period of several years.
In order to compare the graduation rates among the different colleges, we created side-
by-side boxplots (graduation rate by college), and supplemented the graph with
numerical measures.
53. Based on the boxplots and data, which of the six colleges has the best on-time
graduation rate?
Background
At the end of a statistics course, the 27 students in the class were asked to rate the
instructor on a number scale of 1 to 9 (1 being "very poor," and 9 being "best instructor
I've ever had"). Assume that the average rating in each of the three classes is 5 (which
should be visually reasonably clear from the histograms), and recall the interpretation of
the SD as a "typical" or "average" distance between the data points and their mean. The
following table provides three hypothetical rating data:
Rating 1 2 3 4 5 6 7 8 9
Class I 1 0 0 0 22 0 0 0 1
Class II 12 0 0 0 1 0 0 0 12
Class III 2 2 2 2 2 2 2 2 2
54. Judging from the table and the histograms, which class would have the largest
standard deviation?