4 (A) - Graphical Presentation of Data
4 (A) - Graphical Presentation of Data
4 (A) - Graphical Presentation of Data
• The first step in constructing a graphical display is often to summarize the data in a table
and then use information in the table to construct the display.
For Example:
NH = noncompliant helmet
CH = compliant helmet
N = no helmet
CH N CH NH N CH CH CH N N
• From the relative frequency distribution, you can see that a large number of the riders
(43%) were not wearing a helmet, but most of those who wore a helmet were wearing one
that met the Department of Transportation safety standard.
• The bar charts and comparative bar charts can be used to summarize univariate
categorical data.
1. BAR CHART
• The bar chart is used when the purpose of the display is to show the data distribution.
• Each category in the frequency distribution is represented by a bar or rectangle, and the
display is constructed so that the area of each bar is proportional to the corresponding
frequency or relative frequency.
When to Use
• Number of variables: 1
• Data type: categorical
• Purpose: displaying data distribution
How to Construct
1. Draw a horizontal axis, and write the category names or labels below the line at
regularly spaced intervals.
For Example:
• Above example used data on helmet use from a sample of 1,700 motorcyclists to
construct the frequency distribution in Table 2.1.
Using above mentioned three steps, bar chart will be calculated in following ways:
• The bar chart provides a visual representation of the distribution of the 1,700 values that
make up the data set.
• The bar for compliant helmets is about five times as tall (and therefore has five times the
area) as the bar for noncompliant helmets because approximately five times as many
motorcyclists wore compliant helmets than wore noncompliant helmets.
• Bar charts can also be used to provide a visual comparison of two or more groups.
When to Use
How to Construct
• This is constructed by using the same horizontal and vertical axes for the bar charts of
two or more groups.
• When constructing a comparative bar graph, you use the relative frequency rather than
the frequency to construct the scale on the vertical axis, so that you can make meaningful
comparisons even if the sample sizes are not the same.
• The same set of steps that were used to construct a bar chart are used to construct a
comparative bar chart, but in a comparative bar chart each category will have a bar for
each group.
For Example:
Each year, The Princeton Review conducts surveys of high school students who are
applying to college and of parents of college applicants. The report “2009 College
Hopes & Worries Survey Findings” (www.princetonreview/college-hopes-
worries-2009) included a summary of how 12,715 high school students responded to
the question “Ideally how far from home would you like the college you attend to be?”
Students responded by choosing one of four possible distance categories. Also
included was a summary of how 3,007 parents of students applying to college
responded to the question “How far from home would you like the college your child
attends to be?” The accompanying relative frequency table summarizes the student and
parent responses.
• A higher proportion of parents prefer a college close to home, and a higher proportion of
students believe that the ideal distance from home is more than 500 miles
• To see why it is important to use relative frequencies rather than frequencies to compare
groups of different sizes, consider the incorrect bar chart constructed using the
frequencies rather than the relative frequencies (in below Figure).
• Because there were so many more students than parents who participated in the surveys
(12,715 students and only 3,007 parents), the incorrect bar chart conveys a very different
and misleading impression of the differences between students and parents.
(Continue…….)
Three different types of graphical displays for univariate numerical data are:
1. Dotplots
2. Stem-and-leaf displays
3. Histograms.
1) DOTPLOTS
• A dotplot is a simple way to display numerical data when the data set is not too large.
• Each observation is represented by a dot above the location corresponding to its value
on a number line.
• When a value occurs more than once in a data set, there is a dot for each occurrence,
and these dots are stacked vertically in the plot.
When to Use
• Number of variables: 1
• Data Type: numerical
• Purpose: display data distribution
How to Construct
2. Locate each value in the data set along the measurement scale, and represent it by a
dot. If there are two or more observations with the same value, stack the dots
vertically.
For Example:
The article “Keeping Score When It Counts: Graduation Rates and Academic
Progress Rates for 2009 NCAA Division I Basketball Tournament Teams” (The
Institute for Diversity and Ethics in Sport, University of Central Florida, March
2009) included data on graduation rates of basketball players for the universities and
colleges that sent teams to the 2009 Division I playoffs. The following graduation rates
are the percentages of basketball players starting college in 2002 who had graduated
by the end of 2008. (Note: Teams from 65 schools made it to the playoffs, but two of
them—Cornell and North Dakota State—did not report graduation rates.)
• A dotplot is an appropriate choice to summarize these data because the data set consists
of one variable (graduation rate), the variable is numerical, and the purpose is to display
the data distribution.
• The data set is not too large, with 63 observations. (If the data set had been much larger,
a histogram might have been a better choice.)
10
• The dotplot shows how the 63 graduation rates are distributed along the number line.
• It can be seen that basketball graduation rates vary a great deal from school to school,
ranging from a low of 8% to a high of 100%.
• You can also see that the graduation rates seem to cluster in several groups, denoted by
the colored ovals that have been added to the dotplot.
• There are several schools with graduation rates of 100% (excellent!) and another group
of 13 schools with graduation rates that are higher than most.
• The majority of schools are in the large cluster, with graduation rates from about 30% to
about 72%.
• And then there is that bottom group of four schools with embarrassingly low graduation
rates for basketball players.
When to Use
How to Construct
• A comparative dotplot is constructed using the same numerical scale for two or more
dotplots.
11
The article referenced in above Example also gave graduation rates for all student
athletes at the 63 schools in the 2009 Division I basketball playoffs. The data are listed
below. Also listed are the differences between the graduation rate for all student athletes
and the graduation rate for basketball players.
12
• Notice that the comparative dotplot actually consists of two labeled dotplots that use the
same numerical scale.
• There are some striking differences that are easy to see when the data are displayed in
this way.
• The graduation rates for all student athletes tend to be higher and to vary less from
school to school than the graduation rates for just basketball players.
• The dotplots in Figure 2.7 are informative, but we can do even better.
• The data given here are paired data. Each basketball graduation rate can be paired with
the graduation rate for all student athletes from the same school.
• When data are paired, it is usually more informative to look at the differences.
• These differences (all─basketball) are also given in the above data table.
13
• Notice that:
─ There are 11 schools for which the difference is negative. Negative differences
correspond to schools that have a higher graduation rate for basketball players than
for all student athletes.
─ The most interesting features of the difference dotplot are the very large number of
positive differences and the wide spread. Positive differences correspond to schools
that have a lower graduation rate for basketball players.
─ There is a lot of variability in the graduation rate difference from school to school,
and three schools have differences that are noticeably higher than the rest: 53%,
55%, and 69%.
2) STEM-AND-LEAF-DISPLAY
When to Use
• Number of variables: 1
• Data type: Numerical
• Purpose: Display data distribution
How to Construct
1. Each number in the data set is broken into two pieces, a stem and a leaf.
─ The stem is the first part of the number and consists of the beginning digit(s).
14
─ The leaf is the last part of the number and consists of the final digit(s).
For Example:
─ The number 213 might be split into a stem of 2 and a leaf of 13 or a stem of 21 and
a leaf of 3.
─ The resulting stems and leaves are then used to construct the display.
For Example:
Many auto insurance companies give job-related discounts of 5 to 15%. The article
“Auto- Rate Discounts Seem to Defy Data” (San Luis Obispo Tribune, June 19,
2004) included the accompanying data on the number of automobile accidents per
year for every 1,000 people in 40 occupations.
15
• Figure below shows a stem-and-leaf display for the accident rate data.
• The numbers in the vertical column on the left of the display are the stems.
• Each number to the right of the vertical line is a leaf corresponding to one of the
observations in the data set.
• The legend
Stem: Tens
Leaf: Ones
• tells you that the observation that had a stem of 4 and a leaf of 3 corresponds to the
occupation with an accident rate of 43 per 1,000.
• Similarly, the observation with the stem of 10 and leaf of 2 corresponds to 102 accidents
per 1,000.
16
• The display in above Figure suggests that a typical or representative value is in the stem 8
or 9 row, perhaps around 90.
• The observations are mostly concentrated in the 75 to 109 range, but there are a couple of
values that stand out on the low end (43 and 67) and one observation (152) that is far
removed from the rest of the data on the high end
17
When to Use
How to Construct
• A comparative stem-and-leaf display, in which the leaves for one group are listed to
the right of the stem values and the leaves for the second group are listed to the left,
can show how the two groups are similar and how they differ.
• Be sure to include group labels to identify which group is on the left and which is on
the right.
For Example:
The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated
percentage of households with only wireless phone service (no landline) for the 50 U.S.
states and the District of Columbia. Data for the 19 Eastern states and for 13 western
states is in the following table.
18
989 0 559875
1670681 1 66301164001
512 2 00
Stem: tens
Leaves: ones
• From the comparative stem-and-leaf display, you can see that although there was state-to
state variability in both the western and the eastern states, the data distributions are quite
similar.
3. HISTOGRAMS
• Dotplots and stem-and-leaf displays are not always effective ways to summarize
numerical data.
• Both are awkward when the data set contains a large number of data values.
• Histograms are displays that don’t work well for small data sets but do work well for
larger numerical data sets.
• Histograms are constructed a bit differently, depending on whether the variable of interest
is discrete or continuous.
• A frequency distribution for discrete numerical data lists each possible value (either
individually or grouped into intervals), the associated frequency, and sometimes the
corresponding relative frequency [which (relative frequency) is calculated by dividing
each frequency by the total number of observations in the data set].
19
For Example, you might group 1, 2, and 3 frequency to form an interval of 1–3, with a
corresponding frequency of 3 an so on.
─ Frequency Distribution
─ Relative Frequency
─ Grouped Frequency
When to Use
• Number of variables: 1
How to Construct
1. Draw a horizontal scale, and mark the possible values of the variable.
2. Draw a vertical scale, and add either a frequency or relative frequency scale.
3. Above each possible value, draw a rectangle centered at that value (so that the
rectangle for 1 is centered at 1, the rectangle for 5 is centered at 5, and so on).
• General shape
20
• Before constructing histogram for continuous numerical data, the frequency distribution
for continuous numerical data has to be constructed.
• The first step in constructing a frequency distribution for continuous numerical data is to
decide what intervals will be used to group the data. These intervals are called class
intervals.
For Example:
States differ widely in the percentage of college students who attend college in their
home state. The percentages of freshman who attend college in their home state for each
of the 50 states are shown in Table.
96 73 60 73 79
86 93 58 81 75
81 76 89 73 59
84 86 86 72 59
77 78 80 56 43
90 76 66 55 50
73 88 70 75 64
53 86 90 77 80
90 87 89 82 82
96 64 82 83 75
Frequency distribution and relative frequency for above data is in following table:
21
1 40 < 50 1 0.02
2 50 < 60 7 0.14
3 60 < 70 4 0.08
4 70 < 80 15 0.30
5 80 < 90 17 0.34
• It is reasonable to start the first class interval at 40 and let each interval have a width of
10.
• This gives class intervals starting with 40 to < 50 and continuing up to 90 to < 100.
• There are no set rules for selecting either the number of class intervals or the length of the
intervals.
• Using a few relatively wide intervals will bunch the data, whereas using a great many
relatively narrow intervals may spread the data over too many intervals, so that no
interval contains more than a few observations.
• In general, with a small amount of data, relatively few intervals, perhaps between 5 and
10, should be used.
• With a large amount of data, a distribution based on 15 to 20 (or even more) intervals is
often recommended.
• The quantity
22
• When the class intervals in a frequency distribution are all of equal width, you
construct a histogram in a way that is very similar to what is done for discrete data.
When to Use
• Number of variables: 1
• Data Type: Continuous numerical
• Purpose: Displaying data distribution
How to Construct
For Example
23
• You can use the steps in the previous box to construct a histogram for the data
summarized in Table 2.5
• Figure 2.19 shows the completed relative frequency histogram. Notice that the
histogram has a single peak, with a majority of the children watching between 0 and 4
hours of TV per day.
24
Width
25
• When both x and y are numerical variables, each observation consists of a pair of
numbers, such as (14, 5.2) or (27.63, 18.9).
• The first number in a pair is the value of x, and the second number is the value of y.
• An unorganized list of bivariate data doesn’t tell you much about the distribution of the x
values or the distribution of the y values, and tells you even less about how the two
variables are related to one another.
• Just as graphical displays are used to summarize univariate data, they can also be used to
summarize bivariate data.
• The Scatterplots and Time Series Plots can be used to summarize bivariate numerical
data.
1. SCATTORPLOTS
26
When to Use
• Number of variables: 2
• Data Type: Discrete numerical
• Purpose: Investigate the relationship between variables
How to Construct
1. Draw horizontal and vertical axes. Label the horizontal axis and include an
appropriate scale for the x variable. Label the vertical axis and include an appropriate
scale for the y variable.
2. For each (x, y) pair in the data set, add a dot at the appropriate location in the display.
For Example
The table gives the cost and an overall quality rating for 10 different brands of men’s athletic
shoes (www.consumerreports.org).
Is there a relationship between x 5 cost and y 5 quality rating? A scatterplot can help answer
this question.
Cost Rating
65 71
45 70
45 62
80 59
110 58
110 57
30 56
80 52
27
110 51
70 51
Figure below shows the completed scatterplot. There is an interesting and unexpected pattern.
The larger costs tend to be paired with the lower quality ratings, suggesting that there is
actually a negative association between cost and quality!
28
29
• Data sets often consist of measurements collected over time at regular intervals so that
you can learn about change over time.
For example: stock prices, sales figures, and other socio-economic indicators might
be recorded on a weekly or monthly basis.
• A time-series plot (sometimes also called a time plot) is a simple graph of data
collected over time that can help you see interesting trends or patterns.
• A time series plot can be constructed by thinking of the data set as a bivariate data set,
where y is the variable observed and x is the time at which the observation was made.
• These (x, y) pairs are plotted as in a scatterplot. Consecutive observations are then
connected by a line segment.
For Example
The Christmas Price Index is calculated each year by PNC Advisors. The year 2008 was the
most costly year since the index began in 1984, with the “cost of Christmas” at $21,080. A
plot of the Christmas Price Index over time appears on the PNC web site
(www.pncchristmaspriceindex.com), and the data given there were used to construct the time
series plot of below Figure.
• The plot shows an upward trend in the index from 1984 until 1993.
• There has also been a clear upward trend in the index since 1995.
←×→×←×←×→
30