Basics of data-
Categorical variables are those that take on values which are categories, not numerical values.
- A categorical variable is
nominal if the values do not involve some kind of ordering. For example, with a dog’s breed, there’s no point in assigning a corgi, poodle or chihuahua some order.
- A categorical variable is
ordinal if the values can somehow be ordered. For example, an ordering system appears when we categorise speed/velocity into slow, medium or fast.
-
Numeric variables are those that take on values that are numbers.
- A numeric variable is
discrete if we can list out every value they can take, or they can only take integer values (i.e. whole numbers). For example, the number of people living in a household can only be an integer value, say, 4.
- A numeric variable is
continuous if we cannot do the above. For example, when we measure someone’s weight, we might measure it to 2 decimal places (for example, 71.45kg). But in reality, we could go on forever with these decimal places - we just happen to round the values usually.
-
Univariate data is data that only involves one variable.
Presenting univariate data for categorical variables- A
bar chart plots each category along the x-axis of a graph, and uses vertical bars to represent how much of the data lies in each category. They essentially reflect exactly how much is in each category.
- A
pie chart is constructed by computing \( \frac{\text{number in each category}}{\text{total number of observations (i.e. data)}} \) as a percentage, and then plotting them as sectors on a circle. They show the
proportions of the categories the data lies in.
Presenting univariate data for numerical variables- For discrete data, a
dot plot literally plots a dot for each of the numeric values. There is only a horizontal axis, and the dots are plotted above each corresponding value on the axes. (The number of dots measures
- For discrete data, a
stem and leaf plot splits the units value from the other digits (multiples of 10’s). For example, 25 is split into 2 and 5. 371 is split into 37 and 1.
The units form the ‘leaves’ of the plot, and the other digits in the numbers form the ‘stems’ of the plot. Each value is plotted on a stem as one leaf.
The actual value is computed as \((10\times \text{stem}) + \text{leaf} \). For example, a leaf of 5 on a stem of 12 will be \(12\times 10 + 5\), i.e. \(125\)
- A
histogram plots how much of each value we have, in a manner similar to that of a column chart. For discrete data, the exact value can be plotted at each point along the x-axis. For continuous data, usually a range of values are grouped into a certain point on the x-axis.
Example for continuous data Concepts in numeric data- A
mode is any value in the data that occurs the most often. (It is usually only observed for discrete values, but it can also be used for categorical variables as well.) The data collected can potentially have more than one mode.
- The
shape of the data reflects how evenly spread out it is compares to where it usually peaks.
- Data that is roughly spread out on both ends of the spectrum is
symmetric.
- Data that tends to lean more on the lesser end (i.e. to the right in the histogram/dot plot/...) is
positively skewed.
- Data that tends to lean more on the greater end (i.e. to the left in the histogram/dot plot/...) is
negatively skewed.
-
Outliers are values in the data that appear noticeably distinct (really far away) from all of the rest.
Central tendency-
Central tendency seeks to look at what is roughly the "middle" of where all our data is.
- The
mean is one such measure. It reflects the
average of all of the values. The formula is given
\[ \overline{x} = \frac{\sum x_i}{n}. \]
This says to first add every value we have, and then divide by the number of values given.
- The
median is another such measure. It literally seeks to find the value in the middle, if we arrange all our values in ascending order. As a formula, it is given by
\[ \text{median} = \left( \frac{n+1}{2} \right)^{\text{th}}\text{ data value.}\]
To compute the mean by brute force, arrange the data in order. (For stem and leaf plots, this will already be done.) Then cross out the smallest and largest value. Repeat this, until we only have 1 or 2 values left.
If we only have 1 value, that is the median. If we have 2 values left, just compute the average of those two values.
- The mode
can be used as a measure of central tendency. (Personally though, I do not advise it.)
Spread-
Spread seeks to examine how far does the data tend to branch out of some 'centre'. It essentially looks at how spread out the data is.
- The
standard deviation is one such measure, taken with respect to the mean. It is given by the formula \(s = \sqrt{\frac{\sum (x_i - \overline{x})^2}{n-1}} \), which will usually be computed with technology.
- The
minimum and
maximum are respectively the smallest and largest values in your data. They also serve as a measure of spread, however may be problematic as outliers influence them a lot.
- The
lower quartile and
upper quartile help determine a measure of spread. Unlike the median which reflects the middle-most value, the quartiles reflect the lower and upper 25%-cutoffs respectively.
- The
interquartile range is the measure of spread used. It is found by computing the difference of the upper and lower quartiles.
Comparing multiple sets of data a numeric variable- The
five-point summary of one set of data consists of (usually in this order):
- Minimum
- Lower quartile \(Q_1\)
- Median \(Q_2\)
- Upper quartile \(Q_3\)
- Maximum
- A
boxplot represents the data in a systematic way, along the \(x\)-axis. It shows the five-point summary.
- A box is drawn with \(Q_1\) and \(Q_3\) being the endpoints.
- The median is represented by a vertical line down the middle.
- Usually, a whisker will be drawn extending to the minimum and maximum.
- However, outliers are not represented in the whisker. Outliers are plotted as separate points that hang off the whisker. If we have outliers, the whisker stops at the smallest/largest values that are
not outliers.
-
Parallel box plots involve plotting the boxplot for each set of data on the
same set of axes. This allows us to compare the two values they have.