Nothing Special   »   [go: up one dir, main page]

Login

Welcome, Guest. Please login or register.

January 13, 2025, 11:33:39 pm

Author Topic: A quick glossary of statistics concepts  (Read 4262 times)

0 Members and 1 Guest are viewing this topic.

RuiAce

  • ATAR Notes Lecturer
  • Moderator
  • Great Wonder of ATAR Notes
  • *****
  • Posts: 8814
  • "All models are wrong, but some are useful."
  • Respect: +2575
A quick glossary of statistics concepts
« on: June 17, 2019, 03:49:35 pm »
+9
Remember to register here for FREE to ask any questions you may come across in your QCE studies!

Statistics plays a huge role in Unit 3, and in fact builds directly onto the Unit 2 concepts! The concepts are introduced one after the other and this can become overwhelming quite quickly. For this topic, it is also nice to have a compilation of terms to refer back to at any point.

This glossary includes all the Unit 2 work alongside the Unit 3 work. Once again, do not become reliant on this!

Unit 2 work
Basics of data
- Categorical variables are those that take on values which are categories, not numerical values.
- A categorical variable is nominal if the values do not involve some kind of ordering. For example, with a dog’s breed, there’s no point in assigning a corgi, poodle or chihuahua some order.
- A categorical variable is ordinal if the values can somehow be ordered. For example, an ordering system appears when we categorise speed/velocity into slow, medium or fast.

- Numeric variables are those that take on values that are numbers.
- A numeric variable is discrete if we can list out every value they can take, or they can only take integer values (i.e. whole numbers). For example, the number of people living in a household can only be an integer value, say, 4.
- A numeric variable is continuous if we cannot do the above. For example, when we measure someone’s weight, we might measure it to 2 decimal places (for example, 71.45kg). But in reality, we could go on forever with these decimal places - we just happen to round the values usually.

- Univariate data is data that only involves one variable.

Presenting univariate data for categorical variables
- A bar chart plots each category along the x-axis of a graph, and uses vertical bars to represent how much of the data lies in each category. They essentially reflect exactly how much is in each category.
- A pie chart is constructed by computing \( \frac{\text{number in each category}}{\text{total number of observations (i.e. data)}} \) as a percentage, and then plotting them as sectors on a circle. They show the proportions of the categories the data lies in.

Presenting univariate data for numerical variables
- For discrete data, a dot plot literally plots a dot for each of the numeric values. There is only a horizontal axis, and the dots are plotted above each corresponding value on the axes. (The number of dots measures

- For discrete data, a stem and leaf plot splits the units value from the other digits (multiples of 10’s). For example, 25 is split into 2 and 5. 371 is split into 37 and 1.
The units form the ‘leaves’ of the plot, and the other digits in the numbers form the ‘stems’ of the plot. Each value is plotted on a stem as one leaf.

The actual value is computed as \((10\times \text{stem}) + \text{leaf} \). For example, a leaf of 5 on a stem of 12 will be \(12\times 10 + 5\), i.e. \(125\)
- A histogram plots how much of each value we have, in a manner similar to that of a column chart. For discrete data, the exact value can be plotted at each point along the x-axis. For continuous data, usually a range of values are grouped into a certain point on the x-axis.

Example for continuous data

Concepts in numeric data
- A mode is any value in the data that occurs the most often. (It is usually only observed for discrete values, but it can also be used for categorical variables as well.) The data collected can potentially have more than one mode.

- The shape of the data reflects how evenly spread out it is compares to where it usually peaks.
- Data that is roughly spread out on both ends of the spectrum is symmetric.
- Data that tends to lean more on the lesser end (i.e. to the right in the histogram/dot plot/...) is positively skewed.
- Data that tends to lean more on the greater end (i.e. to the left in the histogram/dot plot/...) is negatively skewed.

- Outliers are values in the data that appear noticeably distinct (really far away) from all of the rest.

Central tendency
- Central tendency seeks to look at what is roughly the "middle" of where all our data is.

- The mean is one such measure. It reflects the average of all of the values. The formula is given
\[ \overline{x} = \frac{\sum x_i}{n}. \]
This says to first add every value we have, and then divide by the number of values given.

- The median is another such measure. It literally seeks to find the value in the middle, if we arrange all our values in ascending order. As a formula, it is given by
\[ \text{median} = \left( \frac{n+1}{2} \right)^{\text{th}}\text{ data value.}\]
To compute the mean by brute force, arrange the data in order. (For stem and leaf plots, this will already be done.) Then cross out the smallest and largest value. Repeat this, until we only have 1 or 2 values left.
If we only have 1 value, that is the median. If we have 2 values left, just compute the average of those two values.

- The mode can be used as a measure of central tendency. (Personally though, I do not advise it.)

Spread
- Spread seeks to examine how far does the data tend to branch out of some 'centre'. It essentially looks at how spread out the data is.

- The standard deviation is one such measure, taken with respect to the mean. It is given by the formula \(s = \sqrt{\frac{\sum (x_i - \overline{x})^2}{n-1}} \), which will usually be computed with technology.

- The minimum and maximum are respectively the smallest and largest values in your data. They also serve as a measure of spread, however may be problematic as outliers influence them a lot.

- The lower quartile and upper quartile help determine a measure of spread. Unlike the median which reflects the middle-most value, the quartiles reflect the lower and upper 25%-cutoffs respectively.
- The interquartile range is the measure of spread used. It is found by computing the difference of the upper and lower quartiles.

Comparing multiple sets of data a numeric variable
- The five-point summary of one set of data consists of (usually in this order):
  - Minimum
  - Lower quartile \(Q_1\)
  - Median \(Q_2\)
  - Upper quartile \(Q_3\)
  - Maximum

- A boxplot represents the data in a systematic way, along the \(x\)-axis. It shows the five-point summary.
  - A box is drawn with \(Q_1\) and \(Q_3\) being the endpoints.
  - The median is represented by a vertical line down the middle.
  - Usually, a whisker will be drawn extending to the minimum and maximum.
  - However, outliers are not represented in the whisker. Outliers are plotted as separate points that hang off the whisker. If we have outliers, the whisker stops at the smallest/largest values that are not outliers.

- Parallel box plots involve plotting the boxplot for each set of data on the same set of axes. This allows us to compare the two values they have.


- Bivariate data is data that involves two variables.
- Association/Correlation is said to occur when we observe some kind of relationships between two variables in bivariate data.

Two categorical variables
- A two-way frequency table records one variable along the rows and the other along the columns. The entries denote how many observations we have in each combination of criterias.
- Each row sum involves adding across the row to find how many observations altogether satisfy one of the row categories.
- Each column sum involves adding across the column to find how many observations altogether satisfy one of the column categories.

Multiple images have been taken from here due to how conveniently everything was laid out.

- A percentaged two-way frequency table takes proportions with respect to "some" accumulated value in the data.
- Percentages taken with respect to the entire table measure proportions in each pair of categories

- Percentages taken with respect to a row or column measure proportions within said row/column. These are more commonly used to find association.

Two numerical variables
- A scatterplot is a plot of every point in our data on a number plane. One of the variables is assigned to the \(x\)-coordinate, whilst the other is assigned to the \(y\)-coordinate.

- The direction of association reflects whether the increase of one variable influences an increase or decrease in the other variable.
- If an increase in one variable tends to increase the other variable, the association is positive. The scatterplot tends to slope upwards.
- If an increase in one variable tends to decrease the other variable, the association is negative. The scatterplot tends to slope downwards.

- The form of association reflects what 'shape' the association appears to look like.
- When the points plotted appear to show a straight-line trend, we say that the association is linear
- When they appear to form some other shape, we say the association is non-linear. For example, trends that exhibit a parabolic trend have a quadratic association.

-  The strength of association measures the extent of how much we believe the variables are correlated.
- When the strength of association is strong, we generally expect any association to be clear. The points may be somehow closely knit together.
- When the strength of association is moderate, we would probably still expect any association to be clear, but they might not be so tightly bunched. The points would likely be somewhat spread out.
- When the strength of association is weak, we would barely be able to notice some kind of trend.

- When there is no association, the scatterplot will look like a randomly generated bunch of points, showing no real special behaviour.

- Pearson's correlation coefficient assigns a value between -1 and 1 to study strength and direction of linear association. (It is not an appropriate measure for other forms of association.) It is given by the formula
\[ r= \frac1{n-1} \sum \left( \frac{x_i-\overline{x}}{s_x} \right)\left( \frac{y_i-\overline{y}}{s_y} \right) \]
- The sign of the correlation corresponds to the direction of linear association. (Positive means positive association, negative means negative association.)
- The magnitude of the correlation corresponds to the strength of linear association. The closer the number's magnitude is to 1, the stronger the association.

Fitting the linear model
- The response variable is the variable we believe to be influenced by the other variable in bivariate data. Values of the response variable are plotted as the \(y\)-coordinate.
- The explanatory variable is the variable we believe to cause changes in the other variable. Often we can control the explanatory variable. Values of the explanatory variable are plotted as the \(x\)-coordinate.

- The least squares line is the line \(y=a+bx\) we believe to best fit the model. The coefficients satisfy the formulas:
\begin{align*}b&=r\frac{s_y}{s_x}\\ a &= \overline{y}-b\overline{x} \end{align*}
- The intercept \(a\) reflects what we expect our response variable to be when our explanatory variable is set to 0.
- The slope \(b\) reflects how much we expect our response variable to increase by (or decrease, if it is negative), when we increase our explanatory variable by 1.
- The least squares line is optimal in the sense that it minimises the sum of squared-residuals. A residual is the distance between a point on the scatterplot, to the corresponding point on the least squares line. (The corresponding point has the same \(x\)-coordinate.)

- A residual plot is simply a plot of all residuals after the least squares line has been fit. We believe that a linear association is appropriate if the points on the residual plot appear randomly scattered about the line \(y=0\).

- The coefficient of determination (\(R^2\)) is a value that describes how much we believe our response is influenced by our explanatory variable. (For example, if it equals 0.8, then we believe 80% of the variation in the response variable can be attributed with change in the explanatory variable.) The rest
may be from other factors.

- An interpolation involves making a prediction with our fitted line, for some value within the range of our data.
- An extrapolation involves making a prediction with our fitted line, for some value outside the range of our data. This tends to be dangerous, as the trend may no longer hold outside the range we are considering.

Association and causation
- Causation occurs when we believe that a change in one variable must directly influence change in another variable.

- The statement association/correlation does not imply causation reflects how just because two variables share some association, we cannot confirm that one must directly influence the other.

- For example, a common response occurs when there's actually a third, hidden variable, that serves as an explanatory variable for two response variables.
« Last Edit: June 21, 2019, 05:18:56 pm by RuiAce »