AP Statistics Study Guide
AP Statistics Study Guide
AP Statistics Study Guide
By Geoffrey Gao
ii) Two events are disjoint, or mutually exclusive if the two events cannot both happen simultaneously
2) RANDOM VARIABLES
a) Random Variables are variables that represent the different numbers associated with the potential outcomes of a certain situation i) A Discrete random variable only has a countable number of values ii) A Continuous random variable has a range of values with any value in between iii) The Expected Value of a random variable X is the sum of the products obtained by multiplying each value by the corresponding value of p (1) Equation: E(X) = x = (xi * pi ) iv) The Variance is the mean average of squared deviations. (1) Equation: var(X) = 2 = ( xi - x )2 * pi v) The Standard Deviation is the square root of the variance (1) Equation: = ( ( xi - x )2 * pi ) b) Bernoulli Trials are those that satisfy the following conditions i) There are only two possible outcomes on each trial: success and failure ii) The probability of success are the same on every trial iii) The trials are independent. If this assumption is violated, it is still acceptable if the sample is smaller than 10% of the population c) Binomial Probability Distribution are distributions that deal with HOW a certain chain of events occur i) The mean value of a binomial distribution describes the expected number of successes (1) Equation: x = np ii) The standard deviation is the following: (1) Equation: = (npq) iii) Probability Equation where X is the number of successes in n trials (1) P(X = x) = nCx * px * qn-x iv) Calculations (1) Binompdf(n,p,x) gives the probability of exactly x successes in n trials where p is the probability of success on a single trial (2) Binomcdf(n,p,x) gives the cumulative probability of x or fewer successes in n trials, where p is the probability of success on a single trial
d) Geometric Probability Distribution are distributions that deal with WHEN certain events occur in a chain of events i) The expected value of a geometric distribution is the expected first occurrence of a success (1) Equation: E(X) = 1/P ii) Standard Deviation (1) Equation: = (P / Q2 ) iii) Probability Equation (where X is the number of trials until the first success occurs) (1) Equation: P(X = x) = qx-1p iv) Calculations (1) Geometpdf(p,x) solves the probability density function. You specify the probability of success (p) and the number of the first success trial (x) (2) Geometcdf(p,x) solves the cumulative density function. You specify the probability of success (p) and the value x, in which we calculate the probability of success on or before the xth value
(b) Skewed Left (i) Mean < Median (c) Skewed Right (i) Mean > Median iii) Spread (1) Scope of the values from smallest to largest (2) Equation: Interquartile Range (IQR) is Q3 Q1 iv) Outliers (1) Outliers are extreme values that can be the result of natural chance variation or errors in measurement (2) Equation: Outliers: Q1 IQR and Q3 + IQR v) Standard Deviation/Variance (1) Standard Deviation: average distance from mean (a) Equation: = ( ( x - )2 / n ) (2) Variance: (a) Equation: 2 = ( x - )2 / n vi) Transformations (1) Adding a value to every value changes the mean/median but not the SD. Multiplying every value changes all 3 (multiply by the factor to get the new number) 3) Normal Distributions a) Normal Distributions are bell-shaped and symmetric and have an infinite base. b) Empirical Rule (68-95-99.7 rule) Each standard deviation follows the pattern. 1 standard deviation away on each side is 68% of all values, 2 standard deviations is 95, 3 is 99.7. c) Z-Scores tell us how many standard deviations a certain value is away from the mean. i) Equation: z = (xi x) / ii) Evaluating Z- Scores: (1) Equation: Finding a z-score from a percentile Invnorm(%) (2) Equation: Normcdf(lower,upper) finding the percentage between z-scores 4) Regressions a) A graphical display called a scatterplot gives an immediate visual impression of a possible relationship between two variables, while a numerical measurement, called a correlation coefficient, is often used as a quantitative value of the strength of a linear relationship i) R is the correlation coefficient. It ranges from -1 to 1. 1 and -1 are the strongest linear associations, and 0 has no linear association.
The positive r values have a positive relationship (positive slope), and the negative r values have a negative relationship. Correlation does not imply causation! It only measures the strength of a linear relationship ii) R2 is called the coefficient of determination. It is solved by squaring the r-value. When you explain your R2 value you make the statement: [R2] of variability in [y-axis] can be explained by the linear association with [x-axis]. b) Line of best fit is a line that gives the best predictions for values given a set of data. We wish to minimize the residual values i) Residual Value: Observed Expected ii) Equation: b1 = r (sy / sx) Slope of the LSRL c) Residual Plots i) The residual plot is made up of the residuals of all the values. The sum of the residuals are always 0. A sample with a large R2 value, low absolute residual sum, no clear pattern in a residual plot makes the regression line an appropriate one ii) Influential points are those that sharply change the regression line. iii) Transformations are the altering of the y, x, or both values to achieve a non-patterned residual plot. Usually if the residual plot has a pattern, a linear model is inappropriate and a nonlinear model is more appropriate (thus the transformation). (1) Exponential: log of Y (2) Power: Log of Y and Log of X (3) Quadratic: Square root of Y (4) Reciprocal: 1/y (5) Logarithmic: Log of X
i) A sample is biased if in some critical way it does not represent the population. The main technique to avoid bias is to incorporate randomness d) Experiment vs. Observational Study i) An Experiment is a controlled study. In an experiment, there is an action taken on one or more of the groups and the response is observed. There are often treatment groups and control groups. Good experimental designs include: (1) Controls A group that receives similar conditions as the other groups without the treatment. This is used as a baseline comparison for the response measurement (2) Blocking Process in which the subjects are divided into representative groups (such as gender) to bring certain differences directly into the picture (3) Randomization Unknown and uncontrollable idfferences are handled by randomizing who receives which treatments ii) An Observational Study is a study in which there is no choice in regard to who goes into the treatment and control groups. There is no action taken and is merely an observation of what has occurred. Observational studies on the impact of some variable on another variable often fail because explanatory variables are confounded with other variables (1) Confounding Variables are variables that are not accounted for in the original design. 2) Planning and Conducting Surveys a) Simple Random Sample i) A Simple Random Sample (SRS) is one in which every possible sample of the desired size has an equal chance of being selected. (1) A typical way of an SRS is assigning everyone and using a random number generator b) Bias / Sampling Variation i) All surveys give a statistic as an estimate for a population parameter. Different samples give different statistics, all of which are estimates for the same population parameter, and so error, called sampling error is present. The chance of this error occurring is smaller when the sample size is larger. ii) Bias is the tendency to favor the selection of certain members of a population. Here are a few explanations of certain bias
(1) Response Bias People dont want to be perceived as having unpopular or unsavory views and so they respond untruthfully when face to face with an interview (2) Wording Bias Non-neutral or poorly worded questions may lead to answers that are unrepresentative of the population (3) Selection Bias - Choosing the wrong population to vote. For instance, asking for opinions regarding welfare reform to an area that is largely conservative. (4) Undercoverage Bias This occurs when there is inadequate representation. Convenience Samples are based on choosing individuals who are easy to reach. These tend to produce under-representative data (5) Voluntary Response Bias Samples based on individuals who offer to participate typically give too much emphasis to people with strong opinions (6) Nonresponse Bias When certain people refuse to respond or are unreachable or too difficult to contact c) Other Sampling Methods. i) Systematic Sampling Involves listing the population in some order and choosing a random point to start, and picking every person from the list in intervals (ie every 10th person). This gives a reasonable sample as long as the original order of the list is unrelated to the variables under consideration ii) Stratified Sampling Involves dividing the population into homogeneous (similar) groups called strata, and random samples of persons from all strata are chosen. iii) Cluster Sampling Involves dividing the population into heterogeneous (mixed) groups called clusters, and taking random samples of persons from all the clusters are chosen. Each cluster should then resemble the entire population iv) Multistage Sampling Taking multiple sampling steps. 3) Confounding, Control Groups, Placebo Effects, and Blinding a) Experiments involve explanatory variables, called factors, which are believed to have an effect on response variables. b) When there is uncertainty with regard to which variable is causing an effect, the variables are confounded c) A lurking variable is a variable that drives two other variables, creating the mistaken impression that the two other variables are related by cause and effect. Thus the linkages are often by a common response
d) The placebo effect is the fact that many people respond to any kind of perceived treatment, even though it may be nothing e) Blinding occurs when the subjects or the response evaluators dont know which subjects are receiving which treatments. Double blind is when they are both unaware.
(2) If H0 is plausible (failed to be rejected), state in context that there is not sufficient evidence for Ha 2) Proportion Tests a) 1-Prop and 2-Prop z-tests deal with proportions of populations. All proportions are between 0 and 1 and describe a proportion of a population with a certain characteristic. b) Confidence Interval i) Conditions: (1) Randomization Is the sample random? (2) Normality np and n(1- p) 10 (3) Independence (population large enough) N > 10n ii) Equation: CI = p Margin of error (1) Margin of Error: z* SE iii) Equation: Standard Error (1) 1-Prop: (p * (1- p) / n) (2) 2-Prop: [(p1(1- p1)/n1) + (p2(1- p2)/n2)] c) P-Value (Z-Scores) i) Conditions: (1) Randomization Is the sample random? (2) Normality np and nq 10 (3) Independence (population large enough) N > 10n ii) Equation: Standard Deviation (1) 1-Prop: = (pq/n) (2) 2-Prop: =[(pc(1- p)c(1/n1 + 1/n2)] (a) Equation: pc = (x1+x2)/(n1+n2) iii) Note that in P-Values you use p and q to solve the equations, whereas in Confidence intervals you use p! d) Calculations (1) Z* = invNorm(%) (2) Confidence Interval: ZInterval, 2-PropZinterval (3) P-Value: 1-PropZtest, 2-PropZTest (4) Evaluating P value: normcdf(lower,upper) 3) Sample Tests a) 1-Sample and 2-Sample t-tests deal with the averages of populations. You will need to find the means and the standard deviations b) Confidence Interval / P-Value i) Conditions: (1) Randomization Is the sample random?
(2) Normality Graph with histogram/box and whisker and describe shape (unimodal + symmetric) (3) Independence (population large enough) N > 10n ii) Equation: CI = x t* SE (x) iii) Equation: Standard error (1) 1-Sample: SE(x) = /(n) (2) 2-Sample: [(s12/n1) + (s22/n2)] iv) Equation: Degrees of Freedom: n-1 v) Calculations: (1) T* = invT(%,Df) (2) Confidence Interval: TInterval, 2-SampTInterval (3) P-Value: T-Test, 2-SampTTest (4) Evaluating P-Value: tcdf(lower, upper, degrees of freedom) c) Matched Pair i) These occur when two variables are applied to the same subject in a sample. These are calculated the same as a 1 sample t-test and you look at the difference in the data. 4) Chi-Squared Tests a) Chi-Squared tests were derived to perform significance testing for categorical variables. It focuses on inferring the validity of a sample i) Equation: x2 = (observed expected)2/ expected ii) Make sure to write the sum above in a form like x1 + x2 ++ xn b) Conditions i) Randomization: is the sample chosen randomly ii) Expected Cell Frequency: The expected cell counts of subjects in each cell are at least 5 iii) Independence: N>10n c) Goodness of Fit i) Goodness of Fit is used to determine whether our observed data fits the theoretical distribution for that data ii) Equation: Degrees of Freedom = n-1 iii) Equation: Expected = Sum/columns iv) Hypotheses (1) H0: The is no difference between each event (2) Ha: There is a difference between each event v) Calculations: (1) X2GOF = goodness of fit test d) Homogeneity Test
i) Homogeneity Test is used to compare the distribution of categories. We hope to observe the same amount of variation in all categories of multiple populations/samples ii) Equation: Degrees of Freedom = (rows 1) (columns 1) iii) Hypotheses (1) H0: The proportions across each sample are equal (2) Ha: The proportions across each sample are different iv) Calculations (1) X2-Test (input values into matrix) (2) X2cdf(lower, upper, df) e) Independence Test i) The independence test is used to gain evidence of association between two categorical variables. They will usually ask are the two associated? ii) Equation: Degrees of Freedom = (rows 1) (columns 1) iii) Hypotheses (1) H0: The two events are INDEPENDENT/NOT ASSOCIATED (2) Ha: The two events are DEPENDENT/ASSOCIATED iv) Calculations (1) X2-Test (input values into matrix) (2) X2cdf(lower, upper, df) 5) Regression Tests a) Will usually ask you if there is evidence that a relationship between two things is linear b) Conditions: i) Linearity Assumption Check the scatter plot to see if the shape is linear ii) Independence Assumption Check the residuals plot. The residuals should appear randomly scattered iii) Equal Variance Condition Check the residuals plot again. The vertical spread of residuals should be roughly the same everywhere iv) Normal Population Assumption Check the histogram of the residuals. The distribution of residuals should be unimodal and symmetric c) Hypotheses (1) H0: = 0 (no linear association) (2) Ha: > 0 (positive linear association), < 0 (negative linear association) 0 d) Confidence Interval
i) Equation: SEb = s / (((x- x)2)) ii) Equation: s = [(1/(n-1))(y- )2] iii) Calculation (1) Linreg Int e) P-Value i) Equation: T= b/SEb ii) P value: p (insert alt | = 0) iii) Calculation (1) Linreg t test f) Degrees of Freedom i) Equation: n-2 6) Errors a) Type I i) This occurs when you rejected the null when you shouldnt have. Thus the null is not rejected b) Type II i) This occurs when you failed to reject the null when you should have. Thus the null is false.