Nothing Special   »   [go: up one dir, main page]

Template Slide

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 39

STANDARD DEVIATION, MEAN,

MEDIAN & MODE


ERRATA: The Mean of (1+3+3+5+7+10+100) / 7 is 18.43, NOT 22 like I
mentioned in the video. The text below has been corrected and a note
has been added to the appropriate point in the video

Mean (average) = The sum of all of the values divided by the number of
values. Least Robust
Median (“middle” value) = The value that is in the middle when all of the
values are arranged in ascending order. If there is an even number of
values there is no single middle value. Therefore, you take the average of
the two middle value. Robustness is in between mean and mode
Mode (most common value) = The value that appears the highest number
of times. Most Robust

If you are given the list of values 1, 3, 3, 5, 7, 10. What is the mean,
median, and mode?

 Mean = (1+3+3+5+7+10) / 6 = 4.83


 Median = average of the 2 middle values since there is an even number of
values. (3+5)/ 2 = 4
 Mode = most frequent value = 3
If you take the list of values from the previous question but now add an
additional value of 100 how does the mean, median, and mode change?

 Mean = (1+3+3+5+7+10+100) / 7 = 18.43


 Median = middle value = 5
 Mode = most frequent value = 3

Robustness
The above question illustrates how Robust, or resistant to change by an
extreme value, the three measures of central tendency are. You can see
that by adding one extreme value (an outlier) the mean has changed a lot
and mode hasn’t changed at all. This is because mean is the least
robust of the three values and mode is the most robust. Median is less
robust than mode, but more robust than mean.
On Step 1 you may also be asked to compare mean, median, and mode
in certain situations based on a histogram or set of values. For example,
the answer looks like “mean is greater than mode” rather than a precise
numerical answer. In most of these cases the data is skewed significantly
in one direction and is not normally distributed.

Normal Distribution, Normally Distributed, Skewed Right, Skewed Left, Positive Skew, Negative
Skew, Negatively Skewed, Positively Skewed

Standard Deviation
Standard deviation (greek symbol σ) measures how much the values in a
data set differs from the mean. In other words, standard deviation
measures dispersion or variability in a set of values. A data set with
mostly similar values has a small standard deviation, while a data set with
very different values has a large standard deviation.
Standard deviation changes with changes in sample size (number of
values or participants). With small sample sizes random chance has a
bigger impact and therefore standard deviation for a small sample size is
generally larger. Studies with more values generally have smaller
standard deviations as chance plays less of a role.

Now that you are done with this video you should check out the next video in
the Biostats & Epidemiology for the USMLE Step 1 section which covers 2 by 2
Tables, False Positive, False Negative, True Positive & True Negative
2×2 TABLE, FALSE POSITIVE,
FALSE NEGATIVE, TRUE POSITIVE
& TRUE NEGATIVE

True Positive, True Negative, False Positive,


and False Negative
Laboratory test results are usually a numerical value, but these values are
often converted into a binary system. For example, urine hCG Pregnancy
Test test may give you values ranging from 0 to 30 mlU/mL, but the
numerical continuum of values can be condensed in two main categories
(positive and negative). We do this by setting a Cut-off Point. All
measurements above this cut-off value are categorized as “positive” and
all values below are “negative.” If you change that cut-off point the
positive vs. negative classifications (as well as TP, FP, TN, & FN)
change. In everyday life, positive things are good and negative things are
bad. But remember in most laboratory tests, a positive result means the
patient has a disease.
A True result is a lab result that matches the truth or our best estimate of
the truth based on the results of the best available test (called the Gold
Standard Test). So a true result would be a positive HIV test in a person
we know to clinically have HIV. A False measurement is obviously when
the result does not match the truth. “Good” tests have mostly true
measurements and few false measurements.
 True Positive (TP) = A diseased person who is correctly identified as having a
disease by the test
 False Positive (FP) = A healthy person that is incorrectly identified as having
the disease by the test
 True Negative (TN) = A healthy person who is correctly identified as healthy
by the test
 False Negative (FN) = A diseased person who is incorrectly identified healthy
by the test

Here is another way to think about these definitions:

 All with Disease =TP + FN


 All without Disease = TN + FP
 All that Tested Positive = TP + FP
 All that Tested Negative = TN + FN

Two-by-Two Tables
Questions involving TP, FP, TN, and FN will usually have a two-by-two
table. Sometimes they will give you the actual table and other times they
will give you all of the data for the table in sentence form and you have to
make the table for yourself. You may have learned to refer to the boxes in
a two-by-two table as A, B, C & D. I am going to strongly recommend you
not do this. First off, those letter labels have no meaning. It is therefore
more likely for you to get confused and make a dumb mistake.
Additionally, the top left box in a two-by-two table may not always
represent true positive. Sometimes, test writers will mix up the order of
the columns and/or rows. I suggest using TP, TN, FP, and FN instead.
Sometimes they will give you an extra row and column that has totals.
Don’t let that throw you off. It is still a two-by-two table even though there
are 3 columns and 3 rows. They just try to save you a step of calculation
by giving you the row and column totals.

Now that you have finished this video on 2 by 2 tables you should check out the
next video in the Biostatistics & Epidemiology section which covers Sensitivity,
Specificity & Confirmatory Tests
SENSITIVITY, SPECIFICITY &
SCREENING TESTS
Before you watch this video you should really watch the previous video which
covers Two by Two Tables, TP, TN, FP & FN. That video lays the foundation for
this video so it may be tough to watch this one by itself.
Sensitivity & Specificity
Sensitivity (Sen) & Specificity (Spec) are used to evaluate the validity of
laboratory tests (not results of the tests). Basically, you use sensitivity and
specificity to determine whether or not to use a certain test or to
determine what situations a certain test would work best in. It is important
to note that Sen and Spec are fixed for a certain test as long as you don’t
change the cutoff point. Therefore, Sensitivity & Specificity are not affected
by changing prevalence. Both are given as a percentage ranging from 0%
to 100%.

Sensitivity is the percentage of patients with the disease that receive a


positive result or the percentage chance that the test will correctly identify
a person who actually has the disease
Sensitivity = TP / (TP + FN)
or
Sensitivity = TP / Diseased
Specificity is the Percentage of patients without the disease that receive a
negative result
Specificity= TN / (TN+FP)
or
Specificity = TN / Not Diseased
Imagine you have 2 very different guns. The first gun fires when you
barely touch the trigger. A strong gust of wind could set it off. The first gun
has high sensitivity and low specificity. It is sensitive to the smallest of
signals to fire while not being very specific to an intentional pull of the
trigger. You never miss a possible chance to shoot your gun (~ Low FN),
but you often accidentally fire when you shouldn’t (~ High FP). The
second gun only fires if you pull the trigger really hard. This gun has high
specificity and low sensitivity. It is very specific to firing only when you
intentionally pull the trigger (~Low FP), but it isn’t very sensitive to a weak
pull of the trigger (~High FN).

In the real world you never have a test that is 100% Sen and 100% Spec.
We are usually faced with a decision to use a test with high Sen (and
lower spec) or high Spec (and lower Sen). Usually a test with high
sensitivity is used as the Initial Screening Test. Those that receive a
positive result on the first test will be given a second test with high
specificity that is used as the Confirmatory Test. In these situations you
need both tests to be positive to get a definitive diagnosis. Getting a
single positive reading is not enough for a diagnosis as the individual
tests have either a high chance of FP or a high chance of FN. For
example, HIV is diagnosed using 2 tests. First an ELISA screening test is
used and then a confirmatory Western Blot is used if the first test is
positive.
There are also specific situations where having a high specificity or
sensitivity is really important. Consider that you are trying to screen
donations to a blood bank for blood borne pathogens. In this situation you
want a super high sensitivity, because the drawbacks of a false negative
(spreading disease to a recipient) are way higher than the drawbacks of a
false positive (throwing away 1 blood donation). Now consider you are
testing a patient for the presence of a disease. This particular disease is
treatable, but the treatment has very serious side effects. In this case you
want a test that has high specificity, because there are major drawbacks
to a false positive.

Now that you have finished this video you should check out the next video in
the Biostats & Epidemiology section which covers the calculation of Predictive
Value Positive & Negative (PPV & NPV). That video has some mnemonics and
concepts that also apply to this video.
POSITIVE & NEGATIVE PREDICTIVE
VALUE (PPV & NPV)
Before you watch this video you should really watch the previous videos which
cover Two by Two Tables, TP, TN, FP & FN as well as Sensitivity & Specificity.
Those videos lay the foundation for this video so it may be tough to watch this
one by itself.

Positive Predictive Value & Negative


Predicative Value
PPV & NPV are used to interpret test results once you have them. For
example, if your patient just received a positive HIV test result you would
use the PPV to evaluate what that test result means to your patient (what
the percentage is that this person actually has HIV). It is important to
remember that the PPV & NPV change as prevalence changes. This makes
sense, because if the prevalence of a disease increases you are going to
automatically get more TPs and less TNs just based on the fact that more
people have the disease. Both measurements are given as a percentage
ranging from 0% to 100%.

Positive Predictive Value (PPV) is the percentage chance that a positive


test result is a true positive or the percentage chance that a patient with a
positive result actually has the disease. It is used when determining how
to proceed after a patient gets a positive result. PPV increases with
increases in prevalence. PPV decreases with decreases in prevalence.
Negative Predictive Value (NPV) is the Percentage chance that a negative
test result is a true negative or the percentage chance that a patient with
a negative result is actually disease free. It is used when determining how
to proceed after a patient gets a negative result. NPV decreases with an
increase in prevalence. NPV increases with a decrease in prevalence.
This is how I remember the formulas for Sen, Spec, PPV & NPV. First I
think that the top value (numerator) is always a positive value and the
bottom “left” value always matches the top value. The value of the bottom
“right” is always false. Then I think that the term with positive in the name
(PPV) has “all positives” & the term with negative in the name (NPV) has
“all negatives.” Next I think of Sen looking sort of like PPV & Spec looking
sort of like NPV. You just swap out the value on the bottom “right” value.
To remember which set of values are affected by prevalence I think that
increasing Prevalence increases the formula with the most Ps in it. That
lets you know PPV is directly proportional with prevalence and it is
intuitive that NPV is the opposite because those two are an obvious pair.
So in my head I’m seeing something like this.
Occasionally, you will get these types of questions in graphical form.
These questions with a picture are much less common than questions
that test the definition of a term or ask you to make a calculation based on
a two-by-two table. However, I am going to spend some time on this
question format as I believe it helps to solidify the overall concept.

Now that you are done with this video you should check out the next video in
the Biostatistics & Epidemiology section which covers the definitions &
calculations of Prevalence & Incidence
INCIDENCE, PREVALENCE & CASE-
FATALITY RATE

Prevalence & Incidence


In normal everyday conversations incidence and prevalence are used
interchangeably. That is because broadly speaking incidence and
prevalence both measure the frequency of a disease or outcome. In most
situations Incidence & prevalence are also directly proportional. However,
there are a few clear differences between the two measures that the test
writers love to write questions about.

 Prevalence = number of Total Existing cases divided by the total population


 Incidence = number of New cases within a certain time period divided by the
total number of susceptible individuals in the population

To illustrate the differences, here is how you would calculate the incidence and prevalence
of chicken pox in my home town of Boca Raton.
In most cases, incidence and prevalence are directly proportional. When
one goes up the other one goes up and vice versa. This intuitively makes
sense. If you have more new cases of diabetes within a given year then
you are likely to have a higher total number of people with diabetes at any
particular point during that year. However, Prevalence and Incidence are
not always directly proportional and test makers like to focus on these
situations. Most of these situations include a change in the duration of the
disease. Duration of a disease is the time from when a patient is
diagnosed until they are cured or die. When duration is held constant,
prevalence and incidence are directly proportional. The relationship
between prevalence, incidence and duration of disease can illustrated
with a simple formula.
Prevalence = Incidence * Duration
This relationship makes sense if you think about extreme
examples. Consider a situation where there are 100 new cases of a
disease per year but the disease only lasts one day. Annual incidence will
be higher than prevalence as at any particular moment there is likely only
going to be at most 1 person with the disease. Now consider a disease
that has 100 new cases a year and the disease lasts for 40 years. The
point prevalence is going to be higher than annual incidence, as at any
given point you have the 100 or so newly diagnosed patients from this
year plus people that have been diagnosed over the last 40 years that do
not contribute to incidence.

Another way to think about the relationship between incidence and


prevalence is the “Sink Metaphor.” The water coming into the sink from
the faucet represents incidence and the newly diagnosed patients. The
level of water building up in the sink is prevalence (the total number of
patients at that moment). The drain of the sink represents patients either
being cured or dying. If patients are cured quickly or are dying quickly
then the level of the sink won’t be very high because the drain is really
big. If there are very few patients dying or being cured that is like the
drain getting clogged up and the sink backing up.
Case-fatality Rate
Case-Fatality Rate is a proportion of the people with a particular disease
that die as a result of that disease. As the name implies, it compares the
number of cases to the number of fatalities related to that disease. So if
20 people have a particular cancer and that cancer is fatal in 5 of them
the case fatality rate is 5/20.
BIAS & VALIDITY DEFINITION
Validity is how well the test or study answers the question it was
supposed to answer. With regard to laboratory test results you would use
sensitivity and specificity to measure validity. However, the term validity is
more commonly used when referring to research. It is basically how valid
the conclusions of the study are based on the study’s design and results.
There is internal validity which measures how well your results represent
what is going on in the sample being studied and external validity which
measures how well your results can be applied to other situations (or the
overall population).
Bias is a non-random (directional) deviation from the truth. High bias in a
study means low validity and vice versa. With regards to research studies
bias is problems with the study design or execution that cause you to
consistency get distorted results. These results are non-random as you
are consistently having the results skewed in the same direction. In most
cases this means you are showing a stronger association between the
factor being studied and the health outcome. Bias is different than the
random error you might see with a low sample size. Bias means there is
something fundamentally wrong with the study that is causing you to get
incorrect results that are consistently different than the truth. You can’t
correct bias by having a larger sample size.
The Ideal Research Study has the following characteristics:
• The study population is similar to the overall population of interest
• The two or more groups in the study should be as close to identical as
possible at the start of the study except for the one variable you are trying
to isolate
• The different groups should remain close to identical throughout the
study. This involves keeping as many patients as possible enrolled in the
study until the end and treating the different groups the same except for
the variable you are trying to isolate
• All patients are compliant with any treatments or lifestyle changes
assigned to them
CONFOUNDING, RANDOMIZATION
& BLINDING
Before you watch this video you should really check out the previous video in
the Biostatistics & Epidemiology section which is an introduction to Bias &
Validity. That video forms the foundations for this one.

Sampling Bias & Selection bias


Sampling Bias or Selection Bias is when selection of the study sample from
the overall population is not random. This leads to a group of study
participants that is not representative of the overall population and results
that are not generalizable to the population (AKA Low external validity). A
common example is when participants volunteer for a study (AKA Self-
selection bias). In this case those that choose to volunteer are likely
different from those that choose not to participate.

Confounding
Confounding is when the study results are distorted by some factor other
than the variable(s) being studied. It appears that there is a relationship
between the exposure and health outcome based on the results, but there
is not really a relationship. Some factor other than what is being studied is
distorting the results. A confounder is a characteristic is that is common to
the exposure and the health outcome. Rather than A causing B, C is
associated with A and B. In this example C is the confounder. If you
removed C completely, A and B would not be associated. The problem
with confounders is that an unwise researcher may come to the
conclusion that there is causal relationship between the exposure and
outcome if he or she does not recognize the confounder.
In research, you would ideally like to be able to show that your variable of
interest caused the observed difference in outcomes. For example, you
want to be able to show that your treatment leads to less cases of
disease in the study population. If the treatment and placebo groups
aren’t similar to begin with you can’t come to this conclusion. If the groups
are different at the start of the study you can’t be sure if the observed
differences at the end are due to your treatment or some sort of
predisposing factor that was present to differing degrees in the study

groups.
For example, you can’t learn much if the group receiving your treatment
has an average age of 25 and the group receiving the placebo has an
average age of 75. In this case, your results are being confounded by the
difference in age.

Obviously, when you are creating a research study you want the different
groups to be similar in age, gender, ethnic diversity, socio-economic
factors and lifestyle factors. However, having groups that are similar in
only these types of known prognostic variables is not enough. You also
need the different groups to be similar in characteristics you aren’t even
sure affect the disease process. There could be some type of risk factor
that has not yet been identified as being pivotal to disease development.
You want your groups to be similar with regard to this unknown factor too.
How can you make two groups similar based on an infinite list of
potentially important factors that haven’t even been identified yet? The
answer is randomization.

Randomization
Randomization is just the process of selecting from a group in a fashion
that makes all possibilities equally likely to be selected. To illustrate this
point imagine you have a deck of playing cards. If you take a deck of
cards straight out of the box and pick the top card you are not getting a
random selection. It could be a new deck of cards in which the highest
card is likely on top or you could have last played a game like solitaire
that puts the cards in a particular order. However, if you shuffle the deck
thoroughly before selecting the top card the chances of getting all the
cards are equal. In research studies, randomization is like shuffling the
patient’s before assigning them to different groups so each patient has an
equal chance of being in the different groups.

The process of randomly assigning patients to different groups should give


you comparable groups with regard to any known or unknown confounders.
However, randomization won’t work as well with very small sample sizes,
because chance will play a larger factor in determining the characteristics
of each group.
When group assignment is not random baseline differences between
groups can occur and there is an increased possibility of confounding. For
example, if you allow for group assignment to be determined by personal
preference, severity of disease, or day of the week your results could
largely be explained by these differences in baseline characteristics.
Assigning all patients that come in on a Thursday to one group and all
patients that come in on a Saturday to the other is not randomization. You
might get more unemployed patients on a week day or the weekend bus
routes could limit some population’s ability to arrive on weekends.

Stratification
Sometimes randomization is not enough on its own. More often than not
you will get an equal distribution between groups for characteristics such
as gender, but there is still a chance that you will get more males than
females in one group. This is especially true if the sample size is small. If
you know that gender is an extremely important prognostic factor for your
disease (like if you were studying the frequency of an X-linked genetic
disease) you don’t want to take the chance that this could happen. The
way to avoid this is called Stratification. In Stratification you first divide
your population by a particular characteristic and then you randomize.
You can think about stratification as randomization that is balanced with
regard to one particularly important factor.

Blinding & Placebos


If a patient knows they are in the group that is not receiving the drug they
might be less likely to be complaint with the prescribed regimen or they
could be more likely to drop out of the study. There is also potentially a
psychological effect of knowing that you are not receiving the “real” drug.
If a patient knows they aren’t getting the drug they could lose hope and
have higher stress. Therefore, which group a participant is in must not be
known by the participant. This process of “hiding” which group a patient is
in is called Blinding.
You also want the providers and research staff to not know which patients
are in which group, because they could treat the groups differently based
on that knowledge. For example, a provider may feel compelled to
prescribe additional treatments to a patient receiving a placebo or could
spend more time with patients receiving the real drug because they want
the study to be successful. If the provider knows which group a patient is
in they may also accidentally tip off the patient in which case the patient
would no longer be blinded. A Double Blinded Study is where patients and
providers are unaware of the patient’s group assignment. Sometimes you
will see the term triple blinded which means some other group like data
analyzers, technicians or other support staff are also blinded. Which
group a patient is in should not be revealed until the very end of the study
when you are analyzing data.
A Placebo is just a “drug” without an active ingredient that mimics the
treatment it is being compared with. Placebos are given to the control
group. If the treatment is a pill, the placebo should also be a pill that is the
exact same size, color, and shape. The patient must not be able to be
differentiated the placebo from the actual treatment to prevent “un-
blinding.” By giving a patient a placebo, you are trying to give them “no
drug” without them knowing. Patients receiving placebos can receive
other forms of treatment, but they aren’t given the drug being studied.

CrossoverStudy
Crossover Studies are experimental studies that have the participants
“switch groups” part way through the study. For example, patients that
started with the placebo switch to getting the treatment halfway through
the study while those that started with the treatment get the placebo after
the halfway point. In this study design there is no separate control group
as participants act as their own controls.
COHORT, CASE-CONTROL, META-
ANALYSIS & CROSS-SECTIONAL
STUDY DESIGNS
Before you watch this video you should check out the 2 previous videos in
the Biostatistics & Epidemiology section which cover Validity & Bias as well
as Confounding & Types of Bias. Those videos have principles that will be
applied to this video on Types of Study Design.

Hierarchy of Evidence
Based on the types of bias that are inherent in some study designs we
can rank different study designs based on their validity. The types of
research studies at the top of the list have the highest validity while those
at the bottom have lower validity. In most cases if 2 studies on the same
topic come to different conclusions, you assume the trial of the more valid
type is correct. However, this is not always the case. Any study design
can have bias. A very well designed and executed cohort study can yield
more valid results than a clinical trial with clear deficiencies.

 Meta-analysis of multiple Randomized Trials (Highest Validity)


 Randomized Trial
 Prospective Cohort Studies
 Case Control Studies or Retrospective Cohort
 Case Series (Lowest Validity)

Meta-Analysis
Meta-analysis is the process of taking results from multiple different studies
and combining them to reach a single conclusion. Doing this is sort of like
having one huge study with a very large sample size and therefore meta-
analysis has higher power than individual studies.

Randomized Clinical Trials (RCT)


Clinical trials are the gold standard of research for therapeutic and
preventative interventions. The researchers have a high level of control
over most factors. This allows for randomization and blinding which aren’t
possible in many other study types. Participant’s groups are assigned by
the researcher in clinical trials while in observational studies “natural
conditions” (personal preference, genetics, social determinants,
environment, lifestyle …) assign the group. As we will see later, the
incidence in different groups is compared using Relative Risk (RR).
Cohort Study
Cohort Studies are studies where you first determine whether or not a
person has had an exposure and then you monitor the occurrence of
health outcomes overtime. It is the observational study design with the
highest validity. Cohort is just a fancy name for a group, and this should
help you remember this study design. You start with a group of people
(some of whom happen to have an exposure and some who don’t). Then
you follow this group for a certain amount of time and monitor how often
certain diseases or health outcomes arise. It is easier to conceptually
understand cohort studies that are prospective. However, there are
retrospective cohort studies also. In this scenario you identify a group of
people in the past. You then first identify whether or not these people had
the particular exposure at that point in time and determine whether or not
they ended up getting the health outcomes later on. As we will see later,
the incidence in different groups in a cohort study is compared
using Relative Risk (RR).

Case-Control Study
Case-Control Studies are retrospective and observational. You first identify
people who have the health outcome of interest. Then you carefully select
a group of controls that are very similar to your diseased population
except they don’t have that particular disease. Then you try to determine
whether or not the participants from each group had a particular exposure
in the past. I remember this by thinking that in a case control study
you start off knowing whether a person is diseased (a case) or not diseased
(a control). There isn’t a huge difference between retrospective cohort
and case-control. You are basically doing the same steps but in a slightly
different order. However, the two study designs are used in different
settings. As we will see later, the incidence in different groups in a case-
control study is compared using Odds Ratio (OR).

Case-Series
A Case-Series is a small collection of individual cases. It is an
observational study with a very small sample sizeand no control group.
Basically you are just reviewing the medical records for a few people with
a particular exposure or disease. A study like this is good for very rare
exposures or diseases. Obviously the small sample size and lack of a
control group limits the validity of any conclusions that are made, but in
certain situations this is the best evidence that is available.

Cross-sectional Study
Cross Sectional Studies are different from the others we have discussed.
While the other studies measure the incidence of a particular health
outcome over time, a cross-sectional study measures Prevalence. In this
observational study the prevalence of the exposure and the health
outcome are measured at the same time. You are basically trying to
figure out how many people in the population have the disease and how
many people have the exposure at one point in time. It is hard to
determine an association between the exposure and disease just from
this information, but you can still learn things from these studies. If the
exposure and disease are both common in a particular population it may
be worth investing more resources to do a different type of study to
determine whether or not there is a causal relationship.
DEFINITION AND CALCULATION OF
ODDS RATIO & RELATIVE RISK

Probability vs. Odds


The terms odds and probability are used interchangeably in everyday
life. However, in the setting of Biostats they are two different things.
Generally speaking they both represent how likely something is, but they
are calculated differently and used in different situations. Probability is
essentially the same things as percentage. You are comparing the
number of occurrences of a certain outcome to the number of total events
measured. Probability ranges between zero and one. Odds is a ratio of the
likelihood of an event happening compared to the likelihood of an event
not happening. Odds can be zero or any positive number (not just values
between 0 and 1).

So the probability of rolling a 4 on one attempt with one six faced die is
1/6. The odds of rolling a 4 are 1/5. Here is another example. If 13 people
of a 60 person sample have lung cancer the probability of a person in that
group having lung cancer is 13/60 and the odds of a person in that group
having lung cancer is 13/ 47.
When we are talking about common events the difference between odds
and probability is high. For example, flipping a coin one time gives you
pretty different results. You have 1/1 odds of getting head and 1/2
probability of getting heads. However, as an event gets more and more
rare the difference between odds and probability gets very small. Pretend
there is a drawing with one winner and 10,000 people entered. The odds
of winning are 1/9,999 (0.0001) and the probability of winning is 1/10,000
(0.0001). In this case, odds and probability are essentially identical.

Relative Risk (RR) & Odds Ratio (OR)


The difference between odds and probability is important because
Relative Risk is calculated with probability and Odds Ratio is calculated
with odds. Relative Risk (RR) is a ratio of probabilities or put another way it
is one probability divided by another. Odds Ratio (OR) is a ratio or
proportion of odds. I just remember that odds ratio is a ratio of odds and
probability isn’t a ratio of odds (AKA it is the other option).
Relative Risk = Probability / Probability
Odds Ratio = Odds / Odds
Now that you have a general idea of what odds ratio and relative risk are
you need to know when to use them. They don’t always just ask you to
calculate one or the other. Sometimes questions on Step 1 also require
you to figure out which type of calculation is needed based on the
situation. In clinical trials and cohort studies we use relative risk to
compare the incidence of health outcomes between groups of differing
exposure or treatment. For case-control trials we use odds ratio to
compare the “incidence” of past exposures or treatments.

Cohort Studies (and clinical trials) –> Relative Risk


Case-Control studies –> Odds

Ratio
I remember this by thinking about a group of pirates (group = cohort) all
saying “aRRrrr!”. That reminds you that cohort studies use RR and the
“other one” uses OR.

Now that we understand the research setting for each term we can
redefine RR & OR. I should note that I think memorizing these definitions
is unnecessary because if you understand the simpler definitions you
should be able to create these based on the scenario presented in the
question.
An RR or OR of 1 means there is no difference between the two groups
being compared with respect to what you are measuring. In this case the
treatment or risk factor being study has no effect on the rate of outcome
development. Similarly, an OR or RR of 2 means whatever you are
measuring is two times as likely to occur in the group being studied when
compared with the control group. 0.5 means it is half as likely and so on.
Later in the chapter we will cover how confidence intervals are applied to
RR & OR.

ERRATA: At about the 3:00 mark the slide says “10,00” when it is really
supposed to say “10,000.” I added a pop up box to fix it. Thanks to Mehdi
Hedjazi for pointing this typo out in a youtube comment!
NUMBER NEEDED TO TREAT &
ABSOLUTE RISK REDUCTION
Before you watch this video you should check out the previous video in
the Biostats & Epidemiology section which covers the related topics
of Probability, Odds, RR & OR. That video lays the foundation for this video so it
may be difficult to start with this one.

Attributable risk & Absolute Risk Reduction


Attributable Risk (AR) and Absolute Risk Reduction (ARR) are how much of
the observed change in risk is due to the treatment (or exposure) being
studied. Put another way AR is the amount of disease that would be
eliminated if the exposure was eliminated. ARR would be the amount of
disease that would be eliminated if all patients were receiving the drug.
ARR and AR are essential the same thing but used in different situations.
They are both calculated the same way. The only difference is that in AR
the probability of disease is going down due to a treatment and in ARR
the probability is going up due to an exposure or risk factor.

ARR = probability
of disease among exposed – probability of disease among the non-exposed. AR = probability of
disease among those without treatment – probability of disease among treated

I’m never able to keep the two formulas straight and the difference isn’t
that important so I don’t even try. I just remember it like this:
ARR or AR = higher
probability – lower probability

Number Needed to Treat & Number Needed to


Harm
Both Number Needed to Treat and Number Needed to Harm are 1
divided by the absolute risk reduction or attributable risk (whichever is
more appropriate). The Number Needed to Treat is how many people you
need to give a particular treatment to in order to have a positive effect on
one person. Number Needed to Harm is the number of people that need to
be exposed to a risk factor to effect one person.

Number needed to
treat = 1 / absolute risk reduction. Number needed to harm = 1 / attributable risk
P-VALUE, STATISTICAL
SIGNIFICANCE & TYPES OF ERROR

Null Hypothesis & Alternative Hypothesis


When looking at 2 or more groups that differ based on a treatment or risk
factor, there are two possibilities:

 Null Hypothesis (Ho) = no difference between the groups. The different


groups are the same with regard to what is being studied. There is no
relationship between the risk factor/treatment and occurrence of the health
outcome. By default you assume the null hypothesis is valid until you have
enough evidence to support rejecting this hypothesis.
 Alternative Hypothesis (Ha) = there is a difference between groups. The
groups are different with regard to what is being studied. There is a
relationship between the risk factor/treatment and occurrence of the health
outcome

Obviously, the researcher wants the alternative hypothesis to be true. If


the alternative hypothesis is true it means they discovered a treatment
that improves patient outcomes or identified a risk factor that is important
in the development of a health outcome. However, you never prove the
alternative hypothesis is true. You can only reject a hypothesis (say it is
false) or fail to reject a hypothesis (could be true but you can never be
totally sure). So a researcher really wants to reject the null hypothesis,
because that is as close as they can get to proving the alternative
hypothesis is true. In other words you can’t prove a given treatment
caused a change in outcomes, but you can show that that conclusion is
valid by showing that the opposite hypothesis (or the null hypothesis) is
highly improbable given your data.

Type 1 and Type 2 Error


Anytime you reject a hypothesis there is a chance you made a mistake.
This would mean you rejected a hypothesis that is true or failed to reject a
hypothesis that is false.

 Type 1 Error = incorrectly rejecting the null hypothesis. Researcher says there
is a difference between the groups when there really isn’t. It can be thought of
as a false positive study result. Type I Error is related to p-Value and alpha.
You can remember this by thinking that α is the first letter of the alphabet
 Type 2 Error = fail to reject null when you should have rejected the null
hypothesis. Researcher says there is no difference between the groups when
there is a difference. It can be thought of as a false negative study result. The
probability of making a Type II Error is called beta. You can remember this by
thinking that β is the second letter in the greek alphabet.

Usually we focus on the null hypothesis and type 1 error, because the
researchers want to show a difference between groups. If there is any
intentional or unintentional bias it more likely exaggerates the differences
between groups based on this desire.

Power & Beta


Power is the probability of finding a difference between groups if one truly
exists. It is the percentage chance that you will be able to reject the null
hypothesis if it is really false. Power can also be thought of the probability
of not making a type 2 error. In equation form, Power equals 1 minus
beta.Where power comes into play most often is while the study is being
designed. Before you even start the study you may do power calculations
based on projections. That way you can tweak the design of the study
before you start it and potentially avoid performing an entire study that
has really low power since you are unlikely to learn anything.

Power increases as you increase sample size, because you have more
data from which to make a conclusion. Power also increases as the effect
size or actual difference between the group’s increases. If you are trying
to detect a huge difference between groups it is a lot easier than
detecting a very small difference between groups. Increasing the
precision (or decreasing standard deviation) of your results also increases
power. If all of the results you have are very similar it is easier to come to
a conclusion than if your results are all over the place.

Increased Sample size –> increased power


Increased different between groups (effect size) –> increased power
Increased precision of results (Decreased standard deviation) –> increased power

p-Value Definition:
p-value is the probability of obtaining a result at least as extreme as the
current one, assuming that the null hypothesis is true. Imagine we did a
study comparing a placebo group to a group that received a new blood
pressure medication and the mean blood pressure in the treatment group
was 20 mm Hg lower than the placebo group. Assuming the null
hypothesis is correct the p-value is the probability that if we repeated the
study the observed difference between the group averages would be at
least 20.

Now you have probably picked up on the fact that I keep adding the
caveat that this definition of the p-value only holds true if the null
hypothesis is correct (AKA if is no real difference between the groups).
However, don’t let that throw you off. You just assume this is the case in
order to perform this test because we have to start from somewhere. It is
not as if you have to prove the null hypothesis is true before you utilize
the p-value.

The p-value is a measurement to tell us how much the observed data


disagrees with the null hypothesis. When the p-value is very small there is
more disagreement of our data with the null hypothesis and we can begin
to consider rejecting the null hypothesis (AKA saying there is a real
difference between the groups being studied). In other words, when the p-
value is very small it is less likely that the groups being studied are the
same. Therefore, when the p-value is very low our data is incompatible
with the null hypothesis and we will reject the null hypothesis. When the
p-value is high there is less disagreement between our data and the null
hypothesis. In other words, when the p-value is high it is more likely that
the groups being studied are the same. In this scenario we will likely fail
to reject the null hypothesis.

Using Alpha (α) to Determine Statistical


Significance
You may be wondering what determines whether a p-value is “low” or
“high.” That is where the selected “Level of Significance” or Alpha (α)
comes in. Alpha is the probability of making a Type I Error (or incorrectly
rejecting the null hypothesis). It is a selected cut off point that determines
whether we consider a p-value acceptably high or low. If our p-value is
lower than alpha we conclude that there is a statistically significant
difference between groups. When the p-value is higher than our
significance level we conclude that the observed difference between
groups is not statistically significant.

Alpha is arbitrarily defined. A 5% (0.05) level of significance is most


commonly used in medicine based only on the consensus of researchers.
Using a 5% alpha implies that having a 5% probability of incorrectly
rejecting the null hypothesis is acceptable. Therefore, other alphas such
as 10% or 1% are used in certain situations.

Misconceptions About p-Value & Alpha


Statistical significance is not the same thing as clinical
significance. Clinical Significance is the practical importance of the finding.
There may be a statistically significant difference between 2 drugs, but
the difference is so small that using one over the other is not a big deal.
For example, you might show a new blood pressure medication is a
statistically significant improvement over an older drug, but if the new
drug only lowers blood pressure on average by 1 more mm Hg it won’t
have a meaningful impact on the outcomes that are important to patients.

It is also often incorrectly stated (by students, researchers, review books


etc.) that “p-Value is the probability that the observed difference between
groups is due to chance (random sampling error).” In other words, “if my
p-Value is less than alpha then there is less than a 5% probability that the
null hypothesis is truer.” While this may be easier to understand and
perhaps may even be enough of an understanding to get test questions
right it is a misinterpretation of p-value. For a number of reasons p-Value
is a tool that can only help us determine the observed data’s level of
agreement or disagreement with the null hypothesis and cannot
necessarily be used for a bigger picture discussion about whether our
results were caused by random error. The p-Value alone cannot answer
these larger questions. In order to make larger conclusions about
research results you need to also consider additional factors such as the
design of the study and the results of other studies on similar topics. It is
possible for a study to have a p-value of less than 0.05, but also be poorly
designed and/or disagree with all of the available research on the topic.
Statistics cannot be viewed in a vacuum when attempting to make
conclusions and the results of a single study can only cast doubt on the
null hypothesis if the assumptions made during the design of the study
are true.

A simple way to illustrate this is to remember that by definition the p-value


is calculated using the assumption that the null hypothesis is correct.
Therefore, there is no way that the p-Value can be used to prove that the
alternative hypothesis is true.

Another way to show the pitfalls of blinding applying p-Value is to imagine


a situation where a researcher flips a coin 5 times and gets 5 heads in a
row. If you performed a one-tailed test you would get a p-value of 0.03.
Using the standard alpha of 0.05 this result would be deemed statically
significant and we would reject the null hypothesis. Based solely on this
data our conclusion would be that there is at least a 95% chance on
subsequent flips of the coin that heads will show up significantly more
often than tails. However, we know this conclusion is incorrect, because
the studies sample size was too small and there is plenty of external data
to suggest that coins are fair (given enough flips of the coin you will get
heads about 50% of the time and tails about 50% of the time). In actuality
the chance of the null hypothesis being true is not 3% like we calculated,
but is actually 100%.

Statistical Hypothesis Tests:

Statistical hypothesis testing is how we test the null hypothesis. For the
USMLE Step 1 Medical Board Exam all you need to know when to use
the different tests. You don’t need to know how to actually perform them.

Continuous (numerical) values:

 T Test = compares the mean of 2 sets of numerical values


 ANOVA (Analysis of Variance) = compares the mean of 3 or more sets of
numerical values

Categorical (disease vs. no disease, exposed vs. not exposed) Values:

 Chi-Squared = compares the percentage of categorical data for 2 or more


groups
CONFIDENCE INTERVAL
INTERPRETATION
CORRECTION: Although my mistake is beyond the scope of the Step 1
exam, the interpretation of Confidence Interval that I used in the video is
incorrect & a bit oversimplified. I stated that for an individual study there is
a 95% chance that the true value lies within the 95% CI. However,
confidence interval is a type of frequentist inference and the interpretation
I gave in the video is really better suited for interpreting statistics of
Bayesian Inference (Again please don’t feel like you need this information
for the exam). What I should have said is something like “if 100 similarly
designed studies use a 95% confidence interval then 95 of these intervals
will contain the true value and 5 will not. For more info on this
misconception click here https://en.wikipedia.org/wiki/Bayesian_inference

A Confidence Interval (CI) is the range of values the true value in the
population is expected to fall within based on the study results. The
results we receive in any study do not perfectly mirror the overall
population and the confidence interval lets us get a better idea of what the
results in the overall population might be. The confidence interval is
based on a certain level of confidence. Don’t get this confused with the
value of the sample population. If the measured BMI in 100 people in your
study population and the mean is 25 than you are very confident that the
actual mean BMI in that group is 25. Confidence interval only comes into
play when you try to extrapolate your study results to other situations (like
to the population overall).
If you have a 95% confidence interval (which is most common) that
means there is a 95% chance that the true value lies somewhere in the
confidence interval. You can also alter the width of the confidence interval
by selecting a different percentage of confidence. 90% & 99% are also
commonly used. A 99% confidence interval is wider (has more values) than
a 95% confidence interval & 90% confidence interval is the most narrow.
The width of the CI changes with changes in sample size. The width of
the confidence interval is larger with small sample sizes. You don’t have
enough data to get a clear picture of what is going on so your range of
possible values is wider. Imagine your study on a group of 10 individuals
shows an average shoe size of 9. If based on the results you are 95%
sure that the actual average shoe size for the entire population is
somewhere in between 6 and 12, then the 95% CI is 6-12. Based just on
your results you don’t really know what the average in the population is,
because your study population is a very small sliver of the overall
population. Now if you repeat the study with 10,000 individuals and you
get an average shoe size of 9 the confidence interval is going to be
smaller (something like 8.8 to 9.3). Here you have a much larger sample
size and therefore your results give you a much clearer idea of what is
going on with the entire population. Therefore, your 95% CI shrinks. The
width of the confidence interval decreases with an increasing sample size
(n). This is sort of like the standard deviation decreasing with an
increased sample size.
Confidence intervals are often applied to RR & OR. For example, the
odds ratio might be 1.2, but you aren’t sure how much of an impact
chance had on determining that value. Therefore, instead of just reporting
the value of 1.2 you also report a range of values where the true value in
the population is likely to lie. So we would report something like the odds
ratio is 1.2 and we are 95% confident that the true value within the overall
population is somewhere between .9 and 1.5.

You can use the confidence interval to determine statistical significance


similar to how you use the p-Value. If the 95% confidence interval crosses
the line of no difference that is the same things as saying there is a p-
value of greater than 5%. This is intuitive because if the confidence
interval includes the value of no difference then there is a reasonable
chance that there is no difference between the groups. If the confidence
interval does not cross the line of no difference than the observed
difference is statistically significant, because you know it is highly unlikely
that the two groups are the same.

For both relative risk (RR) and odds ratio (OR), the “line of no difference”
is 1. So an RR or OR of 1 means there is no difference between the two
groups being compared with respect to what you are measuring. This is
because RR and OR are ratios and a value divided by itself is 1. If the
95% confidence interval of the RR or OR includes the value 1, that means
it is possible the true value is 1 and there is no difference between
groups. If that is the case, we say the null hypothesis cannot be rejected
or that there is no statistically significant difference shown. This is the
same thing as saying the p-value is greater than .05.
If you are comparing the average between groups we apply the confidence
interval to the difference between groups (the mean of one group minus
the other group). In this case the line of no difference would be 0. So if the
confidence interval for the difference between the means crosses 0, the
results are not statistically significant.

You might also like