Best programming language
Best programming language
Best programming language
to accompany
About R
R is a freely available environment for statistical computing. R works with a command-line interface,
meaning you type in commands telling R what to do. For more information and to download R, visit
cran.r-project.org.
Commands are given using color coding. Code in red represents commands and punctuation that
always need to be entered exactly as is. Code in blue represents names that will change depending
on the context of the problem (such as dataset names and variable names). Text in green following #
is either optional code or comments. This often includes optional arguments that you may want to
include with a function, but do not always need. In R anything following a # is read as a comment,
and is not actually evaluated
For example, the command mean is used to compute the mean of a set of numbers. The information
for this command is given in this manual as
Whenever you are computing a mean, you always need to type the parts in red, mean( ). Whatever
you type inside the parentheses (the code in blue) will depend on what you have called the set of
numbers you want to compute the mean of, so if you want to calculate the mean body mass index for
data stored in a variable called BMI , you would type mean(BMI). The code in green represents an
optional argument needed only if there are missing values. If there were missing (NA) values in BMI,
you would compute the mean with mean(BMI, na.rm=TRUE).
R Users Guide - 3 Statistics: Unlocking the Power of Data
Commands can be entered directly into the R console (the window that opens when you start R),
following the red > prompt, and sent to the computer by pressing enter. For example, typing 1 + 2 and
pressing enter will output the result 3:
> 1+2
[1] 3
Your entered code always follows the > prompt, and output always follows a number in square
brackets. Each command should take its own line of code, or else a line of code should be continued
with { } (see examples in Chapters 3 and 4).
It is possible to press enter before the line of code is completed, and often R will recognize this. For
example, if you were to type 1 + but then press enter before typing 2, R knows that 1+ by itself
doesn’t make any sense, so prompts for you to continue the line with a + sign. At this point you could
continue the line by pressing 2 then enter. This commonly occurs if you forget to close parentheses or
brackets. If you keep pressing enter and keep seeing a + sign rather than the regular > prompt that
allows you to type new code, and if you can’t figure out why, often the easiest option is to simply press
ESC, which will get you back to the normal > prompt and allow you to enter a new line of code.
Capitalization and punctuation need to be exact in R, but spacing doesn’t matter. If you get errors
when entering code, you may want to check for these common mistakes:
- Did you start your line of code with a fresh prompt (>)? If not, press ESC.
- Are your capitalization and punctuation correct?
- Are all your parentheses and brackets closed? For every forward (, {, or [, make sure there is a
corresponding backwards ), }, or ].
R Script
Rather than entering commands into the console directly however, we recommend creating and using
an R Script, basically a text editor for your code. A new script can be created by File -> New Script.
Code (commands) can be typed here, and then entered into the console in one of three ways:
Using a separate R script is nice because you can save only the code that works, making it easy to
rerun and edit in the future, as opposed to the R console in which you would also have to save all your
mistakes and all the output. We recommend always saving your R Scripts so you have the commands
easily accessible and editable for future use.
R Users Guide - 4 Statistics: Unlocking the Power of Data
Basic Commands
Basic Arithmetic
Addition +
Subtraction –
Multiplication *
Division /
Exponentiation ^
Other
Naming objects =
Open help for a command ?
Creating a set of numbers c(1, 2, 3)
The basic arithmetic commands are pretty straightforward. For example, 1 + (2*3) would return 7.
You can also name the result of any command with a name of your choosing with =. For example, if
you type
x = 3*4
you are setting x to equal the result of 3*4, or equivalently setting x = 12. If you type in x to the
console now you will see 12 as the output:
> x
[1] 12
The choice of x here is completely arbitrary, and you could have named it whatever you wanted.
Naming objects and arithmetic works not just with numbers, but with more complex objects like
variables. To get a little fancier, suppose you have variables called Weight (measured in pounds) and
Height (measured in inches), and want to create a new variable for body mass index, which you
decide to name BMI. You can do this with the following code:
If you want to create your own variable or set of numbers, you can collect numbers together into one
object with c( ) and the numbers separated by commas inside the parentheses. For example, to
create your own variable Weight out of the weights 125, 160, 183, and 137, you would type
To get more information on any built-in R commands, simply type ? followed by the command name,
and this will bring up a separate help page.
R Users Guide - 5 Statistics: Unlocking the Power of Data
Using R in Chapter 1
Loading Data
Load a dataset from a .csv file dataname = read.csv(file.choose())
Load a dataset from the textbook data(dataname)
Type in a variable variablename = c(3.2, 3.3, 3.1)
Viewing Data
See the whole dataset, dataname dataname
See the first 6 rows of a dataset head(dataname)
Finding the number of cases and variables dim(dataname)
Get information about a textbook dataset ?dataname
Variables
Seeing the variable names in a dataset names(dataname)
Extract a variable from a dataset dataname$variablename
Random Sample
Generate n random integers up to max sample(1:max, n)
Loading Data
There are three different ways you may want to get data in R: loading data from a spreadsheet, loading
datasets from the textbook, and manually typing in your own data.
Once you have a dataset loaded, you will want to explore different basic aspects of it, such as the
structure, the names of the variables, and the number of cases. Let’s work with the AllCountries data,
loaded above. To view the dataset, simply type the dataset name
AllCountries
If there are a lot of cases, this may be awkward to see. Often it is useful to just view the first 6 rows of
a dataset to a quick feel for the structure:
head(AllCountries)
dim(AllCountries)
The first number is the number of rows (cases) and the second is the number of columns (variables).
If the dataset comes from the textbook, you can type ? followed by the data name to pull up
information about the data:
?AllCountries
Variables
If you want to see just the variable names, type
names(AllCountries)
If you want to extract a particular variable from a dataset, for example, Population, type
AllCountries$Population
If you will be doing a lot with one dataset, sometimes it gets cumbersome to always type the dataset
name and a dollar sign before each variable name. To avoid this, you can type
attach(AllCountries)
Now you can access variables from the AllCountries data simply by typing the variable names directly.
If you choose to use this option however, just remember to detach the dataset when you are done:
detach(AllCountries)
R Users Guide - 7 Statistics: Unlocking the Power of Data
Taking a Random Sample
While you can sample directly from a list of cases in R, a more general way to generate a random
sample is to randomly generate n (the sample size) numbers between 1 and the number of cases you
want to sample from (max):
sample(1:max, n)
Once you have these random numbers, you can use this with either a dataset or a variable to create
your random sample using square brackets.
A vector of numbers in square brackets after a variable says to only look at cases corresponding to the
given numbers. For example, with our gpa variable, if we want only the 1st and 3rd cases, we could
type:
to get a new variable of just 2.9 and 3.6. For example, if we wanted to take a random sample of 10
countries from all the 213 countries in the world, because Country within the dataset AllCountries lists
the country names identifying each case, we could use
AllCountries$Country[sample(1:213, 10)]
This is useful if you have the case identifiers for the whole population, but not the data.
If you want to take a random sample from an entire dataset, indicate which rows and which columns
you want within the square brackets, separated by a comma:
data[rows, columns]
So to take a random sample of 10 countries along with all the associated variables in the AllCountries
dataset, we could use
AllCountries[sample(1:213, 10), ]
Notice the only difference when sampling a dataset versus a single column is the comma after the
sample() command.
Randomized Experiment
If you want to randomize a sample into two different treatment groups for a randomized experiment,
you can take a random sample from the whole sample to be the treatment group, and the rest of the
sample would then go in the control group.
R Users Guide - 8 Statistics: Unlocking the Power of Data
Using R in Chapter 2
One Categorical (x)
Frequency table table(x)
Proportion in group A mean(x == "A")
Pie chart pie(table(x))
Bar chart barplot(table(x))
Two Categorical (x1, x2)
Two-way table table(x1, x2)
Difference in proportions diff(by(x1,x2,function(o) mean(o=="A")))
of x1 in group A by x2
Segmented bar chart barplot(table(x1, x2), legend=TRUE)
Side-by-side bar chart barplot(table(x1,x2),legend=TRUE,beside=TRUE)
One Quantitative (y)
Mean mean(y) #for missing data: na.rm=TRUE
Median median(y) #for missing data: na.rm=TRUE
Standard deviation sd(y) #for missing data: na.rm=TRUE
5-Number summary summary(y)
Percentile quantile(y, 0.05)
Histogram hist(y)
Boxplot boxplot(y) #ylab="y-axis label"
One Quantitative (y) and
One Categorical (x)
Means by group by(y, x, mean) #for missing data: na.rm=TRUE
Difference in means diff(by(y, x, mean))
S.D. by group by(y, x, sd)
Side-by-side boxplots boxplot(y ~ x) #ylab="y-axis label"
Two Quantitative (y1, y2)
Scatterplot plot(y1, y2)
Correlation cor(y1, y2) #missing data: use="complete.obs"
Linear Regression lm(response ~ explanatory)
table(Award, Gender)
barplot(table(Award, Gender), legend=TRUE)
summary(Pulse)
hist(Pulse)
Hours of exercise per week by award preference (one quantitative and one categorical variable):
plot(Pulse, SAT)
cor(Pulse, SAT)
lm(SAT~Pulse)
Missing Data
You may notice that if you try to do some of these commands on certain variables, you get NA for a
result. This often means there are some missing values in the data, which R codes as NA. To
calculate the average avoiding missing values, use the argument na.rm=TRUE:
mean(Exercise, na.rm=TRUE)
by(Exercise, Award, mean, na.rm=TRUE)
For correlation a similar problem exists, but the fix just takes a different argument. To calculate the
correlation between SAT score and GPA (for which there are missing values), use
Using R in Chapter 3
Generating a b = 10000 #number of bootstrap statistics
Bootstrap Distribution boot.dist = rep(NA, b)
for (i in 1:b) {
boot.sample = sample(n, replace=TRUE)
boot.dist[i] = statistic(y[boot.sample])
}
Using a hist(boot.dist)
Bootstrap Distribution quantile(boot.dist, c(0.025, 0.975))
sd(boot.dist)
To generate a bootstrap confidence interval we first learn how to generate one bootstrap statistic, then
how to repeat this procedure many times to generate an entire bootstrap distribution, and then how to
use the bootstrap distribution to calculate a confidence interval.
For example,
could yield 2, 2, 1, 4. To use this to get a bootstrap sample from our variable, we use square
brackets, [ ] to select those cases from the variable. For example, if we wanted to create a bootstrap
sample of Atlanta commute times (Time), which has 500 values originally, we would use
Lastly, we compute our statistic of interest on this bootstrap sample. For example, for the mean
Atlanta commute time we would use
mean(Time[boot.sample])
If we instead were doing a correlation between Distance and Time, we would use
cor(Distance[boot.sample], Time[boot.sample])
R Users Guide - 11 Statistics: Unlocking the Power of Data
For Loop
A for loop is a convenient way to repeat a procedure many times, without having to type it over and
over again. The code
for (i in 1:5) { }
tells the computer to do whatever is inside of the brackets 5 times, once with i = 1, once with i = 2, etc.
up to i = 5. We use for (i in 1:b), where b is the number of bootstrap statistics we want
(usually 10,000 is sufficient).
We want to save the bootstrap statistic from every iteration of the for loop (for every i). To do this we
first need to create a new object to save these results in:
boot.dist = rep(NA, b)
This creates a set of b NA values which will become the bootstrap distribution. Within the for loop,
tells each iteration of the for loop to fill in the ith value of boot.dist with the bootstrap statistic from that
iteration. When the for loop has finished, boot.dist has b different bootstrap statistics.
result = rep(NA, 5)
for (i in 1:5) {
result[i] = i + 2
}
After the for loop runs, result would be 3, 4, 5, 6, 7, because the for loop when i = 1 would
fill in the 1st value of result with i + 2 = 1 + 2 = 3, when i = 2 it would fill in the 2nd value of result with
i + 2 = 2 + 2 = 4, etc., up to i = 5.
1. Compute the standard error of the statistic by taking the standard deviation of the bootstrap
statistics and then use statistic 2SE.
2. Find the endpoints that correspond to the middle 95% of the bootstrap distribution. For the
middle 95%, we want the 2.5% in each tail, so the 0.025 and 0.975 percentiles (for other
confidence levels, just replace 0.025 and 0.975 with the appropriate percentiles):
quantile(boot.dist, c(0.025, 0.975))
R Users Guide - 12 Statistics: Unlocking the Power of Data
Let's compute a 95% confidence interval for the correlation between Distance and Time for
AtlantaCommutes. Load and attach the data:
library(Lock5Data)
data(CommuteAtlanta)
attach(CommuteAtlanta)
b = 10000
boot.dist = rep(NA, b)
for (i in 1:b) {
boot.sample = sample(500, replace=TRUE)
boot.dist[i] = cor(Distance[boot.sample], Time[boot.sample])
}
hist(boot.dist)
se = sd(boot.dist)
cor(Distance, Time) - 2*se
cor(Distance, Time) + 2*se
Although it's redundant to do both, for illustration we also use the percentile method:
Using R in Chapter 4
Generating a b = 10000 #number of randomization statistics
Randomization Distribution rand.dist = rep(NA, b)
for (i in 1:b) rand.dist[i] = rand.statistic
Randomization Statistic:
Proportion rbinom(1, n, 0.5) #change 0.5 to null value
The general idea with the for loop is the same as we learned in Chapter 3 with bootstrapping,
although if there is only one line of code within the for loop, we put it on one line for simplicity. For
b iterations we calculate a randomization statistic, each time storing it as a value in rand.dist.
sample(Group)
For this problem we want to calculate the difference in mean tap rate for the shuffled groups:
Now that we know how to calculate one randomization statistic, and know how to do a for loop from
Chapter 3, creating a randomization distribution is easy!
b = 10000
rand.dist = rep(NA, b)
for (i in 1:b) rand.dist[i] = diff(by(Tap, sample(Group), mean))
R Users Guide - 14 Statistics: Unlocking the Power of Data
To compute a p-value from this randomization distribution, we want the proportion of randomization
statistics above the observed statistic (because we want to see if caffeine increases tap rate):
1. Dogs and Owners. 16 out of 25 dogs were correctly paired with their owners, is this evidence
that the true proportion is greater than 0.5? In R, you can simulate flipping 25 coins and
counting the number of heads with
(for null proportions other than 0.5, just change the 0.5 above accordingly). Therefore, we can
create a randomization distribution with
b = 10000
rand.dist = rep(NA, b)
for (i in 1:b) rand.dist[i] = rbinom(1, 25, 0.5)
2. Body Temperature. Is a sample mean of 98.26F based on 50 people evidence that the true
average body temperature in the population differs from 98.6F? To answer this we create a
randomization distribution by bootstrapping from a sample that has been shifted to make the
null true, so we add 0.34 to each value. We can create the corresponding randomization
distribution with
b = 10000
rand.dist = rep(NA, b)
NullTemp=BodyTemp+0.34
for (i in 1:b) {
boot.samp = sample(50, replace=TRUE)
rand.dist[i] = mean(NullTemp[boot.samp])
}
In this case we have a two-sided Ha, so calculate the p-value as the proportion in the smaller tail
beyond the observed statistic (in this case because 98.26 is less than 98.6), and double it:
Using R in Chapter 5
We want the middle 90% of the normal distribution, so want 5% in each tail, so need to find the 5th and
95th percentiles:
qnorm(0.05)
qnorm(0.95)
Because Ha is upper tailed, we find the area in the normal distribution above 1.5:
pnorm(1.5, lower.tail=FALSE)
Example 3: Find the area below 60 for a normal distribution with mean 75 and
standard deviation 12.
pnorm(60,75,12)
R Users Guide - 16 Statistics: Unlocking the Power of Data
Using R in Chapter 6
Normal Distribution:
Find a percentile qnorm(0.10)
Find the area below z pnorm(z)
Find the area above z pnorm(z, lower.tail=FALSE)
t-Distribution:
Find a percentile qt(0.10, df)
Find the area below t pt(t, df, lower.tail=TRUE)
Find the area above t pt(t, df, lower.tail=FALSE)
Inference for Proportions:
Single proportion prop.test(count, n, p0) #delete p0 for intervals
Difference in proportions prop.test(c(count1, count2), c(n1, n2))
Inference for Means:
Single mean t.test(y, mu = mu0) #delete mu0 for intervals
Difference in means t.test(y ~ x)
Using prop.test or t.test
For p-values #alternative = "two.sided", "less", "greater"
For confidence intervals #conf.level = 0.95 or desired confidence level
There are two ways of using R to compute confidence intervals and p-values using the normal and t-
distributions:
1. Use the formulas in the book and qnorm, qt, pnorm, and pt.
2. Use prop.test and t.test on the raw data without using any formulas
The two methods should give very similar answers, but may not match exactly because prop.test
and t.test do things slightly more complicated than what you have learned (continuity correction
for proportions, and a more complicated algorithm for degrees of freedom for difference in means).
The commands prop.test and t.test give both confidence intervals and p-values. For
confidence intervals, the default level is 95%, but other levels can be specified with conf.level.
For p-values, the default is a two-tailed test, but the alternative can be changed by specifying either
alternative = "less" or alternative = "greater".
Using Option 1 directly parallels the code in Chapter 5, so we refer you to the Chapter 5 examples.
Here we just illustrate the use of prop.test and t.test.
Example 1: In a recent survey of 800 Quebec residents, 224 thought that Quebec should separate
from Canada. Give a 90% confidence interval for the proportion of Quebecers who would like Quebec
to separate from Canada.
Example 2: Test whether caffeine increases tap rate (based on CaffeineTaps data).
Using R in Chapter 7
Chi-Square Distribution
Find the area above 2 pchisq(chisquare, df, lower.tail=FALSE)
Chi-Square Test
Goodness-of-fit chisq.test(table(x)) #if null probabilities not
equal, use p = c(p1, p2, p3) to specify
Test for association chisq.test(table(x1, x2))
Randomization Test
Goodness-of-fit chisq.test(table(x), simulate.p.value=TRUE)
Test for association chisq.test(table(x1, x2), simulate.p.value=TRUE)
If we get 2 = 3.1 and the degrees of freedom are 2, we would calculate the p-value with
pchisq(3.1, 2, lower.tail=FALSE)
1. Goodness of Fit. Use the data in APMultipleChoice to see if all five choices (A, B, C, D, E)
are equally likely:
chisq.test(table(Answer))
2. Test for Association. Use the data in StudentSurvey to see if type of award preference is
associated with gender:
chisq.test(table(Award, Gender))
Randomization Test
If the expected counts within any cell are too small, you should not use the chi-square distribution, but
instead do a randomization test. If you use chisq.test with small expected counts cell, R helps
you out by giving a warning message saying the chi-square approximation may be incorrect.
If the sample sizes are too small to use a chi-squared distribution, you can do a randomization test with
the optional argument simulate.p.value within the command chisq.test. This tells R to
calculate the p-value by simulating the distribution of the 2 statistic, assuming the null is true, rather
than compare it to the theoretical chi-square distribution.
For example, for a randomization test for an association between Award and Gender:
chisq.test(table(Award, Gender), simulate.p.value=TRUE)
R Users Guide - 18 Statistics: Unlocking the Power of Data
Using R in Chapter 8
F Distribution
Find the area above F pf(F, df, lower.tail=FALSE)
Analysis of Variance summary(aov(y ~ x))
Pairwise Comparisons pairwise.t.test(y, x, p.adjust="none")
As with t-tests and chi-square tests, one option is to calculate the F-statistic by hand, and then compare
it to the F distribution using pf. The (much easier) option is to use R's built in analysis of variable
function, aov.
Analysis of Variance
Let's test whether the average number of ants that feed on a sandwich differs by type of filling, using
data from SandwichAnts (Data 8.1).
We can calculate the sample means in each group, visualize the data, and check the conditions for
ANOVA with
In the sample, we see that the most ants came to the ham & pickles sandwich, and the least to the
vegemite. The sample standard deviations within each group are close enough to proceed. The sample
sizes are very small, so we should proceed with caution, but looking at the boxplots we see the data
appear to be at least symmetrically distributed within each group, without any outliers, so we proceed.
summary(aov(Ants ~ Filling))
Pairwise Comparisons
Finding the overall ANOVA significant, we may want to test individual pairwise comparisons. We
can test all pairwise comparisons with
This gives us the p-value corresponding to each pairwise comparison. The optional argument
p.adjust = "none" tells R to give the raw p-values and not adjust for multiple comparisons. If
you leave off this argument R will increase the p-values to account for multiple comparisons, but the
details here are beyond the scope of this text.
R Users Guide - 19 Statistics: Unlocking the Power of Data
Using R in Chapter 9
Simple Linear Regression
Plot the data plot(y ~ x) # y is the response (vertical)
Fit the model lm(y ~ x) # y is the response)
Give model output summary(model)
Add regression line to plot abline(model)
Inference for Correlation cor.test(x, y)
#alternative = "two.sided", "less", "greater"
Prediction
Calculate predicted values predict(model)
Calculate confidence intervals predict(model, interval = "confidence")
Calculate prediction intervals predict(model, interval = "prediction")
Prediction for new data predict(model, as.data.frame(cbind(x=1)))
Let's load and attach the data from RestaurantTips to regress Tip on Bill. Before doing regression,
we should plot the data to make sure using simple linear regression is reasonable:
plot(Tip~Bill) #Note: plot(Bill, Tip) does the same
The trend appears to be approximately linear. There are a few unusually large tips, but no extreme
outliers, and variability appears to be constant as Bill increases, so we proceed. We fit the simple
linear regression model, saving it under the name mod (short for model - you can call it anything you
want). Once we fit the model, we use summary to see the output:
mod = lm(Tip ~ Bill)
summary(mod)
Results relevant to the intercept are in the (Intercept) row and results relevant to the slope are in the
Bill (the explanatory variable) row. The estimate column gives the estimated coefficients, the std.
error column gives the standard error for these estimates, the t value is simply estimate/SE, and the p-
value is the result of a hypothesis test testing whether that coefficient is significantly different from 0.
We also see the standard error of the error as "Residual standard error" and R2 as "Multiple R-
squared". The last line of the regression output gives details relevant to the ANOVA table: the F-
statistic, degrees of freedom, and p-value.
After creating a plot, we can add the regression line to see how the line fits the data:
abline(mod)
Suppose a waitress at this bistro is about to deliver a $20 bill, and wants to predict her tip. She can get
a predicted value and 95% (this is the default level, change with level) prediction interval with
Lastly, we can do inference for the correlation between Bill and Tip:
cor.test(Bill, Tip)
R Users Guide - 20 Statistics: Unlocking the Power of Data
Using R in Chapter 10
Multiple Regression
Fit the model lm(y ~ x1 + x2)
Give model output summary(model)
Residuals
Calculate residuals model$residuals
Residual plot plot(predict(model), model$residuals)
Histogram of residuals hist(model$residuals)
Prediction
Calculate predicted values predict(model)
Calculate confidence intervals predict(model, interval = "confidence")
Calculate prediction intervals predict(model, interval = "prediction")
Prediction for new data predict(model,as.data.frame(cbind(x1=1,x2=3)))
We'll continue the RestaurantTips example, but include additional explanatory variables: number in
party (Guests), and whether or not they pay with a credit card (Credit = 1 for yes, 0 for no).
We fit the multiple regression model with all three explanatory variables, call it tip.mod, and
summarize the model:
This output should look very similar to the output from Chapter 9, except now there is a row
corresponding to each explanatory variable.
Conditions
To check the conditions, we need to calculate residuals, make a residual versus fitted values plot, and
make a histogram of the residuals:
plot(tip.mod$fit, tip.mod$residuals)
hist(tip.mod$residuals)
Categorical Variables
While Credit was already coded with 0/1 here, this is not necessary for R. You can include any
explanatory variable in a multiple regression model, and R will automatically create corresponding 0/1
variables. For example, if you were to include gender coded as male/female, R would create a variable
GenderMale that is 1 for males and 0 for females. The only thing you should not do is include a
categorical variable with more than two levels that are all coded with numbers, because R will treat
this as a quantitative variable.
R Users Guide - 21 Statistics: Unlocking the Power of Data
CHAPTER 2
One Categorical (x)
Frequency table table(x)
Proportion in group A mean(x == “A”)
Pie chart pie(table(x))
Bar chart barplot(table(x))
Two Categorical (x1, x2)
Two-way table table(x1, x2)
Difference in proportions diff(by(mean(x1,x2,function(o) mean(o==“A”)))
of x1 in group A by x2
Segmented bar chart barplot(table(x1, x2), legend=TRUE)
Side-by-side bar chartt barplot(table(x1,x2),legend=TRUE,beside=TRUE)
One Quantitative (y)
Mean mean(y) #for missing data: na.rm=TRUE
Median median(y) #for missing data: na.rm=TRUE
Standard deviation sd(y) #for missing data: na.rm=TRUE
5-Number summary summary(y)
Percentile quantile(y, 0.05)
Histogram hist(y)
Boxplot boxplot(y) #ylab="y-axis label"
One Quantitative (y) and
One Categorical (x)
Means by group by(y, x, mean) #for missing data: na.rm=TRUE
Difference in means diff(by(y, x, mean))
S.D. by group by(y, x, sd)
Side-by-side boxplots boxplot(y ~ x) #ylab="y-axis label"
Two Quantitative (y1, y2)
Scatterplot plot(y1, y2)
Correlation cor(y1, y2) #missing data: use="complete.obs"
Linear Regression lm(response ~ explanatory)
R Users Guide - 22 Statistics: Unlocking the Power of Data
CHAPTER 3
Generating a b = 10000 #number of bootstrap statistics
Bootstrap Distribution boot.dist = rep(NA, b)
for (i in 1:b) {
boot.sample = sample(n, replace=TRUE)
boot.dist[i] = statistic(y[boot.sample])
}
Using a hist(boot.dist)
Bootstrap Distribution quantile(boot.dist, c(0.025, 0.975))
sd(boot.dist)
CHAPTER 4
Generating a b = 10000 #number of randomization statistics
Randomization Distribution rand.dist = rep(NA, b)
for (i in 1:b) rand.dist[i] = rand.statistic
Randomization Statistic:
Proportion rbinom(1, n, 0.5) #change 0.5 to null value
CHAPTER 5
Standard Normal Distribution:
Find a percentile qnorm(0.10)
Find the area below z pnorm(z)
Find the area above z pnorm(z, lower.tail=FALSE)
CHAPTER 6
t-Distribution:
Find a percentile qt(0.10, df)
Find the area below t pt(t, df)
Find the area above t pt(t, df, lower.tail=FALSE)
Inference for Proportions:
Single proportion prop.test(count, n, p0) #delete p0 for intervals
Difference in proportions prop.test(c(count1, count2), c(n1, n2))
Inference for Means:
Single mean t.test(y, mu = mu0) #delete mu0 for intervals
Difference in means t.test(y ~ x)
Using prop.test or t.test
For p-values #alternative = "two.sided", "less", "greater"
For confidence intervals #conf.level = 0.95 or desired confidence level
R Users Guide - 23 Statistics: Unlocking the Power of Data
CHAPTER 7
Chi-Square Distribution
Find the area above 2 pchisq(chisquare, df, lower.tail=FALSE)
Chi-Square Test
Goodness-of-fit chisq.test(table(x)) #if null probabilities not
equal, use p = c(p1, p2, p3) to specify
Test for association chisq.test(table(x1, x2))
Randomization Test
Goodness-of-fit chisq.test(table(x), simulate.p.value=TRUE)
Test for association chisq.test(table(x1, x2), simulate.p.value=TRUE)
CHAPTER 8
F Distribution
Find the area above F pf(F, df, lower.tail=FALSE)
Analysis of Variance summary(aov(y ~ x))
Pairwise Comparisons pairwise.t.test(y, x, p.adjust="none")
CHAPTER 9
Simple Linear Regression
Plot the data plot(y ~ x) # y is the response (vertical)
Fit the model lm(y ~ x) # y is the response)
Give model output summary(model)
Add regression line to plot abline(model)
Inference for Correlation cor.test(x, y)
#alternative = "two.sided", "less", "greater"
Prediction
Calculate predicted values predict(model)
Calculate confidence intervals predict(model, interval = "confidence")
Calculate prediction intervals predict(model, interval = "prediction")
Prediction for new data predict(model, as.data.frame(cbind(x=1)))
CHAPTER 10
Multiple Regression
Fit the model lm(y ~ x1 + x2)
Give model output summary(model)
Residuals
Calculate residuals model$residuals
Residual plot plot(predict(model), model$residuals)
Histogram of residuals hist(model$residuals)
Prediction
Calculate predicted values predict(model)
Calculate confidence intervals predict(model, interval = "confidence")
Calculate prediction intervals predict(model, interval = "prediction")
Prediction for new data predict(model,as.data.frame(cbind(x1=1,x2=3)))