PDF Advanced R Statistical Programming and Data Models Analysis Machine Learning and Visualization 1St Edition Matt Wiley Ebook Full Chapter
PDF Advanced R Statistical Programming and Data Models Analysis Machine Learning and Visualization 1St Edition Matt Wiley Ebook Full Chapter
PDF Advanced R Statistical Programming and Data Models Analysis Machine Learning and Visualization 1St Edition Matt Wiley Ebook Full Chapter
https://textbookfull.com/product/advanced-r-data-programming-and-
the-cloud-1st-edition-matt-wiley/
https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-
loucas/
https://textbookfull.com/product/functional-programming-in-r-
advanced-statistical-programming-for-data-science-analysis-and-
finance-1st-edition-thomas-mailund/
https://textbookfull.com/product/advanced-object-oriented-
programming-in-r-statistical-programming-for-data-science-
analysis-and-finance-1st-edition-thomas-mailund/
Metaprogramming in R: Advanced Statistical Programming
for Data Science, Analysis and Finance 1st Edition
Thomas Mailund
https://textbookfull.com/product/metaprogramming-in-r-advanced-
statistical-programming-for-data-science-analysis-and-
finance-1st-edition-thomas-mailund/
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-mailund/
https://textbookfull.com/product/advanced-linear-modeling-
statistical-learning-and-dependent-data-3rd-edition-
christensen-r/
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-thomas-mailund/
https://textbookfull.com/product/machine-learning-with-r-
cookbook-second-edition-analyze-data-and-build-predictive-models-
bhatia/
Matt Wiley and Joshua F. Wiley
Joshua F. Wiley
Columbia City, IN, USA
Trademarked names, logos, and images may appear in this book. Rather
than use a trademark symbol with every occurrence of a trademarked
name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark. The use in this publication
of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of
opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true
and accurate at the date of publication, neither the authors nor the
editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no
warranty, express or implied, with respect to the material contained
herein.
Conventions
Bold lowercase letters are used to refer to a vector, for example, x . Bold
uppercase letters are used to refer to a matrix, for example, X .
Generally, the Latin alphabet is used for data and the Greek alphabet is
used for parameters. Mathematical functions are indicated with
parentheses, for example, f (·).
In the text, reference to R code or function will be in monospaced
font like this. R function names have parentheses included to
help indicate it is a function, such as mean() to indicate the mean
function in R .
Package Setup
Throughout the book, we will make use of many different R packages
that make tasks easier or provide more robust or sophisticated
graphing and analysis options.
Although not required for readers, we make use of the
checkpoint package to help ensure the book is reproducible [23]. If
you do not care about reproducibility and are happy to take your
chances that our code that worked with one version of R and packages
also works with whatever versions you have, then you can just skip
reading this section. If you want reproducibility, but do not care why or
how it works, then just create R scripts for the code for each chapter,
save them, and then run the checkpoint package at the beginning. If
you care and want to know why and how it all works, read on the next
few paragraphs.
Details on Reproducibility
The many additional packages available for R are one of its greatest
strengths. However, they also create some challenges. For example, as a
reader, suppose that on your computer, you have R v3.4.3 installed
and as part of that in January you had installed the ggplot2 package
for graphs. By default, you will have whatever version of ggplot2 was
available in January when you installed it. Now in one chapter, we tell
you that you need both the ggplot2 and cowplot packages. Because
you already had ggplot2 installed, you do not need to install it again.
However, suppose that you did not have the cowplot package
installed. So, whenever you happen to be reading that chapter, you
attempt to install the cowplot package, let’s say it’s in April. You will
now by default get the latest version of cowplot available for that
version of R as of April.
Now imagine a second reader comes along and also had R v3.4.3
but had neither the ggplot2 nor the cowplot package installed. They
also read the chapter in April, but they install both packages in April, so
they get the latest version of both packages available in April for R
v3.4.3 .
Even though both you and the other reader had the same version of
R installed, you will end up with different package versions from each
other, and likely different versions yet from whatever versions we used
to write the book.
The end result is that different people, even with the same version
of R, very likely are using different versions of different packages. This
can pose a major challenge for reproducibility. If you are reading a
book, it can be a frustration because code does not seem to work as we
said it would. If you are using code in production or for scientific
research or decision-making, nonreproducibility can pose an even
bigger challenge.
The solution to standardize versions across people and ensure
results are fully reproducible is to control not only the version of R but
also the version of all packages. This requires a different approach to
package installation and management than the default system, which
uses the latest package versions from CRAN. The checkpoint
package is designed to solve this challenge. It does require some extra
steps and processes to use, and at first may seem a nuisance, but the
payoff is that you can be guaranteed that you are not only using the
same version of R but also the same version of all packages.
To understand how the checkpoint package works, we need a bit
more background regarding how R ’s libraries and package system
work.
Mainstream R packages are distributed through CRAN. Package
authors can submit new versions of their packages to CRAN, and CRAN
updates nightly. For some operating systems, CRAN just stores the
package source code, such as for Linux machines. For others, such as
Windows operating systems, CRAN builds precompiled package
binaries and hosts those. CRAN keeps old source code but generally not
old binary packages for long. On a local machine, when
install.packages is run, R goes online to a repository, by default
CRAN, finds the package name, downloads it, and installs it into a local
library . The local library is basically just a directory on your own
machine. R has a default location it likes to use as its local library, and
by default when you install packages, they are added to the default
library. Once a package is installed, when it is loaded or opened using
library(), R goes to its default library, finds a package with the
same name, and opens it.
The checkpoint package works by creating a new library on the
local machine, for a particular version of R for a particular date. Then it
scans all the R script files in R ’s current working directory—you can
identify this using the getwd() function—and identifies any calls to
the library() or require() functions. Then it goes and checks
whether those packages are installed in the local library. If they are not,
it goes to a snapshot of CRAN taken by another server setup to support
the checkpoint package. That way, checkpoint can install the
version of the package available from a specific date. In that way, the
checkpoint package can ensure that you have the same specific
version of R and specific version of all packages that we used when
writing the book. Or if you are trying to re-run some analysis from a
year ago, you can get the same version of those packages on a new
computer.
Assuming that you have the following code in an R script, you can
use the checkpoint package to read the R script and find the call to
library(data.table), and it will install the data.table
package, which is a great package for data management [29]. If you do
not want checkpoint to look in the current working directory, you
can specify the project path, as we do to the book in this example. You
can also change where checkpoint sets its library to another folder
location, instead of the default location, which we also do. We
accomplish both of these using variables set as part of our R project,
book_directory and checkpoint_directory . If you are using
checkpoint on your own machine, set those variables to the relevant
directories, for example, as book_directory <-
"path/to/your/directory" . Note that whatever folder you
choose, R will need read and write privileges for that folder.
library(checkpoint)
checkpoint("2018-09-28", R. version = "3.5.1",
project = book_directory,
checkpointLocation = checkpoint_directory,
scanForPackages = FALSE,
scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
library(data.table)
options(
width = 70,
stringsAsFactors = FALSE,
digits = 2)
Data Setup
One of the datasets we will use throughout this book is a longitudinal
study, the Americans’ Changing Lives (ACL) [45]. This is publicly
available data and can be downloaded by going to
http://doi.org/10.3886/ICPSR04690.v7 .
The Americans’ Changing Lives (ACL) is a longitudinal study with
five waves of data, shown in Table I-1 .
Wave Year
W1 1986
W2 1989
W3 1994
W4 2002
W5 2011
load ("../ICPSR_04690/DS0001/04690-0001-
Data.rda")
ls ()
## [1]
"book_directory" "checkpoint_directory"
## [3]
"da04690.0001" "render_apress"
setnames(acl, names(acl), c(
"ID", "Sex", "RaceEthnicity", "SESCategory",
"Employment_W1", "BMI_W1", "Smoke_W1",
"PhysActCat_W1",
"AGE_W1",
"SWL_W1", "InformalSI_W1", "FormalSI_W1",
"SelfEsteem_W1", "Mastery_W1",
"SelfEfficacy_W1",
"CESD11_W1", "NChronic12_W1",
"Employment_W2", "BMI_W2", "Smoke_W2",
"PhysActCat_W2",
"InformalSI_W2", "FormalSI_W2",
"SelfEsteem_W2", "Mastery_W2",
"SelfEfficacy_W2",
"CESD11_W2", "NChronic12_W2"
))
acl[, ID := factor(ID)]
acl[, SESCategory := factor(SESCategory)]
acl[, SWL_W1 := SWL_W1 * -1]
Joshua F. Wiley
is a lecturer in the Monash Institute of Cognitive and Clinical
Neurosciences and School of Psychological Sciences at Monash
University. He earned his PhD from the University of California, Los
Angeles, and completed his postdoctoral training in primary care and
prevention. His research uses advanced quantitative methods to
understand the dynamics between
psychosocial factors, sleep, and other
health behaviors in relation to
psychological and physical health. He
develops or codevelops a number of R
packages including varian , a package
to conduct Bayesian scale-location
structural equation models, and
MplusAutomation , a popular package
that links R to the commercial Mplus
software, and miscellaneous functions to
explore data or speed up analysis in
JWileymisc .
library(checkpoint)
checkpoint("2018-09-28", R.version = "3.5.1",
project = book_directory,
checkpointLocation = checkpoint_directory,
scanForPackages = FALSE,
scan.rnw.with.knitr = TRUE, use.knitr = TRUE)
library(knitr)
library(ggplot2)
library(cowplot)
library(MASS)
library(JWileymisc)
library(data.table)
1.1 Distribution
Visualizing the Observed Distribution
Many statistical models require that the distribution of a variable be
specified. Histograms use bars to graph a distribution and are probably
the most common graph used to visualize the distribution of a single
variable. Although relatively rare, stacked dot plots are another
approach and provide a precise way to visualize the distribution of data
that shows the individual data points. Finally, density plots are also
quite common and are graphed by using a line that shows the
approximate density or amount of data falling at any given value.
ggplot(mtcars, aes(mpg)) +
geom_dotplot()
Figure 1-1 Stacked dot plot of miles per gallon from old cars
As a brief aside, much of the code for ggplot2 follows the format
shown in the following code snippet. In our case, we wanted a dot plot,
so the geometric object, or “geom”, is a dot plot (geom_dotplot() ).
Many excellent online tutorials and books exist to learn how to use the
ggplot2 package for graphs, so we will not provide a greater
introduction to ggplot2 here. In particular, Hadley Wickham, who
develops ggplot2, has a recently updated book on the package,
ggplot2: Elegant Graphics for Data Analysis [109], which is an excellent
guide. For those who prefer less conceptual background and more of a
cookbook, we recommend the R Graphics Cookbook by Winston Chang
[20].
ggplot(the-data, aes(variable-to-plot)) +
geom_type-of-graph()
Unlike a dot plot that plots the raw data, a histogram is a bar graph
where the height of the bar is the count of the number of values falling
within the range specified by the width of the bar. You can vary the
width of bars to control how many nearby values are aggregated and
counted in one bar. Narrower bars aggregate fewer data points and
provide a more granular view. Wider bars aggregate more and provide
a broader view. A histogram showing the distribution of sepal lengths
of flowers from the famous iris dataset is shown in Figure 1-2.
ggplot(iris, aes(Sepal.Length)) +
geom_histogram()
ggplot(data.table(lynx = as.vector(lynx)),
aes(lynx)) +
geom_histogram()
ggplot(data.table(lynx = as.vector(lynx)),
aes(log(lynx))) +
geom_histogram()
Density Plots
Another common tool to visualize the observed distribution of data is
by plotting the empirical density. The code for ggplot2 is identical to
that for histograms except that geom_histogram() is replaced with
geom_density() . The code follows and the result is shown in Figure
1-5.
ggplot(iris, aes(Sepal.Length)) +
geom_density()
Figure 1-5 This is the density plot for our sepal lengths
Empirical density plots include some degree of smoothing, because
with continuous variables, there is never going to be many observations
at any specific value (e.g., it may be that no observation has a value of
3.286, even though there are values of 3.281 and 3.292). Empirical
density plots show the overall shape of the distribution by applying
some degree of smoothing. At times it can be helpful to adjust the
degree of smooth to see a coarser (closer to the raw data) or smoother
(closer to the “distribution”) graph. Smoothing is controlled in
ggplot2 using the adjust argument. The default, which we saw in
Figure 1-5, is adjust = 1. Values less than 1 are “noisier” or have less
smoothing, while values greater than 1 increase the smoothness. We
compare and contrast noisier in Figure 1-6 vs. very smooth in Figure 1-
7.
ggplot(iris, aes(Sepal.Length)) +
geom_density(adjust = .5)
ggplot(iris, aes(Sepal.Length)) +
geom_density(adjust = 5)