Nothing Special   »   [go: up one dir, main page]

451 - Introduction To R Programming

Download as pdf or txt
Download as pdf or txt
You are on page 1of 233

Unit 1

Introduction to R
Structure:

1.1 Introduction

1.2 What is R?

1.3 The R Environment

1.4 Installing R

1.5 Install R Packages

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-sa/4.0/.
Objectives:

After going through this unit, you will be able to:

 Understand R
 Install R and its packages

1.1 INTRODUCTION

R is a powerful language and environment for statistical computing and graphics. It is a public domain
(a so called “GNU”) project which is similar to the commercial S language and environment which was
developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and
colleagues. R can be considered as a different implementation of S, and is much used in as an
educational language and research tool.

The main advantages of R are the fact that R is freeware and that there is a lot of help available online.
It is quite similar to other programming packages such as MatLab (not freeware), but more user-
friendly than programming languages such as C++ or Fortran. You can use R as it is, but for educational
purposes we prefer to use R in combination with the RStudio interface (also freeware), which has an
organized layout and several extra options.

1.2 WHAT IS R?

While the commercial implementation of S, S-PLUS, is struggling to keep its existing users, the open
source version of S, R, has received a lot of attention in the last five years. Not only because the R
system is a free tool, the system has proven to be a very effective tool in data manipulation, data
analysis, graphing and developing new functionality. The user community has grown enormously the
last years, and it is an active user community writing new R packages that are made available to others.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-
series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S
language is often the vehicle of choice for research in statistical methodology, and R provides an Open
Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced,
including mathematical symbols and formulae where needed. Great care has been taken over the
defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public
License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar
systems (including FreeBSD and Linux), Windows and MacOS.

1.3 THE R ENVIRONMENT

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
It includes

• an effective data handling and storage facility,


• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy, and
• a well-developed, simple and effective programming language which includes conditionals,
loops, user-defined recursive functions and input and output facilities.

The term ‘environment’ is intended to characterize it as a fully planned and coherent system, rather
than an incremental accretion of very specific and inflexible tools, as is frequently the case with other
data analysis software. R, like S, is designed around a true computer language, and it allows users to
add additional functionality by defining new functions. Much of the system is itself written in the R
dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-
intensive tasks, C, C++ and Fortran code can be linked and called at run time.

Advanced users can write C code to manipulate R objects directly. Many users think of R as a statistics
system. We prefer to think of it of an environment within which statistical techniques are
implemented. R can be extended (easily) via packages. There are about eight packages supplied with
the R distribution and many more are available through the CRAN family of Internet sites covering a
very wide range of modern statistics. R has its own LaTeX-like documentation format, which is used
to supply comprehensive documentation, both on-line in a number of formats and in hardcopy

1.4 INSTALLING R

R can be downloaded from the ‘Comprehensive R Archive Network’ (CRAN). You can download the
complete source code of R, but more likely as a beginning R user you want to download the
precompiled binary distribution of R.

Go to the R web site http://www.r-project.org and select a CRAN mirror site and download the base
distribution file in 32/64 bit, under Windows: R-3.6.0-win.exe. At the time of writing the latest version
is 3.6.0. The base file has a size of around 79MB, which you can execute to install R. The installation
wizard will guide you through the installation process. It may be useful to install the R reference
manual as well, by default it is not installed. You can select it in the installation wizard.

Step by step guide to install R on Windows:

1) Download the installable file from the following link: https://cran.r-


project.org/bin/windows/base/
2) Click on the latest version of R .exe file. The versions can be updated as per the latest releases.
3) The SetUp will request permission to be installed on the system click yes to proceed.
4) Select the Preferred language from the dropdown to begin an installation in that preferred
language.
5) Click next to proceed with the installation.
6) Choose the path where you wish to install R by clicking on browse and changing the workspace
locations. Click next to proceed with the default installation. The minimum space
requirements are mentioned at the bottom of the dialog box. Please check you have required
amount of free space in your drive.
7) Choose the type of installation you require. By default R installs both the 32 and 64 bit versions
on your system. If your system is a 32 bit system you will be requiring a 32 bit installation if
the system is a 64 bit system it will be requiring 64 bit installation. Do not uncheck the Core
Files and Message Translations. Please make note of the space requirement of the installation.
8) To customize the startup options for R choose option and customize. To proceed with a vanilla
installation use Next.
9) To generate program shortcuts and naming those as per your requirement specify the
necessary customizations. To proceed with the default installation hit next.
10) Click on the next button to begin your installation.
11) After the installation has completed you will see the final screen. Click finish to complete the
installation.
12) Open Start Menu and you will find R in the available set of Programs.
13) Click on the R icon in the menu settings to open R.
14) You are all set to use R to begin programming.

To Install RStudio

1) Go to www.rstudio.com and click on the "Download RStudio" button.


2) Click on "Download RStudio Desktop."
3) Click on the version recommended for your system, or the latest Windows version, and save
the executable file. Run the .exe file and follow the installation instructions.

Step by step guide to install R on Linux:

Install R on Ubuntu 16.04

1) Open /etc/apt/sources.list and add the following line to the end of the file:
deb http://cran.rstudio.com/bin/linux/ubuntu xenial/
2) Add the key ID for the CRAN network:
Ubuntu GPG key: sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys
E084DAB9
3) Update the repository: sudo apt update
4) Install the R binaries: sudo apt install r-base

Download Packages from CRAN:

1) Open the R interpreter


2) The interpreter will open with some information about the version.
Enter install.packages("ggplot2")

3) A list of available mirrors should appear. Pick the closest location to maximize transfer
speeds:
4) When quitting the interpreter, you will be prompted to save the workspace image. If you
choose yes, this will save all the user defined objects for the next session.

To Install RStudio
1) Paste following piece of code on the terminal and press enter:

sudo apt-get install gdebi


cd ~/Downloads
wget https://download1.rstudio.org/rstudio-xenial-1.1.419-amd64.deb
sudo gdebi rstudio-xenial-1.1.379-amd64.deb
2) Another install option is to visit RStudio (https://www.rstudio.com/) to obtain the software
and then install it by following the procedure.

Step by step guide to install R on Mac OS:

1) Open an internet browser and go to www.r-project.org.


2) Click the "download R" link in the middle of the page under "Getting Started."
3) Select a CRAN location (a mirror site) and click the corresponding link.
4) Click on the "Download R for (Mac) OS X" link at the top of the page.
5) Click on the file containing the latest version of R under "Files."
6) Save the .pkg file, double-click it to open, and follow the installation instructions.
7) Now that R is installed, you need to download and install RStudio.
To Install RStudio

1) Go to www.rstudio.com and click on the "Download RStudio" button.


2) Click on "Download RStudio Desktop."
3) Click on the version recommended for your system, or the latest Mac version, save the .dmg
file on your computer, double-click it to open, and then drag and drop it to your applications
folder.

1.5 INSTALL R PACKAGES

R packages are collections of functions and data sets developed by the community. They increase the
power of R by improving existing base R functionalities, or by adding new ones.

A package is a suitable way to organize your own work and, if you want to, share it with others.
Typically, a package will include code (not only R code!), documentation for the package and the
functions inside, some tests to check everything works as it should, and data sets. The basic
information about a package is provided in the DESCRIPTION file (https://cran.r-
project.org/doc/manuals/r-release/R-exts.html#The-DESCRIPTION-file), where you can find out what
the package does, who the author is, what version the documentation belongs to, the date, the type
of license its use, and the package dependencies.

Besides finding the DESCRIPTION files such as cran.r-project.org or stat.ethz.ch, you can also access
the description file inside R with the command packageDescription("package"), via the
documentation of the package help(package = "package"), or online in the repository of the package.

A repository is a place where packages are located so you can install them from it. Although you or
your organization might have a local repository, typically they are online and accessible to everyone.
Three of the most popular repositories for R packages are:

 CRAN: the official repository, it is a network of ftp and web servers maintained by the R
community around the world. The R foundation coordinates it, and for a package to be
published here, it needs to pass several tests that ensure the package is following CRAN
policies.
 Bioconductor: this is a topic specific repository, intended for open source software for
bioinformatics. As CRAN, it has its own submission and review processes, and its community
is very active having several conferences and meetings per year.
 Github : although this is not R specific, Github is probably the most popular repository for
open source projects. Its popularity comes from the unlimited space for open source, the
integration with git, a version control software, and its ease to share and collaborate with
others.

To install an R package, open an R session and type at the command line

install.packages("<the package's name>")

R will download the package from CRAN, so you'll need to be connected to the internet. Once you
have a package installed, you can make its contents available to use in your current R session by
running

library("<the package's name>")

There are thousands of helpful R packages to use, following is the list of some of the top most
downloaded R packages:
To load data

 DBI - The standard for for communication between R and relational database management
systems. Packages that connect R to databases depend on the DBI package.
 odbc - Use any ODBC driver with the odbc package to connect R to your database. Note:
RStudio professional products come with professional drivers for some of the most popular
databases.
 RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages
are a good place to start. Choose the package that fits your type of database.
 XLConnect, xlsx - These packages help you read and write Micorsoft Excel files from R. You can
also just export your spreadsheets from Excel as .csv's.
 foreign - Want to read a SAS data set into R? Or an SPSS data set? Foreign provides functions
that help you load data files from other programs into R.
 haven - Enables R to read and write data from SAS, SPSS, and Stata.

R can handle plain text files – no package required. Just use the functions read.csv, read.table, and
read.fwf. If you have even more exotic data, consult the CRAN guide to data import and export.

To manipulate data

 dplyr - Essential shortcuts for subsetting, summarizing, rearranging, and joining together data
sets. dplyr is our go to package for fast data manipulation.
 tidyr - Tools for changing the layout of your data sets. Use the gather and spread functions to
convert your data into the tidy format, the layout R likes best.
 stringr - Easy to learn tools for regular expressions and character strings.
 lubridate - Tools that make working with dates and times easier.

To visualize data

 ggplot2 - R's famous package for making beautiful graphics. ggplot2 lets you use the grammar
of graphics to build layered, customizable plots.
 ggvis - Interactive, web based graphics built with the grammar of graphics.
 rgl - Interactive 3D visualizations with R
 htmlwidgets - A fast way to build interactive (javascript based) visualizations with R. Packages
that implement htmlwidgets include:
o leaflet (maps)
o dygraphs (time series)
o DT (tables)
o diagrammeR (diagrams)
o network3D (network graphs)
o threeJS (3D scatterplots and globes)

 googleVis - Let's you use Google Chart tools to visualize data in R. Google Chart tools used to
be called Gapminder, the graphing software Hans Rosling made famous in hie TED talk.

To model data

 car - car's Anova function is popular for making type II and type III Anova tables.
 mgcv - Generalized Additive Models
 lme4/nlme - Linear and Non-linear mixed effects models
 randomForest - Random forest methods from machine learning
 multcomp - Tools for multiple comparison testing
 vcd - Visualization tools and tests for categorical data
 glmnet - Lasso and elastic-net regression methods with cross validation
 survival - Tools for survival analysis
 caret - Tools for training regression and classification models

To report results

 shiny - Easily make interactive, web apps with R. A perfect way to explore data and share
findings with non-programmers.
 R Markdown - The perfect workflow for reproducible reporting. Write R code in your
markdown reports. When you run render, R Markdown will replace the code with its results
and then export your report as an HTML, pdf, or MS Word document, or a HTML or pdf
slideshow. The result? Automated reporting. R Markdown is integrated straight into RStudio.
 xtable - The xtable function takes an R object (like a data frame) and returns the latex or HTML
code you need to paste a pretty version of the object into your documents. Copy and paste,
or pair up with R Markdown.

For Spatial data

 sp, maptools - Tools for loading and using spatial data including shapefiles.
 maps - Easy to use map polygons for plots.
 ggmap - Download street maps straight from Google maps and use them as a background in
your ggplots.

For Time Series and Financial data

 zoo - Provides the most popular format for saving time series objects in R.
 xts - Very flexible tools for manipulating time series data sets.
 quantmod - Tools for downloading financial data, plotting common charts, and doing technical
analysis.

To write high performance R code

 Rcpp - Write R functions that call C++ code for lightning fast speed.
 data.table - An alternative way to organize data sets for very, very fast operations. Useful for
big data.
 parallel - Use parallel processing in R to speed up your code or to crunch large data sets.

To work with the web

 XML - Read and create XML documents with R


 jsonlite - Read and create JSON data tables with R
 httr - A set of useful tools for working with http connections

To write your own R packages

 devtools - An essential suite of tools for turning your code into an R package.
 testthat - testthat provides an easy way to write unit tests for your code projects.
 roxygen2 - A quick way to document your R packages. roxygen2 turns inline code comments
into documentation pages and builds a package namespace.
Suggested Reading
1) https://www.datacamp.com/community/tutorials/r-packages-guide
2) http://r-pkgs.had.co.nz/
3) http://www.sthda.com/english/wiki/installing-and-using-r-packages
4) https://cran.r-project.org/bin/windows/base/
Unit 2
Data Types and Data Structures
Structure:

2.1 Introduction

2.2 Data Types

2.3 Data Structures

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives

After going through this unit, you will be able to:

 Understand the different data types and their application


 Use the various data structure in R programs

2.1 INTRODUCTION

To make the best of the R language, we need a strong understanding of the basic data types and data
structures and how to operate on them. One of the essential features of R is its robust ability to handle
and process complicated statistical operations with an optimised strategy. Data structures are very
important to understand because these are the objects we will manipulate on a day-to-day basis in R.
Dealing with object conversions is one of the most common sources of frustration for beginners.
Elements of these data types may be combined to form data structures, such as atomic vectors. When
we call a vector atomic, we mean that the vector only holds data of a single data type.

In this unit, we will discuss the different aspects of data types and structures in R with examples.

2.2 DATA TYPES

2.2.1 Double

If you do calculations on numbers, you can use the data type double to represent the numbers.
Doubles are numbers like 3.14, 8.0 and 9.1. Doubles are used to represent continuous variables like
the weight or length of a person.

x <- 6.14

y <- 1.0

z <- 7.0 + 13.9

Use the function is.double to check if an object is of type double. Alternatively, use the function typeof
to ask R the type of the object x.

typeof(x)

[1] "double"

is.double(7.9)

[1] TRUE

test <- 1122.490

is.double(test)

[1] TRUE

Keep in mind that doubles are just approximations to real numbers. Mathematically there are infinity
many numbers, the computer can of course only represent a finite number of numbers. Not only can
numbers like π or √2 not be represented exactly, less exotic numbers like 0.1 for example can also not
be represented exactly.
One of the consequences of this is that when you compare two doubles with each other you should
take some care. Consider the following (surprising) result.

0.2 == 0.1 + 0.1

[1] FALSE

2.2.2 Integer

Integers are natural numbers. They can be used to represent counting variables, for example the
number of student in a class. Use the function as.integer to create objects of type integer.

nstudent <- as.integer(3)

is.integer(nstudent)

[1] TRUE

Note that 3.0 is not an integer, nor is 3 by default an integer!

nstudent <- 3.0

is.integer(nstudent)

[1] FALSE

nstudent <- 3

is.integer(nstudent)

[1] FALSE

So a 3 of type ‘integer' in R is something different than a 3.0 of type ‘double'. However, you can mix
objects of type ‘double' and ‘integer' in one calculation without any problems.

x <- as.integer(7)

y <- 2.0

z <- x/y

In contrast to some other programming languages, the answer is of type double and is 3.5. The
maximum integer in R is 231 - 1.

as.integer(2^31 - 1)

[1] 2147483647

as.integer(2^31)

[1] NA

Warning message:

NAs introduced by coercion

2.2.3 Complex

Objects of type ‘complex' are used to represent complex numbers. In statistical data analysis you will
not need them often. Use the function as.complex or complex to create objects of type complex.
test1 <- as.complex(-25+5i)

sqrt(test1)

[1] 0.4975427+5.024694i

test2 <- complex(5,real=2,im=6)

test2

[1] 2+6i 2+6i 2+6i 2+6i 2+6i

typeof(test2)

[1] "complex"

Note that by default calculations are done on real numbers, so sqrt(-1) results in NA. Use

sqrt(as.complex(-1))

2.2.4 Logical

An object of data type logical can have the value TRUE or FALSE and is used to indicate if a condition
is true or false. Such objects are usually the result of logical expressions.

x <- 8

y <- x > 12

[1] FALSE

The result of the function is.double is an object of type logical (TRUE or FALSE).

is.double(7.75)

[1] TRUE

Logical expressions are often built from logical operators:

< smaller than

<= smaller than or equal to

> larger than

>= larger than or equal to

== is equal to

!= is unequal to

The logical operators and, or and not are given by &, | and !, respectively. C (..,..,..) function is a generic
function which combines its arguments to form a vector or list.

x <- c(9,166)
y <- (3 < x) & (x <= 10)

[1] TRUE FALSE

Calculations can also be carried out on logical objects, in which case the FALSE is replaced by a zero
and a one replaces the TRUE. For example, the sum function can be used to count the number of
TRUE's in a vector or array.

x <- 1:15

## number of elements in x larger than 9

sum(x>9)

[1] 6

2.2.5 Character

A character object is represented by a collection of characters between double quotes ("). For
example: "x", "test character" and "hello world!”. One way to create character objects is as follows.

x <- c("a","b","c")

[1] "a" "b" "c"

mychar1 <- "This is a test"

mychar2 <- "This is another test"

charvector <- c("a", "b", "c", "test")

The double quotes indicate that we are dealing with an object of type ‘character'.

2.2.6 Factor

The factor data type is used to represent categorical data (i.e. data of which the value range is a
collection of codes). For example: variable ‘sex' with values male and female, variable ‘blood type'
with values: A, AB and O.

An individual code of the value range is also called a level of the factor variable. So the variable ‘sex'
is a factor variable with two levels, male and female. Sometimes people confuse factor type with
character type. Characters are often used for labels in graphs, column names or row names. Factors
must be used when you want to represent a discrete variable in a data frame and want to analyse it.

Factor objects can be created from character objects or from numeric objects, using the function
factor. For example, to create a vector of length five of type factor do the following:

sex <- c("male","male","female","male","female")

The object sex is a character object. You need to transform it to factor.

sex <- factor(sex)

sex

[1] male male female male female


Use the function levels to see the different levels a factor variable has.

levels(sex)

[1] "female" "male"

Note that the result of the levels function is of type character. Another way to generate the sex
variable is as follows:

sex <- c(1,1,2,1,2)

The object ‘sex' is an integer variable, you need to transform it to a factor.

sex <- factor(sex)

sex

[1] 1 1 2 1 2

Levels: 1 2

The object ‘sex' looks like, but is not an integer variable. The 1 represents level "1" here. So arithmetic
operations on the sex variable are not possible:

sex + 7

[1] NA NA NA NA NA

Warning message:

+ not meaningful for factors in: Ops.factor(sex, 7)

It is better to rename the levels, so level "1" becomes male and level "2" becomes female:

levels(sex) <- c("male","female")

sex

[1] male male female male female

You can transform factor variables to double or integer variables using the as.double or as.integer
function.

sex.numeric <- as.double(sex)

sex.numeric

[1] 2 2 1 2 1

The 1 is assigned to the female level, only because alphabetically female comes first. If the order of
the levels is of importance, you will need to use ordered factors. Use the function ordered and specify
the order with the levels argument. For example:

Income <- c("High","Low","Average","Low","Average","High","Low")

Income <- ordered(Income, levels=c("Low","Average","High"))

Income

[1] High Low Average Low Average High Low


Levels: Low < Average < High

The last line indicates the ordering of the levels within the factor variable. When you transform an
ordered factor variable, the order is used to assign numbers to the levels.

Income.numeric <- as.double(Income)

Income.numeric

[1] 3 1 2 1 2 3 1

The order of the levels is also used in linear models. If one or more of the regression variables are
factor variables, the order of the levels is important for the interpretation of the parameter estimates.

2.2.7 Dates and Times

To represent a calendar date in R use the function as.Date to create an object of class Date.

temp <- c("12-09-2018", "29-08-2019")

z <- as.Date(temp, "%d-%m-%Y")

[1] "2018-09-12" "2019-08-29"

data.class(z)

[1] "Date"

format(z, "%d-%m-%Y")

[1] "12-09-2018" "29-08-2019"

You can add a number to a date object, the number is interpreted as the number of day to add to the
date.

z + 19

[1] "2018-10-01" "2019-09-17"

You can subtract one date from another, the result is an object of class ‘difftime'

dz = z[2] -z[1]

dz

data.class(dz)

Time difference of 351 days

[1] "difftime"

In R the classes POSIXct and POSIXlt can be used to represent calendar dates and times. You can create
POSIXct objects with the function as.POSIXct. The function accepts characters as input, and it can be
used to not only to specify a date but also a time within a date.

t1 <- as.POSIXct("2019-01-23")

t2 <- as.POSIXct("2019-04-23 15:34")


t1

t2

[1] "2019-01-23 …. Standard Time"

[1] "2019-04-23 15:34:00 …. Daylight Time"

A handy function is strptime, it is used to convert a certain character representation of a date (and
time) into another character representation. You need to provide a conversion specification that starts
with a % followed by a single letter.

# first creating four characters

x <- c("1jan2019", "2jan2019", "31mar2019", "30jul2019")

z <- strptime(x, "%d%b%Y")

zt <- as.POSIXct(z)

zt

[1] "2019-01-01 …. Standard Time"

[2] "2019-01-02 …. Standard Time"

[3] "2019-03-31 …. Daylight Time"

[4] "2019-07-30 …. Daylight Time"

# pasting 4 character dates and 4 character times together

dates <- c("02/27/2019", "02/27/2019", "01/14/2019", "02/28/2019")

times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03")

x <- paste(dates, times)

z <- strptime(x, "%m/%d/%y %H:%M:%S")

zt <- as.POSIXct(z)

zt

[1] "2019-02-27 23:03:20 …. Standard Time"

[2] "2019-02-27 22:29:56 …. Standard Time"

[3] "2019-01-14 01:03:30 …. Standard Time"

[4] "2019-02-28 18:21:03 …. Standard Time"

An object of type POSIXct can be used in certain calculations, a number can be added to a POSIXct
object. This number will be the interpreted as the number of seconds to add to the POSIXct object.

zt + 13

[1] "2019-02-27 23:03:33 …. Standard Time"

[2] "2019-02-27 22:30:09 …. Standard Time"


[3] "2019-01-14 01:03:43 …. Standard Time"

[4] "2019-02-28 18:21:16 …. Standard Time"

You can subtract two POSIXct objects, the result is a so called ‘difftime' object.

t2 <- as.POSIXct("2019-01-23 14:33")

t1 <- as.POSIXct("2018-04-23")

d <- t2-t1

Time difference of 275.6479 days

A ‘difftime' object can also be created using the function as.difftime, and you can add a difftime object
to a POSIXct object. Due to a bug in R this can only safely be done with the function "+.POSIXt".

"+.POSIXt"(zt, d)

[1] "2019-11-29 14:36:20 …. Standard Time"

[2] "2019-11-29 14:02:56 …. Standard Time"

[3] "2019-10-15 17:36:30 …. Daylight Time"

[4] "2019-11-30 09:54:03 …. Standard Time"

To extract the weekday, month or quarter from a POSIXct object use the handy R functions weekdays,
months and quarters. Another handy function is Sys.time, which returns the current date and time.

weekdays(zt)

[1] "Wednesday" "Wednesday" "Monday" "Thursday"

There are some R packages that can handle dates and time objects. For example, the packages zoo,
chron, tseries, its and Rmetrics. Especially Rmetrics has a set of powerful functions to maintain and
manipulate dates and times.

2.2.8 Missing Data and Infinite Values

We have already seen the symbol NA. In R, it is used to represent ‘missing' data (Not Available). It is
not really a separate data type, it could be a missing double or a missing integer. To check if data is
missing, use the function is.na or use a direct comparison with the symbol NA. There is also the symbol
NaN (Not a Number), which can be detected with the function is.nan.

x <- as.double( c("1", "2", "qaz"))

is.na(x)

[1] FALSE FALSE TRUE

z <- sqrt(c(1,-1))

Warning message:

NaNs produced in: sqrt(c(1, -1))


is.nan(z)

[1] FALSE TRUE

Infinite values are represented by Inf or -Inf. You can check if a value is infinite with the function
is.infinite. Use is.finite to check if a value is finite.

x <- c(1,3,4)

y <- c(1,0,4)

x/y

[1] 1 Inf 1

z <- log(c(4,0,8))

is.infinite(z)

[1] FALSE TRUE FALSE

In R, NULL represents the null object. NULL is used mainly to represent the lists with zero length, and
is often returned by expressions and functions whose value is undefined.

2.3 DATA STRUCTURES

Before you can perform statistical analysis in R, your data has to be structured in some coherent way.
To store your data, R has the following structures:

1. vector
2. matrix
3. array
4. data frame
5. time-series
6. list

2.3.1 Vectors

The most basic data-type in R is the vector. A vector is just a 1-dimensional array of values. Several
different kinds of vectors are available:

 numerical vectors,
 logical vectors,
 character-string vectors,
 factors,
 ordered factors, and
 lists

A vector’s defining attributes are its mode—which kind of vector it is— and its length. Vectors can also
have a names attribute, which allows one to refer to elements by name. We’ve already seen how to
create vectors in R using the c function, e.g.,

x <- c(1,3,5,7,9,11)

y <- c(6.5,4.3,9.1,-8.5,0,3.6)
z <- c("dog","cat","dormouse","chinchilla")

w <- c(a=4,b=5.5,c=8.8)

length(x)

## [1] 6

mode(y)

## [1] "numeric"

mode(z)

## [1] "character"

names(w)

## [1] "a" "b" "c"

The nice thing about having vectors as a basic type is that many operations in R are efficiently
vectorised. That is, the operation acts on the vector as a unit, saving you the trouble of treating each
entry individually. For example:

x <- x+1

xx <- sqrt(x)

x; xx

## [1] 2 4 6 8 10 12

## [1] 1.414214 2.000000 2.449490 2.828427 3.162278 3.464102

Notice that the operations were applied to every entry in the vector. Similarly, commands like x-5,
2*x, x/10, and x^2 apply subtraction, multiplication, division, and square to each element of the
vector. The same is true for operations involving multiple vectors:

x+y

## [1] 8.5 8.3 15.1 -0.5 10.0 15.6

In R the default is to apply functions and operations to vectors in an element by element manner;
anything else (e.g. matrix multiplication) is done using special notation.

2.3.1.1 Element Recycling

When performing vector operations in R, it is important to know about recycling. R has a very useful,
but unusual and perhaps unexpected, behaviour when two vector operands in a vectorised operation
are of unequal lengths. It will effectively extend the shorter vector using element “re-cycling”: re-using
elements of the shorter vector. Thus

x <- c(1,2,3)

y <- c(10,20,30,40,50,60)

x+y

## [1] 11 22 33 41 52 63
y-x

## [1] 9 18 27 39 48 57

a <- 1:10
b <- 1:5
a+b
[1] 2 4 6 8 10 7 9 11 13 15
Here, the elements of a and b are added together starting from the first element of both vectors.
When R reaches the end of the shorter vector b, it starts again at the first element of b and continues
until it reaches the last element of the longest vector a. This behaviour may seem crazy at first glance,
but it is very useful when you want to perform the same operation on every element of a vector. For
example, say we want to multiply every element of our vector a by 5:

a <- 1:10
b <- 5
a*b
[1] 5 10 15 20 25 30 35 40 45 50
Remember there are no scalars in R, so b is actually a vector of length 1; in order to add its value to
every element of a, it is recycled to match the length of a.

When the length of the longer object is a multiple of the shorter object length (as in our example
above), the recycling occurs silently. When the longer object length is not a multiple of the shorter
object length, a warning is given:

a <- 1:10
b <- 1:7
a+b
Warning in a + b: longer object length is not a multiple of shorter object length
[1] 2 4 6 8 10 12 14 9 11 13
2.3.1.2 Functions for Creating Vectors

A set of regularly spaced values can be created with the seq function, whose syntax is x <-
seq(from,to,by) or x <- seq(from,to) or x <- seq(from,to,length.out). The first form generates a vector
(from,from+by,from+2*by,...) with the last entry not extending further than to; in the second form the
value of by is assumed to be 1 or -1, depending on whether from or to is larger; and the third form
creates a vector with the desired endpoints and length. There is also a shortcut for creating vectors
with by=1:

1:8

## [1] 1 2 3 4 5 6 7 8
A constant vector such as (1 1 1 1) can be created with rep function, whose basic syntax is
rep(values,lengths).

For example,

rep(3,5)

## [1] 3 3 3 3 3

creates a vector in which the value 3 is repeated 5 times. rep() will repeat a whole vector multiple
times

rep(1:3,3)

## [1] 1 2 3 1 2 3 1 2 3

or will repeat each of the elements in a vector a given number of times:

rep(1:3,each=3)

## [1] 1 1 1 2 2 2 3 3 3

Even more flexibly, you can repeat each element in the vector a different number of times:

rep(c(3,4),c(2,5))

## [1] 3 3 4 4 4 4 4

The value 3 was repeated 2 times, followed by the value 4 repeated 5 times. rep() can be a little bit
mind-blowing as you get started, but you’ll get used to it—and it will turn out to be useful.

Some important R functions for creating and working with vectors.

 seq(from,to,by=1) - Vector of evenly spaced values (default increment = 1)


 seq(from,to,length.out) - Vector of evenly spaced values, specified length
 c(u,v,...) - Combine a set of numbers and/or vectors into a single
vector
 rep(a,b) - Create vector by repeating elements of a, b times each
 hist(v) - Histogram plot of value in v
 mean(v),var(v),sd(v) - Estimate of population mean, variance, standard deviation
based on data values in v
 cov(v,w) - Covariance between two vectors
 cor(v,w) - Correlation between two vectors

Many of these have other optional arguments; use the help system (e.g. cor) for more information.
The statistical functions such as var regard the values as samples from a population and compute the
unbiased estimate of the population statistic; for example sd(1:3)=1.

2.3.1.3 Vector Indexing

It is often necessary to extract a specific entry or other part of a vector. This procedure is called vector
indexing, and uses square brackets ([]):

z <- c(1,3,5,7,9,11); z[3]

## [1] 5
z[3] extracts the third element of the vector z. You can also access a block of elements by giving a
vector of indices:

v <- z[c(2,3,4,5)]

or

v <- z[2:5]; v

## [1] 3 5 7 9

This has extracted the 2nd through 5th elements in the vector.

Extracted parts of a vector don’t have to be regularly spaced. For example

v <- z[c(1,2,5)]; v

## [1] 1 3 9

Indexing is also used to set specific values within a vector. For example,

z[1] <- 12

changes the value of the first entry in z while leaving all the rest alone, and

z[c(1,3,5)] <- c(22,33,44)

changes the 1st, 3rd, and 5th values.

Elements in a named vector can be accessed and modified by name as well as by position. Thus

## a b c

## 4.0 5.5 8.8

w["a"]

## a

## 4

w[c("c","b")]

## c b

## 8.8 5.5

w["b"] <- 0

## a b c

## 4.0 0.0 8.8

You may be wondering if vectors in R are row vectors or column vectors (if you don’t know what those
are, don’t worry). The answer is “both and neither”. Vectors are printed out as row vectors, but if you
use a vector in an operation that succeeds or fails depending on the vector’s orientation, R will assume
that you want the operation to succeed and will proceed as if the vector has the necessary orientation.
For example, R will let you add a vector of length 5 to a 5×1 matrix or to a 1×5 matrix, in either case
yielding a matrix of the same dimensions.

2.3.1.4 Logical Operations

Some operations return a logical value (i.e., TRUE or FALSE). For example, try:

a <- 1; b <- 3;

c <- a < b

d <- (a > b)

c; d

## [1] TRUE

## [1] FALSE

The parentheses around a > b above are optional but do make the code easier to read. Be careful
when you make comparisons with negative values: a<-1 may surprise you by setting a to 1, because
<- is the assignment operator in R. Use a < -1 or a < (-1) to make this comparison.

Some comparison operators in R:

R code Comparison

x<y x strictly less than y

x>y x strictly greater than y

x <= y x less than or equal to y

x >= y x greater than or equal to y

x == y x equal to y

x != y x not equal to y

identical(x,y) x completely identical to y

all.equal(x,y) x pretty much equal to y

When we compare two vectors or matrices, comparisons are done element-by-element (and the
recycling rule applies). For example,

x <- 1:5; b <- (x<=3); b

## [1] TRUE TRUE TRUE FALSE FALSE

So if x and y are vectors, then (x==y) will return a vector of values giving the element-by-element
comparisons. If you want to know whether x and y are identical vectors, use identical(x,y) or
all.equal(x,y). You can use ?Logical to read more about logical operations. Note the difference
between = and ==. Can you figure out what happened in the following cautionary tale?

a=1:3

b=2:4
a==b

## [1] FALSE FALSE FALSE

a=b

a==b

## [1] TRUE TRUE TRUE

R can also do arithmetic on logical values, treating TRUE as 1 and FALSE as 0. So sum(x<3) returns the
value 2, telling us that two entries of x satisfied the condition (x<3). This is useful for counting the
number of elements of a vector that satisfy a given condition.

More complicated conditions are built by using logical operators to combine comparisons. The most
important of these are tabulated here.

Logical operators:

Operator Meaning

! logical NOT

& logical AND, elementwise

&& logical AND, first element only

| logical OR, elementwise

|| logical OR, first element only

xor(x,y) exclusive OR, elementwise

For example, try

a <- c(1,2,3,4)

b <- c(1,1,5,5)

(a<b) | (a>3)

## [1] FALSE FALSE TRUE TRUE

(a<b) || (a>3)

## [1] FALSE

and make sure you understand what happened.

The two forms of logical OR (|and ||) are inclusive, meaning that x|y is true if either x or y or both are
true. Use xor when exclusive OR is needed. The two forms of AND and OR differ in how they handle
vectors. The shorter one (|, &) does element-by-element comparisons; the longer one (||, &&) looks
only at the first element in each vector.
2.3.1.5 More on Vector Indexing

We can also use logical vectors (lists of TRUE and FALSE values) to pick elements out of vectors. This
is useful, for example, in subsetting data.

As a simple example, we might want to focus on just the low-light values of rmax in the Chlorella
example:

lowLight <- Light[Light<50]

lowLightrmax <- rmax[Light<50]

lowLight

## [1] 20 20 20 20 21 24 44

lowLightrmax

## [1] 1.73 1.65 2.02 1.89 2.61 1.36 2.37

What is really happening here (think about it for a minute) is that Light<50 generates a logical vector
the same length as Light(TRUE TRUE TRUE ...) which is then used to select the appropriate values.

If you want the positions at which Light is lower than 50, you can use which: which(Light<50). If you
wanted the position at which the maximum value of Light occurs, you could say
which(Light==max(Light)) or which.max(Light). Note that, if Light has several elements that are
maximal, the first will return the positions of them all, while the second will return the position only
of the first one.

2.3.2 Matrices

2.3.2.1 Creating Matrices

A matrix is a two-dimensional array of items. Most straightforwardly, we can create a matrix by


specifying the number of rows and columns, and specifying the entries. For example

X <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3); X

## [,1] [,2] [,3]

## [1,] 1 3 5

## [2,] 2 4 6

takes the values 1 to 6 and reshapes them into a 2 by 3 matrix. Note that the values in the data vector
are put into the matrix column-wise, by default. You can change this by using the optional parameter
by row:

A <- matrix(1:9,nrow=3,ncol=3,byrow=TRUE); A

## [,1] [,2] [,3]

## [1,] 1 2 3

## [2,] 4 5 6

## [3,] 7 8 9
R will re-cycle through entries in the data vector, if need be, to fill a matrix of the specified size. So for
example

matrix(1,nrow=50,ncol=50)

creates a 50×50 matrix, every entry of which is 11.

Another useful function for creating matrices is diag. diag(v,n) creates an n×n matrix with data vector
v on its diagonal. So for example diag(1,5) creates the 5×5 identity matrix, which has 1s on the diagonal
and 0 everywhere else.

Finally, one can use the data.entry function. This function can only edit existing matrices, but for
example (try this now!)

A <- matrix(0,3,4)

data.entry(A)

will create A as a 3×4 matrix, and then call up a spreadsheet-like interface in which the values can be
edited directly. You can further modify A with the same primitive interface, using fix.

Some important functions for creating and working with matrices:

R code Purpose

matrix(v,nrow=m,ncol=n) m×n matrix using the values in v

t(A) transpose (exchange rows and columns) of matrix A

dim(X) dimensions of matrix X. dim(X)[1]=# rows, dim(X)[2]=# columns

data.entry(A) call up a spreadsheet-like interface to edit the values in A

diag(v,n) diagonal n×n matrix with v on diagonal, 0 elsewhere (v is 1 by default,


so diag(n)gives an n×nn×n identity matrix)

cbind(a,b,c,...) combine compatible objects by attaching them along columns

rbind(a,b,c,...) combine compatible objects by attaching them along rows

as.matrix(x) convert an object of some other type to a matrix, if possible

outer(v,w) “outer product” of vectors v, w: the matrix whose (i,j)(i,j)-th element


is v[i]*w[j]

Many of these functions have additional optional arguments; use the help system for full details.

2.3.2.2 cbind and rbind

If their sizes match, vectors can be combined to form matrices, and matrices can be combined with
vectors or matrices to form other matrices. The functions that do this are cbind and rbind.

cbind binds together columns of two objects. One thing it can do is put vectors together to form a
matrix:

C <- cbind(1:3,4:6,5:7); C

## [,1] [,2] [,3]


## [1,] 1 4 5

## [2,] 2 5 6

## [3,] 3 6 7

Remember that R interprets vectors as row or column vectors according to what you’re doing with
them. Here it treats them as column vectors so that columns exist to be bound together. On the other
hand,

D <- rbind(1:3,4:6); D

## [,1] [,2] [,3]

## [1,] 1 2 3

## [2,] 4 5 6

treats them as rows. Now we have two matrices that can be combined.

2.3.2.3 Matrix Indexing

Matrix indexing is like vector indexing except that you have to specify both the row and column, or
range of rows and columns. For example z <- A[2,3] sets z equal to 6, which is the (2nd row, 3rd
column) entry of the matrix A that you recently created, and

A[2,2:3];

## [1] 5 6

B <- A[2:3,1:2]; B

## [,1] [,2]

## [1,] 4 5

## [2,] 7 8

There is an easy shortcut to extract entire rows or columns: leave out the limits, leaving a blank before
or after the comma.

first.row <- A[1,]; first.row

## [1] 1 2 3

second.column <- A[,2]; second.column;

## [1] 2 5 8

As with vectors, indexing also works in reverse for assigning values to matrix entries. For example,

A[1,1] <- 12; A

## [,1] [,2] [,3]

## [1,] 12 2 3

## [2,] 4 5 6

## [3,] 7 8 9
The same can be done with blocks, rows, or columns, for example

A[1,] <- c(2,4,5); A

## [,1] [,2] [,3]

## [1,] 2 4 5

## [2,] 4 5 6

## [3,] 7 8 9

If you use which() on a matrix, R will normally treat the matrix as a vector—so for example which(A==8)
will give the answer 6 (can you see why?). However, which() does have an option that will treat its
argument as a matrix:

which(A>=8,arr.ind=TRUE)

## row col

## [1,] 3 2

## [2,] 3 3

2.3.3 Arrays

The generalization of the matrix to more (or less) than 2 dimensions is the array. In fact, in R, a matrix
is nothing other than a 2-dimensional array. How does R store arrays? In the simplest possible way:
an array is just a vector plus information on the dimensions of the array. Most straightforwardly, we
can create an array from a vector:

X <- array(1:24,dim=c(3,4,2)); X

## , , 1

##

## [,1] [,2] [,3] [,4]

## [1,] 1 4 7 10

## [2,] 2 5 8 11

## [3,] 3 6 9 12

##

## , , 2

##

## [,1] [,2] [,3] [,4]

## [1,] 13 16 19 22

## [2,] 14 17 20 23

## [3,] 15 18 21 24
Note, again, that the arrays are filled in a particular order: the first dimension first, then the second,
and so on. A one-dimensional array is subtly different from a vector:

y <- 1:5; y

## [1] 1 2 3 4 5

z <- array(1:5,dim=5); z

## [1] 1 2 3 4 5

y==z

## [1] TRUE TRUE TRUE TRUE TRUE

identical(y,z)

## [1] FALSE

dim(y); dim(z)

## NULL

## [1] 5

2.3.4 Factors

For dealing with measurements on the nominal and ordinal scales (Stevens 1946), R provides vectors
of type factor. A factor is a variable that can take one of a finite number of distinct levels. To construct
a factor, we can apply the factor function to a vector of any class:

x <- rep(c(1,2),each=3); factor(x)

## [1] 1 1 1 2 2 2

## Levels: 1 2

trochee <- c("jetpack","ferret","pizza","lawyer")

trochee <- factor(trochee); trochee

## [1] jetpack ferret pizza lawyer

## Levels: ferret jetpack lawyer pizza

By default, factor sets the levels to the unique set of values taken by the vector. To modify that
behaviour, there is the levels argument:

factor(trochee,levels=c("ferret","pizza","cowboy","scrapple"))

## [1] <NA> ferret pizza <NA>

## Levels: ferret pizza cowboy scrapple

Note that the order of the levels is arbitrary, in keeping with the fact that the only operation
permissible on the nominal scale is the test for equality. In particular, the factors created with the
factor command are un-ordered: there is no sense in which we can ask whether, e.g., ferret < cowboy.
To represent variables measured on the ordinal scale, R provides ordered factors, constructed via the
ordered function. An ordered factor is just like an un-ordered factor except that the order of the levels
matters:

x <- ordered(sample(x=letters,size=22,replace=TRUE)); x

## [1] q a v u d o j p w z e k l t g f g t r g f c

## 18 Levels: a < c < d < e < f < g < j < k < l < o < p < q < r < t < ... < z

Here, we’ve relied on ordered’s default behaviour, which is to put the levels in alphabetical order. It’s
typically safer to specify explicitly what order we want:

x <- ordered(x,levels=rev(letters))

x[1:5] < x[18:22]

## [1] FALSE FALSE TRUE TRUE TRUE

2.3.5 Lists

While vectors and matrices may seem familiar, lists may be new to you. Vectors and matrices have to
contain elements that are all the same type: lists in R can contain anything—vectors, matrices, other
lists, arbitrary objects. Indexing is a little different too: use [[ ]] to extract an element of a list by
number or name or $ to extract an element by name (only). Given a list like this:

L <- list(A=x,B=trochee,C=c("a","b","c"))

Then L$A, L[["A"]], and L[[1]] will each return the first element of the list. To extract a sublist, use the
ordinary single square brackets [ ]:

L[c("B","C")]

## $B

## [1] jetpack ferret pizza lawyer

## Levels: ferret jetpack lawyer pizza

##

## $C

## [1] "a" "b" "c"

2.3.6 Data Frames

Vectors, matrices, and lists of one sort or another are found in just about every programming
language. The data frame structure is (or was last time I checked) unique to R, and is central to many
of R’s useful data-analysis features. It’s very natural to want to store data in vectors and matrices.
Thus, in the example above, we stored measurements of two variables (rmax and light level) in vectors.
This was done in such a way that the observations of the first replicate were stored in the first element
of each vector, the second in the second, and so on. To explicitly bind together observations
corresponding to the same replicate, we might join the two vectors into a matrix using cbind. In the
resulting data structure, each row would correspond to an observation, each column to a variable.
This is possible, however, only because both variables are of the same type: they’re both numerical.
More commonly, a data set is made up of several different kinds of variables. The data frame is R’s
solution to this problem.

Data frames are a hybrid of lists and vectors. Internally, they are a list of vectors which can be of
different types but must all be the same length. However, they behave somewhat like matrices, in
that you can do most things to them that you can do with matrices. You can index them either the
way you would index a list, using [[ ]] or $—where each variable is a different item in the list—or the
way you would index a matrix.

You can turn a data frame into a matrix (using as.matrix(), but only if all variables are of the same class)
and a matrix into a data frame (using as.data.frame()).

When data are read into R from an external file using one of the read.xxx commands (read.csv,
read.table, read.xls, etc.), the object that is created is a data frame.

data.url <- "https://xyz.com/Growth.csv"

dat <- read.csv(data.url,comment.char='#')

dat

## light rmax

## 1 20 1.73

## 2 20 1.65

## 3 20 2.02

## 4 20 1.89

## 5 21 2.61

## 6 24 1.36

## 7 44 2.37

## 8 60 2.08

## 9 90 2.69

## 10 94 2.32

## 11 101 3.67

Check your Progress 1

Fill in the Blanks.

1. The ______ data type is used to represent categorical data.


2. In R, _______ is used to represent ‘missing' data.
3. ______ indexing is like vector indexing except that you have to specify both the row
and column.

Activity 1

1. What happens when the length of the longer vector is not a multiple of that of the
shorter?
2. Use seq to create the vector v=(1 5 9 13), and to create a vector going from 1 to 5 in
increments of 0.2.
3. Write a one-line command to extract a vector consisting of the second, first, and third
elements of z in that order.
4. runif(n) is a function that generates a vector of n random, uniformly distributed
numbers between 0 and 1. Create a vector of 20 numbers, then find the subset of those
numbers that is less than the mean.

Summary
 There are different data types in R, we have discussed in this unit,
• Double, logical, character, integer, complex
• Vectors, lists
• Factors, arrays
• Data frames and matrices
 All R objects can have attributes that help to describe what is in the object. Perhaps the most
useful attribute is names, such as column and row names in a data frame, or simply names in
a vector or list.

Keywords
 Data Type: It is an attribute of data which tells the compiler or interpreter how the
programmer intends to use the data.
 Objects: It can be a variable, a data structure, a function, or a method, and its value in memory
referenced by an identifier.

Self-Assessment Questions

1. What do the %% and %/% operators do?


2. What happens when we set the dimension attribute on a vector? For example:

x <- seq(1,27)

dim(x) <- c(3,9)

is.array(x)

is.matrix(x)

3. List the data types in R.


4. Explain data frames with example.

Answers to Check your Progress

Check your Progress 1

Fill in the Blanks.

1. The factor data type is used to represent categorical data.


2. In R, NA is used to represent ‘missing' data.
3. Matrix indexing is like vector indexing except that you have to specify both the row and
column.
Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
Unit 3
Loops and Functions in R
Structure:

3.1 Introduction

3.2 Loops

3.2.1 For Loops

3.2.2 While Loops

3.2.3 Repeat Loops

3.2.4 If-else

3.2.5 Switch

3.3 Functions

3.3.1 Function Scope

3.3.2 Nested Functions and Environments

3.4 The Apply Family of Functions

3.4.1 List Apply: Lapply

3.4.2 Sloppy List Apply: Sapply

3.4.3 Multiple-List Apply: Mapply

3.4.4 Array Apply: Apply

3.4.5 Table Apply: Tapply

3.4.6 sapply with Expected Result: vapply

3.5 Vectorized Functions Vs Loops

Summary

Keywords

Self-Assessment Questions

Answers To Check Your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives

After going through this unit, you will be able to:

 Understand the different types of loops


 Compare the loops and vectorised operations
 Use the apply family of functions

3.1 INTRODUCTION

Very frequently, a computation involves iterating some procedure across a range of cases, and every
computer language we have ever come across has one or more facilities for producing such loops. R
is no exception, though judging by their code, many R programmers look down their noses at loops.
In R, we have multiple options when repeating calculations: vectorised operations, for loops, and apply
functions. In this unit we are going to look at the looping constructs and functions available in R.

3.2 LOOPS

3.2.1 For Loops

A for loop is used to apply the same function calls to a collection of objects. Execute the following code
in R.

phi <- 1

for (k in 1:100) {

phi <- 1+1/phi

print(c(k,phi))

What does it do? Sequentially, for each value of k between 1 and 100, phi is modified. More
specifically, at the beginning of the for loop, a vector containing all the integers from 1 to 100 in order
is created. Then, k is set to the first element in that vector, i.e., 1. Then the R expression from the { to
the } is evaluated. When that expression has been evaluated, k is set to the next value in the vector.
The process is repeated until, at the last evaluation, k has value 100.

As an aside, note that the final value of phi is the Golden Ratio, 1.618034. As an example of a situation
where a loop of some sort is really needed, suppose we wish to iterate the Beverton-Holt map (one
of the simplest discrete-time models of population growth),

We simply have no option but to do the calculation one step at a time. Here’s an R code that does this

a <- 1.1
b <- 0.001
T <- seq(from=1,to=200,by=1)
N <- numeric(length(T))
n <- 2
for (t in T) {
n <- a*n/(1+b*n)
N[t] <- n
}
Spend some time to make sure you understand what happens at each line of the above. We can plot
the population sizes Nt through time via plot(T,N)

An alternative way to do the above might be something like

N <- numeric(length(T))
for (t in 1:length(T)) {
n <- a*n/(1+b*n)
N[t] <- n
}
3.2.2 While Loops

A second looping construct is the while loop. Using while, we can compute the Golden Ratio as before:

phi <- 20
k <- 1
while (k <= 100) {
phi <- 1+1/phi
print(c(k,phi))
k <- k+1
}
Here first, phi and k are initialised. Then the while loop is started. At each iteration of the loop, phi is
modified, and intermediate results printed, as before. In addition, k is incremented. The while loop
continues to iterate until the condition k <= 100 is no longer TRUE, at which point, the while loop
terminates.

Note that here we’ve chosen a large number (100) of iterations. Perhaps we could get by with fewer.
If we wanted to terminate the iterations as soon as the value of phi stopped changing, we could do:
phi <- 20
conv <- FALSE
while (!conv) {
phi.new <- 1+1/phi
conv <- phi==phi.new
phi <- phi.new
}
Another way to accomplish this would be to use break to stop the iteration when a condition is met.
For example,

phi <- 20
while (TRUE) {
phi.new <- 1+1/phi
if (phi==phi.new) break
phi <- phi.new
}
While this while loop is equivalent to the one before, it does have the drawback that, if the break
condition is never met, the loop will go on indefinitely. An alternative that avoids this is to use a for
loop with a large (but finite) number of iterations, together with break:

phi <- 3
for (k in seq_len(1000)) {
phi.new <- 1+1/phi
if (phi==phi.new) break
phi <- phi.new
}
3.2.3 Repeat Loops

A third looping construct in R involves the repeat keyword. For example,

phi <- 12
repeat {
phi.new <- 1/(1+phi)
if (phi==phi.new) break
phi <- phi.new
}
In addition, R provides the next keyword, which, like break, is used in the body of a looping construct.
Rather than terminating the iteration, however, it aborts the current iteration and leads to the
immediate execution of the next iteration.
3.2.4 If-else

The if-else combination is probably the most commonly used control structure in R (or perhaps any
language). This structure allows you to test a condition and act on it depending on whether it’s true
or false.

If (<condition>) {

## do something

## Continue with rest of code

The above code does nothing if the condition is false. If we have an action which we want to execute
when the condition is false, then we need an else clause.

If (<condition>) {

## do something

else {

## do something else

We can have a series of tests by following the initial if with any number of else ifs.

If (<condition1>) {

## do something

} else if (<condition2>) {

## do something different

} else {

## do something different

Here is an example of a valid if/else structure.

## Generate a uniform random number

x <- runif(1, 0, 10)


if(x > 3) {
y <- 10
} else {
y <- 0
}
The value of y is set depending on whether x > 3 or not. This expression can also be written a different,
but equivalent, way in R.

y <- if(x > 3) {


10

} else {

Neither way of writing this expression is more correct than the other. Which one we use will depend
on our preference and perhaps those of the team we may be working with.

Of course, the else clause is not necessary. We could have a series of if clauses that always get
executed if their respective conditions are true.

if(<condition1>) {

if(<condition2>) {

3.2.5 Switch

The switch function has the following general form.

switch(object,

"value1" = {expr1},

"value2" = {expr2},

"value3" = {expr3},

{other expressions}

If object has value value1 then expr1 is executed, if it has value2 then expr2 is executed and so on. If
object has no match then other expressions is executed. Note that the block {other expressions} does
not have to be present, the switch will return NULL in case object does not match any value. An
expression expr1 in the above construction can consist of multiple statements. Each statement should
be separated with a ; or on a separate line and surrounded by curly brackets.

Simple example choosing between two calculation methods.

mycalc <- function(x, method="ml"){

switch(method,

"ml" = { my.mlmethod(x) },

"rml" = { my.rmlmethod(x) }

3.3 FUNCTIONS

An extremely useful feature in R is the ability to write arbitrary functions. A function, in this context,
is an algorithm that performs a specific computation that depends on inputs (the function’s
arguments) and produces some output (the function’s value) and/or has some side effects. Let’s see
how this is done.

Here is a function that squares a number.

sq <- function (x) x^2

The syntax is function (arglist) expr. The one argument in this case is x. When a particular value of x is
supplied, R performs the squaring operation. The function then returns the value x^2:

sq(3); sq(9); sq(-2);


## [1] 9
## [1] 81
## [1] 4
Here is a function with two arguments and a more complex body, as we call the expression that is
evaluated when the function is called.

f <- function (x, y = 3) {


a <- sq(x)
a+y
}
Here, the body is the R expression from { to }. Unless the return codeword is used elsewhere in the
function body, the value returned is always the last expression evaluated. Thus:

f(3,0); f(2,2); f(3);

## [1] 9

## [1] 6

## [1] 12

Note that in the last case, only one argument was supplied. In this case, y assumed its default value,
3.

Note that functions need not be assigned to symbols; they can be anonymous:

function (x) x^5

## function (x) x^5

(function (x) x^5)(2)

## [1] 32

A function can also have side effects, e.g.,

hat <- "hat"


hattrick <- function (y) {
hat <<- "rabbit"
2*y
}
hat; hattrick(5); hat
## [1] "hat"
## [1] 10
## [1] "rabbit"
However, the very idea of a function insists that we should never experience unintentional side
effects.

If we want the function not to automatically print, we can wrap the return value in invisible():

hattrick <- function (y) {


hat <<- "rabbit"
invisible(2*y)
}
hattrick(5)
print(hattrick(5))
## [1] 10
A function in R is defined by three components:

1. its formal parameters, i.e., its argument list,


2. its body, and
3. its environment, i.e., the context in which the function was defined.

R provides simple functions to interrogate these function components:

formals(hattrick)

## $y

body(hattrick)

## {

## hat <<- "rabbit"

## invisible(2 * y)

## }

environment(hattrick)

## <environment: R_GlobalEnv>

3.3.1 Function Scope

As noted above, a paramount consideration in the implementation of functions in any programming


language is that unintentional side effects should never occur. In particular, we should be free to write
a function that creates temporary variables as an aid to its computations, and be able to rest assured
that no variables we create temporarily will interfere with any other variables we have defined
anywhere else. To accomplish this, R has a specific set of scoping rules.

Consider the function

f <- function (x) {


y <- 2*x

print(x)

print(y)

print(z)

In this function’s body, x is a formal parameter, y is a local variable, and z is a free, or unbound variable.
When f is evaluated, each of these variables must be bound to some value. In R, the free variable
bindings are resolved—each time the function is evaluated—by first looking in the environment where
the function was created. This is called lexical scope. Thus, if we execute

f(3)

we get an error, because no object named z can be found. If, however, we do

z <- 10
f(3)
## [1] 3
## [1] 6
## [1] 10
we don’t get an error, because z is defined in the environment, <environment: R_GlobalEnv>, of f.
Similarly, when we do

z <- 13
g <- function (x) {
2*x+z
}
f <- function (x) {
z <- -100
g(x)
}
f(5)
## [1] 23
The relevant value of z is the one in the environment where g was defined, not the one in the
environment wherein it is called.

3.3.2 Nested Functions and Environments

In each of the following examples, make sure you understand exactly what has happened.

Consider this:

y <- 11
f <- function (x) {
y <- 2*x
y+x
}
f(1); y
## [1] 3
## [1] 11
As mentioned above, each function is associated with an environment: the environment within which
it was defined. When a function is evaluated, a new temporary environment is created, within which
the function’s calculations are performed. Every new environment has a parent, the environment
wherein it was created. The parent of this new environment is the function’s environment. To see this,
try

f <- function () {

g <- function () {
h <- function () {
cat("inside function h:\n")
cat("current env: ")
print(environment())
cat("parent env: ")
print(parent.frame(1))
cat("grandparent env: ")
print(parent.frame(2))
cat("great-grandparent env: ")
print(parent.frame(3))
invisible(NULL)
}
cat("inside function g:\n")
cat("environment of h: ")
print(environment(h))
cat("current env: ")
print(environment())
cat("parent env: ")
print(parent.frame(1))
cat("grandparent env: ")
print(parent.frame(2))
h()
}
cat("inside function f:\n")
cat("environment of g: ")
print(environment(g))
cat("current env: ")
print(environment())
cat("parent env: ")
print(parent.frame(1))
g()
}
cat("environment of f: "); print(environment(f))
cat("global env: "); print(environment())
f()

## environment of f: <environment: R_GlobalEnv>


## global env: <environment: R_GlobalEnv>
## inside function f:
## environment of g: <environment: 0x4e9dde0>
## current env: <environment: 0x4e9dde0>
## parent env: <environment: R_GlobalEnv>
## inside function g:
## environment of h: <environment: 0x5a802f0>
## current env: <environment: 0x5a802f0>
## parent env: <environment: 0x4e9dde0>
## grandparent env: <environment: R_GlobalEnv>
## inside function h:
## current env: <environment: 0x5a860b8>
## parent env: <environment: 0x5a802f0>
## grandparent env: <environment: 0x4e9dde0>
## great-grandparent env: <environment: R_GlobalEnv>
Each variable referenced in the function’s body is bound, first, to a formal argument if possible. If a
local variable of that name has previously been created (via one of the assignment operators <-, ->, or
=), this is the variable that is affected by any subsequent assignments. If the variable is neither a formal
parameter nor a local variable, then the parent environment of the function is searched for that
variable. If the variable has not been found in the parent environment, then the grand-parent
environment is searched, and so on.

If the assignment operators <<- or ->> are used, a more extensive search for the referenced assignee
is made. If the variable does not exist in the local environment, the parent environment is searched.
If it does not exist in the parent environment, then the grand-parent environment is searched, and so
on. Finally, if the variable cannot be found anywhere along the lineage of environments, a new global
variable is created, with the assigned value.

3.4 THE APPLY FAMILY OF FUNCTIONS

As mentioned above, there are circumstances under which looping constructs are really necessary.
Very often, however, we wish to perform some operation across all the elements of a vector, array,
or dataset. In such cases, it is faster and more elegant (to the R afficiando’s eye) to use the apply family
of functions.

3.4.1 List Apply: lapply

lapply applies a function to each element of a list or vector, returning a list.

x <- list("teenage","mutant","ninja","turtle",
"hamster","plumber","pickle","baby")
lapply(x,nchar)

## [[1]]
## [1] 7
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 5
##
## [[4]]
## [1] 6
##
## [[5]]
## [1] 7
##
## [[6]]
## [1] 7
##
## [[7]]
## [1] 6
##
## [[8]]
## [1] 4
y <- c("teenage","mutant","ninja","turtle",
"hamster","plumber","pickle","baby")
lapply(y,nchar)

## [[1]]
## [1] 7
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 5
##
## [[4]]
## [1] 6
##
## [[5]]
## [1] 7
##
## [[6]]
## [1] 7
##
## [[7]]
## [1] 6
##
## [[8]]
## [1] 4

3.4.2 Sloppy List Apply: sapply

sapply isn’t content to always return a list: it attempts to simplify the results into a non-list vector if
possible.

x <- list("pizza","monster","jigsaw","puddle",
"hamster","plumber","pickle","baby")
sapply(x,nchar)
## [1] 5 7 6 6 7 7 6 4

y <- c("pizza","monster","jigsaw","puddle",
"hamster","plumber","pickle","baby")
sapply(y,nchar)
## pizza monster jigsaw puddle hamster plumber pickle baby
## 5 7 6 6 7 7 6 4

3.4.3 Multiple-list Apply: mapply

mapply is a multiple-argument version of sapply:

x <- c("pizza","monster","jigsaw","puddle")

y <- c("cowboy","barbie","slumber","party")

mapply(paste,x,y,sep="/")

## pizza monster jigsaw puddle

## "pizza/cowboy" "monster/barbie" "jigsaw/slumber" "puddle/party"

As usual, the recycling rule applies:

mapply(paste,x,y[2:3])

## pizza monster jigsaw puddle

## "pizza barbie" "monster slumber" "jigsaw barbie" "puddle slumber"

mapply(paste,x[c(1,3)],y)

## pizza jigsaw <NA> <NA>

## "pizza cowboy" "jigsaw barbie" "pizza slumber" "jigsaw party"

3.4.4 Array Apply: apply

apply is very powerful and a bit more complex. It allows an arbitrary function to applied to each slice
of an array, where the slices can be defined in all possible ways. Let’s create a matrix:

A <- array(data=seq_len(15),dim=c(3,5)); A

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 4 7 10 13

## [2,] 2 5 8 11 14

## [3,] 3 6 9 12 15

To apply an operation to each row, we marginalize over the first dimension (rows). For example, to
sum the rows, we’d do
apply(A,1,sum)

## [1] 35 40 45

To sum the columns (the second dimension), we’d do

apply(A,2,sum)

## [1] 6 15 24 33 42

Now suppose we have a 3-dimensional array:

A <- array(data=seq_len(30),dim=c(3,5,2)); A

## , , 1

##

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 4 7 10 13

## [2,] 2 5 8 11 14

## [3,] 3 6 9 12 15

##

## , , 2

##

## [,1] [,2] [,3] [,4] [,5]

## [1,] 16 19 22 25 28

## [2,] 17 20 23 26 29

## [3,] 18 21 24 27 30

To sum the rows within each slice, we’d do

apply(A,c(1,3),sum)

## [,1] [,2]

## [1,] 35 110

## [2,] 40 115

## [3,] 45 120

while to sum the slices, we’d do

apply(A,3,sum)

## [1] 120 345

Of course, we can apply an anonymous function wherever we apply a named function:

apply(A,c(2,3),function (x) sd(x)/sqrt(length(x)))


## [,1] [,2]

## [1,] 0.5773503 0.5773503

## [2,] 0.5773503 0.5773503

## [3,] 0.5773503 0.5773503

## [4,] 0.5773503 0.5773503

## [5,] 0.5773503 0.5773503

Additional arguments are passed to the function:

apply(A,c(1,2),function (x, y) sum(x>y),y=8)

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 1 1 2 2

## [2,] 1 1 1 2 2

## [3,] 1 1 2 2 2

apply(A,c(1,2),function (x, y) sum(x>y),y=-1)

## [,1] [,2] [,3] [,4] [,5]

## [1,] 2 2 2 2 2

## [2,] 2 2 2 2 2

## [3,] 2 2 2 2 2

3.4.5 Table Apply: tapply

tapply is, in a way, an extension of table. The syntax is tapply(X,INDEX,FUN,...), where X is a vector,
INDEX is a list of one or more factors, each the same length as X, and FUN is a function. The vector X
will be split into subvectors according to INDEX, and FUN will be applied to each of the subvectors. By
default, the result is simplified into an array if possible. Some examples:

x <- seq(1,30,by=1)

b <- rep(letters[1:10],times=3)

data.frame(x,b)

## x b

## 1 1 a

## 2 2 b

## 3 3 c

## 4 4 d

## 5 5 e

## 6 6 f

## 7 7 g

## 8 8 h
## 9 9 i

## 10 10 j

## 11 11 a

## 12 12 b

## 13 13 c

## 14 14 d

## 15 15 e

## 16 16 f

## 17 17 g

## 18 18 h

## 19 19 i

## 20 20 j

## 21 21 a

## 22 22 b

## 23 23 c

## 24 24 d

## 25 25 e

## 26 26 f

## 27 27 g

## 28 28 h

## 29 29 i

## 30 30 j

tapply(x,b,sum)

## a b c d e f g h i j

## 33 36 39 42 45 48 51 54 57 60

b <- rep(letters[1:10],each=3)

data.frame(x,b)

## xb

## 1 1 a

## 2 2 a

## 3 3 a

## 4 4 b
## 5 5 b

## 6 6 b

## 7 7 c

## 8 8 c

## 9 9 c

## 10 10 d

## 11 11 d

## 12 12 d

## 13 13 e

## 14 14 e

## 15 15 e

## 16 16 f

## 17 17 f

## 18 18 f

## 19 19 g

## 20 20 g

## 21 21 g

## 22 22 h

## 23 23 h

## 24 24 h

## 25 25 i

## 26 26 i

## 27 27 i

## 28 28 j

## 29 29 j

## 30 30 j

tapply(x,b,sum)

## a b c d e f g h i j

## 6 15 24 33 42 51 60 69 78 87

datafile <- "seed.dat"

seeds <- read.table(datafile,header=TRUE,


colClasses=c(station='factor',dist='factor',date='Date'))

x <- subset(seeds,available>0)

with(x, tapply(tcum,list(dist,station),max,na.rm=TRUE))

## 1 10 100 101 102 103 104 105 106 107 108 109 11 110 111 112 113 114

## 10 24 248 122 248 10 17 39 17 46 10 3 17 NA 10 10 17 248 248

## 25 18 123 11 249 249 11 11 249 81 11 123 249 4 25 249 47 249 249

## 115 116 117 118 119 12 120 121 122 123 124 125 126 127 128 129 13 130

## 10 67 248 248 24 10 248 17 248 248 3 248 248 10 67 17 248 248 248

## 25 40 68 11 32 18 249 32 11 249 68 249 249 18 NA 18 249 249 18

## 131 132 133 134 135 136 137 138 139 14 140 141 142 143 144 145 146 147

## 10 3 248 248 10 10 3 248 248 3 24 248 248 248 248 248 248 248 24

## 25 18 40 54 18 4 40 61 249 4 61 249 68 11 NA 4 249 209 4

## 148 149 15 150 151 152 153 154 155 156 157 158 159 16 160 17 18 19

## 10 248 248 3 248 39 248 248 248 NA 53 248 248 53 248 NA 248 136 24

## 25 249 81 11 249 249 249 249 249 249 123 249 249 32 25 249 4 249 123

## 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37

## 10 248 248 248 24 24 248 24 248 53 31 80 31 NA NA 248 17 24 248 39 17

## 25 47 54 249 61 40 249 249 249 4 249 249 11 4 NA 18 4 4 159 4 11

## 38 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 56

## 10 3 10 31 248 122 248 NA 3 17 NA NA NA NA 248 3 248 3 10 248 3 3

## 25 NA 249 249 4 NA 123 NA 4 249 11 18 4 11 249 40 18 11 11 249 4 18

## 57 58 59 6 60 61 62 63 64 65 66 67 68 69 7 70 71 72 73 74 75

## 10 24 3 NA 10 3 3 3 3 3 248 248 NA 10 17 10 3 NA 10 248 248 NA

## 25 11 159 25 11 11 209 4 4 11 249 11 40 11 61 25 4 4 4 249 159 32

## 76 77 78 79 8 80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94 95

## 10 248 3 10 NA NA 3 248 80 NA 241 122 24 10 10 24 115 17 NA 17 10 3 3

## 25 249 47 18 4 249 11 NA 11 11 11 249 11 NA 11 11 123 11 11 25 11 4 11

## 96 97 98 99

## 10 10 10 10 17

## 25 18 18 11 4

3.4.6 sapply with Expected Result: vapply

When we could use sapply and we know exactly what the size and class of the value of the function
will be, it is sometimes faster to use vapply. The syntax is like that of sapply:
vapply(X,FUN,FUN.VALUE,...), where X and FUN are as in sapply, but we specify the size and class of
the value of FUN via the FUN.VALUE argument. For example, suppose we define a function that, given
a number between 1 and 26 will return the corresponding letter of the alphabet:

alph <- function (x) {

stopifnot(x >= 1 && x <= 26)

LETTERS[as.integer(x)]

This function will return a vector of length 1 and class character. To apply it to a randomly sampled
set of integers, we might do

x <- sample(1:26,50,replace=TRUE)

y <- vapply(x,alph,character(1))

paste(y,collapse="")

## [1] "HZYNIJFRXXOBJYDTNUBNBAWHDLHAOOSNBKGHVZTDSFVXVBFMRV"

3.5 VECTORISED FUNCTIONS VS LOOPS

A key difference between R and many other languages is a topic known as vectorization. As Ligges &
Fox (2008) point out, the idea that one should avoid loops wherever possible in R, using instead
vectorised functions like those in the apply family, is quite widespread in some quarters. When we
wrote the total function, we mentioned that R already has sum to do this; sum is much faster than the
interpreted for loop because sum is coded in C to work with a vector of numbers. Many of R’s functions
work this way; the loop is hidden from us in C. Learning to use vectorised operations is a key skill in R.

Consider the following loop code that can be vectorised:

x <- runif(n=1e6,min=0,max=2*pi)
y <- numeric(length(x))
for (k in seq_along(x)) {
y[k] <- sin(x[k])
}
To time this, we can wrap the vectorisable parts in a call to system.time:

x <- runif(n=1e6,min=0,max=2*pi)
system.time({
y <- numeric(length(x))
for (k in seq_along(x)) {
y[k] <- sin(x[k])
}
})
## user system elapsed
## 0.116 0.004 0.121
We can compare this with a simple call to sin (which is vectorised):

system.time(z <- sin(x))

## user system elapsed

## 0.032 0.000 0.034

Clearly, calling sin directly is much faster. What about using sapply?

The above example is very simple in that there is a builtin function (sin in this case) which is capable
of the fully vectorized computation. In such a case, it is clearly preferable to use it. Frequently,
however, no such builtin function exists, i.e., we have a custom function of our own we want to apply
to a set of data. Let’s compare the relative speeds of loops and sapply in this case.

x <- seq.int(from=20,to=1e6,by=10)

f <- function (x) {

(((x+1)*x+1)*x+1)*x+1

system.time({

res1 <- numeric(length(x))

for (k in seq_along(x)) {

res1[k] <- f(x[k])

})

## user system elapsed

## 0.052 0.004 0.058

system.time(res2 <- sapply(x,f))

## user system elapsed

## 0.064 0.000 0.064

Actually, in this case, f is vectorized automatically.

system.time(f(x))

## user system elapsed

## 0 0 0

Another example: in this case function g is not vectorized.

g <- function (x) {

if ((x[1] > 30) && (x[1] < 5000)) 1 else 0

system.time({
res1 <- numeric(length(x))

for (k in seq_along(x)) {

res1[k] <- g(x[k])

})

## user system elapsed

## 0.048 0.000 0.049

system.time(res2 <- sapply(x,g))

## user system elapsed

## 0.056 0.000 0.053

Another example, to add pairs of numbers contained in two vectors

a <- 1:10
b <- 1:10
we could loop over the pairs adding each in turn, but that would be very inefficient in R.

Instead of using i in a to make our loop variable, we use the function seq_along to generate indices
for each element a contains.

res <- numeric(length = length(a))


for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
## [1] 2 4 6 8 10 12 14 16 18 20

Instead, + is a vectorized function which can operate on entire vectors at once

res2 <- a + b
all.equal(res, res2)
## [1] TRUE
for or apply?

Deciding whether to use for or one of the apply family is really personal preference. Using an apply
family function forces to encapsulate operations as a function rather than separate calls with for. for
loops are often more natural in some circumstances; for several related operations, a for loop will
avoid having to pass a lot of extra arguments to the function.

Loops in R are slow?

No, they are not! If we follow some golden rules:


 Don’t use a loop when a vectorized alternative exists
 Don’t grow objects (via c, cbind, etc.) during the loop - R has to create a new object and copy
across the information just to add a new element or row/column
 Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of analyze that will return the mean inflammation per day
(column) of each file.

analyze2 <- function(filenames) {


for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
## user system elapsed
## 0.024 0.004 0.029
Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in
R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results.
Then we loop over the files but this time we fill in the f th column of our results matrix out. This time
there is no copying/growing for R to deal with.

analyze3 <- function(filenames) {


out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
## user system elapsed
## 0.028 0.000 0.027
In this simple example there is little difference in the compute time of analyze2 and analyze3. This is
because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we
were doing this over more files or the data objects we were growing were larger, the penalty for
copying/growing would be much larger.

Note that apply handles these memory allocation issues for us, but then we have to write the loop
part as a function to pass to apply. If possible, use vectorized operations instead of for loops to make
code faster and more concise. Use it to operate on the values in a data structure.

The ... Argument

There is a special argument in R known as the ... argument, which indicate a variable number of
arguments that are usually passed on to other functions. The ... argument is often used when
extending another function and you don’t want to copy the entire argument list of the original
function.

For example, a custom plotting function may want to make use of the default plot() function along
with its entire argument list. The function below changes the default for the type argument to the
value type = "l" (the original default was type = "p").

myplot <- function(x, y, type = "l", ...) {


plot(x, y, type = type, ...) ## Pass '...' to 'plot' function
}
Generic functions use ... so that extra arguments can be passed to methods.

mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x7fe7bc5cf988>
<environment: namespace:base>
The ... argument is necessary when the number of arguments passed to the function cannot be known
in advance. This is clear in functions like paste() and cat().

args(paste)
function (..., sep = " ", collapse = NULL)
NULL
args(cat)
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE)
NULL
Because both paste() and cat() print out text to the console by combining multiple character vectors
together, it is impossible for those functions to know in advance how many character vectors will be
passed to the function by the user. So the first argument to either function is ....

Arguments Coming After the ... Argument

One catch with ... is that any arguments that appear after ... on the argument list must be named
explicitly and cannot be partially matched or matched positionally.

Take a look at the arguments to the paste() function.

args(paste)
function (..., sep = " ", collapse = NULL)
NULL

With the paste() function, the arguments sep and collapse must be named explicitly and in full if the
default values are not going to be used.

Here we specify that we want “a” and “b” to be pasted together and separated by a colon.

paste("a", "b", sep = ":")

## [1] "a:b"

If we don’t specify the sep argument in full and attempt to rely on partial matching, we don’t get the
expected result.

paste("a", "b", se = ":")

## [1] "a b :"

Check your Progress 1


Fill in the Blanks.
1. R provides the _____ keyword which aborts the current iteration and leads to the immediate
execution of the next iteration.
2. A function in R is defined by three components: its formal parameters, its body, and
_________.
3. When a function is _____, a new temporary environment is created, within which the
function’s calculations are performed.
4. _________ attempts to simplify the results into a non-list vector if possible.

Activity 1
1. Execute the following code and check the value of y.
y <- 0
f <- function (x) {
2*x+y
}
f(1); y
2. Write a program in R to show the use of … argument.

Summary
 Control structures like if, while, and for allow programmer to control the flow of an R program.
 “apply” functions are more useful.
 Functions can be defined using the function ( ) directive and are assigned to R objects just like
any other R object.
 Functions arguments can be specified by name or by position in the argument list.
 A variable number of arguments can be specified using the special ... argument in a function
definition.

Keywords
 Argument: It is a value that is passed between programs, subroutines or functions.
 Vector: It is a basic data structure in R which contains element of the same type.
Self-Assessment Questions
1. Write a short note on:
a. For loop
b. Repeat loop
2. State the advantages of apply functions over loop.
3. Write a program in R to find the factorial of even numbers.

Answers to Check your Progress


Check your Progress 1
Fill in the Blanks.
1. R provides the next keyword which aborts the current iteration and leads to the immediate
execution of the next iteration.
2. A function in R is defined by three components: its formal parameters, its body, and its
environment.
3. When a function is evaluated, a new temporary environment is created, within which the
function’s calculations are performed.
4. sapply attempts to simplify the results into a non-list vector if possible.

Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
Unit 4
Mathematics in R

Structure:

4.1 Introduction
4.2 Numeric Functions
4.3 Character Functions
4.4 Statistical Functions
Summary
Keywords
Self-Assessment Questions
Answers To Check Your Progress
Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or licensees.
This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives
After going through this unit, you will be able to:
 Understand the different numeric functions and its uses
 Explain and use the statistical and probability distribution functions

4.1 INTRODUCTION
A function is a set of statements organized together to perform a specific task. Almost everything in R is done
by using functions. R has a large number of in-built functions. In this unit, we are going to discuss different types
of functions available in R with their syntax and example.

4.2 NUMERIC FUNCTIONS


Numeric functions are used to perform operations on numbers or vectors. Let’s discuss numeric functions.

Name of the
Sr.
function and Purpose Example
No.
Syntax
computes the absolute value of x, > x<-99
1 abs (x) where x is a numeric or vector or > abs(x)
array. [1] 99
> x<-100
computes the (principal) square root
2 sqrt(x) > sqrt(x)
of x
[1] 10
>ceiling(645.956)
Used for rounding and truncating
[1] 646
3 ceiling(x) or ceil(x) numeric values towards near integer
>ceiling(-10.28)
values of x
[1] -10
>floor(645.956)
used to return the largest integer
[1] 645
4 floor(x) value which is not greater than (less
>floor(-10.28)
than) or equal to x
[1] -11
>trunc(c(1.234,2.342,-
rounds to the nearest integer in the
5 trunc(x) 4.562,5.671,12.345,-14.567))
direction of 0
[1] 1 2 -4 5 12 -14
rounds the values in its first argument
to the specified number of decimal > round(125.2395, digits=0)
places (default 0), where x is numeric [1] 125
6 round(x, digits=n) value or vector and digit is integer >round(c(1.234,2.342,4.562,5.671,12
indicating the number of decimal .345,14.567),digits = 2)
places (round) to be used. Negative [1] 1.23 2.34 4.56 5.67 12.35 14.57
values are allowed.
similar to round function, but digits
>signif(125.2395, digits=4)
7 signif(x, digits=n) argument in this function denotes the
[1] 125.2
total digits, not the digits after decimal
>sin(pi/2)
trigonometric functions, computes
[1] 1
8 cos(x), sin(x), tan(x) cos, sin and tan of x, where x is a
>cos(pi)
numeric or complex vector
[1] -1
>log(6,3)
computes the natural logarithms of x
9 log(x, base) [1] 1.63093
(number or vector) with base value

computes common logarithms (i.e. >log10(6)


10 log10(x)
base 10) [1] 0.7781513
> x <- 5
compute the exponential value of a
11 exp(x) > exp(x)
number or number vector, ex
[1] 148.4132
>y <- rep(1:3, 2)
12 rep(x, ntimes) repeat x n-times
[1] 1, 2, 3, 1, 2, 3
>v <- c( 8, 13, 19, 3, 14, 7, 6, 12, 18,
9, 7, 14, 2, 3, 8, 11, 17)
>c <- cut(v, c(0, 5, 10, 15, 20))
>c
# [1] (5,10] (10,15] (15,20] (0,5]
(10,15] (5,10] (5,10] (10,15] (15,20]
# [10] (5,10] (5,10] (10,15] (0,5]
(0,5] (5,10] (10,15] (15,20]
# Levels: (0,5] (5,10] (10,15] (15,20]
cut(x, breaks, labels
= NULL, divides the range of x into intervals
In the above example, the vector v
include.lowest = and codes the values in x according to
contains elements between 2 and
13 FALSE, right = TRUE, which interval they fall. The leftmost
19.
dig.lab = interval corresponds to level one, the
Each of these elements is assigned a
3,ordered_result = next leftmost to level two and so on
bin, the first of which spans 0…5, the
FALSE, …)
second of which spans 5…10, etc.
(according to the second
parameter).
The returned value (c) is a vector of
equal length like v that contains the
bin for each element. The first
element (8) falls between (5,10], the
second element (13) falls between
(10,15].
> sum(1:5)
returns the sum of all the values [1] 15
14 sum(x)
present in its arguments > sum(1, 2, 3, 4, 5)
[1] 15
returns suitably lagged and iterated
differences where x is numeric vector
> diff(1:10, 2)
or matrix containing the values to be
diff(x, lag=1, [1] 2 2 2 2 2 2 2 2
15 differenced, lag is an integer indicating
differences = 1, …) > diff(1:10, 2, 2)
which lag to use and differences is an
[1] 0 0 0 0 0 0
integer indicating the order of the
difference
>x
[1] 5 8 3 9 2 7 4 6 10
16 min(x) find the minimum of a vector
> min(x)
[1] 2
>x
[1] 5 8 3 9 2 7 4 6 10
17 max(x) find the maximum of a vector
> max(x)
[1] 10
> x <- matrix(1:10, ncol = 2)
> (centered.x <- scale(x, scale =
FALSE))
[,1] [,2]
[1,] -2 -2
[2,] -1 -1
scale(x, It is generic function whose default [3,] 0 0
18 center=TRUE, method centers and/or scales the [4,] 1 1
scale=TRUE) columns of a numeric matrix [5,] 2 2
attr(,"scaled:center")
[1] 3 8
> cov(centered.scaled.x <- scale(x))
[,1] [,2]
[1,] 1 1
[2,] 1 1

4.3 CHARACTER FUNCTIONS


Below is the list of functions which are executed on character/string argument.

Sr. No. Name of the function and syntax Purpose and example
Extract or replace substrings in a character vector
>x <- "abcdef"
1 substr(x, start=n1, stop=n2)
>substr(x, 2, 4)
[1] "bcd"
Search for pattern in x. If fixed =FALSE then pattern is
grep(pattern, x , ignore.case=FALSE,
2 a regular expression. If fixed=TRUE then pattern is a
fixed=FALSE)
text string. Returns matching indices
>grep("A", c("b","A","c"), fixed=TRUE)
[1] 2
Find pattern in x and replace with replacement text. If
fixed=FALSE then pattern is a regular expression. If
sub(pattern, replacement, x,
3 fixed = T then pattern is a text string
ignore.case =FALSE, fixed=FALSE)
>sub("\\s",".","Hello There")
[1] "Hello.There"
Split the elements of character vector x at split
4 strsplit(x, split) >strsplit("abc", "")
[1] "a","b","c"
Concatenate strings after using sep string to seperate
them
>paste("x",1:3,sep="")
[1] c("x1","x2" "x3")
5 paste(..., sep="")
>paste("x",1:3,sep="M")
[1] c("xM1","xM2" "xM3")
>paste("Today is", date())
[1] Today is 10/07/2019
6 toupper(x) Convert string x into upper-case
7 tolower(x) Convert string x into lower-case

4.4 STATISTICAL FUNCTIONS

1) Mean

The mean of a set of observations is just a normal, old-fashioned average: add all of the values up, and then
divide by the total number of values. The first five AFL margins were 56, 31, 56, 8 and 32, so the mean of these
observations is just:

56 + 31 + 56 + 8 + 32 = 183/5 = 36.60

Calculating the mean in R


In R, we can do this just by typing:
> (56 + 31 + 56 + 8 + 32) / 5
[1] 36.6

Outputs the answer 36.6, just as if it were a calculator. However, that’s not the only way to do the calculations,
and when the number of observations starts to become large, it’s easily the most tedious. Besides, in almost
every real world scenario, we have already got the actual numbers stored in a variable of some kind, just like
we have with the vector. Under those circumstances, what we want is a function that will just add up all the
values stored in a numeric vector. That’s what the sum( ) function does. If we want to add up all values in the
data set, we can do so using the following command:
>sum( var_name )
[1] 6213

If we only want the sum of the first five observations, then we can use square brackets to pull out only the first
five elements of the vector. So the command would now be:
>sum( var_name[1:5] )
[1] 183
To calculate the mean, we now tell R to divide the output of this summation by five, so the command that we
need to type now becomes the following:
>sum( var_name[1:5] )/ 5
# [1] 36.6

Although it’s pretty easy to calculate the mean using the sum() function, we can do it in an even easier way,
since R also provides us with the mean() function. To calculate the mean for all values, we would use the
following command:
>mean( x = var_name )
# [1] 35.30114

However, since x is the first argument to the function, we could have omitted the argument name. To calculate
the mean for the first five observations:
> mean( var_name[1:5] )
# [1] 36.6

This gives exactly the same answers as the previous calculations.

2) Median

The second measure of central tendency that people use a lot is the median, and it’s even easier to describe
than the mean. The median of a set of observations is just the middle value. As before let’s imagine we were
interested only in the first winning margins values: 56, 31, 56, 8 and 32. To figure out the median, we sort these
numbers into ascending order:
8, 31, 32, 56, 56

From inspection, it’s obvious that the median value of these 5 observations is 32 since that’s the middle one in
the sorted list. Easy stuff. But what should we do if we were interested in the first 6 values rather than the first
5? Since the sixth value is 14, our sorted list is now
8, 14, 31, 32, 56, 56
and there are two middle numbers, 31 and 32. The median is defined as the average of those two numbers,
which is of course 31.5. As before, it’s very tedious to do this by hand when we have got lots of numbers. To
illustrate this, here’s what happens when we use R to sort all values (e.g.: 176 total values). First, we use the
sort () function to display the values in increasing numerical order:
> sort( x = var_name)
# [1] 0 0 1 1 1 1 2 2 3 3 3 3 3
[14] 3 3 3 4 4 5 6 7 7 8 8 8 8
[27] 8 9 9 9 9 9 9 10 10 10 10 10 11
[40] 11 11 12 12 12 13 14 14 15 16 16 16 16
[53] 18 19 19 19 19 19 20 20 20 21 21 22 22
[66] 22 23 23 23 24 24 25 25 26 26 26 26 27
[79] 27 28 28 29 29 29 29 29 29 30 31 32 32
[92] 33 35 35 35 35 36 36 36 36 36 36 37 38
[105] 38 38 38 38 39 39 40 41 42 43 43 44 44
[118] 44 44 44 47 47 47 48 48 48 49 49 50 50
[131] 50 50 52 52 53 53 54 54 55 55 55 56 56
[144] 56 57 60 61 61 63 64 65 65 66 67 68 70
[157] 71 71 72 73 75 75 76 81 82 82 83 84 89
[170] 94 95 98 101 104 108 116

The middle values are 30 and 31, so the median is 30.5. In real life, of course, no-one actually calculates the
median by sorting the data and then looking for the middle value. In real life, we use the median command:
> median( x = var_name)
[1] 30.5

which outputs the median value of 30.5.

Mean or median? What’s the difference?

Knowing how to calculate means and medians is only a part of the story. We also need to understand what each
one is saying about the data, and what that implies for when we should use each one.
 If data are nominal scale, we probably shouldn’t be using either the mean or the median. Both the mean
and the median rely on the idea that the numbers assigned to values are meaningful. If the numbering
scheme is arbitrary, then it’s probably best to use the mode instead.
 If data are ordinal scale, we are more likely to want to use the median than the mean. The median only
makes use of the order information in our data (i.e., which numbers are bigger), but doesn’t depend on
the precise numbers involved. That’s exactly the situation that applies when the data are ordinal scale.
The mean, on the other hand, makes use of the precise numeric values assigned to the observations, so
it’s not really appropriate for ordinal data.
 For interval and ratio scale data, either one is generally acceptable. Which one we pick depends a bit on
what we are trying to achieve. The mean has the advantage that it uses all the information in the data
(which is useful when we don’t have a lot of data), but it’s very sensitive to extreme values.

3) Trimmed Mean

One of the fundamental rules of applied statistics is that the data are messy. The data sets that we obtain are
never as straightforward as the statistical theory says. This can have awkward consequences. To illustrate,
consider this rather strange looking data set:

-100, 2, 3, 4, 5, 6, 7, 8, 9, 10

If we were to observe this in a real life data set, we would probably suspect that something funny was going on
with the 100 value.
´ It’s probably an outlier, a value that doesn’t really belong with the others. We might consider
removing it from the data set entirely. In real life, however, we don’t always get such cut-and-dried examples.
For instance, we might get this instead:

-15, 2, 3, 4, 5, 6, 7, 8, 9, 12

The -15 looks a bit suspicious, but not anywhere near as much as that -100 did. In this case, it's a little trickier.
It might be a legitimate observation, it might not. When faced with a situation where some of the most extreme-
valued observations might not be quite trustworthy, the mean is not necessarily a good measure of central
tendency. It is highly sensitive to one or two extreme values and is thus not considered to be a robust measure.

One remedy that we’ve seen is to use the median. A more general solution is to use a “trimmed mean”. To
calculate a trimmed mean, what we do is “discard” the most extreme examples on both ends (i.e., the largest
and the smallest), and then take the mean of everything else. The goal is to preserve the best characteristics of
the mean and the median: just like a median, we aren’t highly influenced by extreme outliers, but as the mean,
we “use” more than one of the observations.

Generally, we describe a trimmed mean in terms of the percentage of observation on either side that is
discarded. So, for instance, a 10% trimmed mean discards the largest 10% of the observations and the smallest
10% of the observations and then takes the mean of the remaining 80% of the observations. Not surprisingly,
the 0% trimmed mean is just the regular mean, and the 50% trimmed mean is the median. In that sense,
trimmed means provide a whole family of central tendency measures that span the range from the mean to the
median.

In the above example, we have 10 observations, and so a 10% trimmed mean is calculated by ignoring the largest
value (i.e., 12) and the smallest value (i.e., -15) and taking the mean of the remaining values. First, let’s enter
the data
>dataset <- c( -15,2,3,4,5,6,7,8,9,12 )

Next, let’s calculate means and medians:


>mean( x = dataset )
# [1] 4.1
>median( x = dataset )
# [1] 5.5

That’s a fairly substantial difference, let’s just try trimming the mean a bit. If we take a 10% trimmed mean,
we’ll drop the extreme values on either side and take the mean of the rest:
>mean( x = dataset, trim = .1)
# [1] 5.5

which in this case, gives exactly the same answer as the median. Note that, to get a 10% trimmed mean you
write trim = .1, not trim = 10.

4) Mode

The mode of a sample is very simple: it is the value that occurs most frequently. To illustrate the mode using
the data, let’s examine a different aspect of the data set. Who has played in the most finals? The afl.finalists
variable is a factor that contains the name of every team that played in any AFL final from 1987-2010, so let’s
have a look at it:

>print( afl.finalists )

# [1] Hawthorn Melbourne Carlton


[4] Melbourne Hawthorn Carlton
[7] Melbourne Carlton Hawthorn
[10] Melbourne Melbourne Hawthorn

…….

[391] St Kilda Hawthorn Western Bulldogs


[394] Carlton Fremantle Sydney
[397] Geelong Western Bulldogs St Kilda
[400] St Kilda
17 Levels: Adelaide Brisbane Carlton ... Western Bulldogs
What we could do is read through all 400 entries, and count the number of occasions on which each team name
appears in our list of finalists, thereby producing a frequency table. However, that would be mindless and boring:
exactly the sort of task that computers are great at. So let’s use the table() function to do this task for us:

>table( afl.finalists )
# afl.finalists
Adelaide Brisbane Carlton Collingwood
26 25 26 28
Essendon Fitzroy Fremantle Geelong
32 0 6 39
Hawthorn Melbourne North Melbourne Port Adelaide
27 28 28 17
Richmond St Kilda Sydney West Coast
6 24 26 38
Western Bulldogs
24
Now that we have our frequency table, we can just look at it and see that, over the 24 years for which we have
data, Geelong has played in more finals than any other team. Thus, the mode of the finalist's data is "Geelong".
The core packages in R don’t have a function for calculating the mode, there is a function is called modeOf(),
and here’s how we can use it:

>modeOf( x = afl.finalists )
# [1] "Geelong"

There’s also a function called maxFreq() that tells us what the modal frequency is. If we apply this function to
our finalist's data, we obtain the following:

>maxFreq( x = afl.finalists )
# [1] 39

Taken together, we observe that Geelong (39 finals) played in more finals than any other team during the 1987-
2010 period.
One last point to make with respect to the mode. While it’s generally true that the mode is most often calculated
when we have nominal scale data (because means and medians are useless for those sorts of variables), there
are some situations in which we really do want to know the mode of an ordinal, interval or ratio scale variable.
Let’s consider the scenario, a friend of yours is offering a bet. They pick a football game at random, and (without
knowing who is playing) you have to guess the exact margin. If you guess correctly, you win $50. If you don’t,
you lose $1. There are no consolation prizes for “almost” getting the right answer. You have to guess exactly the
right margin. For this bet, the mean and the median are completely useless to you. It is the mode that you
should bet on. So, we calculate this modal value
>modeOf( x = var_name )
# [1] 3
>maxFreq( x = var_name )
# [1] 8

So the data suggest you should bet on a 3 point and since this was observed in 8.
The statistics that we’ve discussed so far all relate to central tendency. That is, they all talk about which values
are “in the middle” or “popular” in the data. However, the central tendency is not the only type of summary
statistic that we want to calculate. The second thing that we really want is a measure of the variability of the
data. That is, how “spread out” are the data? How “far” away from the mean or median do the observed values
tend to be?

5) Range

The range of a variable is very simple: it’s the biggest value minus the smallest value. For example, consider the
data with the maximum value is 116, and the minimum value is 0. We can calculate these values in R using the
max() and min() functions:
max( var_name )
min( var_name )
The other possibility is to use the range() function, which outputs both the minimum value and the maximum
value in a vector, like this:

>range( var_name )
# [1] 0 116

6) Interquartile Range

The interquartile range (IQR) is like the range, but instead of calculating the difference between the biggest and
smallest value, it calculates the difference between the 25th quantile and the 75th quantile. A quantile is the 10th
percentile of a data set, is the smallest number x such that 10% of the data is less than x. In fact, we’ve already
come across the idea: the median of a data set is its 50th quantile/percentile! R actually provides a way of
calculating quantiles, using the quantile() function. Let’s use it to calculate the median:
>quantile( x = var_name, probs = .5)
# 50%
# 30.5

And not surprisingly, this agrees with the answer that we saw earlier with the median() function. Now, we can
actually input lots of quantiles at once, by specifying a vector for the probs argument. So let’s do that, and get
the 25th and 75th percentile:

>quantile( x = var_name, probs = c(.25,.75) )


# 25% 75%
# 12.75 50.50

And, by noting that 50.5 - 12.75=37.75, we can see that the interquartile range for the data is 37.75. Of course,
that seems like too much work to do all that typing, so R has a built in function called IQR() that we can use:

>IQR( x = var_name)
# [1] 37.75

While it’s obvious how to interpret the range, it’s a little less obvious how to interpret the IQR. The simplest way
to think about it is like this: the interquartile range is the range spanned by the “middle half” of the data. That
is, one quarter of the data falls below the 25th percentile, one quarter of the data is above the 75th percentile,
leaving the “middle half” of the data lying in between the two. And the IQR is the range covered by that middle
half.

7) Mean Absolute Deviation

The two measures we’ve looked at so far, the range and the interquartile range, both rely on the idea that we
can measure the spread of the data by looking at the quantiles of the data. However, this isn’t the only way to
think about the problem. A different approach is to select a meaningful reference point (usually the mean or
the median) and then report the “typical” deviations from that reference point. What do we mean by “typical”
deviation? Usually, the mean or median value of these deviations! In practice, this leads to two different
measures, the “mean absolute deviation (from the mean)” and the “median absolute deviation (from the
median)”.
One useful thing about this measure is that the name actually tells you exactly how to calculate it. Let’s consider
the data 56, 31, 56, 8 and 32. To calculate mean absolute deviation we execute a series of commands that might
look like this:

X <- c(56,31,56,8,32) # enter the data


X.bar <- mean( X ) # step 1. the mean of the data
AD <- abs( X - X.bar ) # step 2. the absolute deviations from the mean
AAD <- mean( AD ) # step 3. the mean absolute deviations
print( AAD ) # print the results [1] 15.52

Each of those commands is pretty simple, but there’s just too many of them. We simply apply the aad() function
to our data, we get this:

>aad( X )
# [1] 15.52

8) Variance

Although the mean absolute deviation measure has its uses, it’s not the best measure of variability to use. From
a purely mathematical perspective, there are some solid reasons to prefer squared deviations rather than
absolute deviations. If we do that, we obtain a measure is called the variance. The variance of a data set X is
sometimes written as Var(X), but it’s more commonly denoted s2. The formula that we use to calculate the
variance of a set of observations is as follows:

It’s basically the same formula that we used to calculate the mean absolute deviation, except that instead of
using “absolute deviations” we use “squared deviations”. It is for this reason that the variance is sometimes
referred to as the “mean square deviation”.

Now that we’ve got the basic idea, let’s have a look at a concrete example. Once again, let’s use the first five
AFL games as our data. If we follow the same approach that we took last time, we end up with the following
table:
That last column contains all of the squared deviations, so all we have to do is average them. If we do that by
typing all the numbers into R by hand then,
> ( 376.36 + 31.36 + 376.36 + 817.96 + 21.16 ) / 5
# [1] 324.64

Or we can calculate the variance of X by using the following command,


>mean( (X - mean(X) )^2)
# [1] 324.64

and as usual, we get the same answer as the one that we got when we did everything by hand. R has a built in
function called var() which does calculate variances. So we could also do this...
>var(X)
# [1] 405.8

9) Standard Deviation

We would like to have a measure that is expressed in the same units as the data itself (i.e. points, not points-
squared). The solution to it is obvious: take the square root of the variance, known as the standard deviation,
also called the “root mean squared deviation”, or RMSD. This solves out problem fairly neatly: while nobody has
a clue what “a variance of 324.68 points-squared” really means, it’s much easier to understand “a standard
deviation of 18.01 points” since it’s expressed in the original units. It is traditional to refer to the standard
deviation of a sample of data as s, though “sd” and “std dev.” are also used at times. Because the standard
deviation is equal to the square root of the variance, we probably won’t be surprised to see that the formula is:

and the R function that we use to calculate it is sd(). Calculating standard deviations in R is simple:
>sd( var_name )
# [1] 26.07364 ´
10) Median Absolute Deviation

The last measure of variability that we are going to discuss is the median absolute deviation (MAD). The basic
idea behind MAD is very simple and is pretty much identical to the idea behind the mean absolute deviation.
The difference is that we use the median everywhere.
Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are (1, 1, 0, 0,
2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)).
So the median absolute deviation for this data is 1.
If we were to frame this idea as a pair of R commands, they would look like this:

# mean absolute deviation from the mean:


>mean( abs(var_name - mean(var_name )) )
# [1] 21.10124
# *median* absolute deviation from the *median*:
>median( abs(var_name - median(var_name )) )
# [1] 19.5

This has a straightforward interpretation: every observation in the data set lies some distance away from the
typical value (the median). So the MAD is an attempt to describe a typical deviation from a typical value in the
data set. The MAD value 19.5, indicating that a typical value would differ from this median value by about 19-
20 points.

R has a built in function for calculating MAD, called as mad(). However, it’s a little bit more complicated than
the functions that we’ve been using previously. Syntax is:
mad(x, center = median(x), constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE)
where,
 x -> a numeric vector,
 center -> Optionally, the centre: defaults to the median,
 constant -> scale factor,
 na.rm -> if TRUE then NA values are stripped from x before computation takes place,
 Low -> if TRUE, compute the ‘lo-median’, i.e., for even sample size, do not average the two
middle values, but take the smaller one,
 High -> if TRUE, compute the ‘hi-median’, i.e., take the larger of the two middle values for even
sample size.
> mad(c(1:9))
# [1] 2.9652
> print(mad(c(1:9), constant = 1)) == mad(c(1:8, 100), constant = 1)
# [1] 2
# [1] TRUE
> x <- c(1,2,3,5,7,8)
> sort(abs(x - median(x)))
# [1] 1 1 2 3 3 4
> c(mad(x, constant = 1), mad(x, constant = 1, low = TRUE), mad(x, constant = 1, high = TRUE))
# [1] 2.5 2.0 3.0

11) “Summarising” a Variable

The summary() function is an easy thing to use, but a tricky thing to understand in full, since it’s a generic
function. The basic idea behind the summary() function is that it prints out some useful information about
whatever object we specify as the object argument. As a consequence, the behaviour of the summary() function
differs quite dramatically depending on the class of the object that we give it. In R, summary() is a generic
function used to produce result summaries of the results of various model fitting functions. The function invokes
particular methods which depend on the class of the first argument. Usage:

summary(object, …)
# S3 method for default
summary(object, …, digits, quantile.type = 7)
# S3 method for data.frame
summary(object, maxsum = 7, digits = max(3, getOption("digits")-3), …)
# S3 method for factor
summary(object, maxsum = 100, …)
# S3 method for matrix
summary(object, …)

where,

 object -> an object for which a summary is desired.


 maxsum -> integer, indicating how many levels should be shown for factors.
 digits -> integer, used for number formatting with signif() (for summary.default) or format() (for
summary.data.frame).
 quantile.type -> integer code used in quantile(*, type=quantile.type) for the default method.
 … -> additional arguments affecting the summary produced.
Example:

> x <- c("green","red","blue")


> summary(x)
# Length Class Mode
3 character character

Let’s create logical variable blowouts in which the ith element is TRUE if that game was a blowout according to
that,
>blowouts <- var_name > 50
>blowouts
# [1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[11] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
……
Now let’s ask R for a summary( )
> summary( object = blowouts )
# Mode FALSE TRUE NA’s
logical 132 44 0

In this context, the summary() function gives us a count of the number of TRUE values, the number of FALSE
values, and the number of missing values (i.e., the NAs).

> summary(Age)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.75 30.00 32.30 44.25 70.00

12) Few more Functions for Probability Distributions

Every distribution that R handles has four functions. There is a root name, for example, the root name for the
normal distribution is the norm. This root is prefixed by one of the letters
 p for "probability", the cumulative distribution function (c. d. f.)
 q for "quantile", the inverse c. d. f.
 d for "density", the density function (p. f. or p. d. f.)
 r for "random", a random variable having the specified distribution
For the normal distribution, these functions are pnorm, qnorm, dnorm, and rnorm. For the binomial distribution,
these functions are pbinom, qbinom, dbinom, and rbinom and so forth.
For a continuous distribution (like the normal), the most useful functions for doing problems involving
probability calculations are the "p" and "q" functions (c. d. f. and inverse c. d. f.), because the the density (p. d.
f.) calculated by the "d" function can only be used to calculate probabilities via integrals and R doesn't do
integrals.

For a discrete distribution (like the binomial), the "d" function calculates the density (p. f.), which in this case is
a probability
f(x) = P(X = x)
and hence is useful in calculating probabilities.
Sr. Name of the Function and
Purpose Example
No. Syntax
1 dnorm(x, mean = 0, sd = 1, log = Density, distribution function, > dnorm(0,mean=4,sd=10)
FALSE) quantile function and random [1] 0.03682701
pnorm(q, mean = 0, sd = 1, generation for the normal > pnorm(0,mean=2,sd=3)
lower.tail = TRUE, log.p = FALSE) distribution with mean equal to [1] 0.2524925
qnorm(p, mean = 0, sd = 1, mean and standard deviation equal > qnorm(0.5,mean=1,sd=2)
lower.tail = TRUE, log.p = FALSE) to sd. [1] 1
rnorm(n, mean = 0, sd = 1) Where, > rnorm(4,mean=3,sd=3)
x, q - vector of quantiles. [1] 4.580556 2.974903
p-vector of probabilities. 4.756097 6.395894
n-the number of observations. If
length(n) > 1, the length is taken to
be the number required.
mean-vector of means.
sd-vector of standard deviations.
log, log.p-logical; if TRUE,
probabilities p are given as log(p).
lower.tail-logical; if TRUE (default),
probabilities are P[X ≤ x] otherwise,
P[X > x].
2 dbinom(x, size, prob, log = FALSE) Density, distribution function, >dbinom(5, size=10, prob=0.5)
pbinom(q, size, prob, lower.tail = quantile function and random [1] 0.2460938
TRUE, log.p = FALSE) generation for the binomial >rbinom(7, 150,.05)
[1] 10 12 10 2 5 5 14
qbinom(p, size, prob, lower.tail = distribution with parameters size
>pbinom(5,10,0.5)
TRUE, log.p = FALSE) and prob. [1] 0.6230469
rbinom(n, size, prob) Where, >qbinom(0.25,10,.5)
x, q - vector of quantiles. [1] 4
p-vector of probabilities.
n -number of observations. If
length(n) > 1, the length is taken to
be the number required.
mean-vector of means.
size-number of trials (zero or more)
prob-probability of success on each
trial.
log, log.p-logical; if TRUE,
probabilities p are given as log(p).
lower.tail-logical; if TRUE (default),
probabilities are P[X ≤ x] otherwise,
P[X > x].
3 dpois(x, lambda, log = FALSE) Density, distribution function, > ppois(16, lambda=12) #
ppois(q, lambda, lower.tail = quantile function and random lower tail
TRUE, log.p = FALSE) generation for the Poisson [1] 0.89871
qpois(p, lambda, lower.tail = distribution with parameter >rpois(10, 10)
TRUE, log.p = FALSE) lambda. [1] 6 10 11 3 10 7 7 8 14 12
rpois(n, lambda) Where, >dpois(20, lambda=12)
x-vector of (non-negative integer) [1] 0.009682032
quantiles. >qpois(0.25,lambda = 12)
q-vector of quantiles. [1] 10
p-vector of probabilities.
n-number of random values to
return.
Lambda-vector of (non-negative)
means.
log, log.p-logical; if TRUE,
probabilities p are given as log(p).

lower.tail-logical; if TRUE (default),


probabilities are P[X ≤ x],
otherwise, P[X > x].

4 dunif(x, min = 0, max = 1, log = These functions provide > u <- runif(20)
FALSE) information about the uniform > punif(u) == u
punif(q, min = 0, max = 1, distribution on the interval from [1] TRUE TRUE TRUE TRUE
lower.tail = TRUE, log.p = FALSE) min to max. dunif gives the density, TRUE TRUE TRUE TRUE TRUE
qunif(p, min = 0, max = 1, punif gives the distribution function TRUE TRUE TRUE TRUE TRUE
lower.tail = TRUE, log.p = FALSE) qunif gives the quantile function TRUE
runif(n, min = 0, max = 1) and runif generates random [16] TRUE TRUE TRUE TRUE
deviates. TRUE
Where, > dunif(u) == 1
x, q - vector of quantiles. [1] TRUE TRUE TRUE TRUE
p-vector of probabilities. TRUE TRUE TRUE TRUE TRUE
n -number of observations. If TRUE TRUE TRUE TRUE TRUE
length(n) > 1, the length is taken to TRUE
be the number required. [16] TRUE TRUE TRUE TRUE
min, max -lower and upper limits of TRUE
the distribution. Must be a finite > var(runif(10000))
log, log.p-logical; if TRUE, [1] 0.08273603
probabilities p are given as log(p).

lower.tail-logical; if TRUE (default),


probabilities are P[X ≤ x],
otherwise, P[X > x].
Check your Progress 1
Fill in the Blanks.
1. _______ computes common logarithms.
2. The mode is the value that occurs most ______.
3. Median absolute deviation is calculated using _______ function.

Activity 1
1. Write a program in R to print the addition, minimum and maximum value of the given vector.
2. Apply all the character function on a given string and note the output.
3. Try the probability distribution functions on the given data.

Summary
 Calculating some basic descriptive statistics is one of the very first things we do when analysing real
data.
 In R, we have various built-in functions which are helpful in performing the statistical operations on
data.
 Every distribution that R handles has four functions – probability, quantile, density and random.

Keywords
 String: It is traditionally a sequence of characters.
 Probability: It is a measure quantifying the likelihood that events will occur.

Self-Assessment Questions
1. State the difference between mean and median.
2. List and explain the use of character functions.
3. Write syntax of the following functions:
a. Mean
b. Median
c. Quantile
d. Pnorm
e. Substr

Answers To Check Your Progress


Check your Progress 1
Fill in the Blanks.
1. Log10(x) computes common logarithms.
2. Mode is the value that occurs most frequently.
3. Median absolute deviation is calculated using mad() function.
Suggested Reading
 The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
 https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
 https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5su9/OReill
y_HandsOn_Programming_with_R_2014.pdf.
 The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
 R Programming for Data Science by Roger D. Peng.
 An introduction to R by Longhow Lam.
Unit 5
Graphs
Structure:
5.1 Introduction
5.2 An Overview of R Graphics
5.3 An Introduction to Plotting
5.3.1 A Tedious Digression
5.3.2 Customising the Title and the Axis Labels
5.3.3 Changing the Plot Type
5.3.4 Changing Other Features of the Plot
5.3.5 Changing the Appearance of the Axes
5.4 Histograms
5.4.1 Visual Style of the Histogram
5.5 Stem and Leaf Plots
5.6 Boxplots
5.6.1 Visual Style of Boxplot
5.6.2 Using Box Plots to Detect Outliers
5.6.3 Drawing Multiple Boxplots
5.7 Scatterplots
5.7.1 More Elaborate Options
5.8 Bar Graphs
5.8.1 Changing Global Settings Using par()
5.9 Saving Image Files Using R and Rstudio
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives
After going through this unit, you will be able to:

 Understand the R graphics system


 Visualise the data into different plots and graphs

5.1 INTRODUCTION
Visualising data is one of the most important tasks facing the data analyst. It's important for two
distinct but closely related reasons. Firstly, there's the matter of drawing “presentation graphics":
displaying data in a clean, visually appealing fashion makes it easier for reader to understand what
we're trying to tell them. Equally important, perhaps even more important, is the fact that drawing
graphs helps in understanding the data. To that end, it's important to draw “exploratory graphics" that
help us to learn about the data as we go about analysing it. In this unit, we are going to discuss several
fairly standard graphs that we use a lot when analysing and presenting data, and secondly, to show
how to create these graphs in R.

5.2 AN OVERVIEW OF R GRAPHICS


Reduced to its simplest form, you can think of an R graphic as being much like a painting. You start out
with an empty canvas. Every time you use a graphics function, it paints some new things onto your
canvas. Later on, you can paint more things over the top if you want; but just like painting, you can’t
“undo" your strokes. If you make a mistake, you have to throw away your painting and start over.

Fortunately, this is easier way to do when using R than it is when painting a picture in real life: you
delete the plot and then type a new set of commands. This way of thinking about drawing graphs is
referred to as the painter's model. So far, this probably doesn't sound particularly complicated, and
for the vast majority of graphs you'll want to draw it's exactly as simple as it sounds.

In Rstudio (regardless of which operating system we're on), there's a separate device called RStudioGD
that forces R to paint inside the “plots" panel in Rstudio. However, from the computers perspective
there's nothing terribly special about drawing pictures on screen and so R is quite happy to paint
pictures directly into a file. R can paint several different types of image files: jpeg, png, pdf, postscript,
tiff and bmp files are all among the options that are available. For the most part, these different
devices all behave the same way, so we don't really need to know much about the differences between
them when learning how to draw pictures. But, just like real life painting, sometimes the specifics do
matter.

In R, graphics system defines a collection of very low-level graphics commands about what to draw
and where to draw it. Something that surprises most new R users is the discovery that R actually has
two completely independent graphics systems, known as traditional graphics (in the graphics package)
and grid graphics (in the grid package). Not surprisingly, the traditional graphics system is the older of
the two: in fact, it's actually older than R since it has its origins in S, the system from which R is
descended. Grid graphics are newer, and in some respects more powerful, so many of the more
recent, fancier graphical tools in R make use of grid graphics.

R has quite a number of different packages, each of which provide a collection of high-level graphics
commands. A single high-level command is capable of drawing an entire graph, complete with a range
of customisation options. The grid universe relies heavily on two different packages - lattice and
ggplots2, each of which provides a quite different visual style.
5.3 AN INTRODUCTION TO PLOTTING

Before we discuss any specialised graphics, let's start by drawing a few very simple graphs just to get
a feel for what it's like to draw pictures using R. To that end, let's create a small vector Fibonacci that
contains a few numbers we'd like R to draw for us. Then, we'll ask R to plot() those numbers:

> Fibonacci <- c( 1,1,2,3,5,8,13 )

> plot( Fibonacci )

The result is Fig. 5.1. As we can see, what R has done is plot the values stored in the Fibonacci variable
on the vertical axis (y-axis) and the corresponding index on the horizontal axis (x-axis). In other words,
since the 4th element of the vector has a value of 3, we get a dot plotted at the location (4,3). That's
pretty straightforward. However, there's quite a lot of customisation options available.

Figure 5.1

5.3.1 A Tedious Digression


Before we go into any discussion of customising plots, we need a little more background. The
important thing to note when using the plot() function, is that it's another example of a generic
function much like print() and summary(), and so its behaviour changes depending on what kind of
input we give it. However, the plot() function is somewhat fancier than the other two, and its
behaviour depends on two arguments, x (the first input, which is required) and y (which is optional).
This makes it (a) extremely powerful and (b) hilariously unpredictable, when we are not sure what
we're doing. If we look at the help documentation for the default plotting method (i.e.,
help("plot.default")) we can see a very long list of arguments that specify to customise the plot.

What exactly is a graphical parameter? Basically, the idea is that there are some characteristics of a
plot which are pretty universal: for instance, regardless of what kind of graph you're drawing, you
probably need to specify what colour to use for the plot, right? So we would expect there to be
something like a col argument to every single graphics function in R? Well, sort of. In order to avoid
having hundreds of arguments for every single function, what R does is refer to a bunch of these
“graphical parameters" which are pretty general purpose. Graphical parameters can be changed
directly by using the low-level par() function. Fortunately, (a) the default settings are generally pretty
good so we can ignore the majority of the parameters, and (b) very rarely need to use par() directly,
because we can “pretend" that graphical parameters are just additional arguments to the high-level
function (e.g. plot.default()).

5.3.2 Customising the Title and the Axis Labels


One of the first things that we are wanting to do when customising plot is to label it better. We might
want to specify more appropriate axis labels, add a title or add a subtitle. The arguments that we need
to specify to make this happen are:

 main. A character string containing the title.


 sub. A character string containing the subtitle.
 xlab. A character string containing the x-axis label.
 ylab. A character string containing the y-axis label.

These aren't graphical parameters, they're arguments to the high-level function. However, because
the high-level functions all rely on the same low-level function to do the drawing the names of these
arguments are identical for pretty much every high-level function. Let's have a look at what happens
when we make use of all these arguments. Here's the command:

> plot( x = Fibonacci,


+ main = "You specify title using the 'main' argument",
+ sub = "The subtitle appears here! (Use the 'sub' argument for this)",
+ xlab = "The x-axis label is 'xlab'",
+ ylab = "The y-axis label is 'ylab'"
+)
The picture that this draws is shown in Fig. 5.2.

Figure 5.2
There's a couple of interesting features worth calling attention. Firstly, notice that the subtitle is drawn
below the plot. Secondly, notice that R has decided to use boldface text and a larger font size for the
title, this is default settings in R graphics, since we feel that it draws too much attention to the title.
To that end, there are a bunch of graphical parameters that we can use to customise the font style:

 Font styles: font.main, font.sub, font.lab, font.axis. These four parameters control the font
style used for the plot title (font.main), the subtitle (font.sub), the axis labels (font.lab: note
that we can't specify separate styles for the x-axis and y-axis without using low level
commands), and the numbers next to the tick marks on the axis (font.axis). Somewhat
irritatingly, these arguments are numbers instead of meaningful names: a value of 1
corresponds to plain text, 2 means boldface, 3 means italic and 4 means bold italic.
 Font colours: col.main, col.sub, col.lab, col.axis. These parameters do pretty much what the
name says: each one specifies a colour in which to type each of the different bits of text.
Conveniently, R has a very large number of named colours (type colours() to see a list of over
650 colour names that R knows), so we can use the English language name of the colour to
select it. Thus, the parameter value here string like "red", "gray25" or "springgreen4" (yes, R
really does recognise four different shades of “spring green").
 Font size: cex.main, cex.sub, cex.lab, cex.axis. Font size is handled in a slightly curious way in
R. The “cex" part here is short for “character expansion", and it's essentially a magnification
value. By default, all of these are set to a value of 1, except for the font title: cex.main has a
default magnification of 1.2, which is why the title font is 20% bigger than the others.
 Font family: family. This argument specifies a font family to use: the simplest way to use it is
to set it to "sans", "serif", or "mono", corresponding to a san serif font, a serif font, or a
monospaced font. We can give the name of a specific font, but keep in mind that different
operating systems use different fonts, so it's probably safest to keep it simple. Better yet,
unless we have some deep objections to the R defaults, just ignore this parameter entirely.

The following command can be used to draw Fig. 5.3:

> plot( x = Fibonacci, # the data to plot


+ main = "The first 7 Fibonacci numbers", # the title
+ xlab = "Position in the sequence", # x-axis label
+ ylab = "The Fibonacci number", # y-axis label
+ font.main = 1, # plain text for title
+ cex.main = 1, # normal size for title
+ font.axis = 2, # bold text for numbering
+ col.lab = "gray50" # grey colour for labels
+)
Figure 5.3
Although this command is quite long, it's not complicated: all it does is override a bunch of the default
parameter values. The only difficult aspect to this is that we have to remember what each of these
parameters is called, and what all the different values are.

5.3.3 Changing the Plot Type


Adding and customising the titles associated with the plot is one way in which we can play around
with what picture looks like. Another thing that we will want to do is customise the appearance of the
actual plot! To start with, let's look at the single most important options that the plot() function (or,
recalling that we're dealing with a generic function, in this case the plot.default() function, since that's
the one doing all the work) provides to use, which is the type argument. The type argument specifies
the visual style of the plot. The possible values for this are:

 type = "p". Draw the points only.


 type = "l". Draw a line through the points.
 type = "o". Draw the line over the top of the points.
 type = "b". Draw both points and lines, but don't overplot.
 type = "h". Draw “histogram-like" vertical bars.
 type = "s". Draw a staircase, going horizontally then vertically.
 type = "S". Draw a staircase, going vertically then horizontally.
 type = "c". Draw only the connecting lines from the “b" version.
 type = "n". Draw nothing. (Apparently this is useful sometimes?)
Figure 5.4

The simplest way to illustrate what each of these really looks like is just to draw them. To that end,
Fig. 5.4 shows the same Fibonacci data, drawn using six different types of plot. As you can see, by
altering the type argument you can get a qualitatively different appearance to your plot. In other
words, as far as R is concerned, the only difference between a scatterplot and a line plot is that we
can draw a scatterplot by setting type = "p" and we can draw a line plot by setting type = "l".

As you can see by looking at Fig. 5.4, a line plot implies that there is some notion of continuity from
one point to the next, whereas a scatterplot does not.

(a) (b)

Fig. 5.5 (a) Changing the pch parameter, (b) the lty parameter
5.3.4 Changing Other Features of the Plot
In the previous section, we talked about a group of graphical parameters that are related to the
formatting of titles, axis labels etc. The second group of parameters to discuss are those related to the
formatting of the plot itself:

 Colour of the plot: col. As we saw with the previous colour-related parameters, the simplest
way to specify this parameter is using a character string: e.g., col = "blue". It's a pretty
straightforward parameter to specify: the only real subtlety is that every high-level function
tends to draw a different “thing" as its output, and so this parameter gets interpreted a little
differently by different functions. However, for the plot.default() function it's pretty simple:
the col argument refers to the colour of the points and/or lines that get drawn!
 Character used to plot points: pch. The plot character parameter is a number, usually between
1 and 25. What it does is tell R what symbol to use to draw the points that it plots. The simplest
way to illustrate what the different values do is with a picture. Fig. 5.5a shows the first 25
plotting characters. The default plotting character is a hollow circle (i.e., pch = 1).
 Plot size: cex. This parameter describes a character expansion factor (i.e., magnification) for
the plotted characters. By default cex=1, but if we want bigger symbols in our graph we should
specify a larger value.
 Line type: lty. The line type parameter describes the kind of line that R draws. It has seven
values which you can specify using a number between 0 and 7, or using a meaningful character
string: "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". Note that
the “blank" version (value 0) just means that R doesn't draw the lines at all. The other six
versions are shown in Fig. 5.5b.

Fig. 5.6 Customising various aspects to the plot itself

 Line width: lwd. The last graphical parameter in this category is the line width parameter,
which is just a number specifying the width of the line. The default value is 1. Not surprisingly,
larger values produce thicker lines and smaller values produce thinner lines. Try playing
around with different values of lwd to see what happens.

To illustrate what you can do by altering these parameters, let's try the following command:

> plot( x = Fibonacci, # the data set


+ type = "b", # plot both points and lines
+ col = "blue", # change the plot colour to blue
+ pch = 19, # plotting character is a solid circle
+ cex = 5, # plot it at 5x the usual size
+ lty = 2, # change line type to dashed
+ lwd = 4 # change line width to 4x the usual
+)
The output is shown in Fig. 5.6.

5.3.5 Changing the Appearance of the Axes


There's a few other arguments to the plot.default() function. As before, many of these are standard
arguments that are used by a lot of high level graphics functions:

 Changing the axis scales: xlim, ylim. Generally R does a pretty good job of figuring out where
to set the edges of the plot. However, we can override its choices by setting the xlim and ylim
arguments. For instance, if we want the vertical scale of the plot to run from 0 to 100, then
set ylim = c(0, 100).
 Suppress labelling: ann. This is a logical-valued argument that we can use if we don't want R
to include any text for a title, subtitle or axis label. To do so, set ann = FALSE. This will stop R
from including any text that would normally appear in those places. Note that this will override
any of our manual titles. For example, if we try to add a title using the main argument, but we
also specify ann = FALSE, no title will appear.
 Suppress axis drawing: axes. Again, this is a logical valued argument. Suppose we don't want
R to draw any axes at all. To suppress the axes, all we have to do is add axes = FALSE. This will
remove the axes and the numbering, but not the axis labels (i.e. the xlab and ylab text). Note
that we can get finer grain control over this by specifying the xaxt and yaxt graphical
parameters instead (see below).
 Include a framing box: frame.plot. Suppose we have removed the axes by setting axes = FALSE,
but we still want to have a simple box drawn around the plot; that is, we only wanted to get
rid of the numbering and the tick marks, but we want to keep the box. To do that, we set
frame.plot= TRUE.

Note that this list isn't exhaustive. There are a few other arguments to the plot.default function that
we can use. As always, however, if these aren't enough options, there's also a number of other
graphical parameters that we might want to use. Here is a command that makes use of all these
different options:

> plot( x = Fibonacci, # the data


+ xlim = c(0, 15), # expand the x-scale
+ ylim = c(0, 15), # expand the y-scale
+ ann = FALSE, # delete all annotations
+ axes = FALSE, # delete the axes
+ frame.plot = TRUE # but include a framing box
+)
The output is shown in Fig. 5.7, and it's pretty much exactly as we expect. The axis scales on both the
horizontal and vertical dimensions have been expanded, the axes have been suppressed as have the
annotations.
Before moving on, there are several graphical parameters relating to the axes, the box, and the general
appearance of the plot which allow finer grain control over the appearance of the axes and the
annotations.

 Suppressing the axes individually: xaxt, yaxt. These graphical parameters are basically just
fancier versions of the axes argument we discussed earlier. If we want to stop R from drawing
the vertical axis but we'd like it to keep the horizontal axis, set yaxt = "n".
 Box type: bty. In the same way that xaxt, yaxt are just fancy versions of axes, the box type
parameter is really just a fancier version of the frame.plot argument, allowing us to specify
exactly which out of the four borders we want to keep. The possible values are "o" (the
default), "l", "7", "c", "u", or "]", each of which will draw only those edges that the
corresponding character suggests. That is, the letter "c" has a top, a bottom and a left, but is
blank on the right hand side, whereas "7" has a top and a right, but is blank on the left and
the bottom. Alternatively a value of "n" means that no box will be drawn.

Fig. 5.7 Altering the scale and appearance of the plot axes

Fig. 5.8 Other ways to customise the axes

 Orientation of the axis labels: las. I presume that the name of this parameter is an acronym of
label style or something along those lines; but what it actually does is govern the orientation
of the text used to label the individual tick marks (i.e., the numbering, not the xlab and ylab
axis labels). There are four possible values for las: A value of 0 means that the labels of both
axes are printed parallel to the axis itself (the default). A value of 1 means that the text is
always horizontal. A value of 2 means that the labelling text is printed at right angles to the
axis. Finally, a value of 3 means that the text is always vertical.

Let's try the following command:

> plot( x = Fibonacci, # the data


+ xaxt = "n", # don't draw the x-axis
+ bty = "]", # keep bottom, right and top of box only
+ las = 1 # rotate the text
+)
The output is shown in Fig. 5.8.

5.4 HISTOGRAMS
Histograms are one of the simplest and most useful ways of visualising data. They make most sense
when you have an interval or ratio scale and what you want to do is get an overall impression of the
data. Example: Divide up the possible values into bins, and then count the number of observations
that fall within each bin. This count is referred to as the frequency of the bin, and is displayed as a bar:
in the AFL winning margins data, there are 33 games in which the winning margin was less than 10
points, and it is this fact that is represented by the height of the leftmost bar in Fig. 5.9(a). Drawing
this histogram in R is pretty straightforward.

The function we need to use is called hist(), and it has pretty reasonable default settings. In fact,
Fig. 5.9(a) is exactly what we get if we just type this:

> hist( afl.margins ) # panel a

Although this image would need a lot of cleaning up in order to make a good presentation graphic
(i.e., one you'd include in a report), it nevertheless does a pretty good job of describing the data. In
fact, the big strength of a histogram is that (properly used) it does show the entire spread of the data,
so we can get a pretty good sense about what it looks like. The downside to histograms is that they
aren't very compact, if the data are nominal scale then histograms are useless.

The main subtlety that we need to be aware of when drawing histograms is determining where the
breaks that separate bins should be located, and (relatedly) how many breaks there should be. In
Fig. 5.9 (a), you can see that R has made pretty sensible choices all by itself: the breaks are located at
0, 10, 20, . . . 120. On the other hand, consider the two histograms in Fig. 5.9 (b) and 5.9 (c), which are
produced using the following two commands:

> hist( x = afl.margins, breaks = 3 ) # panel b

> hist( x = afl.margins, breaks = 0:116 ) # panel c

In Fig. 5.9 (c), the bins are only 1 point wide. As a result, although the plot is very informative (it
displays the entire data set with no loss of information at all!), the plot is very hard to interpret, and
feels quite cluttered. On the other hand, the plot in Fig. 5.9 (b) has a bin width of 50 points, and has
the opposite problem: it's very easy to “read" this plot, but it doesn't convey a lot of information. One
gets the sense that this histogram is hiding too much. In short, the way in which the breaks are specify
has a big effect on what the histogram looks like, so it's important to make sure to choose the breaks
sensibly.

In general, R does a pretty good job of selecting the breaks on its own, since it makes use of some
quite clever tricks that statisticians have devised for automatically selecting the right bins for a
histogram, but nevertheless it's usually a good idea to play around with the breaks a bit to see what
happens.

There is one fairly important thing to add regarding how the breaks argument works. There are two
different ways we can specify the breaks. We can either specify how many breaks we want and let R
figure out where they should go, or we can provide a vector that tells R exactly where the breaks
should be placed (panel c, breaks = 0:116). The behaviour of the hist() function is slightly different
depending on which version we use. If all you do is tell it how many breaks you want, R treats it as a
“suggestion" not as a demand. It assumes you want “approximately 3" breaks, but if it doesn't think
that this would look very pretty on screen, it picks a different (but similar) number. It does this for a
sensible reason - it tries to make sure that the breaks are located at sensible values (like 10) rather
than stupid ones (like 7.224414). And most of the time R is right: usually, when a human researcher
says “give me 3 breaks", he or she really does mean “give me approximately 3 breaks, and don't put
them in stupid places". However, sometimes R is dead wrong. Sometimes you really do mean “exactly
3 breaks", and you know precisely where you want them to go. So you need to invoke “real person
privilege", and order R to do what it's bloody well told. In order to do that, you have to input the full
vector that tells R exactly where you want the breaks. If you do that, R will go back to behaving like
the nice little obedient calculator that it's supposed to be.
Fig. 5.9 Four different histograms of the afl.margins variable: (a) the default histogram that R
produces, (b) a histogram with too few bins, (c) a histogram with too many bins, and (d) a “prettier"
histogram making use of various optional arguments to hist()

5.4.1 Visual Style of the Histogram


At this point we can draw a basic histogram, and we can alter the number and even the location of
the breaks. To improve the visual style of the histograms we can use some of the other arguments to
the hist() function.

 Shading lines: density, angle. You can add diagonal lines to shade the bars: the density value
is a number indicating how many lines per inch R should draw (the default value of NULL
means no lines), and the angle is a number indicating how many degrees from horizontal the
lines should be drawn at (default is angle = 45 degrees).
 Specifics regarding colours: col, border. You can also change the colours: in this instance the
col parameter sets the colour of the shading (either the shading lines if there are any, or else
the colour of the interior of the bars if there are not), and the border argument sets the colour
of the edges of the bars.
 Labelling the bars: labels. You can also attach labels to each of the bars using the labels
argument. The simplest way to do this is to set labels = TRUE, in which case R will add a number
just above each bar, that number being the exact number of observations in the bin.
Alternatively, you can choose the labels yourself, by inputting a vector of strings, e.g., labels =
c("label 1","label 2","etc").

Not surprisingly, this doesn't exhaust the possibilities. If you type help("hist") and have a look at the
help documentation for histograms, you'll see a few more options. A histogram that makes use of the
histogram-specific customisations as well as several of the options we discussed is shown in Fig. 5.9(d).
The R command that I used to draw it is this:
> hist( x = afl.margins,

+ main = "2010 AFL margins", # title of the plot

+ xlab = "Margin", # set the x-axis label

+ density = 10, # draw shading lines: 10 per inch

+ angle = 40, # set the angle of the shading lines is 40 degrees

+ border = "gray20", # set the colour of the borders of the bars

+ col = "gray80", # set the colour of the shading lines

+ labels = TRUE, # add frequency labels to each bar

+ ylim = c(0,40) # change the scale of the y-axis

+)

5.5 STEM AND LEAF PLOTS


Histograms are one of the most widely used methods for displaying the observed values for a variable.
They're simple, pretty, and very informative. However, they do take a little bit of effort to draw.
Sometimes it can be quite useful to make use of simpler, if less visually appealing, options. One such
alternative is the stem and leaf plot.

The AFL margins data contains 176 observations, which is at the upper end for what you can
realistically plot this way. The function in R for drawing stem and leaf plots is called stem() and if we
ask for a stem and leaf plot of the afl.margins data, here's what we get:

> stem( afl.margins )


The decimal point is 1 digit(s) to the right of the |
0 | 001111223333333344567788888999999
1 | 0000011122234456666899999
2 | 00011222333445566667788999999
3 | 01223555566666678888899
4 | 012334444477788899
5 | 00002233445556667
6 | 0113455678
7 | 01123556
8 | 122349
9 | 458
10 | 148
11 | 6
The values to the left of the | are called stems and the values to the right are called leaves. If you just
look at the shape that the leaves make, you can see something that looks a lot like a histogram made
out of numbers, just rotated by 90 degrees. But if you know how to read the plot, there's quite a lot
of additional information here. In fact, it's also giving you the actual values of all of the observations
in the data set. To illustrate, let's have a look at the last line in the stem and leaf plot, namely 11 | 6.

Specifically, let's compare this to the largest values of the afl.margins data set:

> max( afl.margins )

[1] 116

11 | 6 versus 116. Obviously the stem and leaf plot is trying to tell us that the largest value in the data
set is 116. Similarly, when we look at the line that reads 10 | 148, the way we interpret it to note that
the stem and leaf plot is telling us that the data set contains observations with values 101, 104 and
108. Finally, when we see something like 5 | 00002233445556667 the four 0’s in the stem and leaf
plot are telling us that there are four observations with value 50.

Some customisation options are available for stem and leaf plots in R. The two arguments that you
can use to do this are:

 scale. Changing the scale of the plot (default value is 1), which is analogous to changing the
number of breaks in a histogram. Reducing the scale causes R to reduce the number of stem
values (i.e., the number of breaks, if this were a histogram) that the plot uses.
 width. The second way that can customise a stem and leaf plot is to alter the width (default
value is 80). Changing the width alters the maximum number of leaf values that can be
displayed for any given stem.

The only other thing to note about stem and leaf plots is the line in which R tells you where the decimal
point is. If our data set had included only the numbers .11, .15, .23, .35 and .59 and we'd drawn a stem
and leaf plot of these data, then R would move the decimal point: the stem values would be 1,2,3,4
and 5, but R would tell you that the decimal point has moved to the left of the | symbol. If you want
to see this in action, try the following command:

> stem( x = afl.margins / 1000 )

The stem and leaf plot itself will look identical to the original one we drew, except for the fact that R
will tell you that the decimal point has moved.

5.6 BOXPLOTS
Another alternative to histograms is a boxplot, sometimes called a “box and whiskers" plot. Like
histograms, they're most suited to interval or ratio scale data. The idea behind a boxplot is to provide
a simple visual depiction of the median, the interquartile range, and the range of the data. And
because they do so in a fairly compact way, boxplots have become a very popular statistical graphic,
especially during the exploratory stage of data analysis. Let's have a look at how they work, again using
the afl.margins data as our example.

Firstly, let's actually calculate these numbers ourselves using the summary() function:

> summary( afl.margins )

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 12.75 30.50 35.30 50.50 116.00


The function in R is boxplot(). As always there's a lot of optional arguments that you can specify if you
want, but for the most part you can just let R choose the defaults for you. Let's try the following
command:

> boxplot( x = afl.margins, range = 100 )

What R draws is shown in Fig. 5.10(a), the most basic boxplot possible. When you look at this plot, this
is how you should interpret it: the thick line in the middle of the box is the median; the box itself spans
the range from the 25th percentile to the 75th percentile; and the “whiskers" cover the full range from
the minimum value to the maximum value. This is summarised in the annotated plot in Fig. 5.10(b).

Fig. 5.10 A basic boxplot (panel a), plus the same plot with annotations added to explain what aspect
of the data set each part of the boxplot corresponds to (panel b)

In most applications, the “whiskers" don't cover the full range from minimum to maximum. Instead,
they actually go out to the most extreme data point that doesn't exceed a certain bound. By default,
this value is 1.5 times the interquartile range, corresponding to a range value of 1.5. Any observation
whose value falls outside this range is plotted as a circle instead of being covered by the whiskers, and
is commonly referred to as an outlier. For our AFL margins data, there is one observation (a game with
a margin of 116 points) that falls outside this range.

As a consequence, the upper whisker is pulled back to the next largest observation (a value of 108),
and the observation at 116 is plotted as a circle. This is illustrated in Fig. 5.11(a). Since the default
value is range = 1.5, we can draw this plot using the simple command

> boxplot( afl.margins )


Figure 5.11

5.6.1 Visual Style of Boxplot


Boxplots in R are extremely customisable. In addition to the usual range of graphical parameters that
you can tweak to make the plot look nice, you can also exercise nearly complete control over every
element to the plot. Consider the boxplot in Fig. 5.11(b): in this version of the plot, added labels (xlab,
ylab) and removed the stupid border (frame.plot), also dimmed all of the graphical elements of the
boxplot except the central bar that plots the median (border) so as to draw more attention to the
median rather than the rest of the boxplot.

Here, the cross-bars at the top and bottom of the whiskers (known as the “staples" of the plot) are
deleted, and converted the whiskers themselves to solid lines. The arguments that is used to do this
are called by the names of staplewex and whisklty. Here's the actual command used to draw this :

> boxplot( x = afl.margins, # the data


+ xlab = "AFL games, 2010", # x-axis label
+ ylab = "Winning Margin", # y-axis label
+ border = "grey50", # dim the border of the box
+ frame.plot = FALSE, # don't draw a frame
+ staplewex = 0, # don't draw staples
+ whisklty = 1 # solid line for whisker
+)
Overall, the resulting boxplot is a huge improvement in visual design over the default version.

Okay, what commands can we use to customise the boxplot? If you type ?boxplot and flick through
the help documentation, you'll notice that it does mention staplewex as an argument, but there's no
mention of whisklty. The reason for this is that the function that handles the drawing is called bxp(),
so if you type ?bxp all the gory details appear.

The first part of the argument name specifies one part of the box plot: staple refers to the staples of
the plot (i.e., the cross-bars), and whisk refers to the whiskers. The second part of the name specifies
a graphical parameter: wex is a width parameter, and lty is a line type parameter. The parts of the plot
you can customise are:

 box. The box that covers the interquartile range.


 med. The line used to show the median.
 whisk. The vertical lines used to draw the whiskers.
 staple. The cross bars at the ends of the whiskers.
 out. The points used to show the outliers.

The actual graphical parameters that you might want to specify are slightly different for each visual
element, just because they're different shapes from each other. As a consequence, the following
options are available:

 Width expansion: boxwex, staplewex, outwex. These are scaling factors that govern the width
of various parts of the plot. The default scaling factor is (usually) 0.8 for the box, and 0.5 for
the other two. Note that in the case of the outliers this parameter is meaningless unless you
decide to draw lines plotting the outliers rather than use points.
 Line type: boxlty, medlty, whisklty, staplelty, outlty. These govern the line type for the
relevant elements. The values for this are exactly the same as those used for the regular lty
parameter, with two exceptions. There's an additional option where you can set
medlty = "blank" to suppress the median line completely (useful if you want to draw a point
for the median rather than plot a line). Similarly, by default the outlier line type is set to
outlty = "blank", because the default behaviour is to draw outliers as points instead of lines.
 Line width: boxlwd, medlwd, whisklwd, staplelwd, outlwd. These govern the line widths for
the relevant elements, and behave the same way as the regular lwd parameter. The only thing
to note is that the default value for medlwd value is three times the value of the others.
 Line colour: boxcol, medcol, whiskcol, staplecol, outcol. These govern the colour of the lines
used to draw the relevant elements. Specify a colour in the same way that you usually do.
 Fill colour: boxfill. What colour should we use to fill the box?
 Point character: medpch, outpch. These behave like the regular pch parameter used to select
the plot character. Note that you can set outpch = NA to stop R from plotting the outliers at
all, and you can also set medpch = NA to stop it from drawing a character for the median (this
is the default!)
 Point expansion: medcex, outcex. Size parameters for the points used to plot medians and
outliers. These are only meaningful if the corresponding points are actually plotted. So for the
default boxplot, which includes outlier points but uses a line rather than a point to draw the
median, only the outcex parameter is meaningful.
 Background colours: medbg, outbg. Again, the background colours are only meaningful if the
points are actually plotted.

Taken as a group, these parameters allow you almost complete freedom to select the graphical style
for your boxplot that you feel is most appropriate to the data set you're trying to describe. Following
are few other arguments that you might want to make use of:

 horizontal. Set this to TRUE to display the plot horizontally rather than vertically.
Fig. 5.12 Used the horizontal argument to draw it sideways in order to save space

 varwidth. Set this to TRUE to get R to scale the width of each box so that the areas are
proportional to the number of observations that contribute to the boxplot. This is only useful
if you're drawing multiple boxplots at once.
 show.names. Set this to TRUE to get R to attach labels to the boxplots.
 notch. If you set notch = TRUE, R will draw little notches in the sides of each box. If the notches
of two boxplots don't overlap, then there is a “statistically significant" difference between the
corresponding medians.

5.6.2 Using Box Plots to Detect Outliers


Sometimes it's convenient to have the boxplot automatically label the outliers. The original boxplot()
function doesn't allow you to do this; however, the Boxplot() function in the car package does. The
design of the Boxplot() function is very similar to boxplot(). It just adds a few new arguments that allow
you to tweak the labelling scheme.

5.6.3 Drawing Multiple Boxplots


One last thing. What if you want to draw multiple boxplots at once? Suppose, for instance, we wanted
separate boxplots showing the AFL margins not just for 2010, but for every year between 1987 and
2010. To do that, the first thing we'll have to do is find the data. These are stored in the aflsmall2.Rdata
file. So let's load it and take a quick peek at what's inside:

> load( "aflsmall2.Rdata" )

> who( TRUE )

-- Name -- -- Class -- -- Size --

afl2 data.frame 4296 x 2

$margin numeric 4296

$year numeric 4296

Notice that afl2 data frame is pretty big. It contains 4296 games, which is far more than we want to
see printed out on our computer screen. To that end, R provides you with a few useful functions to
print out only a few of the row in the data frame. The first of these is head() which prints out the first
6 rows, of the data frame, like this:

> head( afl2 )

margin year

1 33 1987
2 59 1987

3 45 1987

4 91 1987

5 39 1987

61 1987

You can also use the tail() function to print out the last 6 rows. The car package also provides a handy
little function called some() which prints out a random subset of the rows.

In any case, the important thing is that we have the afl2 data frame which contains the variables that
we're interested in. What we want to do is have R draw boxplots for the margin variable, plotted
separately for each separate year. The way to do this using the boxplot() function is to input a formula
rather than a variable as the input. In this case, the formula we want is margin ~ year. So our boxplot
command now looks like this:

> boxplot( formula = margin ~ year,

+ data = afl2

+)

The result is shown in Fig. 5.13. The default boxplot leaves a great deal to be desired in terms of visual
clarity. The outliers are too visually prominent, the dotted lines look messy, and the interesting
content (i.e., the behaviour of the median and the interquartile range across years) gets a little
obscured.

Fig. 5.13 Default plot created by R, with no annotations added and no changes to the visual design
See Fig. 6.14, the command below are used to produce it is long, but not complicated:

> boxplot( formula = margin ~ year, # the formula


+ data = afl2, # the data set
+ xlab = "AFL season", # x axis label
+ ylab = "Winning Margin", # y axis label
+ frame.plot = FALSE, # don't draw a frame
+ staplewex = 0, # don't draw staples
+ staplecol = "white", # (fixes a tiny display issue)
+ boxwex = .75, # narrow the boxes slightly
+ boxfill = "grey80", # lightly shade the boxes
+ whisklty = 1, # solid line for whiskers
+ whiskcol = "grey70", # dim the whiskers
+ boxcol = "grey70", # dim the box borders
+ outcol = "grey70", # dim the outliers
+ outpch = 20, # outliers as solid dots
+ outcex = .5, # shrink the outliers
+ medlty = "blank", # no line for the medians
+ medpch = 20, # instead, draw solid dots
+ medlwd = 1.5 # make them larger
+)

Figure 5.14
5.7 SCATTERPLOTS
Scatterplots are a simple but effective tool for visualising data. We've already seen scatterplots in this
unit, when using the plot() function to draw the Fibonacci variable as a collection of dots.

Fig. 5.15 Two different scatterplots: (a) the default scatterplot that R produces, (b) one that makes
use of several options for fancier display

Instead of just plotting one variable, what we want to do with our scatterplot is display the relationship
between two variables. In this kind of plot, each observation corresponds to one dot: the horizontal
location of the dot plots the value of the observation on one variable, and the vertical location displays
its value on the other variable. In many situations you don't really have a clear opinions about what
the causal relationship is (e.g., does A cause B, or does B cause A, or does some other variable C control
both A and B). If that's the case, it doesn't really matter which variable you plot on the x-axis and which
one you plot on the y-axis. However, in many situations you do have a pretty strong idea which
variable you think is most likely to be causal, or at least you have some suspicions in that direction. If
so, then it's conventional to plot the cause variable on the x-axis, and the effect variable on the y-axis.
With that in mind, let's look at how to draw scatterplots in R.

Suppose the goal is to draw a scatterplot displaying the relationship between the amount of sleep that
get (dan.sleep) and how grumpy the next day (dan.grump). As you might expect given our earlier use
of plot() to display the Fibonacci data, the function that we use is the plot() function, but because it's
a generic function all the hard work is still being done by the plot.default() function. In any case, there
are two different ways in which we can get the plot that we're after. The first way is to specify the
name of the variable to be plotted on the x axis and the variable to be plotted on the y axis. When we
do it this way, the command looks like this:
> plot( x = parenthood$dan.sleep, # data on the x-axis
+ y = parenthood$dan.grump # data on the y-axis
+)
In Fig. 5.15 (a), R has selected the scales so that the data fall neatly in the middle. But, in this case, we
happen to know that the grumpiness measure falls on a scale from 0 to 100, and the hours slept fall
on a natural scale between 0 hours and about 12 or so hours. So the command use to draw this is:

> plot( x = parenthood$dan.sleep, # data on the x-axis


+ y = parenthood$dan.grump, # data on the y-axis
+ xlab = "My sleep (hours)", # x-axis label
+ ylab = "My grumpiness (0-100)", # y-axis label
+ xlim = c(0,12), # scale the x-axis
+ ylim = c(0,100), # scale the y-axis
+ pch = 20, # change the plot type
+ col = "gray50", # dim the dots slightly
+ frame.plot = FALSE # don't draw a box
+)
This command produces the scatterplot in Fig. 5.15(b), or at least very nearly. What it doesn't do is
draw the line through the middle of the points. Sometimes, it can be very useful to do this, and it can
do so using lines(), which is a low level plotting function. Better yet, the arguments that need to specify
are pretty much the exact same ones that are use when calling the plot() function. That is, suppose
that we want to draw a line that goes from the point (4,93) to the point (9.5,37). Then the x locations
can be specified by the vector c(4,9.5) and the y locations correspond to the vector c(93,37). In other
words, use this command:

> lines( x = c(4,9.5), # the horizontal locations


+ y = c(93,37), # the vertical locations
+ lwd = 2 # line width
+)
In most realistic data analysis situations, you absolutely don't want to just guess where the line
through the points goes, since there's about a billion different ways in which you can get R to do a
better job. However, it does at least illustrate the basic idea.

One possibility, if you do want to get R to draw nice clean lines through the data for you, is to use the
scatterplot() function in the car package. Before we can use scatterplot(), we need to load the package:

> library( car )

Having done so, we can now use the function. The command we need is this one:
> scatterplot( dan.grump ~ dan.sleep,
+ data = parenthood,
+ smooth = FALSE
+)
The first two arguments should be familiar: the first input is a formula (dan.grump ~ dan.sleep) telling
R what variables to plot, and the second specifies a data frame. The third argument smooth set to
FALSE to stop the scatterplot() function from drawing a fancy “smoothed" trendline (since it's a bit
confusing to beginners). The scatterplot itself is shown in Fig. 5.16. As you can see, it's not only drawn
the scatterplot, but also its drawn boxplots for each of the two variables, as well as a simple line of
best fit are showing the relationship between the two variables.

Fig. 5.16 A fancy scatterplot drawn using the scatterplot() function in the car package

5.7.1 More Elaborate Options


Often you find yourself wanting to look at the relationships between several variables at once. One
useful tool for doing so is to produce a scatterplot matrix, analogous to the correlation matrix.

> cor( x = parenthood ) # calculate correlation matrix


dan.sleep baby.sleep dan.grump day
dan.sleep 1.00000000 0.62794934 -0.90338404 -0.09840768
baby.sleep 0.62794934 1.00000000 -0.56596373 -0.01043394
dan.grump -0.90338404 -0.56596373 1.00000000 0.07647926
day -0.09840768 -0.01043394 0.07647926 1.00000000
We can get a the corresponding scatterplot matrix by using the pairs() function:

> pairs( x = parenthood ) # draw corresponding scatterplot matrix

The output of the pairs() command is shown in Fig. 5.17. An alternative way of calling the pairs()
function, which can be useful in some situations, is to specify the variables to include using a one-
sided formula. For instance, this

> pairs( formula = ~ dan.sleep + baby.sleep + dan.grump,


+ data = parenthood
+)
would produce a 3x3 scatterplot matrix that only compare dan.sleep, dan.grump and baby.sleep.
Obviously, the first version is much easier, but there are cases where you really only want to look at a
few of the variables, so it's nice to use the formula interface.

5.8 BAR GRAPHS


Another form of graph that you often want to plot is the bar graph. The main function that you can
use in R to draw them is the barplot() function. We want to do is draw a bar graph that displays the
number of finals that each team has played in over the time spanned by the afl data set.

So, let's start by creating a vector that contains this information. We use the tabulate() function to do
this, since it creates a simple numeric vector:

> freq <- tabulate( afl.finalists )


> print( freq )
[1] 26 25 26 28 32 0 6 39 27 28 28 17 6 24 26 39 24
This isn't exactly the prettiest of frequency tables, of course. Here you can see the barplot() function
in its “purest" form: when the input is just an ordinary numeric vector. We need the team names to
create some labels, so let's create a variable with those. We can do this using the levels() function,
which outputs the names of all the levels of a factor:

> teams <- levels( afl.finalists )


> print( teams )
[1] "Adelaide" "Brisbane" "Carlton" "Collingwood"
[5] "Essendon" "Fitzroy" "Fremantle" "Geelong"
[9] "Hawthorn" "Melbourne" "North Melbourne" "Port Adelaide"
[13] "Richmond" "St Kilda" "Sydney" "West Coast"
[17] "Western Bulldogs"
Fig. 5.17 A matrix of scatterplots produced using pairs()
Fig. 5.18 Four bargraphs. Panel a shows the simplest version of a bargraph, containing the data but
no labels. In panel b, we've added the labels, but because the text runs horizontally R only includes a
few of them. In panel c, we've rotated the labels, but now the text is too long to fit. Finally, in panel
d, we fix this by expanding the margin at the bottom, and add several other customisations to make
the chart a bit nicer

Now that we have the information we need, let's draw our bar graph. The main argument that you
need to specify for a bar graph is the height of the bars, which in our case correspond to the values
stored in the freq variable:

> barplot( height = freq ) # specifying the argument name (panel a)

> barplot( freq ) # the lazier version (panel a)

Either of these two commands will produce the simple bar graph shown in Fig. 5.18 (a). As you can
see, R has drawn a pretty minimal plot. It doesn't have any labels, obviously, because we didn't actually
tell the barplot() function what the labels are! To do this, we need to specify the names.arg argument.
The names.arg argument needs to be a vector of character strings containing the text that needs to
be used as the label for each of the items. In this case, the teams vector is exactly what we need, so
the command we're looking for is:

> barplot( height = freq, names.arg = teams ) # panel b

This is an improvement, but not much of an improvement. R has only included a few of the labels,
because it can't fit them in the plot. This is the same behaviour we saw earlier with the multiple-
boxplot graph in Fig. 5.13. However, in Fig. 5.13, it wasn't an issue: it's pretty obvious from inspection
that the two unlabelled plots in between 1987 and 1990 must correspond to the data from 1988 and
1989. However, the fact that barplot() has omitted the names of every team in between Adelaide and
Fitzroy is a lot more problematic.

The simplest way to fix this is to rotate the labels, so that the text runs vertically not horizontally. To
do this, we need to alter set the las parameter, tell R to rotate the text so that it's always perpendicular
to the axes (i.e., I'll set las = 2). Use the following command:

> barplot( height = freq, # the frequencies


+ names.arg = teams, # the labels
+ las = 2 # rotate the labels
+ ) # (see panel c)
The result is the bar graph shown in Fig. 5.18(c).

5.8.1 Changing Global Settings Using par()


Altering the margins to the plot is actually a somewhat more complicated exercise. In principle, it's a
very simple thing to do: the size of the margins is governed by a graphical parameter called mar, so all
we need to do is alter this parameter. First, let's look at what the mar argument specifies. The mar
argument is a vector containing four numbers: specifying the amount of space at the bottom, the left,
the top and then the right. The units are “number of `lines'". The default value for mar is c(5.1, 4.1,
4.1, 2.1), meaning that R leaves 5.1 “lines" empty at the bottom, 4.1 lines on the left and the bottom,
and only 2.1 lines on the right.

In order to make more room at the bottom, what we need to do is change the first of these numbers.
A value of 10.1 should do the trick.

So far this doesn't seem any different to the other graphical parameters that we've talked about.
However, because of the way that the traditional graphics system in R works, you need to specify what
the margins will be before calling your high-level plotting function. Unlike the other cases we've see,
you can't treat mar as if it were just another argument in your plotting function. Instead, you have to
use the par() function to change the graphical parameters beforehand, and only then try to draw your
figure. In other words, the first thing we would do is this:

> par( mar = c( 10.1, 4.1, 4.1, 2.1) )

There's no visible output here, but behind the scenes, R has changed the graphical parameters
associated with the current device (remember, in R terminology all graphics are drawn onto a
“device"). Now that this is done, we could use the exact same command as before, but this time you'd
see that the labels all fit, because R now leaves twice as much room for the labels at the bottom.
Consider the below command:
> barplot( height = freq, # the data to plot
+ names.arg = teams, # label the plots
+ las = 2, # rotate the labels
+ ylab = "Number of Finals", # y axis label
+ main = "Finals Played, 1987-2010", # figure title
+ density = 10, # shade the bars
+ angle = 20 # shading lines angle
+)
However, one thing to remember about the par() function is that it doesn't just change the graphical
parameters for the current plot. Rather, the changes pertain to any subsequent plot that you draw
onto the same device. This might be exactly what you want, in which case there's no problem. But if
not, you need to reset the graphical parameters to their original settings. To do this, you can either
close the device (e.g., close the window, or click the “Clear All" button in the Plots panel in Rstudio) or
you can reset the graphical parameters to their original values, using a command like this:

> par( mar = c(5.1, 4.1, 4.1, 2.1) )

5.9 SAVING IMAGE FILES USING R AND RSTUDIO


If you're running R through Rstudio, then the easiest way to save your image is to click on the “Export"
button in the Plot panel (i.e., the area in Rstudio where all the plots have been appearing). When you
do that you'll see a menu that contains the options “Save Plot as PDF" and “Save Plot as Image". Either
version works. Both will bring up dialog boxes that give you a few options that you can play with, but
besides that it's pretty simple.

Check your Progress 1


Fill in the Blanks.

1. The _____ type parameter describes the kind of line that R draws.
2. ________ is a logical-valued argument that we can use if we don't want R to include any text
for a title, subtitle or axis label.

Activity 1
1. Try the following two commands to see what happens:

> stem( x = afl.margins, scale = .25 )


> stem( x = afl.margins, width = 20 )
2. Use any statistical data, visualise and customise it using different graphs and option discussed
in the unit.

Summary
 R has quite a number of different packages, each of which provide a collection of high-level
graphics commands. A single high-level command is capable of drawing an entire graph,
complete with a range of customisation options.
 In this unit, we have discussed the standard graphs including histograms, stem and leaf plots,
boxplots, scatterplots and bar graphs that statisticians like to produce.
 Histograms are one of the simplest and most useful ways of visualising data. They make most
sense when you have an interval or ratio scale and what you want to do is get an overall
impression of the data.
 Boxplot provide a simple visual depiction of the median, the interquartile range, and the range
of the data.
 Scatterplots are a simple but effective tool for visualising data.

Keywords
 Plot: It refers to the sequence of events inside a story.
 Graph: It is a collection of points and lines connecting some (possibly empty) subset of them.

Self-Assessment Questions
1. Write a short note on:
a. Histograms
b. Bar graphs
c. Scatterplots
2. What are the different visual style of the plot?
3. List and explain different font style options.

Answers to Check your Progress


Check your Progress 1

Fill in the Blanks.

1. The line type parameter describes the kind of line that R draws.
2. ann is a logical-valued argument that we can use if we don't want R to include any text for a
title, subtitle, or axis label.

Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An Introduction to R by Longhow Lam.
7. Learning Statistics with R by Danielle Navarro.
Unit 6
String Manipulation and Input/Output
Structure:
6.1 Introduction
6.2 Shortening a String
6.3 Pasting Strings Together
6.4 Splitting Strings
6.5 Making Simple Conversions
6.6 Applying Logical Operations to Text
6.7 Concatenating and Printing with cat()
6.8 Using Escape Characters in Text
6.9 Matching and Substituting Text
6.10 Regular Expressions
6.11 Input and Output
6.11.1 Accessing the Keyboard and Monitor
6.11.2 Reading and Writing Files
Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives
After going through this unit, you will be able to:

• Perform various operations of the strings


• Discuss different functions used for reading and writing files

6.1 INTRODUCTION
Sometimes your data set is quite text heavy. This can be for a lot of different reasons. Maybe the raw
data are actually taken from text sources (e.g., newspaper articles), or maybe your data set contains
a lot of free responses to survey questions, in which people can write whatever text they like in
response to some query. Or maybe you just need to rejig some of the text used to describe nominal
scale variables. Regardless of what the reason is, you’ll probably want to know a little bit about how
to handle text in R. R provides a lot of additional tools that are quite specific to text. In this unit we
discuss only those tools that come as part of the base packages, but there are other possibilities out
there: the stringr package provides a powerful alternative that is a lot more coherent than the basic
tools, and is well worth looking into.

6.2 SHORTENING A STRING


The first task is to shorten a character string. For example, suppose that we have a vector that contains
the names of several different animals:
> animals <- c( "cat", "dog", "kangaroo", "whale" )
It might be useful in some contexts to extract the first three letters of each word. This is often useful
when annotating figures, or when creating variable labels: it’s often very inconvenient to use the full
name, so you want to shorten it to a short code for space reasons. The strtrim() function can be used
for this purpose. It has two arguments: x is a vector containing the text to be shortened and width
specifies the number of characters to keep. When applied to the animals data, here’s what we get:
> strtrim( x = animals, width = 3 )
[1] "cat" "dog" "kan" "wha"
Note that, the only thing that strtrim() does is chop off excess characters at the end of a string. It
doesn’t insert any whitespace characters to fill them out if the original string is shorter than the width
argument. For example, if we trim the animals data to 4 characters, here’s what we get:
> strtrim( x = animals, width = 4 )
[1] "cat" "dog" "kang" "whal"
The "cat" and "dog" strings still only use 3 characters. Okay, but what if you don’t want to start from
the first letter? Suppose, for instance, you only wanted to keep the second and third letter of each
word. That doesn’t happen quite as often, but there are some situations where you need to do
something like that. If that does happen, then the function you need is substr(), in which you specify
a start point and a stop point instead of specifying the width. For instance, to keep only the 2nd and
3rd letters of the various animals, we can do the following:
> substr( x = animals, start = 2, stop = 3 )
[1] "at" "og" "an" "ha"

6.3 PASTING STRINGS TOGETHER


Much more commonly, you will need either to glue several character strings together or to pull them
apart. To glue several strings together, the paste() function is very useful. There are three arguments
to the paste() function:
• ... As usual, the dots “match” up against any number of inputs. In this case, the inputs should
be the various different strings you want to paste together.
• sep. This argument should be a string, indicating what characters R should use as separators,
in order to keep each of the original strings separate from each other in the pasted output. By
default the value is a single space, sep = " ". This is made a little clearer when we look at the
examples.
• collapse. This is an argument indicating whether the paste() function should interpret vector
inputs as things to be collapsed, or whether a vector of inputs should be converted into a
vector of outputs. The default value is collapse = NULL which is interpreted as meaning that
vectors should not be collapsed. If you want to collapse vectors into as single string, then you
should specify a value for collapse. Specifically, the value of collapse should correspond to the
separator character that you want to use for the collapsed inputs. Again, see the examples
below for more details.
That probably doesn’t make much sense yet, so let’s start with a simple example. First, let’s try to
paste two words together, like this:
> paste( "hello", "world" )
[1] "hello world"
Notice that R has inserted a space between the "hello" and "world". Suppose that’s not what you
wanted. Instead, you might want to use . as the separator character, or to use no separator at all. To
do either of those, we would need to specify sep = "." or sep = "".
For instance:
> paste( "hello", "world", sep = "." )
[1] "hello.world"
Now let’s consider a slightly more complicated example. Suppose we have two vectors that we want
to paste() together. Let’s say something like this:
> hw <- c( "hello", "world" )
> ng <- c( "nasty", "government" )
And suppose, we want to paste these together. However, if you think about it, this statement is kind
of ambiguous. It could mean that you want to do an “element wise” paste, in which all of the first
elements get pasted together ("hello nasty") and all the second elements get pasted together ("world
government"). Or, alternatively, you might intend to collapse everything into one big string ("hello
nasty world government"). By default, the paste() function assumes that you want to do an element-
wise paste:
> paste( hw, ng )
[1] "hello nasty" "world government"
However, there’s nothing stopping you from overriding this default. All you have to do is specify a
value for the collapse argument, and R will chuck everything into one dirty big string. To give you a
sense of exactly how this works, what I’ll do in this next example is specify different values for sep and
collapse:
> paste( hw, ng, sep = ".", collapse = ":::")
[1] "hello.nasty:::world.government"

6.4 SPLITTING STRINGS


At other times you have the opposite problem to the one in the last section: you have a whole lot of
text bundled together into a single string that needs to be pulled apart and stored as several different
variables. For instance, the data set that you get sent might include a single variable containing
someone’s full name, and you need to separate it into first names and last names. To do this in R, you
can use the strsplit() function, and for the sake of argument, let’s assume that the string you want to
split up is the following string:
> monkey <- "It was the best of times. It was the blurst of times."
To use the strsplit() function to break this apart, there are three arguments that you need to pay
particular attention to:
• x. A vector of character strings containing the data that you want to split.
• split. Depending on the value of the fixed argument, this is either a fixed string that specifies
a delimiter, or a regular expression that matches against one or more possible delimiters. If
you don’t know what regular expressions are, don’t use this option. Just specify a separator
string, just like you would for the paste() function.
• fixed. Set fixed = TRUE if you want to use a fixed delimiter. As noted above, unless you
understand regular expressions this is definitely what you want. However, the default value is
fixed = FALSE, so you have to set it explicitly.
Let’s look at a simple example:
> monkey.1 <- strsplit( x = monkey, split = " ", fixed = TRUE )
> monkey.1
[[1]]
[1] "It" "was" "the" "best" "of" "times." "It" "was"
[9] "the" "blurst" "of" "times."
One thing to note in passing is that the output here is a list (you can tell from the [[1]] part of the
output), whose first and only element is a character vector. This is useful in a lot of ways, since it
means that you can input a character vector for x and then then have the strsplit() function split all of
them, but it’s kind of annoying when you only have a single input. To that end, it’s useful to know that
you can unlist() the output:
> unlist( monkey.1 )
[1] "It" "was" "the" "best" "of" "times." "It" "was"
[9] "the" "blurst" "of" "times."
To understand why it’s important to remember to use the fixed = TRUE argument, suppose we wanted
to split this into two separate sentences. That is, we want to use split = "." as our delimiter string. As
long as we tell R to remember to treat this as a fixed separator character, then we get the right answer:
> strsplit( x = monkey, split = ".", fixed = TRUE )
[[1]]
[1] "It was the best of times" " It was the blurst of times"
However, if we don’t do this, then R will assume that when you typed split = "." you were trying to
construct a “regular expression”, and as it happens the character . has a special meaning within a
regular expression. As a consequence, if you forget to include the fixed = TRUE part, you won’t get the
answers you’re looking for.

6.5 MAKING SIMPLE CONVERSIONS


A slightly different task that comes up quite often is making transformations to text. A simple example
of this would be converting text to lower case or upper case, which you can do using the toupper()
and tolower() functions. Both of these functions have a single argument x which contains the text that
needs to be converted. An example of this is shown below:
> text <- c( "lIfe", "Impact" )
> tolower( x = text )
[1] "life" "impact"
A slightly more powerful way of doing text transformations is to use the chartr() function, which allows
you to specify a “character by character” substitution. This function contains three arguments, old,
new and x. As usual x specifies the text that needs to be transformed. The old and new arguments are
strings of the same length, and they specify how x is to be converted. Every instance of the first
character in old is converted to the first character in new and so on. For instance, suppose we wanted
to convert "albino" to "libido". To do this, we need to convert all of the "a" characters (all 1 of them)
in "albino" into "l" characters (i.e., a → l). Additionally, we need to make the substitutions l → i and n
→ d. To do so, we would use the following command:
> old.text <- "albino"
> chartr( old = "aln", new = "lid", x = old.text )
[1] "libido"
6.6 APPLYING LOGICAL OPERATIONS TO TEXT
We discussed a very basic text processing tool, namely the ability to use the equality operator == to
test to see if two strings are identical to each other. However, you can also use other logical operators
too. For instance R also allows you to use the < and > operators to determine which of two strings
comes first, alphabetically speaking. Sort of. Actually, it’s a bit more complicated than that, but let’s
start with a simple example:
> "cat" < "dog"
[1] TRUE
In this case, we see that "cat" does come before "dog" alphabetically, so R judges the statement to be
true. However, if we ask R to tell us if "cat" comes before "anteater",
> "cat" < "anteater"
[1] FALSE
It tell us that the statement is false. So far, so good. But text data is a bit more complicated than the
dictionary suggests. What about "cat" and "CAT"? Which of these comes first? Let’s try it and find out:
> "CAT" < "cat"
[1] TRUE

Table 6.1 The ordering of various text characters used by the < and > operators, as well as by the
sort() function. Not shown is the “space” character, which actually comes first on the list
In other words, R assumes that uppercase letters come before lowercase ones. Fair enough. No-one
is likely to be surprised by that. What you might find surprising is that R assumes that all uppercase
letters come before all lowercase ones. That is, while "anteater" < "zebra" is a true statement, and the
uppercase equivalent "ANTEATER" < "ZEBRA" is also true, it is not true to say that "anteater" <
"ZEBRA", as the following extract illustrates:
> "anteater" < "ZEBRA"
[1] FALSE
This may seem slightly counterintuitive. With that in mind, it may help to have a quick look Table 6.1,
which lists various text characters in the order that R uses.

6.7 CONCATENATING AND PRINTING WITH cat()


The cat() function is a mixture of paste() and print(). That is, what it does is concatenate strings and
then print them out. In your own work, you can probably survive without it, since print() and paste()
will actually do what you need, but the cat() function is so widely used. The basic idea behind cat() is
straightforward. Like paste(), it takes several arguments as inputs, which it converts to strings,
collapses (using a separator character specified using the sep argument), and prints on screen.
If you want, you can use the file argument to tell R to print the output into a file rather than on screen
(I won’t do that here). However, it’s important to note that the cat() function collapses vectors first,
and then concatenates them. That is, notice that when we use cat() to combine hw and ng, we get a
different result than if we used paste()
> cat( hw, ng )
hello world nasty government
> paste( hw, ng, collapse = " " )
[1] "hello nasty world government"
Notice the difference in the ordering of words. There’s a few additional details about cat(). Firstly,
cat() really is a function for printing, and not for creating text strings to store for later. You can’t assign
the output to a variable, as the following example illustrates:
> x <- cat( hw, ng )
hello world nasty government
>x
NULL
Despite our attempt to store the output as a variable, cat() printed the results on screen anyway, and
it turns out that the variable created doesn’t contain anything at all. Secondly, the cat() function makes
use of a number of “special” characters. For instance, compare the behaviour of print() and cat() when
asked to print the string hello\nworld":
> print( "hello\nworld" ) # print literally:
[1] "hello\nworld"
> cat( "hello\nworld" ) # interpret as newline
hello
world

6.8 USING ESCAPE CHARACTERS IN TEXT


The previous section brings us quite naturally to a fairly fundamental issue when dealing with strings,
namely the issue of delimiters and escape characters. Reduced to its most basic form, the problem we
have is that R commands are written using text characters, and our strings also consist of text
characters.
So, suppose we want to type in the word “hello”, and have R encode it as a string. If we were to just
type hello, R will think that we are referring to a variable or a function called hello rather than interpret
it as a string. The solution that R adopts is to require to enclose your string by delimiter characters,
which can be either double quotes or single quotes. So, when we type "hello" or ’hello’ then R knows
that it should treat the text in between the quote marks as a character string. However, this isn’t a
complete solution to the problem: after all, " and ’ are themselves perfectly legitimate text characters,
and so we might want to include those in our string as well. For instance, suppose we wanted to
encode the name “O’Rourke” as a string. It’s not legitimate for me to type ’O’rourke’ because R does
not realise that “O’Rourke” is a real word. So it will interpret the ’O’ part as a complete string, and
then will get confused when it reaches the Rourke’ part. As a consequence, what you get is an error
message:
> ’O’Rourke’
Error: unexpected symbol in "’O’Rourke"
To some extent, R offers us a cheap fix to the problem because of the fact that it allows us to use
either " or ’ as the delimiter character. Although ’O’rourke’ will make R cry, it is perfectly happy with
"O’Rourke":
> "O’Rourke"
[1] "O’Rourke"
This is a real advantage to having two different delimiter characters. Unfortunately, anyone with even
the slightest bit of deviousness to them can see the problem with this. Suppose you are reading a book
that contains the following passage,
P.J. O’Rourke says, “Yay, money!”. It’s a joke, but no-one laughs.
and you want to enter this as a string. Neither the ’ or " delimiters will solve the problem here, since
this string contains both a single quote character and a double quote character. To encode strings like
this one, we have to do something a little bit clever.
The solution to the problem is to designate an escape character, which in this case is \, the humble
backslash. The escape character is a bit of a sacrificial lamb: if you include a backslash character in
your string, R will not treat it as a literal character at all. It’s actually used as a way of inserting “special”
characters into your string. For instance, if you want to force R to insert actual quote marks into the
string, then what you actually type is \’ or \" (these are called escape sequences). So, in order to
encode the string discussed earlier, here’s a command we could use:
> PJ <- "P.J. O\’Rourke says, \"Yay, money!\". It\’s a joke, but no-one laughs."

Table 6.2 Standard escape characters that are evaluated by some text processing commands,
including cat(). Type ?Quotes for the corresponding R help file
Notice that we have included the backslashes for both the single quotes and double quotes. That’s
actually overkill: since we have used " as my delimiter, we only needed to do this for the double
quotes. Nevertheless, the command has worked, since we didn’t get an error message. Now let’s see
what happens when we print it out:
> print( PJ )
[1] "P.J. O’Rourke says, \"Yay, money!\". It’s a joke, but no-one laughs."
Why has R printed out the string using \"? For the exact same reason that needed to insert the
backslash in the first place. That is, when R prints out the PJ string, it has enclosed it with delimiter
characters, and it wants to unambiguously show us which of the double quotes are delimiters and
which ones are actually part of the string. Fortunately, if this bugs you, you can make it go away by
using the print.noquote() function, which will just print out the literal string that you encoded in the
first place:
> print.noquote( PJ )
[1] P.J. O’Rourke says, "Yay, money!". It’s a joke, but no-one laughs.
Typing cat(PJ) will produce a similar output.
Introducing the escape character solves a lot of problems, since it provides a mechanism by which we
can insert all sorts of characters that aren’t on the keyboard. For instance, as far as a computer is
concerned, “new line” is actually a text character. It’s the character that is printed whenever you hit
the “return” key on your keyboard. If you want to insert a new line character into your string, you can
actually do this by including the escape sequence \n. Or, if you want to insert a backslash character,
then you can use \\. A list of the standard escape sequences recognised by R is shown in Table 6.2. A
lot of these actually date back to the days of the typewriter (e.g., carriage return), so they might seem
a bit counterintuitive to people who’ve never used one. In order to get a sense for what the various
escape sequences do, we’ll have to use the cat() function, because it’s the only function “dumb”
enough to literally print them out:
> cat( "xxxx\boo" ) # \b is a backspace, so it deletes the preceding x
xxxoo
> cat( "xxxx\too" ) # \t is a tab, so it inserts a tab space
xxxx oo
> cat( "xxxx\noo" ) # \n is a newline character
xxxx
oo
> cat( "xxxx\roo" ) # \r returns you to the beginning of the line
ooxx
And that’s pretty much it. There are a few other escape sequence that R recognises, which you can
use to insert arbitrary ASCII or Unicode characters into your string (type ?Quotes for more details).

6.9 MATCHING AND SUBSTITUTING TEXT


Another task that we often want to solve is find all strings that match a certain criterion, and possibly
even to make alterations to the text on that basis. There are several functions in R that allow you to
do this, three of which are: grep(), gsub() and sub().
All three of these functions are intended to be used in conjunction with regular expressions but you
can also use them in a simpler fashion, since they all allow you to set fixed = TRUE, which means we
can ignore all this regular expression and just use simple text matching.
So, how do these functions work? Let’s start with the grep() function. The purpose of this function is
to input a vector of character strings x, and to extract all those strings that fit a certain pattern. In our
examples, we will assume that the pattern in question is a literal sequence of characters that the string
must contain (that’s what fixed = TRUE does). To illustrate this, let’s start with a simple data set, a
vector that contains the names of three beers. Something like this:
> beers <- c( "little creatures", "sierra nevada", "coopers pale" )
Next, let’s use grep() to find out which of these strings contains the substring "er". That is, the pattern
that we need to match is the fixed string "er", so the command we need to use is:
> grep( pattern = "er", x = beers, fixed = TRUE )
[1] 2 3
The output here is telling us is that the second and third elements of beers both contain the substring
"er". Alternatively, however, we might prefer it if grep() returned the actual strings themselves. We
can do this by specifying value = TRUE in our function call. That is, we’d use a command like this:
> grep( pattern = "er", x = beers, fixed = TRUE, value = TRUE )
[1] "sierra nevada" "coopers pale"
The other two functions that we wanted to mention in this section are gsub() and sub(). These are
both similar in spirit to grep() insofar as what they do is search through the input strings (x) and find
all of the strings that match a pattern. However, what these two functions do is replace the pattern
with a replacement string. The gsub() function will replace all instances of the pattern, whereas the
sub() function just replaces the first instance of it in each string. To illustrate how this works, suppose
we want to replace all instances of the letter "a" with the string "BLAH". We can do this to the beers
data using the gsub() function:
> gsub( pattern = "a", replacement = "BLAH", x = beers, fixed = TRUE )
[1] "little creBLAHtures" "sierrBLAH nevBLAHdBLAH"
[3] "coopers pBLAHle"
Notice that all three of the "a"s in "sierra nevada" have been replaced. In contrast, let’s see what
happens when we use the exact same command, but this time using the sub() function instead:
> sub( pattern = "a", replacement = "BLAH", x = beers, fixed = TRUE )
[1] "little creBLAHtures" "sierrBLAH nevada" "coopers pBLAHle"
Only the first "a" is changed.

6.10 REGULAR EXPRESSIONS


There’s one last thing regarding text manipulation, and that’s the concept of a regular expression.
Throughout this section, we’ve often needed to specify fixed = TRUE in order to force R to treat some
of our strings as actual strings, rather than as regular expressions. So, before moving on, lets discuss
what are regular expressions. They’re genuinely complicated and they are extremely powerful tools
and they’re quite widely used by people who have to work with lots of text data (e.g., people who
work with natural language data), and so it’s handy to at least have a vague idea about what they are.
The basic idea is quite simple.
Suppose you want to extract all strings in your beers vector that contain a vowel followed immediately
by the letter "s". That is, you want to finds the beer names that contain either "as", "es", "is", "os" or
"us". One possibility would be to manually specify all of these possibilities and then match against
these as fixed strings one at a time, but that’s tedious. The alternative is to try to write out a single
“regular” expression that matches all of these. The regular expression that does this is "[aeiou]s", and
you can kind of see what the syntax is doing here. The bracketed expression means “any of the things
in the middle”, so the expression as a whole means “any of the things in the middle” (i.e. vowels)
followed by the letter "s". When applied to your beer names you get this:
> grep( pattern = "[aeiou]s", x = beers, value = TRUE )
[1] "little creatures"
So it turns out that only "little creatures" contains a vowel followed by the letter "s". But of course,
had the data contained a beer like "fosters” that would have matched as well because it contains the
string "os". However, we deliberately chose not to include it because Fosters is not a proper beer. As
you can tell from this example, regular expressions are a neat tool for specifying patterns in text: in
this case, “vowel then s”. So they are definitely things worth knowing about if you ever find yourself
needing to work with a large body of text.

6.11 INPUT AND OUTPUT


Input/Output (I/O) plays a central role in most real-world applications of computers. Just consider an
ATM cash machine, which uses multiple I/O operations for both input—reading your card and reading
your typed-in cash request—and output—printing instructions on the screen, printing your receipt,
and most important, controlling the machine to output your money! R features a highly versatile array
of I/O capabilities. We’ll start with the basics of access to the keyboard and monitor, and then go into
considerable detail on reading and writing files, including the navigation of file directories. Finally, we
discuss R’s facilities for accessing the Internet.

6.11.1 Accessing the Keyboard and Monitor


R provides several functions for accessing the keyboard and monitor. Here, we’ll look at the scan(),
readline(), print(), and cat() functions.
Using the scan() Function
You can use scan() to read in a vector, whether numeric or character, from a file or the keyboard. With
a little extra work, you can even read in data to form a list.
Suppose we have files named z1.txt, z2.txt, z3.txt, and z4.txt. The z1.txt file contains the following:
123
45
6
The z2.txt file contents are as follows:
123
4.2 5
6
The z3.txt file contains this:
abc
de f
g
And finally, the z4.txt file has these contents:
abc
123 6
y
Let’s see what we can do with these files using the scan() function.
> scan("z1.txt")
Read 4 items
[1] 123 4 5 6
> scan("z2.txt")
Read 4 items
[1] 123.0 4.2 5.0 6.0
> scan("z3.txt")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'abc'
> scan("z3.txt",what="")
Read 4 items
[1] "abc" "de" "f" "g"
> scan("z4.txt",what="")
Read 4 items
[1] "abc" "123" "6" "y"
In the first call, we got a vector of four integers (though the mode is numeric). The second time, since
one number was non integral, the others were shown as floating-point numbers, too.
In the third case, we got an error. The scan() function has an optional argument named what, which
specifies mode, defaulting to double mode. So, the nonnumeric contents of the file z3 produced an
error. But we then tried again, with what="". This assigns a character string to what, indicating that
we want character mode. (We could have set what to any character string.)
The last call worked the same way. The first item was a character string, so it treated all the items that
followed as strings too. Of course, in typical usage, we would assign the return value of scan() to a
variable. Here’s an example:
> v <- scan("z1.txt")
By default, scan() assumes that the items of the vector are separated by whitespace, which includes
blanks, carriage return/line feeds, and horizontal tabs. You can use the optional sep argument for
other situations. As example, we can set sep to the newline character to read in each line as a string,
as follows:
> x1 <- scan("z3.txt",what="")
Read 4 items
> x2 <- scan("z3.txt",what="",sep="\n")
Read 3 items
> x1
[1] "abc" "de" "f" "g"
> x2
[1] "abc" "de f" "g"
> x1[2]
[1] "de"
> x2[2]
[1] "de f"
In the first case, the strings "de" and "f" were assigned to separate elements of x1. But in the second
case, we specified that elements of x2 were to be delineated by end-of-line characters, not spaces.
Since "de" and "f" are on the same line, they are assigned together to x[2].
More sophisticated methods for reading files will be presented later in this unit, such as methods to
read in a file one line at a time. But if you want to read the entire file at once, scan() provides a quick
solution.
You can use scan() to read from the keyboard by specifying an empty string for the filename:
> v <- scan("")
1: 12 5 13
4: 3 4 5
7: 8
8:
Read 7 items
>v
[1] 12 5 13 3 4 5 8
Note that, we are prompted with the index of the next item to be input, and we signal the end of input
with an empty line.
If you do not wish scan() to announce the number of items it has read, include the argument
quiet=TRUE.
Using the readline() Function
If you want to read in a single line from the keyboard, readline() is very handy.
> w <- readline()
abc de f
>w
[1] "abc de f"
Typically, readline() is called with its optional prompt, as follows:
> inits <- readline("type your initials: ")
type your initials: NM
> inits
[1] "NM"
Printing to the Screen
At the top level of interactive mode, you can print the value of a variable or expression by simply
typing the variable name or expression. This won’t work if you need to print from within the body of
a function. In that case, you can use the print() function, like this:
> x <- 1:3
> print(x^2)
[1] 1 4 9
Recall that print() is a generic function, so the actual function called will depend on the class of the
object that is printed. If, for example, the argument is of class "table", then the print.table() function
will be called.
It’s a little better to use cat() instead of print(), as the latter can print only one expression and its
output is numbered, which may be a nuisance.
Compare the results of the functions:
> print("abc")
[1] "abc"
> cat("abc\n")
abc
Note that, we needed to supply our own end-of-line character, "\n", in the call to cat(). Without it, our
next call would continue to write to the same line.
The arguments to cat() will be printed out with intervening spaces:
>x
[1] 1 2 3
> cat(x,"abc","de\n")
1 2 3 abc de
If you don’t want the spaces, set sep to the empty string "", as follows:
> cat(x,"abc","de\n",sep="")
123abcde
Any string can be used for sep. Here, we use the newline character:
> cat(x,"abc","de\n",sep="\n")
1
2
3
abc
de
You can even set sep to be a vector of strings, like this:
> x <- c(5,12,13,8,88)
> cat(x,sep=c(".",".",".","\n","\n"))
5.12.13.8
88
6.11.2 Reading and Writing Files
Now that we’ve covered the basics of I/O, let’s get to some more practical applications of reading and
writing files. The following sections discuss reading data frames or matrices from files, working with
text files, accessing files on remote machines, and getting file and directory information.
Reading a Data Frame or Matrix from a File
The function read.table() is used to read in a data frame. As a quick review, suppose the file z looks
like this:
name age
John 25
Mary 28
Jim 19
The first line contains an optional header, specifying column names. We could read the file this way:
> z <- read.table("z",header=TRUE)
>z
name age
1 John 25
2 Mary 28
3 Jim 19
Note that scan() would not work here, because our file has a mixture of numeric and character data
(and a header).
There appears to be no direct way of reading in a matrix from a file, but it can be done easily with
other tools. A simple, quick way is to use scan() to read in the matrix row by row. You use the byrow
option in the function matrix() to indicate that you are defining the elements of the matrix in a row-
wise, rather than column-wise, manner.
For instance, say the file x contains a 5-by-3 matrix, stored row-wise:
1 0 1
1 1 1
1 1 0
1 1 0
0 0 1
We can read it into a matrix this way:
> x <- matrix(scan("x"),nrow=5,byrow=TRUE)
This is fine for quick, one-time operations, but for generality, you can use read.table(), which returns
a data frame, and then convert via as.matrix(). Here is a general method:
read.matrix <- function(filename) {
as.matrix(read.table(filename))
}
Reading Text Files
In computer literature, there is often a distinction made between text files and binary files. That
distinction is somewhat misleading—every file is binary in the sense that it consists of 0s and 1s. Let’s
take the term text file to mean a file that consists mainly of ASCII characters or coding for some other
human language (such as GB for Chinese) and that uses newline characters to give humans the
perception of lines. The latter aspect will turn out to be central here. Non-text files, such as JPEG
images or executable program files, are generally called binary files.
You can use readLines() to read in a text file, either one line at a time or in a single operation. For
example, suppose we have a file z1 with the following contents:
John 25
Mary 28
Jim 19
We can read the file all at once, like this:
> z1 <- readLines("z1")
> z1
[1] "John 25" "Mary 28" "Jim 19"
Since each line is treated as a string, the return value here is a vector of strings—that is, a vector of
character mode. There is one vector element for each line read, thus three elements here.
Alternatively, we can read it in one line at a time. For this, we first need to create a connection, as
described in next section.
If your data are saved as a text file but aren’t quite in the proper CSV format, then there’s still a pretty
good chance that the read.csv() function (or equivalently, read.table()) will be able to open it. You just
need to specify a few more of the optional arguments to the function. If you type ?read.csv you’ll see
that the read.csv() function actually has several arguments that you can specify. Obviously you need
to specify the file that you want it to load, but the others all have sensible default values. Nevertheless,
you will sometimes need to change them. Few of them are:
• header. A lot of the time when you’re storing data as a CSV file, the first row actually contains
the column names and not data. If that’s not true, you need to set header = FALSE.
• sep. As the name “comma separated value” indicates, the values in a row of a CSV file are
usually separated by commas. This isn’t universal, however. In Europe the decimal point is
typically written as , instead of . and as a consequence it would be somewhat awkward to use
, as the separator. Therefore it is not unusual to use ; over there. At other times, a TAB
character used. To handle these cases, we’d need to set sep = ";" or sep = "\t".
• quote. It’s conventional in CSV files to include a quoting character for textual data. As you can
see by looking at the booksales.csv file, this is usually a double quote character, ". But
sometimes there is no quoting character at all, or you might see a single quote mark ’ used
instead. In those cases, you’d need to specify quote = "" or quote = "’".
• skip. It’s actually very common to receive CSV files in which the first few rows have nothing to
do with the actual data. Instead, they provide a human readable summary of where the data
came from, or maybe they include some technical info that doesn’t relate to the data. To tell
R to ignore the first (say) three lines, you’d need to set skip = 3
• na.strings. Often you’ll get given data with missing values. For one reason or another, some
entries in the table are missing. The data file needs to include a “special” string to indicate
that the entry is missing. By default R assumes that this string is NA, since that’s what it would
do, but there’s no universal agreement on what to use in this situation. If the file uses ???
instead, then you’ll need to set na.strings = "???".
It’s kind of nice to be able to have all these options that you can tinker with. For instance, have a look
at the data file shown in Figure 6.1. This file contains almost the same data as the last file (except it
doesn’t have a header), and it uses a bunch of wacky features that you don’t normally see in CSV files.
In fact, it just so happens that we have to change all five of those arguments listed above in order to
load this file. Here’s how we would do it:

If you now have a look at the data we have loaded, you see that this is what you have got:
> head( data )
V1 V2 V3 V4
1 January 31 0 high
2 February 28 100 high
3 March 31 200 low
4 April 30 50 out
5 May 31 NA out
6 June 30 0 high

Fig. 6.1 The booksales2.csv data file. It contains more or less the same data as the original
booksales.csv data file, but has a lot of very quirky features
Here we told R to expect * to be used as the quoting character instead of "; to look for tabs (which we
write like this: \t) instead of commas, and to skip the first 8 lines of the file, it’s basically loaded the
right data. However, since booksales2.csv doesn’t contain the column names, R has made them up.
Showing the kind of imagination we expect from insentient software, R decided to call them V1, V2,
V3 and V4. Finally, because we told it that the file uses “NFI” to denote missing data, R correctly figures
out that the sales data for May are actually missing.
Loading data from SPSS (and other statistics packages)
In real life, we have many more possibilities. For example, you might want to read data files in from
other statistics programs. Since SPSS is probably the most widely used statistics package in psychology,
it’s worth briefly showing how to open SPSS data files (file extension .sav). It’s surprisingly easy. The
extract below should illustrate how to do so:
> library( foreign ) # load the package
> X <- read.spss( "datafile.sav" ) # create a list containing the data
> X <- as.data.frame( X ) # convert to data frame
If you wanted to import from an SPSS file to a data frame directly, instead of importing a list and then
converting the list to a data frame, you can do that too:
> X <- read.spss( file = "datafile.sav", to.data.frame = TRUE )
And that’s pretty much it, at least as far as SPSS goes. As far as other statistical software goes, the
foreign package provides a wealth of possibilities. To open SAS files, check out the read.ssd()and
read.xport() functions. To open data from Minitab, the read.mtp() function is what you’re looking for.
For Stata, the read.dta() function is what you want. For Systat, the read.systat() function is what you’re
after.
Loading Excel files
A different problem is posed by Excel files. In general, R does a pretty good job of opening them, but
it’s bit finicky because Microsoft don’t seem to be terribly fond of people using non-Microsoft
products, and go to some lengths to make it tricky. If you get an Excel file, suggestion would be to
open it up in Excel (or better yet, OpenOffice, since that’s free software) and then save the
spreadsheet as a CSV file. Once you’ve got the data in that format, you can open it using read.csv().
However, if for some reason you’re desperate to open the .xls or .xlsx file directly, then you can use
the read.xls() function in the gdata package:
> library( gdata ) # load the package
> X <- read.xls( "datafile.xlsx" ) # create a data frame
This usually works. And if it doesn’t, you’re probably justified in “suggesting” to the person that sent
you the file that they should send you a nice clean CSV file instead.
Loading Matlab (& Octave) files
A lot of scientific labs use Matlab as their default platform for scientific computing; or Octave as a free
alternative. Opening Matlab data files (file extension .mat) slightly more complicated, and if it wasn’t
for the fact that Matlab is so very widespread and is an extremely good platform.
The way to do this is to install the R.matlab package (don’t forget to install the dependencies too).
Once you’ve installed and loaded the package, you have access to the readMat() function. As any
Matlab user will know, the .mat files that Matlab produces are workspace files, very much like the
.Rdata files that R produces. So you can’t import a .mat file as a data frame. However, you can import
it as a list. So, when we do this:
> library( R.matlab ) # load the package
> data <- readMat( "matlabfile.mat" ) # read the data file to a list
The data object that gets created will be a list, containing one variable for every variable stored in the
Matlab file. It’s fairly straightforward, though there are some subtleties that I’m ignoring. In particular,
note that if you don’t have the Rcompression package, you can’t open Matlab files above the version
format. So, if you’ve got a recent version of Matlab, and don’t have the Rcompression package, you’ll
need to save your files using the -v6 flag otherwise R can’t open them.
For Octave users, the foreign package contains a read.octave() command.
Saving other kinds of data
R is also pretty good at writing data into other file formats besides its own native ones. The write.csv()
function can write CSV files, and the write.foreign() function (in the foreign package) can write SPSS,
Stata and SAS files. There are also a lot of low level commands that you can use to write very specific
information to a file, so if you really, really needed to you could create your own
write.obscurefiletype() function.
6.11.2.1 Introduction to Connections
Connection is R’s term for a fundamental mechanism used in various kinds of I/O operations. Here, it
will be used for file access. The connection is created by calling file(), url(), or one of several other R
functions. To see a list of those functions, type this:
> ?connection
So, we can now read in the z1 file (introduced in the previous section) line by line, as follows:
> c <- file("z1","r")
> readLines(c,n=1)
[1] "John 25"
> readLines(c,n=1)
[1] "Mary 28"
> readLines(c,n=1)
[1] "Jim 19"
Input/Output 237
> readLines(c,n=1)
character(0)
We opened the connection, assigned the result to c, and then read the file one line at a time, as
specified by the argument n=1. When R encountered the end of file (EOF), it returned an empty result.
We needed to set up a connection so that R could keep track of our position in the file as we read
through it. We can detect EOF in our code:
> c <- file("z","r")
> while(TRUE) {
+ rl <- readLines(c,n=1)
+ if (length(rl) == 0) {
+ print("reached the end")
+ break
+ } else print(rl)
+}
[1] "John 25"
[1] "Mary 28"
[1] "Jim 19"
[1] "reached the end"
If we wish to “rewind”—to start again at the beginning of the file—we can use seek():
> c <- file("z1","r")
> readLines(c,n=2)
[1] "John 25" "Mary 28"
> seek(con=c,where=0)
[1] 16
> readLines(c,n=1)
[1] "John 25"
The argument where=0 in our call to seek() means that we wish to position the file pointer zero
characters from the start of the file—in other words, directly at the beginning.
The call returns 16, meaning that the file pointer was at position 16 before we made the call. That
makes sense. The first line consists of "John 25" plus the end-of-line character, for a total of eight
characters, and the same is true for the second line. So, after reading the first two lines, we were at
position 16.
You can close a connection by calling—what else?—close(). You would use this to let the system know
that the file you have been writing is complete and should now be officially written to disk. As another
example, in a client/server relationship over the Internet, a client would use close() to indicate to the
server that the client is signing off.
6.11.2.2 Accessing Files on Remote Machines via URLs
Certain I/O functions, such as read.table() and scan(), accept web URLs as arguments. As an example,
we’ll read some data from the University of California, Irvine archive at
http://archive.ics.uci.edu/ml/datasets.html, using the Echocardiogram data set. After navigating the
links, we find the location of that file and then read it from R, as follows:
> uci <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
> uci <- paste(uci,"echocardiogram/echocardiogram.data",sep="")
> ecc <- read.csv(uci)
Let’s take a look at what we downloaded:

We could then do our analyses. For example, the third column is age, so we could find its mean or
perform other calculations on that data. See the echocardiogram.names page at
http://archive.ics.uci.edu/ml/machine-learningdatabases/echocardiogram/echocardiogram.names
for descriptions of all of the variables.
6.11.2.3 Writing to a File
Given the statistical basis of R, file reads are probably much more common than writes. But writes are
sometimes necessary, and this section will present methods for writing to files.
The function write.table() works very much like read.table(), except that it writes a data frame instead
of reading one. For instance, let’s take the little Jack and Jill example:
> kids <- c("Jack","Jill")
> ages <- c(12,10)
> d <- data.frame(kids,ages,stringsAsFactors=FALSE)
>d
kids ages
1 Jack 12
2 Jill 10
> write.table(d,"kds")
The file kds will now have these contents:
"kids" "ages"
"1" "Jack" 12
"2" "Jill" 10
In the case of writing a matrix to a file, just state that you do not want row or column names, as
follows:
> write.table(xc,"xcnew",row.names=FALSE,col.names=FALSE)
The function cat() can also be used to write to a file, one part at a time. Here’s an example:
> cat("abc\n",file="u")
> cat("de\n",file="u",append=TRUE)
The first call to cat() creates the file u, consisting of one line with contents "abc". The second call
appends a second line. Unlike the case of using the writeLines() function, the file is automatically saved
after each operation. For instance, after the previous calls, the file will look like this:
abc
de
You can write multiple fields as well. So:
> cat(file="v",1,2,"xyz\n")
would produce a file v consisting of a single line:
1 2 xyz
You can also use writeLines(), the counterpart of readLines(). If you use a connection, you must specify
"w" to indicate you are writing to the file, not reading from it:
> c <- file("www","w")
> writeLines(c("abc","de","f"),c)
> close(c)
The file www will be created with these contents:
abc
de
f
Note the need to proactively close the file.
6.11.2.4 Getting File and Directory Information
R has a variety of functions for getting information about directories and files, setting file access
permissions, and the like. The following are a few examples:
• file.info(): Gives file size, creation time, directory-versus-ordinary file status, and so on for
each file whose name is in the argument, a character vector.
• dir(): Returns a character vector listing the names of all the files in the directory specified in
its first argument. If the optional argument recursive=TRUE is specified, the result will show
the entire directory tree rooted at the first argument.
• file.exists(): Returns a Boolean vector indicating whether the given file exists for each name in
the first argument, a character vector.
• getwd() and setwd(): Used to determine or change the current working directory.
To see all the file- and directory-related functions, type the following:
> ?files
6.11.2.5 Accessing the Internet
R’s socket facilities give the programmer access to the Internet’s TCP/IP protocol. A very important
point to keep in mind is that all the bytes sent by A to B during the time the connection between them
exists are collectively considered one big message. Say A sends one line of text of 8 characters and
then another of 20 characters. From A’s point of view, that’s two lines, but to TCP/IP, it’s just 28
characters of a yet incomplete message. Splitting that long message back into lines can take a bit of
doing. R provides various functions for this purpose, including the following:
• readLines() and writeLines(): These allow you to program as if TCP/IP were sending messages
line by line, even though this is not actually the case. If your application is naturally viewed in
terms of lines, these two functions can be quite handy.
• serialize() and unserialize(): You can use these to send R objects, such as a matrix or the
complex output of a call to a statistical function. The object is converted to character string
form by the sender and then converted back to the original object form at the receiver.
• readBin() and writeBin(): These are for sending data in binary form.
It’s important to choose the right function for each job. If you have a long vector, for example, using
serialize() and unserialize() may be more convenient but far more time-consuming. This is not only
because numbers must be converted to and from their character representations but also because
the character representation is typically much longer, which means greater transmission time.
Here are two other R socket functions:
• socketConnection(): This establishes an R connection via sockets. You specify the port number
in the argument port, and state whether a server or client is to be created, by setting the
argument server to TRUE or FALSE, respectively. In the client case, you must also supply the
server’s IP address in the argument host.
• socketSelect(): This is useful when a server is connected to multiple clients. Its main argument,
socklist, is a list of connections, and its return value is the sublist of connections that have data
ready for the server to read.

Check your Progress 1


Fill in the Blanks.

1. The _______ function to break the string.


2. The chartr() function allows you to specify a ______ substitution.
3. The _____ function is a of mixture of paste() and print().
4. The purpose of ______ function is to input a vector of character strings x, and to extract all
those strings that fit a certain pattern.
5. The function ______ is used to read in a data frame.
6. The read.xls() function is in the ___ package.
7. ______ gives file size, creation time, directory-versus-ordinary file status, and so on for each
file whose name is in the argument, a character vector.
8. socketSelect() is useful when a server is connected to ______ clients.

Activity 1
1. Write a program in R and apply all the string manipulation functions discussed in the unit.
2. Write a program in R to read the .csv file and print the content on the screen.

Summary
 R has a number of string-manipulation utilities.
 We can perform various operations on string in R like, concatenating, splitting, shortening,
matching and substituting a string.
 R provides several functions for accessing the keyboard and monitor like scan(), readline(),
print(), and cat() functions.
 We can also read and write different types of files using functions in R.

Keywords
• String: It is a sequence of characters.
• Vector: Vector is a basic data structure in R and it contains element of the same type.
• Socket: A socket is one endpoint of a two-way communication link between two programs
running on the network.
• TCP/IP: Transmission Control Protocol/Internet Protocol (TCP/IP), is a suite of
communication protocols used to interconnect network devices on the internet.
• Server: A server is a computer program or a device that provides functionality for other
programs or devices, called "clients".

Self-Assessment Questions
1. State the difference between cat() and paste() function.
2. Write a short note on strsplit() function.
3. Explain the use of escape character.
4. What is the difference between gsub() and sub().
5. Discuss the function used for reading a .csv file.

Answers To Check Your Progress


Check your Progress 1

Fill in the Blanks.

1. The strsplit() function to break the string.


2. The chartr() function allows you to specify a “character by character” substitution.
3. The cat() function is a of mixture of paste() and print().
4. The purpose of grep() function is to input a vector of character strings x, and to extract all
those strings that fit a certain pattern.
5. The function read.table() is used to read in a data frame.
6. The read.xls() function is in the gdata package.
7. file.info() gives file size, creation time, directory-versus-ordinary file status, and so on for each
file whose name is in the argument, a character vector.
8. socketSelect() is useful when a server is connected to multiple clients.

Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
7. Learning Statistics with R by Danielle Navarro.
8. Advanced R by Hadley Wickham, The R Series.
Unit 7
Object Oriented Programming – I
Structure:

7.1 Introduction

7.2 OOP Systems

7.3 OOP in R

7.4 Sloop

7.5 Base Types

7.5.1 Base versus OO Objects

7.5.2 Base Types

7.6 S3

7.6.1 Basics

7.6.2 Classes

7.6.3 Generics and Methods

7.6.4 Object Styles

7.6.5 Inheritance

7.6.6 Dispatch Details

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives
After going through this unit, you will be able to:

 Understand OOP systems


 Explain a complete set of the base types used to build all objects
 Describes how S3 generics and methods work, including the basics of method dispatch

7.1 INTRODUCTION
OOP is a little more challenging in R than in other languages because:

 There are multiple OOP systems to choose from: S3, R6, and S4. S3 and S4 are provided by
base R. R6 is provided by the R6 package, and is similar to the Reference Classes, or RC for
short, from base R.
 There is disagreement about the relative importance of the OOP systems. S3 is most
important, followed by R6, then S4. Some believe that S4 is most important, followed by RC,
and that S3 should be avoided. This means that different R communities use different systems.
 S3 and S4 use generic function OOP which is rather different from the encapsulated OOP used
by most languages popular today. Basically, while the underlying ideas of OOP are the same
across languages, their expressions are rather different. This means that you can’t
immediately transfer your existing OOP skills to R.

Generally in R, functional programming is much more important than object-oriented programming,


because you typically solve complex problems by decomposing them into simple functions, not simple
objects. Nevertheless, there are important reasons to learn each of the three systems:

 S3 allows your functions to return rich results with user-friendly display and programmer-
friendly internals. S3 is used throughout base R, so it’s important to master if you want to
extend base R functions to work with new types of input.
 R6 provides a standardised way to escape R’s copy-on-modify semantics. This is particularly
important if you want to model objects that exist independently of R. Today, a common need
for R6 is to model data that comes from a web API, and where changes come from inside or
outside of R.
 S4 is a rigorous system that forces you to think carefully about program design. It’s particularly
well-suited for building large systems that evolve over time and will receive contributions from
many programmers. This is why it is used by the Bioconductor project, so another reason to
learn S4 is to equip you to contribute to that project.

The goal of this unit is to give you some important vocabulary and some tools to identify OOP systems
in the wild.

7.2 OOP SYSTEMS


Different people use OOP terms in different ways, so this section provides a quick overview of
important vocabulary. The explanations are necessarily compressed, but we will come back to these
ideas multiple times.

The main reason to use OOP is polymorphism (literally: many shapes). Polymorphism means that a
developer can consider a function’s interface separately from its implementation, making it possible
to use the same function form for different types of input. This is closely related to the idea of
encapsulation: the user doesn’t need to worry about details of an object because they are
encapsulated behind a standard interface.

To be concrete, polymorphism is what allows summary() to produce different outputs for numeric and
factor variables:

diamonds <- ggplot2::diamonds

summary(diamonds$carat)

#> Min. 1st Qu. Median Mean 3rd Qu. Max.

#> 0.20 0.40 0.70 0.80 1.04 5.01

summary(diamonds$cut)

#> Fair Good Very Good Premium Ideal

#> 1610 4906 12082 13791 21551

You could imagine summary() containing a series of if-else statements, but that would mean only the
original author could add new implementations. An OOP system makes it possible for any developer
to extend the interface with implementations for new types of input.

To be more precise, OO systems call the type of an object its class, and an implementation for a specific
class is called a method. Roughly speaking, a class defines what an object is and methods describe
what that object can do. The class defines the fields, the data possessed by every instance of that
class. Classes are organised in a hierarchy so that if a method does not exist for one class, its parent’s
method is used, and the child is said to inherit behaviour. For example, in R, an ordered factor inherits
from a regular factor, and a generalised linear model inherits from a linear model. The process of
finding the correct method given a class is called method dispatch.

There are two main paradigms of object-oriented programming which differ in how methods and
classes are related, call these paradigms encapsulated and functional:

 In encapsulated OOP, methods belong to objects or classes, and method calls typically look
like object.method(arg1, arg2). This is called encapsulated because the object encapsulates
both data (with fields) and behaviour (with methods), and is the paradigm found in most
popular languages.
 In functional OOP, methods belong to generic functions, and method calls look like ordinary
function calls: generic(object, arg2, arg3). This is called functional because from the outside it
looks like a regular function call, and internally the components are also functions.

7.3 OOP IN R
Base R provides three OOP systems: S3, S4, and reference classes (RC):

 S3 is R’s first OOP system, and is described in Statistical Models in S (Chambers and Hastie
1992). S3 is an informal implementation of functional OOP and relies on common conventions
rather than ironclad guarantees. This makes it easy to get started with, providing a low cost
way of solving many simple problems.
 S4 is a formal and rigorous rewrite of S3, and was introduced in Programming with Data
(Chambers 1998). It requires more upfront work than S3, but in return provides more
guarantees and greater encapsulation. S4 is implemented in the base methods package, which
is always installed with R.
(You might wonder if S1 and S2 exist. They don’t: S3 and S4 were named according to the
versions of S that they accompanied. The first two versions of S didn’t have any OOP
framework.)
 RC implements encapsulated OO. RC objects are a special type of S4 objects that are also
mutable, i.e., instead of using R’s usual copy-on-modify semantics, they can be modified in
place. This makes them harder to reason about, but allows them to solve problems that are
difficult to solve in the functional OOP style of S3 and S4.

A number of other OOP systems are provided by CRAN packages:

 R6 (Chang 2017) implements encapsulated OOP like RC, but resolves some important issues.
 R.oo (Bengtsson 2003) provides some formalism on top of S3, and makes it possible to have
mutable S3 objects.
 proto (Grothendieck, Kates, and Petzoldt 2016) implements another style of OOP based on
the idea of prototypes, which blur the distinctions between classes and instances of classes
(objects).

7.4 SLOOP
Before we go on, lets introduce the sloop package:

library(sloop)

The sloop package (think “sail the seas of OOP”) provides a number of helpers that fill in missing pieces
in base R. The first of these is sloop::otype(). It makes it easy to figure out the OOP system used by a
wild-caught object:

otype(1:10)

#> [1] "base"

otype(mtcars)

#> [1] "S3"

mle_obj <- stats4::mle(function(x = 1) (x - 2) ^ 2)

otype(mle_obj)

#> [1] "S4"

7.5 BASE TYPES


To talk about objects and OOP in R we first need to clear up a fundamental confusion about two uses
of the word “object”. The general sense captured by John Chambers’ pithy quote: “Everything that
exists in R is an object”. However, while everything is an object, not everything is object-oriented. This
confusion arises because the base objects come from S, and were developed before anyone thought
that S might need an OOP system. The tools and nomenclature evolved organically over many years
without a single guiding principle.

Most of the time, the distinction between objects and object-oriented objects is not important. But
here we need to get into the nitty gritty details so we’ll use the terms base objects and OO objects to
distinguish them.
7.5.1 Base versus OO Objects
To tell the difference between a base and OO object, use is.object() or sloop::otype():

# A base object:

is.object(1:10)

#> [1] FALSE

sloop::otype(1:10)

#> [1] "base"

# An OO object

is.object(mtcars)

#> [1] TRUE

sloop::otype(mtcars)

#> [1] "S3"

Technically, the difference between base and OO objects is that OO objects have a “class” attribute:

attr(1:10, "class")

#> NULL

attr(mtcars, "class")

#> [1] "data.frame"

You may already be familiar with the class() function. This function is safe to apply to S3 and S4 objects,
but it returns misleading results when applied to base objects. It’s safer to use sloop::s3_class(), which
returns the implicit class that the S3 and S4 systems will use to pick methods.

x <- matrix(1:4, nrow = 2)

class(x)

#> [1] "matrix"

sloop::s3_class(x)

#> [1] "matrix" "integer" "numeric"


7.5.2 Base Types
While only OO objects have a class attribute, every object has a base type:

typeof(1:10)

#> [1] "integer"

typeof(mtcars)

#> [1] "list"

Base types do not form an OOP system because functions that behave differently for different base
types are primarily written in C code that uses switch statements. This means that only R-core can
create new types, and creating a new type is a lot of work because every switch statement needs to
be modified to handle a new case. As a consequence, new base types are rarely added. The most
recent change, in 2011, added two exotic types that you never see in R itself, but are needed for
diagnosing memory problems. Prior to that, the last type added was a special base type for S4 objects
added in 2005.

In total, there are 25 different base types. Few of them are listed below,

Vectors, NULL (NILSXP), logical (LGLSXP), integer (INTSXP), double (REALSXP), complex (CPLXSXP),
character (STRSXP), list (VECSXP), and raw (RAWSXP).

typeof(NULL)

#> [1] "NULL"

typeof(1L)

#> [1] "integer"

typeof(1i)

#> [1] "complex"

Functions, include types closure (regular R functions, CLOSXP), special (internal functions,
SPECIALSXP), and builtin (primitive functions, BUILTINSXP).

typeof(mean)

#> [1] "closure"

typeof(`[`)

#> [1] "special"

typeof(sum)

#> [1] "builtin"

Environments, have type environment (ENVSXP).

typeof(globalenv())

#> [1] "environment"

The S4 type (S4SXP), is used for S4 classes that don’t inherit from an existing base type.
mle_obj <- stats4::mle(function(x = 1) (x - 2) ^ 2)

typeof(mle_obj)

#> [1] "S4"

Language components, include symbol (aka name, SYMSXP), language (usually called calls, LANGSXP),
and pairlist (used for function arguments, LISTSXP) types.

typeof(quote(a))

#> [1] "symbol"

typeof(quote(a + 1))

#> [1] "language"

typeof(formals(mean))

#> [1] "pairlist"

expression (EXPRSXP) is a special purpose type that’s only returned by parse() and expression().
Expressions are generally not needed in user code.

The remaining types are esoteric and rarely seen in R. They are important primarily for C code:
externalptr (EXTPTRSXP), weakref (WEAKREFSXP), bytecode (BCODESXP), promise (PROMSXP), ...
(DOTSXP), and any (ANYSXP).

You may have heard of mode() and storage.mode(). Do not use these functions: they exist only to
provide type names that are compatible with S.

7.5.2.1 Numeric Type

Be careful when talking about the numeric type, because R uses “numeric” to mean three slightly
different things:

1. In some places numeric is used as an alias for the double type. For example as.numeric() is
identical to as.double(), and numeric() is identical to double().
(R also occasionally uses real instead of double; NA_real_ is the one place that you’re likely to
encounter this in practice.)
2. In the S3 and S4 systems, numeric is used as a shorthand for either integer or double type,
and is used when picking methods:

sloop::s3_class(1)

#> [1] "double" "numeric"

sloop::s3_class(1L)

#> [1] "integer" "numeric"

3. is.numeric() tests for objects that behave like numbers. For example, factors have type
“integer” but don’t behave like numbers (i.e. it doesn’t make sense to take the mean of
factor).

typeof(factor("x"))

#> [1] "integer"


is.numeric(factor("x"))

#> [1] FALSE

7.6 S3
S3 is R’s first and simplest OO system. S3 is informal and ad hoc, but there is a certain elegance in its
minimalism: you can’t take away any part of it and still have a useful OO system. For these reasons,
you should use it, unless you have a compelling reason to do otherwise. S3 is the only OO system used
in the base and stats packages, and it’s the most commonly used system in CRAN packages.

S3 is very flexible, which means it allows you to do things that are quite ill-advised. If you’re coming
from a strict environment like Java this will seem pretty frightening, but it gives R programmers a
tremendous amount of freedom. It may be very difficult to prevent people from doing something you
don’t want them to do, but your users will never be held back because there is something you haven’t
implemented yet. Since S3 has few built-in constraints, the key to its successful use is applying the
constraints yourself. This chapter will therefore teach you the conventions you should (almost) always
follow.

This section covers how the S3 system works, not how to use it effectively to create new classes and
generics.

7.6.1 Basics
An S3 object is a base type with at least a class attribute (other attributes may be used to store other
data). For example, take the factor. Its base type is the integer vector, it has a class attribute of
“factor”, and a levels attribute that stores the possible levels:

f <- factor(c("a", "b", "c"))

typeof(f)

#> [1] "integer"

attributes(f)

#> $levels

#> [1] "a" "b" "c"

#>

#> $class

#> [1] "factor"

You can get the underlying base type by unclass()ing it, which strips the class attribute, causing it to
lose its special behaviour:

unclass(f)

#> [1] 1 2 3

#> attr(,"levels")

#> [1] "a" "b" "c"


An S3 object behaves differently from its underlying base type whenever it’s passed to a generic (short
for generic function). The easiest way to tell if a function is a generic is to use sloop::ftype() and look
for “generic” in the output:

ftype(print)

#> [1] "S3" "generic"

ftype(str)

#> [1] "S3" "generic"

ftype(unclass)

#> [1] "primitive"

A generic function defines an interface, which uses a different implementation depending on the class
of an argument (almost always the first argument). Many base R functions are generic, including the
important print():

print(f)

#> [1] a b c

#> Levels: a b c

# stripping class reverts to integer behaviour

print(unclass(f))

#> [1] 1 2 3

#> attr(,"levels")

#> [1] "a" "b" "c"

Beware that str() is generic, and some S3 classes use that generic to hide the internal details. For
example, the POSIXlt class used to represent date-time data is actually built on top of a list, a fact
which is hidden by its str() method:

time <- strptime(c("2017-01-01", "2020-05-04 03:21"), "%Y-%m-%d")

str(time)

#> POSIXlt[1:2], format: "2017-01-01" "2020-05-04"

str(unclass(time))

#> List of 9

#> $ sec : num [1:2] 0 0

#> $ min : int [1:2] 0 0

#> $ hour : int [1:2] 0 0

#> $ mday : int [1:2] 1 4

#> $ mon : int [1:2] 0 4


#> $ year : int [1:2] 117 120

#> $ wday : int [1:2] 0 1

#> $ yday : int [1:2] 0 124

#> $ isdst: int [1:2] 0 0

#> - attr(*, "tzone")= chr "UTC"

The generic is a middleman: its job is to define the interface (i.e. the arguments) then find the right
implementation for the job. The implementation for a specific class is called a method, and the generic
finds that method by performing method dispatch.

You can use sloop::s3_dispatch() to see the process of method dispatch:

s3_dispatch(print(f))

#> => print.factor

#> * print.default

Note that S3 methods are functions with a special naming scheme, generic.class(). For example, the
factor method for the print() generic is called print.factor(). You should never call the method directly,
but instead rely on the generic to find it for you.

Generally, you can identify a method by the presence of . in the function name, but there are a number
of important functions in base R that were written before S3, and hence use . to join words. If you’re
unsure, check with sloop::ftype():

ftype(t.test)

#> [1] "S3" "generic"

ftype(t.data.frame)

#> [1] "S3" "method"

Unlike most functions, you can’t see the source code for most S3 methods just by typing their names.
That’s because S3 methods are not usually exported: they live only inside the package, and are not
available from the global environment. Instead, you can use sloop::s3_get_method(), which will work
regardless of where the method lives:

weighted.mean.Date

#> Error in eval(expr, envir, enclos): object 'weighted.mean.Date' not found

s3_get_method(weighted.mean.Date)

#> function (x, w, ...)

#> structure(weighted.mean(unclass(x), w, ...), class = "Date")

#> <bytecode: 0x3eb1a58>

#> <environment: namespace:stats>


7.6.2 Classes
If you have done object-oriented programming in other languages, you may be surprised to learn that
S3 has no formal definition of a class: to make an object an instance of a class, you simply set the class
attribute. You can do that during creation with structure(), or after the fact with class<-():

# Create and assign class in one step

x <- structure(list(), class = "my_class")

# Create, then set class

x <- list()

class(x) <- "my_class"

You can determine the class of an S3 object with class(x), and see if an object is an instance of a class
using inherits(x, "classname").

class(x)

#> [1] "my_class"

inherits(x, "my_class")

#> [1] TRUE

inherits(x, "your_class")

#> [1] FALSE

The class name can be any string, but it is recommended to use only letters and _. Avoid . because (as
mentioned earlier) it can be confused with the . separator between a generic name and a class name.
When using a class in a package, I recommend including the package name in the class name. That
ensures you won’t accidentally clash with a class defined by another package.

S3 has no checks for correctness which means you can change the class of existing objects:

# Create a linear model

mod <- lm(log(mpg) ~ log(disp), data = mtcars)

class(mod)

#> [1] "lm"

print(mod)

#>

#> Call:

#> lm(formula = log(mpg) ~ log(disp), data = mtcars)

#>

#> Coefficients:

#> (Intercept) log(disp)


#> 5.381 -0.459

# Turn it into a date (?!)

class(mod) <- "Date"

# Unsurprisingly this doesn't work very well

print(mod)

#> Error in as.POSIXlt.Date(x): (list) object cannot be coerced to type

#> 'double'

If you’ve used other OO languages, this might make you feel queasy, but in practice this flexibility
causes few problems. R doesn’t stop you from shooting yourself in the foot, but as long as you don’t
aim the gun at your toes and pull the trigger, you won’t have a problem.

To avoid foot-bullet intersections when creating your own class, you usually provide three functions:

1. A low-level constructor, new_myclass(), that efficiently creates new objects with the correct
structure.
2. A validator, validate_myclass(), that performs more computationally expensive checks to
ensure that the object has correct values.
3. A user-friendly helper, myclass(), that provides a convenient way for others to create objects
of your class.

You don’t need a validator for very simple classes, and you can skip the helper if the class is for internal
use only, but you should always provide a constructor.

7.6.2.1 Constructors

S3 doesn’t provide a formal definition of a class, so it has no built-in way to ensure that all objects of
a given class have the same structure (i.e. the same base type and the same attributes with the same
types). Instead, you must enforce a consistent structure by using a constructor.

The constructor should follow three principles:

 Be called new_myclass().
 Have one argument for the base object, and one for each attribute.
 Check the type of the base object and the types of each attribute.

To illustrate these ideas create constructors for base classes. To start, let’s make a constructor for the
simplest S3 class: Date. A Date is just a double with a single attribute: its class is “Date”. This makes
for a very simple constructor:

new_Date <- function(x = double()) {

stopifnot(is.double(x))

structure(x, class = "Date")

new_Date(c(-1, 0, 1))

#> [1] "1969-12-31" "1970-01-01" "1970-01-02"


The purpose of constructors is to help you, the developer. That means you can keep them simple, and
you don’t need to optimise error messages for public consumption. If you expect users to also create
objects, you should create a friendly helper function, called class_name().

A slightly more complicated constructor is that for difftime, which is used to represent time
differences. It is again built on a double, but has a units attribute that must take one of a small set of
values:

new_difftime <- function(x = double(), units = "secs") {

stopifnot(is.double(x))

units <- match.arg(units, c("secs", "mins", "hours", "days", "weeks"))

structure(x,

class = "difftime",

units = units

new_difftime(c(1, 10, 3600), "secs")

#> Time differences in secs

#> [1] 1 10 3600

new_difftime(52, "weeks")

#> Time difference of 52 weeks

The constructor is a developer function: it will be called in many places, by an experienced user. That
means it’s OK to trade a little safety in return for performance, and you should avoid potentially time-
consuming checks in the constructor.

7.6.2.2 Validators

More complicated classes require more complicated checks for validity. Take factors, for example. A
constructor only checks that types are correct, making it possible to create malformed factors:

new_factor <- function(x = integer(), levels = character()) {

stopifnot(is.integer(x))

stopifnot(is.character(levels))

structure(

x,

levels = levels,

class = "factor"

)
}

new_factor(1:5, "a")

#> Error in as.character.factor(x): malformed factor

new_factor(0:1, "a")

#> Error in as.character.factor(x): malformed factor

Rather than encumbering the constructor with complicated checks, it’s better to put them in a
separate function. Doing so allows you to cheaply create new objects when you know that the values
are correct, and easily re-use the checks in other places.

validate_factor <- function(x) {

values <- unclass(x)

levels <- attr(x, "levels")

if (!all(!is.na(values) & values > 0)) {

stop(

"All `x` values must be non-missing and greater than zero",

call. = FALSE

if (length(levels) < max(values)) {

stop(

"There must be at least as many `levels` as possible values in `x`",

call. = FALSE

validate_factor(new_factor(1:5, "a"))

#> Error: There must be at least as many `levels` as possible values in `x`

validate_factor(new_factor(0:1, "a"))

#> Error: All `x` values must be non-missing and greater than zero

This validator function is called primarily for its side-effects (throwing an error if the object is invalid)
so you’d expect it to invisibly return its primary input. However, it’s useful for validation methods to
return visibly.
7.6.2.3 Helpers

If you want users to construct objects from your class, you should also provide a helper method that
makes their life as easy as possible. A helper should always:

 Have the same name as the class, e.g. myclass().


 Finish by calling the constructor, and the validator, if it exists.
 Create carefully crafted error messages tailored towards an end-user.
 Have a thoughtfully crafted user interface with carefully chosen default values and useful
conversions.

The last bullet is the trickiest, and it’s hard to give general advice. However, there are three common
patterns:

 Sometimes all the helper needs to do is coerce its inputs to the desired type. For example,
new_difftime() is very strict, and violates the usual convention that you can use an integer
vector wherever you can use a double vector:

new_difftime(1:10)

#> Error in new_difftime(1:10): is.double(x) is not TRUE

It’s not the job of the constructor to be flexible, so here we create a helper that just coerces
the input to a double.

difftime <- function(x = double(), units = "secs") {

x <- as.double(x)

new_difftime(x, units = units)

difftime(1:10)

#> Time differences in secs

#> [1] 1 2 3 4 5 6 7 8 9 10

 Often, the most natural representation of a complex object is a string. For example, it’s very
convenient to specify factors with a character vector. The code below shows a simple version
of factor(): it takes a character vector, and guesses that the levels should be the unique values.
This is not always correct (since some levels might not be seen in the data), but it’s a useful
default.

factor <- function(x = character(), levels = unique(x)) {

ind <- match(x, levels)

validate_factor(new_factor(ind, levels))

factor(c("a", "a", "b"))

#> [1] a a b
#> Levels: a b

 Some complex objects are most naturally specified by multiple simple components. For
example, we think it’s natural to construct a date-time by supplying the individual components
(year, month, day etc). That leads us to this POSIXct() helper that resembles the existing
ISODatetime() function:

POSIXct <- function(year = integer(),


month = integer(),
day = integer(),
hour = 0L,
minute = 0L,
sec = 0,
tzone = "") {
ISOdatetime(year, month, day, hour, minute, sec, tz = tzone)
}
POSIXct(2020, 1, 1, tzone = "America/New_York")
#> [1] "2020-01-01 EST"
For more complicated classes, you should feel free to go beyond these patterns to make life as easy
as possible for your users.

7.6.3 Generics and Methods


The job of an S3 generic is to perform method dispatch, i.e. find the specific implementation for a
class. Method dispatch is performed by UseMethod(), which every generic calls. UseMethod() takes
two arguments: the name of the generic function (required), and the argument to use for method
dispatch (optional). If you omit the second argument, it will dispatch based on the first argument,
which is almost always what is desired.

Most generics are very simple, and consist of only a call to UseMethod(). Take mean() for example:

mean

#> function (x, ...)

#> UseMethod("mean")

#> <bytecode: 0x1235500>

#> <environment: namespace:base>

Creating your own generic is similarly simple:

my_new_generic <- function(x) {

UseMethod("my_new_generic")

}
You don’t pass any of the arguments of the generic to UseMethod(); it uses deep magic to pass to the
method automatically. The precise process is complicated and frequently surprising, so you should
avoid doing any computation in a generic. To learn the full details, carefully read the Technical Details
section in ?UseMethod.

7.6.3.1 Method Dispatch

How does UseMethod() work? It basically creates a vector of method names, paste0("generic", ".",
c(class(x), "default")), and then looks for each potential method in turn. We can see this in action with
sloop::s3_dispatch(). You give it a call to an S3 generic, and it lists all the possible methods. For
example, what method is called when you print a Date object?

x <- Sys.Date()

s3_dispatch(print(x))

#> => print.Date

#> * print.default

The output here is simple:

 => indicates the method that is called, here print.Date()


 * indicates a method that is defined, but not called, here print.default().

The “default” class is a special pseudo-class. This is not a real class, but is included to make it possible
to define a standard fallback that is found whenever a class-specific method is not available.

The essence of method dispatch is quite simple, but as the chapter proceeds you’ll see it get
progressively more complicated to encompass inheritance, base types, internal generics, and group
generics. The code below shows a couple of more complicated cases which we’ll discuss in the next
section:

x <- matrix(1:10, nrow = 2)

s3_dispatch(mean(x))

#> mean.matrix

#> mean.integer

#> mean.numeric

#> => mean.default

s3_dispatch(sum(Sys.time()))

#> sum.POSIXct

#> sum.POSIXt

#> sum.default

#> => Summary.POSIXct

#> Summary.POSIXt
#> Summary.default

#> -> sum (internal)

7.6.3.2 Finding Methods

sloop::s3_dispatch() lets you find the specific method used for a single call. What if you want to find
all methods defined for a generic or associated with a class? That’s the job of
sloop::s3_methods_generic() and sloop::s3_methods_class():

s3_methods_generic("mean")

#> # A tibble: 6 x 4

#> generic class visible source

#> <chr> <chr> <lgl> <chr>

#> 1 mean Date TRUE base

#> 2 mean default TRUE base

#> 3 mean difftime TRUE base

#> 4 mean POSIXct TRUE base

#> 5 mean POSIXlt TRUE base

#> 6 mean quosure FALSE registered S3method

s3_methods_class("ordered")

#> # A tibble: 4 x 4

#> generic class visible source

#> <chr> <chr> <lgl> <chr>

#> 1 as.data.frame ordered TRUE base

#> 2 Ops ordered TRUE base

#> 3 relevel ordered FALSE registered S3method

#> 4 Summary ordered TRUE base

7.6.3.3 Creating Methods

There are two wrinkles to be aware of when you create a new method:

 First, you should only ever write a method if you own the generic or the class. R will allow you
to define a method even if you don’t, but it is exceedingly bad manners. Instead, work with
the author of either the generic or the class to add the method in their code.
 A method must have the same arguments as its generic. This is enforced in packages by R CMD
check, but it’s good practice even if you’re not creating a package.
There is one exception to this rule: if the generic has ..., the method can contain a superset of the
arguments. This allows methods to take arbitrary additional arguments. The downside of using ...,
however, is that any misspelled arguments will be silently swallowed.

7.6.4 Object Styles


So far we have focussed on vector style classes like Date and factor. These have the key property that
length(x) represents the number of observations in the vector. There are three variants that do not
have this property:

 Record style objects use a list of equal-length vectors to represent individual components of
the object. The best example of this is POSIXlt, which underneath the hood is a list of 11 date-
time components like year, month, and day. Record style classes override length() and
subsetting methods to conceal this implementation detail.

x <- as.POSIXlt(ISOdatetime(2020, 1, 1, 0, 0, 1:3))

#> [1] "2020-01-01 00:00:01 UTC" "2020-01-01 00:00:02 UTC"

#> [3] "2020-01-01 00:00:03 UTC"

length(x)

#> [1] 3

length(unclass(x))

#> [1] 9

x[[1]] # the first date time

#> [1] "2020-01-01 00:00:01 UTC"

unclass(x)[[1]] # the first component, the number of seconds

#> [1] 1 2 3

 Data frames are similar to record style objects in that both use lists of equal length vectors.
However, data frames are conceptually two dimensional, and the individual components are
readily exposed to the user. The number of observations is the number of rows, not the
length:

x <- data.frame(x = 1:100, y = 1:100)

length(x)

#> [1] 2

nrow(x)

#> [1] 100


 Scalar objects typically use a list to represent a single thing. For example, an lm object is a list
of length 12 but it represents one model.

mod <- lm(mpg ~ wt, data = mtcars)

length(mod)

#> [1] 12

Scalar objects can also be built on top of functions, calls, and environments. This is less
generally useful, but you can see applications in stats::ecdf(), R6, and rlang::quo().

7.6.5 Inheritance
S3 classes can share behaviour through a mechanism called inheritance. Inheritance is powered by
three ideas:

 The class can be a character vector. For example, the ordered and POSIXct classes have two
components in their class:

class(ordered("x"))

#> [1] "ordered" "factor"

class(Sys.time())

#> [1] "POSIXct" "POSIXt"

 If a method is not found for the class in the first element of the vector, R looks for a method
for the second class (and so on):

s3_dispatch(print(ordered("x")))

#> print.ordered

#> => print.factor

#> * print.default

s3_dispatch(print(Sys.time()))

#> => print.POSIXct

#> print.POSIXt

#> * print.default

 A method can delegate work by calling NextMethod(). We’ll come back to that very shortly;
for now, note that s3_dispatch() reports delegation with ->.

s3_dispatch(ordered("x")[1])

#> [.ordered

#> => [.factor

#> [.default

#> -> [ (internal)


s3_dispatch(Sys.time()[1])

#> => [.POSIXct

#> [.POSIXt

#> [.default

#> -> [ (internal)

Before we continue, we need a bit of vocabulary to describe the relationship between the classes that
appear together in a class vector. We’ll say that ordered is a subclass of factor because it always
appears before it in the class vector, and, conversely, we’ll say factor is a superclass of ordered.

S3 imposes no restrictions on the relationship between sub- and superclasses but your life will be
easier if you impose some. It is recommended to adhere two simple principles when creating a
subclass:

 The base type of the subclass should be that same as the superclass.
 The attributes of the subclass should be a superset of the attributes of the superclass.

POSIXt does not adhere to these principles because POSIXct has type double, and POSIXlt has type list.
This means that POSIXt is not a superclass, and illustrates that it’s quite possible to use the S3
inheritance system to implement other styles of code sharing (here POSIXt plays a role more like an
interface), but you’ll need to figure out safe conventions yourself.

7.6.5.1 NextMethod()

NextMethod() is the hardest part of inheritance to understand, so we’ll start with a concrete example
for the most common use case: [. We’ll start by creating a simple toy class: a secret class that hides its
output when printed:

new_secret <- function(x = double()) {

stopifnot(is.double(x))

structure(x, class = "secret")

print.secret <- function(x, ...) {

print(strrep("x", nchar(x)))

invisible(x)

x <- new_secret(c(15, 1, 456))

#> [1] "xx" "x" "xxx"

This works, but the default [ method doesn’t preserve the class:

s3_dispatch(x[1])
#> [.secret

#> [.default

#> => [ (internal)

x[1]

#> [1] 15

To fix this, we need to provide a [.secret method. How could we implement this method? The naive
approach won’t work because we’ll get stuck in an infinite loop:

`[.secret` <- function(x, i) {

new_secret(x[i])

Instead, we need some way to call the underlying [ code, i.e. the implementation that would get called
if we didn’t have a [.secret method. One approach would be to unclass() the object:

`[.secret` <- function(x, i) {

x <- unclass(x)

new_secret(x[i])

x[1]

#> [1] "xx"

This works, but is inefficient because it creates a copy of x. A better approach is to use NextMethod(),
which concisely solves the problem delegating to the method that would’ve have been called if [.secret
didn’t exist:

`[.secret` <- function(x, i) {

new_secret(NextMethod())

x[1]

#> [1] "xx"

We can see what’s going on with sloop::s3_dispatch():

s3_dispatch(x[1])

#> => [.secret

#> [.default

#> -> [ (internal)

The => indicates that [.secret is called, but that NextMethod() delegates work to the underlying
internal [ method, as shown by the ->.
As with UseMethod(), the precise semantics of NextMethod() are complex. In particular, it tracks the
list of potential next methods with a special variable, which means that modifying the object that’s
being dispatched upon will have no impact on which method gets called next.

7.6.5.2 Allowing Subclassing

When you create a class, you need to decide if you want to allow subclasses, because it requires some
changes to the constructor and careful thought in your methods.

To allow subclasses, the parent constructor needs to have ... and class arguments:

new_secret <- function(x, ..., class = character()) {

stopifnot(is.double(x))

structure(

x,

...,

class = c(class, "secret")

Then the subclass constructor can just call to the parent class constructor with additional arguments
as needed. For example, imagine we want to create a supersecret class which also hides the number
of characters:

new_supersecret <- function(x) {

new_secret(x, class = "supersecret")

print.supersecret <- function(x, ...) {

print(rep("xxxxx", length(x)))

invisible(x)

x2 <- new_supersecret(c(15, 1, 456))

x2

#> [1] "xxxxx" "xxxxx" "xxxxx"

To allow inheritance, you also need to think carefully about your methods, as you can no longer use
the constructor. If you do, the method will always return the same class, regardless of the input. This
forces whoever makes a subclass to do a lot of extra work.

Concretely, this means we need to revise the [.secret method. Currently it always returns a secret(),
even when given a supersecret:
`[.secret` <- function(x, ...) {

new_secret(NextMethod())

x2[1:3]

#> [1] "xx" "x" "xxx"

We want to make sure that [.secret returns the same class as x even if it’s a subclass. You’ll need to
use the vctrs package, which provides a solution in the form of the vctrs::vec_restore() generic. This
generic takes two inputs: an object which has lost subclass information, and a template object to use
for restoration.

Typically vec_restore() methods are quite simple: you just call the constructor with appropriate
arguments:

vec_restore.secret <- function(x, to, ...) new_secret(x)

vec_restore.supersecret <- function(x, to, ...) new_supersecret(x)

(If your class has attributes, you’ll need to pass them from to into the constructor.)

Now we can use vec_restore() in the [.secret method:

`[.secret` <- function(x, ...) {

vctrs::vec_restore(NextMethod(), x)

x2[1:3]

#> [1] "xxxxx" "xxxxx" "xxxxx"

If you build your class using the tools provided by the vctrs package, [ will gain this behaviour
automatically. You will only need to provide your own [ method if you use attributes that depend on
the data or want non-standard subsetting behaviour. See ?vctrs::new_vctr for details.

7.6.6 Dispatch Details


7.6.6.1 S3 and Base Types

What happens when you call an S3 generic with a base object, i.e. an object with no class? You might
think it would dispatch on what class() returns:

class(matrix(1:5))

#> [1] "matrix"

But unfortunately dispatch actually occurs on the implicit class, which has three components:

 The string “array” or “matrix” if the object has dimensions


 The result of typeof() with a few minor tweaks
 The string “numeric” if object is “integer” or “double”
There is no base function that will compute the implicit class, but you can use sloop::s3_class()

s3_class(matrix(1:5))

#> [1] "matrix" "integer" "numeric"

This is used by s3_dispatch():

s3_dispatch(print(matrix(1:5)))

#> print.matrix

#> print.integer

#> print.numeric

#> => print.default

This means that the class() of an object does not uniquely determine its dispatch:

x1 <- 1:5

class(x1)

#> [1] "integer"

s3_dispatch(mean(x1))

#> mean.integer

#> mean.numeric

#> => mean.default

x2 <- structure(x1, class = "integer")

class(x2)

#> [1] "integer"

s3_dispatch(mean(x2))

#> mean.integer

#> => mean.default

7.6.6.2 Internal Generics

Some base functions, like [, sum(), and cbind(), are called internal generics because they don’t call
UseMethod() but instead call the C functions DispatchGroup() or DispatchOrEval(). s3_dispatch()
shows internal generics by including the name of the generic followed by (internal):

s3_dispatch(Sys.time()[1])

#> => [.POSIXct

#> [.POSIXt

#> [.default
#> -> [ (internal)

For performance reasons, internal generics do not dispatch to methods unless the class attribute has
been set, which means that internal generics do not use the implicit class. Again, if you’re ever
confused about method dispatch, you can rely on s3_dispatch().

7.6.6.3 Group Generics

Group generics are the most complicated part of S3 method dispatch because they involve both
NextMethod() and internal generics. Like internal generics, they only exist in base R, and you cannot
define your own group generic.

There are four group generics:

 Math: abs(), sign(), sqrt(), floor(), cos(), sin(), log(), and more (see ?Math for the complete list).
 Ops: +, -, *, /, ^, %%, %/%, &, |, !, ==, !=, <, <=, >=, and >.
 Summary: all(), any(), sum(), prod(), min(), max(), and range().
 Complex: Arg(), Conj(), Im(), Mod(), Re().

Defining a single group generic for your class overrides the default behaviour for all of the members
of the group. Methods for group generics are looked for only if the methods for the specific generic
do not exist:

s3_dispatch(sum(Sys.time()))

#> sum.POSIXct

#> sum.POSIXt

#> sum.default

#> => Summary.POSIXct

#> Summary.POSIXt

#> Summary.default

#> -> sum (internal)

Most group generics involve a call to NextMethod(). For example, take difftime() objects. If you look
at the method dispatch for abs(), you’ll see there’s a Math group generic defined.

y <- as.difftime(10, units = "mins")

s3_dispatch(abs(y))

#> abs.difftime

#> abs.default

#> => Math.difftime

#> Math.default

#> -> abs (internal)

Math.difftime basically looks like this:


Math.difftime <- function(x, ...) {

new_difftime(NextMethod(), units = attr(x, "units"))

It dispatches to the next method, here the internal default, to perform the actual computation, then
restore the class and attributes.

Inside a group generic function a special variable, .Generic provides the actual generic function called.
This can be useful when producing error messages, and can sometimes be useful if you need to
manually re-call the generic with different arguments.

7.6.6.4 Double Dispatch

Generics in the Ops group, which includes the two-argument arithmetic and Boolean operators like -
and &, implement a special type of method dispatch. They dispatch on the type of both of the
arguments, which is called double dispatch. This is necessary to preserve the commutative property
of many operators, i.e. a + b should equal b + a. Take the following simple example:

date <- as.Date("2017-01-01")

integer <- 1L

date + integer

#> [1] "2017-01-02"

integer + date

#> [1] "2017-01-02"

If + dispatched only on the first argument, it would return different values for the two cases. To
overcome this problem, generics in the Ops group use a slightly different strategy from usual. Rather
than doing a single method dispatch, they do two, one for each input. There are three possible
outcomes of this lookup:

 The methods are the same, so it doesn’t matter which method is used.
 The methods are different, and R falls back to the internal method with a warning.
 One method is internal, in which case R calls the other method.

This approach is error prone so if you want to implement robust double dispatch for algebraic
operators, use the vctrs package. See ?vctrs::vec_arith for details.

Check your Progress 1


State True or False.

1. An S3 object is a base type with at least a class attribute.


2. A helper should always have the different name as the class.
3. In S3, method dispatch is performed by NextMethod().
4. Record style objects use a list of equal-length vectors to represent individual components of
the object.
Activity 1
1. Describe the difference in behaviour in these two calls.

set.seed(1014)

some_days <- as.Date("2017-01-31") + sample(10, 5)

mean(some_days)

#> [1] "2017-02-06"

mean(unclass(some_days))

#> [1] 17203

2. What class of object does the following code return? What base type is it built on? What
attributes does it use?

x <- table(rpois(100, 5))

#>

#> 1 2 3 4 5 6 7 8 9 10

#> 7 5 18 14 15 15 14 4 5 3

3. Read the documentation for utils::as.roman(). How would you write a constructor for this
class? Does it need a validator? What might a helper do?
4. Carefully read the documentation for UseMethod() and explain why the following code
returns the results that it does. What two usual rules of function evaluation does UseMethod()
violate?

g <- function(x) {

x <- 10

y <- 10

UseMethod("g")

g.default <- function(x) c(x = x, y = y)

x <- 1

y <- 1

g(x)

#> x y

#> 1 10
5. What do you expect this code to return? What does it actually return? Why?

generic2 <- function(x) UseMethod("generic2")

generic2.a1 <- function(x) "a1"

generic2.a2 <- function(x) "a2"

generic2.b <- function(x) {

class(x) <- "a1"

NextMethod()

generic2(structure(list(), class = c("b", "a2")))

Summary
 Everything in R—ranging from numbers to character strings to matrices—is an object.
 R promotes encapsulation, which is packaging separate but related data items into one class
instance. Encapsulation helps you keep track of related variables, enhancing clarity.
 R classes are polymorphic, which means that the same function call leads to different
operations for objects of different classes. For instance, a call to print() on an object of a
certain class triggers a call to a print function tailored to that class. Polymorphism promotes
reusability.
 R allows inheritance, which allows extending a given class to a more specialised class.
 There are multiple OOP systems to choose from: S3, R6, and S4. S3 and S4 are provided by
base R. R6 is provided by the R6 package, and is similar to the Reference Classes, or RC for
short, from base R.

Keywords
 Polymorphism: It is the provision of a single interface to entities of different types or the use
of a single symbol to represent multiple different types.
 Inheritance: It is the mechanism of basing an object or class upon another object or class,
retaining similar implementation.
 Encapsulation: Encapsulation is defined as the wrapping up of data under a single unit.
 Class: A class is a blueprint for creating objects, providing initial values for state and
implementations of behaviour.
 Generic function: It is a function defined for polymorphism.

Self-Assessment Questions
1. Make a list of commonly used base R functions that contain . in their name but are not S3
methods.
2. What does the as.data.frame.data.frame() method do? Why is it confusing? How could you
avoid this confusion in your own code?
3. Write a constructor for data.frame objects. What base type is a data frame built on? What
attributes does it use? What are the restrictions placed on the individual elements? What
about the names?
4. Write a short note on Sloop.
5. Explain the OOP concept in detail.

Answers to Check your Progress


Check your Progress 1

State True or False.

1. True
2. False
3. False
4. True
Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
7. Learning Statistics with R by Danielle Navarro.
8. Advanced R by Hadley Wickham, The R Series.
Unit 8
Object Oriented Programming – II
Structure:
8.1 Introduction

8.2 R6

8.2.1 Classes and Methods

8.2.2 Controlling Access

8.2.3 Reference Semantics

8.3 S4

8.3.1 Basics

8.3.2 Classes

8.3.3 Generics and Methods

8.3.4 Method Dispatch

8.3.5 S4 and S3

8.4 Trade-Offs

8.4.1 S4 versus S3

8.4.2 R6 versus S3

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives

After going through this unit, you will be able to:

 Understand the main components of R6 and S4


 Discuss the access mechanisms of R6
 Discuss the interaction between S4 and S3

8.1 INTRODUCTION
R6 is very similar to a built-in OO system called reference classes, or RC for short. R6 is much simpler.
Both R6 and RC are built on top of environments, but while R6 uses S3, RC uses S4. R6 has a simpler
mechanism for cross-package subclassing, which just works without you having to think about it. This
unit covers the R6 and S4 systems in details.

8.2 R6
R6 has two special properties:

 It uses the encapsulated OOP paradigm, which means that methods belong to objects, not
generics, and you call them like object$method().
 R6 objects are mutable, which means that they are modified in place, and hence have
reference semantics.

If you’ve learned OOP in another programming language, it’s likely that R6 will feel very natural, and
you’ll be inclined to prefer it over S3. Resist the temptation to follow the path of least resistance: in
most cases R6 will lead you to non-idiomatic R code.

R6 is very similar to a base OOP system called reference classes, or RC for short. Because R6 is not built
into base R, you’ll need to install and load the R6 package to use it:

# install.packages("R6")

library(R6)

R6 objects have reference semantics which means that they are modified in-place, not copied-on-
modify.

8.2.1 Classes and Methods


R6 only needs a single function call to create both the class and its methods: R6::R6Class(). This is the
only function from the package that you’ll ever use!

The following example shows the two most important arguments to R6Class():

 The first argument is the classname. It’s not strictly needed, but it improves error messages
and makes it possible to use R6 objects with S3 generics. By convention, R6 classes have
UpperCamelCase names.
 The second argument, public, supplies a list of methods (functions) and fields (anything else)
that make up the public interface of the object. By convention, methods and fields use
snake_case. Methods can access the methods and fields of the current object via self$.

Accumulator <- R6Class("Accumulator", list(

sum = 0,
add = function(x = 1) {

self$sum <- self$sum + x

invisible(self)

})

You should always assign the result of R6Class() into a variable with the same name as the class,
because R6Class() returns an R6 object that defines the class:

Accumulator

#> <Accumulator> object generator

#> Public:

#> sum: 0

#> add: function (x = 1)

#> clone: function (deep = FALSE)

#> Parent env: <environment: R_GlobalEnv>

#> Locked objects: TRUE

#> Locked class: FALSE

#> Portable: TRUE

You construct a new object from the class by calling the new() method. In R6, methods belong to
objects, so you use $ to access new():

x <- Accumulator$new()

You can then call methods and access fields with $:

x$add(4)

x$sum

#> [1] 4

In this class, the fields and methods are public, which means that you can get or set the value of any
field. Later, we’ll see how to use private fields and methods to prevent casual access to the internals
of your class.

To make it clear when we’re talking about fields and methods as opposed to variables and functions,
we have prefix their names with $. For example, the Accumulate class has field $sum and method
$add().

8.2.1.1 Method Chaining

$add() is called primarily for its side-effect of updating $sum.

Accumulator <- R6Class("Accumulator", list(


sum = 0,

add = function(x = 1) {

self$sum <- self$sum + x

invisible(self)

})

Side-effect R6 methods should always return self invisibly. This returns the “current” object and makes
it possible to chain together multiple method calls:

x$add(10)$add(10)$sum

#> [1] 24

For, readability, you might put one method call on each line:

x$

add(10)$

add(10)$

sum

#> [1] 44

This technique is called method chaining and is commonly used in languages like Python and
JavaScript. Method chaining is deeply related to the pipe, and we’ll discuss the pros and cons of each
approach in next section.

8.2.1.2 Important Methods

There are two important methods that should be defined for most classes: $initialize() and $print().
They’re not required, but providing them will make your class easier to use.

$initialize() overrides the default behaviour of $new(). For example, the following code defines an
Person class with fields $name and $age. To ensure that $name is always a single string, and $age is
always a single number, I placed checks in $initialize().

Person <- R6Class("Person", list(

name = NULL,

age = NA,

initialize = function(name, age = NA) {

stopifnot(is.character(name), length(name) == 1)

stopifnot(is.numeric(age), length(age) == 1)

self$name <- name

self$age <- age


}

))

hadley <- Person$new("Hadley", age = "thirty-eight")

#> Error in .subset2(public_bind_env, "initialize")(...): is.numeric(age) is

#> not TRUE

hadley <- Person$new("Hadley", age = 38)

If you have more expensive validation requirements, implement them in a separate $validate() and
only call when needed.

Defining $print() allows you to override the default printing behaviour. As with any R6 method called
for its side effects, $print() should return invisible(self).

Person <- R6Class("Person", list(

name = NULL,

age = NA,

initialize = function(name, age = NA) {

self$name <- name

self$age <- age

},

print = function(...) {

cat("Person: \n")

cat(" Name: ", self$name, "\n", sep = "")

cat(" Age: ", self$age, "\n", sep = "")

invisible(self)

))

hadley2 <- Person$new("Hadley")

hadley2

#> Person:

#> Name: Hadley

#> Age: NA

This code illustrates an important aspect of R6. Because methods are bound to individual objects, the
previously created hadley object does not get this new method:

hadley
#> <Person>

#> Public:

#> age: 38

#> clone: function (deep = FALSE)

#> initialize: function (name, age = NA)

#> name: Hadley

hadley$print

#> NULL

From the perspective of R6, there is no relationship between hadley and hadley2; they just
coincidentally share the same class name. This doesn’t cause problems when using already developed
R6 objects but can make interactive experimentation confusing. If you’re changing the code and can’t
figure out why the results of method calls aren’t any different, make sure you’ve re-constructed R6
objects with the new class.

8.2.1.3 Adding Methods after Creation

Instead of continuously creating new classes, it’s also possible to modify the fields and methods of an
existing class. This is useful when exploring interactively, or when you have a class with many functions
that you’d like to break up into pieces. Add new elements to an existing class with $set(), supplying
the visibility, the name, and the component.

Accumulator <- R6Class("Accumulator")

Accumulator$set("public", "sum", 0)

Accumulator$set("public", "add", function(x = 1) {

self$sum <- self$sum + x

invisible(self)

})

As above, new methods and fields are only available to new objects; they are not retrospectively
added to existing objects.

8.2.1.4 Inheritance

To inherit behaviour from an existing class, provide the class object to the inherit argument:

AccumulatorChatty <- R6Class("AccumulatorChatty",

inherit = Accumulator,

public = list(

add = function(x = 1) {

cat("Adding ", x, "\n", sep = "")

super$add(x = x)
}

x2 <- AccumulatorChatty$new()

x2$add(10)$add(1)$sum

#> Adding 10

#> Adding 1

#> [1] 11

$add() overrides the superclass implementation, but we can still delegate to the superclass
implementation by using super$. (This is analogous to NextMethod()) in S3. Any methods which are
not overridden will use the implementation in the parent class.

8.2.1.5 Introspection

Every R6 object has an S3 class that reflects its hierarchy of R6 classes. This means that the easiest way
to determine the class (and all classes it inherits from) is to use class():

class(hadley2)

#> [1] "Person" "R6"

The S3 hierarchy includes the base “R6” class. This provides common behaviour, including a print.R6()
method which calls $print(), as described above.

You can list all methods and fields with names():

names(hadley2)

#> [1] ".__enclos_env__" "age" "name" "clone"

#> [5] "print" "initialize"

We defined $name, $age, $print, and $initialize.

8.2.2 Controlling Access


R6Class() has two other arguments that work similarly to public:

 private allows you to create fields and methods that are only available from within the class,
not outside of it.
 active allows you to use accessor functions to define dynamic, or active, fields.

These are described in the following sections.

8.2.2.1 Privacy

With R6, you can define private fields and methods, elements that can only be accessed from within
the class, not from the outside. There are two things that you need to know to take advantage of
private elements:
 The private argument to R6Class works in the same way as the public argument: you give it a
named list of methods (functions) and fields (everything else).
 Fields and methods defined in private are available within the methods using private$ instead
of self$. You cannot access private fields or methods outside of the class.

To make this concrete, we could make $age and $name fields of the Person class private. With this
definition of Person, we can only set $age and $name during object creation, and we cannot access
their values from outside of the class.

Person <- R6Class("Person",

public = list(

initialize = function(name, age = NA) {

private$name <- name

private$age <- age

},

print = function(...) {

cat("Person: \n")

cat(" Name: ", private$name, "\n", sep = "")

cat(" Age: ", private$age, "\n", sep = "")

),

private = list(

age = NA,

name = NULL

hadley3 <- Person$new("Hadley")

hadley3

#> Person:

#> Name: Hadley

#> Age: NA

hadley3$name

#> NULL

The distinction between public and private fields is important when you create complex networks of
classes, and you want to make it as clear as possible what it’s ok for others to access. Anything that’s
private can be more easily refactored because you know others aren’t relying on it. Private methods
tend to be less important in R compared to other programming languages because the object
hierarchies in R tend to be simpler.

8.2.2.2 Active Fields

Active fields allow you to define components that look like fields from the outside, but are defined
with functions, like methods. Active fields are implemented using active bindings. Each active binding
is a function that takes a single argument: value. If the argument is missing(), the value is being
retrieved; otherwise it’s being modified.

For example, you could make an active field random that returns a different value every time you
access it:

Rando <- R6::R6Class("Rando", active = list(

random = function(value) {

if (missing(value)) {

runif(1)

} else {

stop("Can't set `$random`", call. = FALSE)

))

x <- Rando$new()

x$random

#> [1] 0.0808

x$random

#> [1] 0.834

x$random

#> [1] 0.601

Active fields are particularly useful in conjunction with private fields, because they make it possible to
implement components that look like fields from the outside but provide additional checks. For
example, we can use them to make a read-only age field, and to ensure that name is a length 1
character vector.

Person <- R6Class("Person",

private = list(

.age = NA,

.name = NULL
),

active = list(

age = function(value) {

if (missing(value)) {

private$.age

} else {

stop("`$age` is read only", call. = FALSE)

},

name = function(value) {

if (missing(value)) {

private$.name

} else {

stopifnot(is.character(value), length(value) == 1)

private$.name <- value

self

),

public = list(

initialize = function(name, age = NA) {

private$.name <- name

private$.age <- age

hadley4 <- Person$new("Hadley", age = 38)

hadley4$name

#> [1] "Hadley"

hadley4$name <- 10
#> Error in (function (value) : is.character(value) is not TRUE

hadley4$age <- 20

#> Error: `$age` is read only

8.2.3 Reference Semantics


One of the big differences between R6 and most other objects is that they have reference semantics.
The primary consequence of reference semantics is that objects are not copied when modified:

y1 <- Accumulator$new()

y2 <- y1

y1$add(10)

c(y1 = y1$sum, y2 = y2$sum)

#> y1 y2

#> 10 10

Instead, if you want a copy, you’ll need to explicitly $clone() the object:

y1 <- Accumulator$new()

y2 <- y1$clone()

y1$add(10)

c(y1 = y1$sum, y2 = y2$sum)

#> y1 y2

#> 10 0

($clone() does not recursively clone nested R6 objects. If you want that, you’ll need to use $clone(deep
= TRUE).)

There are three other less obvious consequences:

 It is harder to reason about code that uses R6 objects because you need to understand more
context.
 It makes sense to think about when an R6 object is deleted, and you can write a $finalize() to
complement the $initialize().
 If one of the fields is an R6 object, you must create it inside $initialize(), not R6Class().

These consequences are described in more detail below.

8.2.3.1 Reasoning

Generally, reference semantics makes code harder to reason about. Take this very simple example:
x <- list(a = 1)

y <- list(b = 2)

z <- f(x, y)

For the vast majority of functions, you know that the final line only modifies z.

Take a similar example that uses an imaginary List reference class:

x <- List$new(a = 1)

y <- List$new(b = 2)

z <- f(x, y)

The final line is much harder to reason about: if f() calls methods of x or y, it might modify them as
well as z. This is the biggest potential downside of R6 and you should take care to avoid it by writing
functions that either return a value, or modify their R6 inputs, but not both. That said, doing both can
lead to substantially simpler code in some cases.

8.2.3.2 Finalizer

One useful property of reference semantics is that it makes sense to think about when an R6 object is
finalised, i.e. when it’s deleted. This doesn’t make sense for most objects because copy-on-modify
semantics mean that there may be many transient versions of an object. For example, the following
creates two factor objects: the second is created when the levels are modified, leaving the first to be
destroyed by the garbage collector.

x <- factor(c("a", "b", "c"))

levels(x) <- c("c", "b", "a")

Since R6 objects are not copied-on-modify, they are only deleted once, and it makes sense to think
about $finalize() as a complement to $initialize(). Finalizers usually play a similar role to on.exit(),
cleaning up any resources created by the initializer. For example, the following class wraps up a
temporary file, automatically deleting it when the class is finalized.

TemporaryFile <- R6Class("TemporaryFile", list(

path = NULL,

initialize = function() {

self$path <- tempfile()

},

finalize = function() {

message("Cleaning up ", self$path)

unlink(self$path)

))
The finalize method will be run when the object is deleted (or more precisely, by the first garbage
collection after the object has been unbound from all names) or when R exits. This means that the
finalizer can be called effectively anywhere in your R code, and therefore it’s almost impossible to
reason about finalizer code that touches shared data structures. Avoid these potential problems by
only using the finalizer to clean up private resources allocated by initializer.

tf <- TemporaryFile$new()

rm(tf)

#> Cleaning up /tmp/Rtmpk73JdI/file155f31d8424bd

8.2.3.3 R6 Fields

A final consequence of reference semantics can crop up where you don’t expect it. If you use an R6
class as the default value of a field, it will be shared across all instances of the object! Take the
following code: we want to create a temporary database every time we call
TemporaryDatabase$new(), but the current code always uses the same path.

TemporaryDatabase <- R6Class("TemporaryDatabase", list(

con = NULL,

file = TemporaryFile$new(),

initialize = function() {

self$con <- DBI::dbConnect(RSQLite::SQLite(), path = file$path)

},

finalize = function() {

DBI::dbDisconnect(self$con)

))

db_a <- TemporaryDatabase$new()

db_b <- TemporaryDatabase$new()

db_a$file$path == db_b$file$path

#> [1] TRUE

(If you’re familiar with Python, this is very similar to the “mutable default argument” problem.)

The problem arises because TemporaryFile$new() is called only once when the TemporaryDatabase
class is defined. To fix the problem, we need to make sure it’s called every time that
TemporaryDatabase$new() is called, i.e. we need to put it in $initialize():

TemporaryDatabase <- R6Class("TemporaryDatabase", list(

con = NULL,

file = NULL,
initialize = function() {

self$file <- TemporaryFile$new()

self$con <- DBI::dbConnect(RSQLite::SQLite(), path = file$path)

},

finalize = function() {

DBI::dbDisconnect(self$con)

))

db_a <- TemporaryDatabase$new()

db_b <- TemporaryDatabase$new()

db_a$file$path == db_b$file$path

#> [1] FALSE

8.3 S4
S4 provides a formal approach to functional OOP. The underlying ideas are similar to S3, but
implementation is much stricter and makes use of specialised functions for creating classes
(setClass()), generics (setGeneric()), and methods (setMethod()). Additionally, S4 provides both
multiple inheritance (i.e. a class can have multiple parents) and multiple dispatch (i.e. method dispatch
can use the class of multiple arguments).

An important new component of S4 is the slot, a named component of the object that is accessed
using the specialised subsetting operator @ (pronounced at). The set of slots, and their classes, forms
an important part of the definition of an S4 class.

All functions related to S4 live in the methods package. This package is always available when you’re
running R interactively, but may not be available when running R in batch mode, i.e. from Rscript. For
this reason, it’s a good idea to call library(methods) whenever you use S4. This also signals to the
reader that you’ll be using the S4 object system.

library(methods)

8.3.1 Basics
We’ll start with a quick overview of the main components of S4. You define an S4 class by calling
setClass() with the class name and a definition of its slots, and the names and classes of the class data:

setClass("Person",

slots = c(

name = "character",

age = "numeric"

)
)

Once the class is defined, you can construct new objects from it by calling new() with the name of the
class and a value for each slot:

john <- new("Person", name = "John Smith", age = NA_real_)

Given an S4 object you can see its class with is() and access slots with @ (equivalent to $) and slot()
(equivalent to [[):

is(john)

#> [1] "Person"

john@name

#> [1] "John Smith"

slot(john, "age")

#> [1] NA

Generally, you should only use @ in your methods. If you’re working with someone else’s class, look
for accessor functions that allow you to safely set and get slot values. As the developer of a class, you
should also provide your own accessor functions. Accessors are typically S4 generics allowing multiple
classes to share the same external interface.

Here we’ll create a setter and getter for the age slot by first creating generics with setGeneric():

setGeneric("age", function(x) standardGeneric("age"))

setGeneric("age<-", function(x, value) standardGeneric("age<-"))

And then defining methods with setMethod():

setMethod("age", "Person", function(x) x@age)

setMethod("age<-", "Person", function(x, value) {

x@age <- value

})

age(john) <- 50

age(john)

#> [1] 50

If you’re using an S4 class defined in a package, you can get help on it with class?Person. To get help
for a method, put ? in front of a call (e.g. ?age(john)) and ? will use the class of the arguments to figure
out which help file you need.

Finally, you can use sloop functions to identify S4 objects and generics found in the wild:
sloop::otype(john)

#> [1] "S4"

sloop::ftype(age)

#> [1] "S4" "generic"

8.3.2 Classes
To define an S4 class, call setClass() with three arguments:

 The class name. By convention, S4 class names use UpperCamelCase.


 A named character vector that describes the names and classes of the slots (fields). For
example, a person might be represented by a character name and a numeric age: c(name =
"character", age = "numeric"). The pseudo-class ANY allows a slot to accept objects of any
type.
 A prototype, a list of default values for each slot. Technically, the prototype is optional, but
you should always provide it.

The code below illustrates the three arguments by creating a Person class with character name and
numeric age slots.

setClass("Person",

slots = c(

name = "character",

age = "numeric"

),

prototype = list(

name = NA_character_,

age = NA_real_

me <- new("Person", name = "Hadley")

str(me)

#> Formal class 'Person' [package ".GlobalEnv"] with 2 slots

#> ..@ name: chr "Hadley"

#> ..@ age : num NA


8.3.2.1 Inheritance

There is one other important argument to setClass(): contains. This specifies a class (or classes) to
inherit slots and behaviour from. For example, we can create an Employee class that inherits from the
Person class, adding an extra slot that describes their boss.

setClass("Employee",

contains = "Person",

slots = c(

boss = "Person"

),

prototype = list(

boss = new("Person")

str(new("Employee"))

#> Formal class 'Employee' [package ".GlobalEnv"] with 3 slots

#> ..@ boss:Formal class 'Person' [package ".GlobalEnv"] with 2 slots

#> .. .. ..@ name: chr NA

#> .. .. ..@ age : num NA

#> ..@ name: chr NA

#> ..@ age : num NA

setClass() has 9 other arguments but they are either deprecated or not recommended.

8.3.2.2 Introspection

To determine what classes an object inherits from, use is():

is(new("Person"))

#> [1] "Person"

is(new("Employee"))

#> [1] "Employee" "Person"

To test if an object inherits from a specific class, use the second argument of is():

is(john, "person")

#> [1] FALSE


8.3.2.3 Redefinition

In most programming languages, class definition occurs at compile-time and object construction
occurs later, at run-time. In R, however, both definition and construction occur at run time. When you
call setClass(), you are registering a class definition in a (hidden) global variable. As with all state-
modifying functions you need to use setClass() with care. It’s possible to create invalid objects if you
redefine a class after already having instantiated an object:

setClass("A", slots = c(x = "numeric"))

a <- new("A", x = 10)

setClass("A", slots = c(a_different_slot = "numeric"))

#> An object of class "A"

#> Slot "a_different_slot":

#> Error in slot(object, what): no slot of name "a_different_slot" for this

#> object of class "A"

This can cause confusion during interactive creation of new classes.

8.3.2.4 Helper

new() is a low-level constructor suitable for use by you, the developer. User-facing classes should
always be paired with a user-friendly helper. A helper should always:

 Have the same name as the class, e.g. myclass().


 Have a thoughtfully crafted user interface with carefully chosen default values and useful
conversions.
 Create carefully crafted error messages tailored towards an end-user.
 Finish by calling by calling methods::new().

The Person class is so simple so a helper is almost superfluous, but we can use it to clearly define the
contract: age is optional but name is required. We’ll also coerce age to a double so the helper also
works when passed an integer.

Person <- function(name, age = NA) {

age <- as.double(age)

new("Person", name = name, age = age)

Person("Hadley")

#> An object of class "Person"

#> Slot "name":

#> [1] "Hadley"

#>
#> Slot "age":

#> [1] NA

8.3.2.5 Validator

The constructor automatically checks that the slots have correct classes:

Person(mtcars)

#> Error in validObject(.Object): invalid class "Person" object: invalid

#> object for slot "name" in class "Person": got class "data.frame", should

#> be or extend class "character"

You will need to implement more complicated checks (i.e. checks that involve lengths, or multiple
slots) yourself. For example, we might want to make it clear that the Person class is a vector class, and
can store data about multiple people. That’s not currently clear because @name and @age can be
different lengths:

Person("Hadley", age = c(30, 37))

#> An object of class "Person"

#> Slot "name":

#> [1] "Hadley"

#>

#> Slot "age":

#> [1] 30 37

To enforce these additional constraints we write a validator with setValidity(). It takes a class and a
function that returns TRUE if the input is valid, and otherwise returns a character vector describing
the problem(s):

setValidity("Person", function(object) {

if (length(object@name) != length(object@age)) {

"@name and @age must be same length"

} else {

TRUE

})

Now we can no longer create an invalid object:

Person("Hadley", age = c(30, 37))

#> Error in validObject(.Object): invalid class "Person" object: @name and

#> @age must be same length


NB: The validity method is only called automatically by new(), so you can still create an invalid object
by modifying it:

alex <- Person("Alex", age = 30)

alex@age <- 1:10

You can explicitly check the validity yourself by calling validObject():

validObject(alex)

#> Error in validObject(alex): invalid class "Person" object: @name and @age

#> must be same length

8.3.3 Generics and Methods


The job of a generic is to perform method dispatch, i.e. find the specific implementation for the
combination of classes passed to the generic. Here you’ll learn how to define S4 generics and methods,
then in the next section we’ll explore precisely how S4 method dispatch works.

To create a new S4 generic, call setGeneric() with a function that calls standardGeneric():

setGeneric("myGeneric", function(x) standardGeneric("myGeneric"))

By convention, new S4 generics should use lowerCamelCase.

It is bad practice to use {} in the generic as it triggers a special case that is more expensive, and
generally best avoided.

# Don't do this!

setGeneric("myGeneric", function(x) {

standardGeneric("myGeneric")

})

8.3.3.1 Signature

Like setClass(), setGeneric() has many other arguments. There is only one that you need to know
about: signature. This allows you to control the arguments that are used for method dispatch. If
signature is not supplied, all arguments (apart from ...) are used. It is occasionally useful to remove
arguments from dispatch. This allows you to require that methods provide arguments like verbose =
TRUE or quiet = FALSE, but they don’t take part in dispatch.

setGeneric("myGeneric",

function(x, ..., verbose = TRUE) standardGeneric("myGeneric"),

signature = "x"

8.3.3.2 Methods

A generic isn’t useful without some methods, and in S4 you define methods with setMethod(). There
are three important arguments: the name of the generic, the name of the class, and the method itself.
setMethod("myGeneric", "Person", function(x) {

# method implementation

})

More formally, the second argument to setMethod() is called the signature. In S4, unlike S3, the
signature can include multiple arguments. This makes method dispatch in S4 substantially more
complicated, but avoids having to implement double-dispatch as a special case. We’ll talk more about
multiple dispatch in the next section. setMethod() has other arguments, but you should never use
them.

To list all the methods that belong to a generic, or that are associated with a class, use
methods("generic") or methods(class = "class"); to find the implementation of a specific method, use
selectMethod("generic", "class").

8.3.3.3 Show Method

The most commonly defined S4 method that controls printing is show(), which controls how the object
appears when it is printed. To define a method for an existing generic, you must first determine the
arguments. You can get those from the documentation or by looking at the args() of the generic:

args(getGeneric("show"))

#> function (object)

#> NULL

Our show method needs to have a single argument object:

setMethod("show", "Person", function(object) {

cat(is(object)[[1]], "\n",

" Name: ", object@name, "\n",

" Age: ", object@age, "\n",

sep = ""

})

john

#> Person

#> Name: John Smith

#> Age: 50

8.3.3.4 Accessors

Slots should be considered an internal implementation detail: they can change without warning and
user code should avoid accessing them directly. Instead, all user-accessible slots should be
accompanied by a pair of accessors. If the slot is unique to the class, this can just be a function:
person_name <- function(x) x@name

Typically, however, you’ll define a generic so that multiple classes can use the same interface:

setGeneric("name", function(x) standardGeneric("name"))

setMethod("name", "Person", function(x) x@name)

name(john)

#> [1] "John Smith"

If the slot is also writeable, you should provide a setter function. You should always include
validObject() in the setter to prevent the user from creating invalid objects.

setGeneric("name<-", function(x, value) standardGeneric("name<-"))

setMethod("name<-", "Person", function(x, value) {

x@name <- value

validObject(x)

})

name(john) <- "Jon Smythe"

name(john)

#> [1] "Jon Smythe"

name(john) <- letters

#> Error in validObject(x): invalid class "Person" object: @name and @age

#> must be same length

8.3.4 Method Dispatch


S4 dispatch is complicated because S4 has two important features:

 Multiple inheritance, i.e. a class can have multiple parents,


 Multiple dispatch, i.e. a generic can use multiple arguments to pick a method.

These features make S4 very powerful, but can also make it hard to understand which method will get
selected for a given combination of inputs. In practice, keep method dispatch as simple as possible by
avoiding multiple inheritance, and reserving multiple dispatch only for where it is absolutely
necessary.

But it’s important to describe the full details, so here we’ll start simple with single inheritance and
single dispatch, and work our way up to the more complicated cases. To illustrate the ideas without
getting bogged down in the details, we’ll use an imaginary class graph based on emoji:
Emoji give us very compact class names that evoke the relationships between the classes. It should be
straightforward to remember that 😜 inherits from 😉 which inherits from 😶, and that 😎 inherits

from both . and .

8.3.4.1 Single Dispatch

Let’s start with the simplest case: a generic function that dispatches on a single class with a single
parent. The method dispatch here is simple so it’s a good place to define the graphical conventions
we’ll use for the more complex cases.

There are two parts to this diagram:

 The top part, f(...), defines the scope of the diagram. Here we have a generic with one
argument that has a class hierarchy that is three levels deep.
 The bottom part is the method graph and displays all the possible methods that could be
defined. Methods that exist, i.e. that have been defined with setMethod(), have a grey
background.
To find the method that gets called, you start with the most specific class of the actual arguments,
then follow the arrows until you find a method that exists. For example, if you called the function with
an object of class 😉, you would follow the arrow right to find the method defined for the more general
😶 class. If no method is found, method dispatch has failed and an error is thrown. In practice, this
means that you should always define methods defined for the terminal nodes, i.e. those on the far
right.

There are two pseudo-classes that you can define methods for. These are called pseudo-classes
because they don’t actually exist, but allow you to define useful behaviours. The first pseudo-class is
ANY which matches any class. For technical reasons that we’ll get to later, the link to the ANY method
is longer than the links between the other classes:

The second pseudo-class is MISSING. If you define a method for this pseudo-class, it will match
whenever the argument is missing. It’s not useful for single dispatch, but is important for functions
like + and - that use double dispatch and behave differently depending on whether they have one or
two arguments.

8.3.4.2 Multiple Inheritance

Things get more complicated when the class has multiple parents.

The basic process remains the same: you start from the actual class supplied to the generic, then
follow the arrows until you find a defined method. The wrinkle is that now there are multiple arrows
to follow, so you might find multiple methods. If that happens, you pick the method that is closest,
i.e. requires travelling the fewest arrows.

NB: while the method graph is a powerful metaphor for understanding method dispatch,
implementing it in this way would be rather inefficient, so the actual approach that S4 uses is
somewhat different. You can read the details in ?Methods_Details.

What happens if methods are the same distance? For example, imagine we’ve defined methods for
and , and we call the generic with 😎. Note that no method can be found for the 😶 class,
which highlighted with a red double outline.
This is called an ambiguous method, and in diagrams it with a thick dotted border. When this happens
in R, you’ll get a warning, and the method for the class that comes earlier in the alphabet will be picked
(this is effectively random and should not be relied upon). When you discover ambiguity you should
always resolve it by providing a more precise method:

The fallback ANY method still exists but the rules are little more complex. As indicated by the wavy
dotted lines, the ANY method is always considered further away than a method for a real class. This
means that it will never contribute to ambiguity.

With multiple inheritances it is hard to simultaneously prevent ambiguity, ensure that every terminal
method has an implementation, and minimise the number of defined methods (in order to benefit
from OOP). For example, of the six ways to define only two methods for this call, only one is free from
problems.
8.3.4.3 Multiple Dispatch

Once you understand multiple inheritance, understanding multiple dispatch is straightforward. You
follow multiple arrows in the same way as previously, but now each method is specified by two classes
(separated by a comma).

The main difference between multiple inheritance and multiple dispatch is that there are many more
arrows to follow. The following diagram shows four defined methods which produce two ambiguous
cases:

Multiple dispatch tends to be less tricky to work with than multiple inheritance because there are
usually fewer terminal class combinations. In this example, there’s only one. That means, at a
minimum, you can define a single method and have default behaviour for all inputs.

8.3.4.4 Multiple Dispatch and Multiple Inheritance

Of course you can combine multiple dispatch with multiple inheritance:


A still more complicated case dispatches on two classes, both of which have multiple inheritance:

As the method graph gets more and more complicated it gets harder and harder to predict which
method will get called given a combination of inputs, and it gets harder and harder to make sure that
you haven’t introduced ambiguity. If you have to draw diagrams to figure out what method is actually
going to be called, it’s a strong indication that you should go back and simplify your design.

8.3.5 S4 and S3
When writing S4 code, you’ll often need to interact with existing S3 classes and generics. This section
describes how S4 classes, methods, and generics interact with existing code.
8.3.5.1 Classes

In slots and contains you can use S4 classes, S3 classes, or the implicit class of a base type. To use an
S3 class, you must first register it with setOldClass(). You call this function once for each S3 class, giving
it the class attribute. For example, the following definitions are already provided by base R:

setOldClass("data.frame")

setOldClass(c("ordered", "factor"))

setOldClass(c("glm", "lm"))

However, it’s generally better to be more specific and provide a full S4 definition with slots and a
prototype:

setClass("factor",

contains = "integer",

slots = c(

levels = "character"

),

prototype = structure(

integer(),

levels = character()

setOldClass("factor", S4Class = "factor")

Generally, these definitions should be provided by the creator of the S3 class. If you’re trying to build
an S4 class on top of an S3 class provided by a package, you should request that the package
maintainer add this call to their package, rather than adding it to your own code.

If an S4 object inherits from an S3 class or a base type, it will have a special virtual slot called .Data.
This contains the underlying base type or S3 object:

RangedNumeric <- setClass(

"RangedNumeric",

contains = "numeric",

slots = c(min = "numeric", max = "numeric"),

prototype = structure(numeric(), min = NA_real_, max = NA_real_)

rn <- RangedNumeric(1:10, min = 1, max = 10)

rn@min
#> [1] 1

rn@.Data

#> [1] 1 2 3 4 5 6 7 8 9 10

It is possible to define S3 methods for S4 generics, and S4 methods for S3 generics (provided you’ve
called setOldClass()). However, it’s more complicated than it might appear at first glance, so make
sure you thoroughly read ?Methods_for_S3.

8.3.5.2 Generics

As well as creating a new generic from scratch, it’s also possible to convert an existing S3 generic to
an S4 generic:

setGeneric("mean")

In this case, the existing function becomes the default (ANY) method:

selectMethod("mean", "ANY")

#> Method Definition (Class "derivedDefaultMethod"):

#>

#> function (x, ...)

#> UseMethod("mean")

#> <bytecode: 0x1aa1380>

#> <environment: namespace:base>

#>

#> Signatures:

#> x

#> target "ANY"

#> defined "ANY"

NB: setMethod() will automatically call setGeneric() if the first argument isn’t already a generic,
enabling you to turn any existing function into an S4 generic. It is OK to convert an existing S3 generic
to S4, but you should avoid converting regular functions to S4 generics in packages because that
requires careful coordination if done by multiple packages.

8.4 TRADE-OFFS
You now know about the three most important OOP toolkits available in R. Now that you understand
their basic operation and the principles that underlie them, we can start to compare and contrast the
systems in order to understand their strengths and weaknesses. This will help you pick the system that
is most likely to solve new problems.

S3 is simple, and widely used throughout base R and CRAN. While it’s far from perfect, its
idiosyncrasies are well understood and there are known approaches to overcome most shortcomings.
If you have an existing background in programming, you are likely to lean towards R6, because it will
feel familiar. I think you should resist this tendency for two reasons. Firstly, if you use R6 it’s very easy
to create a non-idiomatic API that will feel very odd to native R users, and will have surprising pain
points because of the reference semantics. Secondly, if you stick to R6, you’ll lose out on learning a
new way of thinking about OOP that gives you a new set of tools for solving problems.

8.4.1 S4 versus S3
Once you’ve mastered S3, S4 is not too difficult to pick up: the underlying ideas are the same, S4 is
just more formal, stricter, and more verbose. The strictness and formality of S4 make it well suited for
large teams. Since more structure is provided by the system itself, there is less need for convention,
and new contributors don’t need as much training. S4 tends to require more upfront design than S3,
and this investment is more likely to pay off on larger projects where greater resources are available.

One large team effort where S4 is used to good effect is Bioconductor. Bioconductor is similar to CRAN:
it’s a way of sharing packages amongst a wider audience. Bioconductor is smaller than CRAN (~1,300
versus ~10,000 packages, July 2017) and the packages tend to be more tightly integrated because of
the shared domain and because Bioconductor has a stricter review process. Bioconductor packages
are not required to use S4, but most will because the key data structures (e.g. SummarizedExperiment,
IRanges, DNAStringSet) are built using S4.

S4 is also a good fit for complex systems of interrelated objects, and it’s possible to minimise code
duplication through careful implementation of methods. The best example of such a system is the
Matrix package (Bates and Maechler 2018). It is designed to efficiently store and compute with many
different types of sparse and dense matrices. As of version 1.2.17, it defines 102 classes, 21 generic
functions, and 1994 methods, and to give you some idea of the complexity, a small subset of the class
graph is shown in Fig. 8.1.

Fig. 8.1 A small subset of the Matrix class graph showing the inheritance of sparse matrices. Each
concrete class inherits from two virtual parents: one that describes how the data is stored
(C = column oriented, R = row oriented, T = tagged) and one that describes any restriction on the
matrix (s = symmetric, t = triangle, g = general)
This domain is a good fit for S4 because there are often computational shortcuts for specific
combinations of sparse matrices. S4 makes it easy to provide a general method that works for all
inputs, and then provide more specialised methods where the inputs allow a more efficient
implementation. This requires careful planning to avoid method dispatch ambiguity, but the planning
pays off with higher performance.

The biggest challenge to using S4 is the combination of increased complexity and absence of a single
source of documentation. S4 is a complex system and it can be challenging to use effectively in
practice.

8.4.2 R6 versus S3
R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects,
rather than generic functions. Additionally R6 objects have reference semantics, which means that
they can be modified in place. These two big differences have a number of non-obvious consequences
which we’ll explore here:

 A generic is a regular function so it lives in the global namespace. An R6 method belongs to an


object so it lives in a local namespace. This influences how we think about naming.
 R6’s reference semantics allow methods to simultaneously return a value and modify an
object. This solves a painful problem called “threading state”.
 You invoke an R6 method using $, which is an infix operator. If you set up your methods
correctly you can use chains of method calls as an alternative to the pipe.

These are general trade-offs between functional and encapsulated OOP, so they also serve as a
discussion of system design in R versus Python.

8.4.2.1 Namespacing

One non-obvious difference between S3 and R6 is the space in which methods are found:

 Generic functions are global: all packages share the same namespace.
 Encapsulated methods are local: methods are bound to a single object.

The advantage of a global namespace is that multiple packages can use the same verbs for working
with different types of objects. Generic functions provide a uniform API that makes it easier to perform
typical actions with a new object because there are strong naming conventions. This works well for
data analysis because you often want to do the same thing to different types of objects. In particular,
this is one reason that R’s modelling system is so useful: regardless of where the model has been
implemented you always work with it using the same set of tools (summary(), predict(), …).

The disadvantage of a global namespace is that it forces you to think more deeply about naming. You
want to avoid multiple generics with the same name in different packages because it requires the user
to type :: frequently. This can be hard because function names are usually English verbs, and verbs
often have multiple meanings. Take plot() for example:

plot(data) # plot some data

plot(bank_heist) # plot a crime

plot(land) # create a new plot of land

plot(movie) # extract plot of a movie


Generally, you should avoid methods that are homonyms of the original generic, and instead define a
new generic.

This problem doesn’t occur with R6 methods because they are scoped to the object. The following
code is fine, because there is no implication that the plot method of two different R6 objects has the
same meaning:

data$plot()

bank_heist$plot()

land$plot()

movie$plot()

These considerations also apply to the arguments to the generic. S3 generics must have the same core
arguments, which means they generally have non-specific names like x or .data. S3 generics generally
need ... to pass on additional arguments to methods, but this has the downside that misspelled
argument names will not create an error. In comparison, R6 methods can vary more widely and use
more specific and evocative argument names.

A secondary advantage of local namespacing is that creating an R6 method is very cheap. Most
encapsulated OO languages encourage you to create many small methods, each doing one thing well
with an evocative name. Creating a new S3 method is more expensive, because you may also have to
create a generic, and think about the naming issues described above. That means that the advice to
create many small methods does not apply to S3. It’s still a good idea to break your code down into
small, easily understood chunks, but they should generally just be regular functions, not methods.

8.4.2.2 Threading State

One challenge of programming with S3 is when you want to both return a value and modify the object.
This violates our guideline that a function should either be called for its return value or for its side
effects, but is necessary in a handful of cases.

For example, imagine you want to create a stack of objects. A stack has two main methods:

 push() adds a new object to the top of the stack.


 pop() returns the top most value, and removes it from the stack.

The implementation of the constructor and the push() method is straightforward. A stack contains a
list of items, and pushing an object to the stack simply appends to this list.

new_stack <- function(items = list()) {

structure(list(items = items), class = "stack")

push <- function(x, y) {

x$items <- c(x$items, list(y))

}
Implementing pop() is more challenging because it has to both return a value (the object at the top of
the stack), and have a side-effect (remove that object from that top). Since we can’t modify the input
object in S3 we need to return two things: the value, and the updated object.

pop <- function(x) {

n <- length(x$items)

item <- x$items[[n]]

x$items <- x$items[-n]

list(item = item, x = x)

This leads to rather awkward usage:

s <- new_stack()

s <- push(s, 10)

s <- push(s, 20)

out <- pop(s)

out$item

#> [1] 20

s <- out$x

#> $items

#> $items[[1]]

#> [1] 10

#>

#>

#> attr(,"class")

#> [1] "stack"

This problem is known as threading state or accumulator programming, because no matter how
deeply the pop() is called, you have to thread the modified stack object all the way back to where it
lives.

One way that other FP languages deal with this challenge is to provide a multiple assign (or
destructuring bind) operator that allows you to assign multiple values in a single step. The zeallot
package (Teetor 2018) provides multi-assign for R with %<-%. This makes the code more elegant, but
doesn’t solve the key problem:

library(zeallot)

c(value, s) %<-% pop(s)

value

#> [1] 10

An R6 implementation of a stack is simpler because $pop() can modify the object in place, and return
only the top-most value:

Stack <- R6::R6Class("Stack", list(

items = list(),

push = function(x) {

self$items <- c(self$items, x)

invisible(self)

},

pop = function() {

item <- self$items[[self$length()]]

self$items <- self$items[-self$length()]

item

},

length = function() {

length(self$items)

))

This leads to more natural code:

s <- Stack$new()

s$push(10)

s$push(20)

s$pop()

#> [1] 20

8.4.2.3 Method Chaining

The pipe, %>%, is useful because it provides an infix operator that makes it easy to compose functions
from left-to-right. Interestingly, the pipe is not so important for R6 objects because they already use
an infix operator: $. This allows the user to chain together multiple method calls in a single expression,
a technique known as method chaining:

s <- Stack$new()

s$

push(10)$

push(20)$

pop()

#> [1] 20

This technique is commonly used in other programming languages, like Python and JavaScript, and is
made possible with one convention: any R6 method that is primarily called for its side-effects (usually
modifying the object) should return invisible(self).

The primary advantage of method chaining is that you can get useful autocomplete; the primary
disadvantage is that only the creator of the class can add new methods (and there’s no way to use
multiple dispatch).

Check your Progress 1


Fill in the Blanks.

1. R6 only needs a single function call to create both the class and its methods: _____.
2. $add() overrides the superclass implementation, but we can still delegate to the superclass
implementation by using _____.
3. The ______ can be called effectively anywhere in your R code.
4. To define an S4 class, call setClass() with ____, _________ and ______ .
5. To enforce additional constraints we write a ______ with setValidity().
6. The most commonly defined S4 method that controls printing is _______, which controls how
the object appears when it is printed.

Activity 1
1. Create an R6 class that represents a shuffled deck of cards. You should be able to draw cards
from the deck with $draw(n), and return all cards to the deck and reshuffle with $reshuffle().
Use the following code to make a vector of cards.

suit <- c("♠", "♥", "♦", "♣")


value <- c("A", 2:10, "J", "Q", "K")
cards <- paste0(rep(value, 4), suit)
Why can’t you model a deck of cards with an S3 class?

2. Create a class that allows you to write a line to a specified file. You should open a connection
to the file in $initialize(), append a line using cat() in $append_line(), and close the connection
in $finalize().
3. Imagine you were going to reimplement factors, dates, and data frames in S4. Sketch out the
setClass() calls that you would use to define the classes. Think about appropriate slots and
prototype.
4. Draw the method graph for f(😃, 😉, 😙).

Summary
 R6 is very similar to a base OOP system called reference classes, or RC for short.
 R6 classes provide capabilities that are common in other object-oriented programming
languages. They’re similar to R’s built-in reference classes, but are simpler, smaller, and faster,
and they allow inheritance across packages.
 Some features of R6: R6 objects have reference semantics, R6 cleanly supports inheritance
across packages and R6 classes have public and private members.
 The S4 approach differs from the S3 approach to creating a class in that it is a more rigid
definition. The idea is that an object is created using the setClass command. The command
takes a number of options.

Keywords
 Threads: A thread of execution is the smallest sequence of programmed instructions that can
be managed independently by a scheduler.
 Type Introspection: It is the ability of a program to examine the type or properties of an object
at runtime.
 Validation: It is the process of checking that a system meets specifications and that it fulfils
its intended purpose.
 Namespace: A namespace is a set of symbols that are used to organise objects of various
kinds, so that these objects may be referred to by name.

Self-Assessment Questions
1. Create a bank account R6 class that stores a balance and allows you to deposit and withdraw
money. Create a subclass that throws an error if you attempt to go into overdraft. Create
another subclass that allows you to go into overdraft, but charges you a fee.
2. What happens if you define a method with different argument names to the generic?
3. Write a short note on R6Class().
4. Explain the concept of Inheritance in s4.
5. What is the purpose of Show method?
6. Explain multiple dispatch with example.
7. State the difference between R6 and S3.

Answers to Check your Progress


Check your Progress 1

Fill in the Blanks.

1. R6 only needs a single function call to create both the class and its methods: R6::R6Class().
2. $add() overrides the superclass implementation, but we can still delegate to the superclass
implementation by using super$.
3. The finalizer can be called effectively anywhere in your R code.
4. To define an S4 class, call setClass() with the class name, a named character vector and a
prototype arguments .
5. To enforce additional constraints we write a validator with setValidity().
6. The most commonly defined S4 method that controls printing is show(), which controls how
the object appears when it is printed.

Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
7. Learning Statistics with R by Danielle Navarro.
8. Advanced R by Hadley Wickham, The R Series.
Unit 9
Debugging and Condition Handling
Structure:
9.1 Introduction
9.2 Overview
9.3 Locating Errors
9.3.1 Lazy Evaluation
9.4 Interactive Debugger
9.4.1 browser() Commands
9.4.2 Alternatives
9.4.3 Compiled Code
9.5 Non-Interactive Debugging
9.5.1 dump.frames()
9.5.2 Print Debugging
9.5.3 RMarkdown
9.6 Non-Error Failures
9.7 Condition Handling
9.8 Defensive Programming
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives
After going through this unit, you will be able to:
 Understand the general strategy for finding and fixing errors
 Use function which helps to locate exactly where an error occurred
 Discuss the challenging problem of debugging

9.1 INTRODUCTION
What do you do when R code throws an unexpected error? What tools do you have to find and fix the
problem? This unit will teach you the art and science of debugging, starting with a general strategy,
then following up with specific tools.

9.2 OVERVIEW
Finding your bug is a process of confirming the many things that you believe are true — until you find
one which is not true. —Norm Matloff
Finding the root cause of a problem is always challenging. Most bugs are subtle and hard to find
because if they were obvious, you would’ve avoided them in the first place. A good strategy helps.
Below is the four step process that found to be useful:

 Google!
Whenever you see an error message, start by googling it. If you’re lucky, you’ll discover that it’s a
common error with a known solution. When googling, improve your chances of a good match by
removing any variable names or values that are specific to your problem.
 Make it repeatable
To find the root cause of an error, you’re going to need to execute the code many times as you
consider and reject hypotheses. To make that iteration as quick possible, it’s worth some upfront
investment to make the problem both easy and fast to reproduce.
Start by creating a reproducible example. Next, make the example minimal by removing code and
simplifying data. As you do this, you may discover inputs that don’t trigger the error. Make note of
them: they will be helpful when diagnosing the root cause.
If you’re using automated testing, this is also a good time to create an automated test case. If your
existing test coverage is low, take the opportunity to add some nearby tests to ensure that existing
good behaviour is preserved. This reduces the chances of creating a new bug.
 Figure out where it is
If you’re lucky, one of the tools in the following section will help you to quickly identify the line of code
that’s causing the bug. Usually, however, you’ll have to think a bit more about the problem. It’s a great
idea to adopt the scientific method. Generate hypotheses, design experiments to test them, and
record your results. This may seem like a lot of work, but a systematic approach will end up saving you
time.
 Fix it and test it
Once you’ve found the bug, you need to figure out how to fix it and to check that the fix actually
worked. Again, it’s very useful to have automated tests in place. Not only does this help to ensure that
you’ve actually fixed the bug, it also helps to ensure you haven’t introduced any new bugs in the
process. In the absence of automated tests, make sure to carefully record the correct output, and
check against the inputs that previously failed.
9.3 LOCATING ERRORS
Once you’ve made the error repeatable, the next step is to figure out where it comes from. The most
important tool for this part of the process is traceback(), which shows you the sequence of calls (also
known as the call stack) that lead to the error.
Here’s a simple example: you can see that f() calls g() calls h() calls i(), which checks if its argument is
numeric:
f <- function(a) g(a)
g <- function(b) h(b)
h <- function(c) i(c)
i <- function(d) {
if (!is.numeric(d)) {
stop("`d` must be numeric", call. = FALSE)
}
d + 10
}
When we run f("a") code in RStudio we see:

Two options appear to the right of the error message: “Show Traceback” and “Rerun with Debug”. If
you click “Show traceback” you see:

If you’re not using RStudio, you can use traceback() to get the same information:
traceback()
#> 5: stop("`d` must be numeric", call. = FALSE) at debugging.R#6
#> 4: i(c) at debugging.R#3
#> 3: h(b) at debugging.R#2
#> 2: g(a) at debugging.R#1
#> 1: f("a")
NB: You read the traceback() output from bottom to top: the initial call is f(), which calls g(), then h(),
then i(), which triggers the error. If you’re calling code that you source() d into R, the traceback will
also display the location of the function, in the form filename.r#linenumber. These are clickable in
RStudio, and will take you to the corresponding line of code in the editor.
The traceback() function can be used to print a summary of how your program arrived at the error.
This is also called a call stack, stack trace or backtrace. In R this gives you each call that lead up to the
error, which can be very useful for determining what lead to the error.
You can use traceback() in two different ways, either by calling it immediately after the error has
occurred.
f <- function(x) x + 1
g <- function(x) f(x)
g("a")
#> Error in x + 1 : non-numeric argument to binary operator

traceback()
#> 2: f(x) at #1
#> 1: g("a")
Or by using traceback() as an error handler, which will call it immediately on any error. (You could
even put this in your .Rprofile)
options(error = traceback)
g("a")
#> Error in x + 1 : non-numeric argument to binary operator
#> 2: f(x) at #1
#> 1: g("a")

9.3.1 Lazy Evaluation


One drawback to traceback() is that it always linearises the call tree, which can be confusing if there
is much lazy evaluation involved. For example, take the following example where the error happens
when evaluating the first argument to f():
j <- function() k()
k <- function() stop("Oops!", call. = FALSE)
f(j())
#> Error: Oops!

traceback()
#> 7: stop("Oops!") at #1
#> 6: k() at #1
#> 5: j() at debugging.R#1
#> 4: i(c) at debugging.R#3
#> 3: h(b) at debugging.R#2
#> 2: g(a) at debugging.R#1
#> 1: f(j())
You can use rlang::with_abort() and rlang::last_trace() to see the call tree. Here, it makes it much
easier to see the source of the problem. Look at the last branch of the call tree to see that the error
comes from j() calling k().
rlang::with_abort(f(j()))
#> Warning: `rlang__backtrace_on_error` is no longer experimental.
#> It has been renamed to `rlang_backtrace_on_error`. Please update your RProfile.
#> This warning is displayed once per session.
#> Error: Oops!
rlang::last_trace()
#>
#> 1. ├─rlang::with_abort(f(j()))
#> 2. │ └─base::withCallingHandlers(...)
#> 3. ├─global::f(j())
#> 4. │ └─global::g(a)
#> 5. │ └─global::h(b)
#> 6. │ └─global::i(c)
#> 7. └─global::j()
#> 8. └─global::k()
NB: rlang::last_trace() is ordered in the opposite way to traceback().

9.4 INTERACTIVE DEBUGGER


Sometimes, the precise location of the error is enough to let you track it down and fix it. Frequently,
however, you need more information, and the easiest way to get it is with the interactive debugger
which allows you to pause execution of a function and interactively explore its state.
If you’re using RStudio, the easiest way to enter the interactive debugger is through RStudio’s “Rerun
with Debug” tool. This reruns the command that created the error, pausing execution where the error
occurred. Otherwise, you can insert a call to browser() where you want to pause, and re-run the
function. For example, we could insert a call browser() in g():
g <- function(b) {
browser()
h(b)
}
f(10)
browser() is just a regular function call which means that you can run it conditionally by wrapping it in
an if statement:
g <- function(b) {
if (b < 0) {
browser()
}
h(b)
}
In either case, you’ll end up in an interactive environment inside the function where you can run
arbitrary R code to explore the current state. You’ll know when you’re in the interactive debugger
because you get a special prompt:
Browse[1]>
In RStudio, you’ll see the corresponding code in the editor (with the statement that will be run next
highlighted), objects in the current environment in the Environment pane, and the call stack in the
Traceback pane.

9.4.1 browser() Commands


As well as allowing you to run regular R code, browser() provides a few special commands. You can
use them by either typing short text commands, or by clicking a button in the RStudio toolbar, Fig. 9.1:

Fig. 9.1 RStudio debugging toolbar

 Next, n: executes the next step in the function. If you have a variable named n, you’ll need
print(n) to display its value.
 Step into, or s: works like next, but if the next step is a function, it will step into that
function so you can explore it interactively.
 Finish, or f: finishes execution of the current loop or function.
 Continue, c: leaves interactive debugging and continues regular execution of the function. This
is useful if you’ve fixed the bad state and want to check that the function proceeds correctly.
 Stop, Q: stops debugging, terminates the function, and returns to the global workspace. Use
this once you’ve figured out where the problem is, and you’re ready to fix it and reload the
code.
There are two other slightly less useful commands that aren’t available in the toolbar:

 Enter: repeats the previous command. This too easy to activate accidentally, so you can turn
it off using options(browserNLdisabled = TRUE).
 where: prints stack trace of active calls (the interactive equivalent of traceback).
If you are trying to hunt down a particular error it is often useful to have RStudio enter the debugger
when it occurs. You can control the error behavior with (Debug -> On Error -> Error Inspector).

9.4.2 Alternatives
There are three alternatives to using browser(): setting breakpoints in RStudio, options(error =
recover), and debug() and other related functions.
9.4.2.1 Breakpoints
In RStudio, you can set a breakpoint by clicking to the left of the line number, or pressing Shift + F9.
Breakpoints behave similarly to browser() but they are easier to set (one click instead of nine key
presses), and you don’t run the risk of accidentally including a browser() statement in your source
code. There are two small downsides to breakpoints:
 There are a few unusual situations in which breakpoints will not work. Read breakpoint
troubleshooting for more details.
 RStudio currently does not support conditional breakpoints.
9.4.2.2 recover()
Another way to activate browser() is to use options(error = recover). Now when you get an error, you’ll
get an interactive prompt that displays the traceback and gives you the ability to interactively debug
inside any of the frames:
options(error = recover)
f("x")
#> Error: `d` must be numeric
#>
#> Enter a frame number, or 0 to exit
#>
#> 1: f("x")
#> 2: debugging.R#1: g(a)
#> 3: debugging.R#2: h(b)
#> 4: debugging.R#3: i(c)
#>
#> Selection:
You can return to default error handling with options(error = NULL).
9.4.2.3 debug()
Another approach is to call a function that inserts the browser() call for you:

 debug() inserts a browser statement in the first line of the specified function. undebug()
removes it. Alternatively, you can use debugonce() to browse only on the next run.
 utils::setBreakpoint() works similarly, but instead of taking a function name, it takes a file
name and line number and finds the appropriate function for you.
These two functions are both special cases of trace(), which insert arbitrary code at any position in an
existing function. trace() is occasionally useful when you’re debugging code that you don’t have the
source for. To remove tracing from a function, use untrace(). You can only perform one trace per
function, but that one trace can call multiple functions.
If called with no additional arguments trace() simply prints a message when the function is entered.
If called with a function as the second argument this inserts the function at the start of the function.
trace(fun, browser) is functionally equivalent to debug(fun). browser() or revover() are generally the
most useful functions to use, but this could actually be any R function or even regular R expressions.
This is often useful to open the debugger only when a certain condition is met.
trace(print, quote(if (is.numeric(x) && x >= 3) cat("hi\n")), print = FALSE)
## Tracing function "print" in package "base"
## [1] "print"
print(1)
## [1] 1
print(3)
## hi
## [1] 3
# Use untrace to remove the tracing code
untrace(print)
## Untracing function "print" in package "base"
You can also use the at argument to trace() to insert the tracing expressions at other points in the
function body. To determine the number of the expression to insert, convert the body of the function
to a list. e.g. as.list(body(fun)).
9.4.2.4 Call Stack
Unfortunately, the call stacks printed by traceback(), browser() & where, and recover() are not
consistent. The following table shows how the call stacks from a simple nested set of calls are
displayed by the three tools. The numbering is different between traceback() and where, and recover()
displays calls in the opposite order.
RStudio displays calls in the same order as traceback(). rlang functions use the same ordering and
numbering as recover(), but also use indenting to reinforce the hierarchy of calls.

9.4.3 Compiled Code


It is also possible to use an interactive debugger (gdb or lldb) for compiled code (like C or C++). It’s
possible, with a little extra work, to use an interactive debugger to debug your C/C++ in the same way
that you can use browser() and debug() to debug your R code. Unfortunately you won’t be able to use
RStudio, you’ll have to run R from the command line.
Open a shell (e.g. with Tools | Shell…) and start R by typing:
# If you compile with clang
R --debugger=lldb
# If you compile with gcc
R --debugger=gdb
This will start either lldb or gdb, the debuggers that work with code produced by clang or gcc
respectively. Like R, lldb and gdb provide a REPL, a run-eval-print loop where you enter commands
and then look at the results. Below examples show the results of lldb, the output from gdb is similar.
For each interactive command, the explicit, but long, lldb command and the short, but cryptic, gdb
command. Because lldb understand all gdb commands, you can use choose to be explicit of terse.
Once you’ve started the debugger, start R by typing process start (lldb) or run (gdb). Now when your
C/C++ code crashes, you’ll be dumped into an interactive debugger instead of getting a cryptic error
message and a crash.
Let’s start with a simple C++ function that writes to memory it doesn’t “own”:
Rcpp::cppFunction("
bool mistake() {
NumericVector x(1);
int n = INT_MAX;
x[n] = 0;
return true;
}
", plugins = "debug", verbose = TRUE, rebuild = TRUE)
mistake()
Use devtools::load_all() to load the current package. Then copy and paste the code that creates the
bug. Here’s a crash report from a package:
Process 32743 stopped
* thread #1: tid = 0x1f79f6, 0x... gggeom.so...`
frame #0: 0x0.. gggeom.so`vw_distance(x=..., y=...) + ... at vw-distance.cpp:54
51 int prev_idx = prev[idx];
52
53 next[prev[idx]] = next_idx;
-> 54 prev[next[idx]] = prev_idx;
55 prev[idx] = -1;
56 next[idx] = -1;
57
It tells us that the crash occurred because of an EXC_BAD_ACCESS - this is one of the most common
types of crash in C/C++ code. Helpfully, lldb shows exactly which line of C++ code caused the problem:
vw-distance.cpp:54. Often just knowing where the problem occurs is enough to fix it. But we’re also
now at an interactive prompt. There are many commands you can run here to explore what’s going
on. The most useful are listed below:

 See a list of all commands: help.


 Show your location on the callstack with thread backtrace/bt. This will print a list of calls
leading up to the error, much like traceback() does in R. Navigate the callstack with frame
select <n>/frame <n>, or up and down.
 Evaluate the next expression with thread step-over/next, or step into it with thread step-
in/step. Continue executing the rest of the code with thread step-out/finish.
 Show all variables defined in the current frame with frame variable/ info locals, or print the
value of a single variable with frame variable <var>/p <var>.
Instead of waiting for a crash to occur you can also set breakpoints in your code. To do so, start the
debugger, run R, then:
1. Press Ctrl + C
2. Type breakpoint set --file foo.c --line 12/break foo.c:12.
3. process continue/c to go back to the R console. Now run the C code you’re interested in, and
the debugger will stop when it gets to the specified line.
You can also set a breakpoint for any C++ exception: this allows you to figure out exactly where a C++
error occurs:
1. Press Ctrl + C.
2. Type breakpoint set -E c++.
3. process continue/c to go back to the R console. Now if an exception is thrown in C++ code (or
by R’s C API when wrapped in Rcpp code), the debugger will stop.
Finally, you can also use the debugger if your code is stuck in an infinite loop. Press Ctrl + C to break
into the debugger and you’ll see which line of code is causing the problem.

9.5 NON-INTERACTIVE DEBUGGING


Debugging is most challenging when you can’t run code interactively, typically because it’s part of
some pipeline run automatically (possibly on another computer), or because the error doesn’t occur
when you run same code interactively. This can be extremely frustrating!
When you can’t explore interactively, it’s particularly important to spend some time making the
problem as small as possible so you can iterate quickly. Sometimes callr::r(f, list(1, 2)) can be useful;
this calls f(1, 2) in a fresh session, and can help to reproduce the problem.
You might also want to double check for these common issues:
 Is the global environment different? Have you loaded different packages? Are objects left
from previous sessions causing differences?
 Is the working directory different?
 Is the PATH environment variable, which determines where external commands (like git) are
found, different?
 Is the R_LIBS environment variable, which determines where library() looks for packages,
different?

9.5.1 dump.frames()
dump.frames() is the equivalent to recover() for non-interactive code; it saves a last.dump.rda file in
the working directory. Later, an interactive session, you can load("last.dump.rda"); debugger() to
enter an interactive debugger with the same interface as recover(). This lets you “cheat”, interactively
debugging code that was run non-interactively.
# In batch R process ----
dump_and_quit <- function() {
# Save debugging info to file last.dump.rda
dump.frames(to.file = TRUE)
# Quit R with error status
q(status = 1)
}
options(error = dump_and_quit)
# In a later interactive session ----
load("last.dump.rda")
debugger()

9.5.2 Print Debugging


If dump.frames() doesn’t help, a good fallback is print debugging, where you insert numerous print
statements to precisely locate the problem, and see the values of important variables. Print debugging
is slow and primitive, but it always works, so it’s particularly useful if you can’t get a good traceback.
Start by inserting coarse-grained markers, and then make them progressively more fine-grained as
you determine exactly where the problem is.
f <- function(a) {
cat("f()\n")
g(a)
}
g <- function(b) {
cat("g()\n")
cat("b =", b, "\n")
h(b)
}
h <- function(c) {
cat("i()\n")
i(c)
}

f(10)
#> f()
#> g()
#> b = 10
#> i()
#> [1] 20
Print debugging is particularly useful for compiled code because it’s not uncommon for the compiler
to modify your code to such an extent you can’t figure out the root problem even when inside an
interactive debugger.

9.5.3 RMarkdown
Debugging code inside RMarkdown files requires some special tools. First, if you’re knitting the file
using RStudio, switch to calling RMarkdown::render("path/to/file.Rmd") instead. This runs the code
in the current session, which makes it easier to debug. If doing this makes the problem go away, you’ll
need to figure out what makes the environments different.
If the problem persists, you’ll need to use your interactive debugging skills. Whatever method you use,
you’ll need an extra step: in the error handler, you’ll need to call sink(). This removes the default sink
that knitr uses to capture all output, and ensures that you can see the results in the console. For
example, to use recover() with RMarkdown, you’d put the following code in your setup block:
options(error = function() {
sink()
recover()
})
This will generate a “no sink to remove” warning when knitr completes; you can safely ignore this
warning.
If you simply want a traceback, the easiest option is to use rlang::trace_back(), taking advantage of
the rlang_trace_top_env option. This ensures that you only see the traceback from your code, instead
of all the functions called by RMarkdown and knitr.
options(rlang_trace_top_env = rlang::current_env())
options(error = function() {
sink()
print(rlang::trace_back(bottom = sys.frame(-1)), simplify = "none")
})
9.6 NON-ERROR FAILURES
There are other ways for a function to fail apart from throwing an error:
 A function may generate an unexpected warning. The easiest way to track down warnings is
to convert them into errors with options(warn = 2) and use the call stack, like
doWithOneRestart(), withOneRestart(), regular debugging tools. When you do this you’ll see
some extra calls withRestarts(), and .signalSimpleWarning(). Ignore these: they are internal
functions used to turn warnings into errors.
 A function may generate an unexpected message. You can use rlang::with_abort() to turn
these messages into errors:
f <- function() g()
g <- function() message("Hi!")
f()
#> Hi!

rlang::with_abort(f(), "message")
#> Error: Hi!
rlang::last_trace()
#>
#> 1. ├─rlang::with_abort(f(), "message")
#> 2. │ └─base::withCallingHandlers(...)
#> 3. └─global::f()
#> 4. └─global::g()
 A function might never return. This is particularly hard to debug automatically, but sometimes
terminating the function and looking at the traceback() is informative. Otherwise, use print
debugging, as in Section 9.5.2.
 The worst scenario is that your code might crash R completely, leaving you with no way to
interactively debug your code. This indicates a bug in compiled (C or C++) code.
 If the bug is in your compiled code, use an interactive C debugger (or insert many print
statements).
 If the bug is in a package or base R, you’ll need to contact the package maintainer. In either
case, work on making the smallest possible reproducible example to help the developer help
you.

9.7 CONDITION HANDLING


Unexpected errors require interactive debugging to figure out what went wrong. Some errors,
however, are expected, and you want to handle them automatically. In R, expected errors crop up
most frequently when you’re fitting many models to different datasets, such as bootstrap replicates.
Sometimes the model might fail to fit and throw an error, but you don’t want to stop everything.
Instead, you want to fit as many models as possible and then perform diagnostics after the fact.
In R, there are three tools for handling conditions (including errors) programmatically:
1. try() gives you the ability to continue execution even when an error occurs.
2. tryCatch() lets you specify handler functions that control what happens when a condition is
signalled.
3. withCallingHandlers() is a variant of tryCatch() that establishes local handlers, whereas
tryCatch() registers exiting handlers. Local handlers are called in the same context as where
the condition is signalled, without interrupting the execution of the function. When a exiting
handler from tryCatch() is called, the execution of the function is interrupted and the handler
is called. withCallingHandlers() is rarely needed, but is useful to be aware of.
The following sections describe these tools in more detail.

 Ignore errors with try


try() allows execution to continue even after an error has occurred. For example, normally if you run
a function that throws an error, it terminates immediately and doesn’t return a value:
f1 <- function(x) {
log(x)
10
}
f1("x")
## Error in log(x): non-numeric argument to mathematical function
However, if you wrap the statement that creates the error in try(), the error message will be printed
but execution will continue:

f2 <- function(x) {
try(log(x))
10
}
f2("a")
#> Error in log(x) : non-numeric argument to mathematical function
#> [1] 10
You can suppress the message with try(..., silent = TRUE).
To pass larger blocks of code to try(), wrap them in {}:
try({
a <- 1
b <- "x"
a+b
})
You can also capture the output of the try() function. If successful, it will be the last result evaluated
in the block (just like a function). If unsuccessful it will be an (invisible) object of class “try-error”:
success <- try(1 + 2)
failure <- try("a" + "b")
class(success)

## [1] "numeric"
class(failure)
## [1] "try-error"
try() is particularly useful when you’re applying a function to multiple elements in a list:
elements <- list(1:10, c(-1, 10), c(TRUE, FALSE), letters)
results <- lapply(elements, log)

## Warning in FUN(X[[i]], ...): NaNs produced


## Error in FUN(X[[i]], ...): non-numeric argument to mathematical function

results <- lapply(elements, function(x) try(log(x)))


## Warning in log(x): NaNs produced
There isn’t a built-in function to test for the try-error class, so we’ll define one. Then you can easily
find the locations of errors with sapply() and extract the successes or look at the inputs that lead to
failures.
is.error <- function(x) inherits(x, "try-error")
succeeded <- !vapply(results, is.error, logical(1))
# look at successful results
str(results[succeeded])

## List of 3
## $ : num [1:10] 0 0.693 1.099 1.386 1.609 ...
## $ : num [1:2] NaN 2.3
## $ : num [1:2] 0 –Inf

# look at inputs that failed


str(elements[!succeeded])
## List of 1
## $ : chr [1:26] "a" "b" "c" "d" ...
Another useful try() idiom is using a default value if an expression fails. Simply assign the default value
outside the try block, and then run the risky code:
default <- NULL
try(default <- read.csv("possibly-bad-input.csv"), silent = TRUE)
There is also plyr::failwith(), which makes this strategy even easier to implement.
 Handle conditions with tryCatch()
tryCatch() is a general tool for handling conditions: in addition to errors, you can take different actions
for warnings, messages, and interrupts. You’ve seen errors (made by stop()), warnings (warning()) and
messages (message()) before, but interrupts are new. They can’t be generated directly by the
programmer, but are raised when the user attempts to terminate execution by pressing Ctrl + Break,
Escape, or Ctrl + C (depending on the platform).
With tryCatch() you map conditions to handlers, named functions that are called with the condition
as an input. If a condition is signalled, tryCatch() will call the first handler whose name matches one of
the classes of the condition. The only useful built-in names are error, warning, message, interrupt, and
the catch-all condition. A handler function can do anything, but typically it will either return a value or
create a more informative error message. For example, the show_condition() function below sets up
handlers that return the type of condition signalled:
show_condition <- function(code) {
tryCatch(code,
error = function(c) "error",
warning = function(c) "warning",
message = function(c) "message"
)
}
show_condition(stop("!"))
## [1] "error"
show_condition(warning("?!"))
## [1] "warning"
show_condition(message("?"))
## [1] "message"
# If no condition is captured, tryCatch returns the
# value of the input
show_condition(10)
## [1] 10
You can use tryCatch() to implement try(). A simple implementation is shown below. base::try() is
more complicated in order to make the error message look more like what you’d see if tryCatch()
wasn’t used. Note the use of conditionMessage() to extract the message associated with the original
error.
try2 <- function(code, silent = FALSE) {
tryCatch(code, error = function(c) {
msg <- conditionMessage(c)
if (!silent) message(c)
invisible(structure(msg, class = "try-error"))
})
}
try2(1)
## [1] 1
try2(stop("Hi"))
try2(stop("Hi"), silent = TRUE)
As well as returning default values when a condition is signalled, handlers can be used to make more
informative error messages. For example, by modifying the message stored in the error condition
object, the following function wraps read.csv() to add the file name to any errors:
read.csv2 <- function(file, ...) {
tryCatch(read.csv(file, ...), error = function(c) {
c$message <- paste0(c$message, " (in ", file, ")")
stop(c)
})
}
read.csv("code/dummy.csv")
## Error in file(file, "rt"): cannot open the connection
read.csv2("code/dummy.csv")
## Error in file(file, "rt"): cannot open the connection (in code/dummy.csv)
Catching interrupts can be useful if you want to take special action when the user tries to abort running
code. But be careful, it’s easy to create a loop that you can never escape (unless you kill R)!
# Don't let the user interrupt the code
i <- 1
while(i < 3) {
tryCatch({
Sys.sleep(0.5)
message("Try to escape")
}, interrupt = function(x) {
message("Try again!")
i <<- i + 1
})
}
tryCatch() has one other argument: finally. It specifies a block of code (not a function) to run regardless
of whether the initial expression succeeds or fails. This can be useful for clean up (e.g., deleting files,
closing connections). This is functionally equivalent to using on.exit() but it can wrap smaller chunks
of code than an entire function.
 withCallingHandlers()
An alternative to tryCatch() is withCallingHandlers(). The difference between the two is that the
former establishes exiting handlers while the latter registers local handlers. Here the main differences
between the two kind of handlers:
 The handlers in withCallingHandlers() are called in the context of the call that generated the
condition whereas the handlers in tryCatch() are called in the context of tryCatch(). This is
shown here with sys.calls(), which is the run-time equivalent of traceback() — it lists all calls
leading to the current function.
f <- function() g()
g <- function() h()
h <- function() stop("!")

tryCatch(f(), error = function(e) print(sys.calls()))


# [[1]] tryCatch(f(), error = function(e) print(sys.calls()))
# [[2]] tryCatchList(expr, classes, parentenv, handlers)
# [[3]] tryCatchOne(expr, names, parentenv, handlers[[1L]])
# [[4]] value[[3L]](cond)

withCallingHandlers(f(), error = function(e) print(sys.calls()))


# [[1]] withCallingHandlers(f(),
# error = function(e) print(sys.calls()))
# [[2]] f()
# [[3]] g()
# [[4]] h()
# [[5]] stop("!")
# [[6]] .handleSimpleError(
# function (e) print(sys.calls()), "!", quote(h()))
# [[7]] h(simpleError(msg, call))
This also affects the order in which on.exit() is called.
 A related difference is that with tryCatch(), the flow of execution is interrupted when a handler
is called, while with withCallingHandlers(), execution continues normally when the handler
returns. This includes the signalling function which continues its course after having called the
handler (e.g., stop() will continue stopping the program and message() or warning() will
continue signalling a message/warning). This is why it is often better to handle a message with
withCallingHandlers() rather than tryCatch(), since the latter will stop the program:
message_handler <- function(c) cat("Caught a message!\n")
tryCatch(message = message_handler, {
message("Someone there?")
message("Why, yes!")
})
## Caught a message!
withCallingHandlers(message = message_handler, {
message("Someone there?")
message("Why, yes!")
})
## Caught a message!
## Someone there?
## Caught a message!
## Why, yes!
 The return value of a handler is returned by tryCatch(), whereas it is ignored with
withCallingHandlers():
f <- function() message("!")
tryCatch(f(), message = function(m) 1)
## [1] 1
withCallingHandlers(f(), message = function(m) 1)
## !
These subtle differences are rarely useful, except when you’re trying to capture exactly what went
wrong and pass it on to another function. For most purposes, you should never need to use
withCallingHandlers().
Custom Signal Classes
One of the challenges of error handling in R is that most functions just call stop() with a string. That
means if you want to figure out if a particular error occurred, you have to look at the text of the error
message. This is error prone, not only because the text of the error might change over time, but also
because many error messages are translated, so the message might be completely different to what
you expect.
R has a little known and little used feature to solve this problem. Conditions are S3 classes, so you can
define your own classes if you want to distinguish different types of error. Each condition signalling
function, stop(), warning(), and message(), can be given either a list of strings, or a custom S3 condition
object. Custom condition objects are not used very often, but are very useful because they make it
possible for the user to respond to different errors in different ways. For example, “expected” errors
(like a model failing to converge for some input datasets) can be silently ignored, while unexpected
errors (like no disk space available) can be propagated to the user.
R doesn’t come with a built-in constructor function for conditions, but we can easily add one.
Conditions must contain message and call components, and may contain other useful components.
When creating a new condition, it should always inherit from condition and should in most cases
inherit from one of error, warning, or message.
condition <- function(subclass, message, call = sys.call(-1), ...) {
structure(
class = c(subclass, "condition"),
list(message = message, call = call),
...
)
}
is.condition <- function(x) inherits(x, "condition")
You can signal an arbitrary condition with signalCondition(), but nothing will happen unless you’ve
instantiated a custom signal handler (with tryCatch() or withCallingHandlers()). Instead, pass this
condition to stop(), warning(), or message() as appropriate to trigger the usual handling. R won’t
complain if the class of your condition doesn’t match the function, but in real code you should pass a
condition that inherits from the appropriate class: "error" for stop(), "warning" for warning(), and
"message" for message().
e <- condition(c("my_error", "error"), "This is an error")
signalCondition(e)
# NULL
stop(e)
# Error: This is an error
w <- condition(c("my_warning", "warning"), "This is a warning")
warning(w)
# Warning message: This is a warning
m <- condition(c("my_message", "message"), "This is a message")
message(m)
# This is a message
You can then use tryCatch() to take different actions for different types of errors. In this example, we
make a convenient custom_stop() function that allows us to signal error conditions with arbitrary
classes. In a real application, it would be better to have individual S3 constructor functions that you
could document, describing the error classes in more detail.
custom_stop <- function(subclass, message, call = sys.call(-1),
...) {
c <- condition(c(subclass, "error"), message, call = call, ...)
stop(c)
}
my_log <- function(x) {
if (!is.numeric(x))
custom_stop("invalid_class", "my_log() needs numeric input")
if (any(x < 0))
custom_stop("invalid_value", "my_log() needs positive inputs")
log(x)
}
tryCatch(
my_log("a"),
invalid_class = function(c) "class",
invalid_value = function(c) "value"
)
## [1] "class"
Note that, when using tryCatch() with multiple handlers and custom classes, the first handler to match
any class in the signal’s class hierarchy is called, not the best match. For this reason, you need to make
sure to put the most specific handlers first:
tryCatch(custom_stop("my_error", "!"),
error = function(c) "error",
my_error = function(c) "my_error"
)
## [1] "error"
tryCatch(custom_stop("my_error", "!"),
my_error = function(c) "my_error",
error = function(c) "error"
)
## [1] "my_error"

9.8 DEFENSIVE PROGRAMMING


Defensive programming is the art of making code fail in a well-defined manner even when something
unexpected occurs. A key principle of defensive programming is to “fail fast”: as soon as something
wrong is discovered, signal an error. This is more work for the author of the function (you!), but it
makes debugging easier for users because they get errors earlier rather than later, after unexpected
input has passed through several functions.
In R, the “fail fast” principle is implemented in three ways:

 Be strict about what you accept. For example, if your function is not vectorised in its inputs,
but uses functions that are, make sure to check that the inputs are scalars. You can use
stopifnot(), the assertthat package, or simple if statements and stop().
 Avoid functions that use non-standard evaluation, like subset, transform, and with. These
functions save time when used interactively, but because they make assumptions to reduce
typing, when they fail, they often fail with uninformative error messages.
 Avoid functions that return different types of output depending on their input. The two
biggest offenders are [ and sapply(). Whenever subsetting a data frame in a function, you
should always use drop = FALSE, otherwise you will accidentally convert 1-column data frames
into vectors. Similarly, never use sapply() inside a function: always use the stricter vapply()
which will throw an error if the inputs are incorrect types and return the correct type of output
even for zero-length inputs.
There is a tension between interactive analysis and programming. When you’re working interactively,
you want R to do what you mean. If it guesses wrong, you want to discover that right away so you can
fix it. When you’re programming, you want functions that signal errors if anything is even slightly
wrong or underspecified. Keep this tension in mind when writing functions. If you’re writing functions
to facilitate interactive data analysis, feel free to guess what the analyst wants and recover from minor
misspecifications automatically. If you’re writing functions for programming, be strict. Never try to
guess what the caller wants.

Check your Progress 1


State True or False.
1. The traceback() function can be used to print a summary of how your program arrived at the
error.
2. recover() inserts a browser statement in the first line of the specified function.
3. tryCatch() gives you the ability to continue execution even when an error occurs.

Activity 1
1. Given this function, use `trace()` to add a `browser()` statement before the stop
fun <- function() {
for (i in 1:10000) {
if (i == 9876)
stop("Ohno!")
}
}
2. Determine the bugs in your R code and try to use different functions.
3. Compare the following two implementations of message2error(). What is the main advantage
of withCallingHandlers() in this scenario? (Hint: look carefully at the traceback.)
message2error <- function(code) {
withCallingHandlers(code, message = function(e) stop(e))
}
message2error <- function(code) {
tryCatch(code, message = function(e) stop(e))
}
4. The following function “lags” a vector, returning a version of x that is n values behind the
original. Improve the function so that it (1) returns a useful error message if n is not a vector,
and (2) has reasonable behaviour when n is 0 or longer than x.
lag <- function(x, n = 1L) {
xlen <- length(x)
c(rep(NA, n), x[seq_len(xlen - n)])
}
Summary
 Debugging is designed to help you find bugs by figuring out where the code is not behaving in
the way that you expect. To do this, you need to: Begin running the code, Stop the code at the
point where you suspect the problem is arising, and Look at and/or walk through the code,
step-by-step at that point.
 The traceback() function can be used to print a summary of how your program arrived at the
error. This is also called a call stack, stack trace or backtrace.
 There are three alternatives to using browser(): setting breakpoints in RStudio, options(error
= recover), and debug() and other related functions.
 It is also possible to use an interactive debugger (gdb or lldb) for compiled code like C or C++.
 In R, there are three tools for handling conditions (including errors) programmatically: try(),
tryCatch() and withCallingHandlers().
 Defensive programming is the art of making code fail in a well-defined manner even when
something unexpected occurs. A key principle of defensive programming is to “fail fast”: as
soon as something wrong is discovered, signal an error.

Keywords
 Bug: It is an error, flaw, failure or fault in a computer program or system that causes it to
produce an incorrect or unexpected result, or to behave in unintended ways.
 Compiled code: It is a set of files that must be linked together and with one master list of steps
in order for it to run as a program.

Self-Assessment Questions
1. Explain interactive debugger.
2. State the difference between recover() and debug().
3. What is dump.frames() ?
4. Write a short note on “fail fast” principle.
5. Explain the tryCatch() with example.

Answers to Check your Progress


Check your Progress 1
State True or False.
1. True
2. False
3. False

Suggested Reading
1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.
2. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
3. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
4. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
5. R Programming for Data Science by Roger D. Peng.
6. An introduction to R by Longhow Lam.
7. Learning Statistics with R by Danielle Navarro.
8. Balamuta, James. 2018a. Errorist: Automatically Search Errors or Warnings.
https://github.com/coatless/errorist.
9. Balamuta, James. 2018b. Searcher: Query Search Interfaces.
https://github.com/coatless/searcher.
Unit 10
Introduction to Parallel Computing in R
Structure:

10.1 Introduction

10.2 Why Parallelism?

10.3 Processors (CPUs) and Cores

10.4 When to Parallelise

10.5 Loops and Repetitive Tasks using lapply

10.6 Parallelise using: mclapply

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) as requested by the work’s creator or
licensees. This license is available at https://creativecommons.org/licenses/by-nc-sa/4.0/ .
Objectives

After going through this unit, you will be able to:


 Understand what parallel computing is and when it may be useful
 Understand how parallelism can work
 Review sequential loops and *apply functions
 Understand and use the parallel package multicore functions

10.1 INTRODUCTION
Processing large amounts of data with complex models can be time consuming. New types of sensing
means the scale of data collection today is massive and modeled outputs can be large as well. For
example, here’s a 2 TB (that’s Terabyte) set of modeled output data from Ofir Levy et al. 2016 that
models 15 environmental variables at hourly time scales for hundreds of years across a regular grid
spanning a good chunk of North America.

There are over 400,000 individual netCDF files in the Levy et al. microclimate data set. Processing them
would benefit massively from parallelisation.

Alternatively, think of remote sensing data. Processing airborne hyperspectral data can involve
processing each of hundreds of bands of data for each image in a flight path that is repeated many
times over months and years.

10.2 WHY PARALLELISM?


Much R code runs fast and fine on a single processor. But at times, computations can be:

 CPU-bound: Take too much CPU time


 Memory-bound: Take too much memory
 I/O-bound: Take too much time to read/write from disk
 Network-bound: Take too much time to transfer

To help with CPU-bound computations, one can take advantage of modern processor architectures
that provide multiple cores on a single processor, and thereby enable multiple computations to take
place at the same time. In addition, some machines ship with multiple processors, allowing large
computations to occur across the entire cluster of those computers. Plus, these machines also have
large amounts of memory to avoid memory-bound computing jobs.

10.3 PROCESSORS (CPUS) AND CORES


A modern CPU (Central Processing Unit) is at the heart of every computer. While traditional computers
had a single CPU, modern computers can ship with multiple processors, which in turn can each contain
multiple cores. These processors and cores are available to perform computations.

A computer with one processor may still have 4 cores (quad-core), allowing 4 computations to be
executed at the same time.
A typical modern computer has multiple cores, ranging from one or two in laptops to thousands in
high performance compute clusters. Here we show four quad-core processors for a total of 16 cores
in this machine.

You can think of this as allowing 16 computations to happen at the same time. Theoretically, your
computation would take 1/16 of the time (but only theoretically, more on that later).

10.4 WHEN TO PARALLELISE


It’s not as simple as it may seem. While in theory each added processor would linearly increase the
throughput of a computation, there is overhead that reduces that efficiency. For example, the code
and, importantly, the data need to be copied to each additional CPU, and this takes time and
bandwidth. Plus, new processes and/or threads need to be created by the operating system, which
also takes time. This overhead reduces the efficiency enough that realistic performance gains are
much less than theoretical, and usually do not scale linearly as a function of processing power. For
example, if the time that a computation takes is short, then the overhead of setting up these additional
resources may actually overwhelm any advantages of the additional processing power, and the
computation could potentially take longer!

In addition, not all of a task can be parallelised. Depending on the proportion, the expected speedup
can be significantly reduced. Some propose that this may follow Amdahl’s Law, where the speedup of
the computation (y-axis) is a function of both the number of cores (x-axis) and the proportion of the
computation that can be parallelised (see coloured lines in the graph given below):
So, it’s important to evaluate the computational efficiency of requests, and work to ensure that
additional compute resources brought to bear will pay off in terms of increased work being done. With
that, let’s do some parallel computing.

10.5 LOOPS AND REPETITIVE TASKS USING lapply


When you have a list of repetitive tasks, you may be able to speed it up by adding more computing
power. If each task is completely independent of the others, then it is a prime candidate for executing
those tasks in parallel, each on its own core. For example, let’s build a simple loop that uses sample
with replacement to do a bootstrap analysis. In this case, we select Sepal.Length and Species from the
iris dataset, subset it to 100 observations, and then iterate across 10,000 trials, each time resampling
the observations with replacement. We then run a logistic regression fitting species as a function of
length, and record the coefficients for each trial to be returned.

x <- iris[which(iris[,5] != "setosa"), c(1,5)]


trials <- 10000
res <- data.frame()
system.time({
trial <- 1
while(trial <= trials) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
r <- coefficients(result1)
res <- rbind(res, r)
trial <- trial + 1
}
})

## user system elapsed


## 20.031 0.458 21.220
The issue with this loop is that we execute each trial sequentially, which means that only one of our 8
processors on this machine are in use. In order to exploit parallelism, we need to be able to dispatch
our tasks as functions, with one task going to each processor. To do that, we need to convert our task
to a function, and then use the *apply() family of R functions to apply that function to all of the
members of a set. In R, using apply is often significantly faster than the equivalent code in a loop.
Here’s the same code rewritten to use lapply(), which applies a function to each of the members of a
list (in this case the trials we want to run):

x <- iris[which(iris[,5] != "setosa"), c(1,5)]


trials <- seq(1, 10000)
boot_fx <- function(trial) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
r <- coefficients(result1)
res <- rbind(data.frame(), r)
}
system.time({
results <- lapply(trials, boot_fx)
})
## user system elapsed
## 19.340 0.553 20.315

Approaches to Parallelisation

When parallelising jobs, one can:

 Use the multiple cores on a local computer through mclapply


 Use multiple processors on local (and remote) machines using makeCluster and clusterApply
o In this approach, one has to manually copy data and code to each cluster member
using clusterExport
o This is extra work, but sometimes gaining access to a large cluster is worth it.

10.6 PARALLELISE USING: mclapply


The parallel library can be used to send tasks (encoded as function calls) to each of the processing
cores on your machine in parallel. This is done by using the parallel::mclapply function, which is
analogous to lapply, but distributes the tasks to multiple processors. mclapply gathers up the
responses from each of these function calls, and returns a list of responses that is the same length as
the list or vector of input data (one return per input item).

library(parallel)
library(MASS)

starts <- rep(100, 40)


fx <- function(nstart) kmeans(Boston, 4, nstart=nstart)
numCores <- detectCores()
numCores
## [1] 8
system.time(
results <- lapply(starts, fx)
)

## user system elapsed


## 1.346 0.024 1.372
system.time(
results <- mclapply(starts, fx, mc.cores = numCores)
)
## user system elapsed
## 0.801 0.178 0.367

Now let’s demonstrate with our bootstrap example:

x <- iris[which(iris[,5] != "setosa"), c(1,5)]


trials <- seq(1, 10000)
boot_fx <- function(trial) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
r <- coefficients(result1)
res <- rbind(data.frame(), r)
}
system.time({
results <- mclapply(trials, boot_fx, mc.cores = numCores)
})

## user system elapsed


## 25.672 1.343 5.003

Parallelise using: foreach and doParallel

The normal for loop in R looks like:

for (i in 1:3) {
print(sqrt(i))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051

The foreach method is similar, but uses the sequential %do% operator to indicate an expression to
run. Note the difference in the returned data structure.

library(foreach)
foreach (i=1:3) %do% {
sqrt(i)
}
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051

In addition, foreach supports a parallelisable operator %dopar% from the doParallel package. This
allows each iteration through the loop to use different cores or different machines in a cluster. Here,
we demonstrate with using all the cores on the current machine:

library(foreach)
library(doParallel)

## Loading required package: iterators


registerDoParallel(numCores) # use multicore, set to the number of our cores
foreach (i=1:3) %dopar% {
sqrt(i)
}
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
# To simplify output, foreach has the .combine parameter that can simplify return values

# Return a vector
foreach (i=1:3, .combine=c) %dopar% {
sqrt(i)
}
## [1] 1.000000 1.414214 1.732051
# Return a data frame
foreach (i=1:3, .combine=rbind) %dopar% {
sqrt(i)
}
## [,1]
## result.1 1.000000
## result.2 1.414214
## result.3 1.732051
The doParallel vignette on CRAN shows a much more realistic example, where one can use %dopar%
to parallelise a bootstrap analysis where a data set is resampled 10,000 times and the analysis is rerun
on each sample, and then the results combined:

# Let's use the iris data set to do a parallel bootstrap


# From the doParallel vignette, but slightly modified
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
system.time({
r <- foreach(icount(trials), .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
## user system elapsed
## 24.117 1.303 4.944
# And compare that to what it takes to do the same analysis in serial
system.time({
r <- foreach(icount(trials), .combine=rbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
## user system elapsed
## 19.445 0.571 20.302
# When you're done, clean up the cluster
stopImplicitCluster()

Check your Progress 1


State True or False.

1. When parallelising jobs, one can use the multiple processor on a local computer through
mclapply.
2. foreach supports a parallelisable operator %dopar% from the doParallel package.

Activity 1

1. Install doParallel package and execute a code using the same.

Summary

 In this unit, we showed examples of computing tasks that are likely limited by the number of
CPU cores that can be applied, and we reviewed the architecture of computers to understand
the relationship between CPU processors and cores.
 We reviewed the way in which traditional for loops in R can be rewritten as functions that are
applied to a list serially using lapply, and then how the parallel package mclapply function can
be substituted in order to utilize multiple cores on the local computer to speed up
computations.
 Finally, we reviewed the use of the foreach package with the %dopar operator to accomplish
a similar parallelisation using multiple cores.

Keywords

 Cores: A core is part of a CPU that receives instructions and performs calculations, or
actions, based on those instructions.
 Processor: A processor is the logic circuitry that responds to and processes the basic
instructions that drive a computer. The four primary functions of a processor are fetch,
decode, execute and writeback.
 Parallel Computing/Processing: It is a type of computation in which many calculations or the
execution of processes are carried out simultaneously.

Self-Assessment Questions

1. Explain the concept of parallelism.


2. Discuss the packages which support parallelism in R.

Answers To Check Your Progress


Check your Progress 1

State True or False.

1. False
2. True

Suggested Reading

1. The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff.


1. https://kingaa.github.io/R_Tutorial/#data-structures-in-r.
2. https://d1b10bmlvqabco.cloudfront.net/attach/ighbo26t3ua52t/igp9099yy4v10/igz7vp4w5
su9/OReilly_HandsOn_Programming_with_R_2014.pdf.
3. The R Book by Michael J. Crawley, Imperial College London at Silwood Park, UK.
4. R Programming for Data Science by Roger D. Peng.
5. An introduction to R by Longhow Lam.
6. Learning Statistics with R by Danielle Navarro.
7. Balamuta, James. 2018a. Errorist: Automatically Search Errors or Warnings.
https://github.com/coatless/errorist.
8. Balamuta, James. 2018b. Searcher: Query Search Interfaces.
https://github.com/coatless/searcher.

You might also like