Working With R
Working With R
Working With R
StatsBomb Data In R
What is R and Why Use It?
What Is R and Why Use It?
R is a programming language that is useful for managing large datasets.
It is especially useful in the world of football data, as it allows us to manipulate that data to various ends.
Such as creating metrics out of the data and visualising it.
R itself can be downloaded here:
https://cran.r-project.org/mirrors.html
We at StatsBomb use R regularly (amongst other coding languages) in day-to-day work, particularly within
our analysis department. Spreadsheets are a viable route when you’re just starting out, but eventually the
datasets become too big and unwieldy, performing nuanced dissection of them becomes too complicated.
Once you’ve gotten over the learning curve, R is ideal for parsing data and working with it however you like
in a fast manner.
RStudio
The base version of R is a somewhat cumbersome piece of software. This has lead to the creation of many
different ‘IDE’s (integrated development environment). These are wrappers around the initial R install that
make most tasks within R easier and more manageable for the end user. The most popular of these is
RStudio:
https://www.rstudio.com/products/rstudio/
It is recommended that you install RStudio (or any similar IDE that you find and prefer) as most users do. It
will make working with StatsBomb’s data a cleaner, simpler process.
Opening a New R Project
This (minus the annotations of
course) is what you should see
when you load up R Studio.
If you’re wondering what a
particular option or section of R
Studio does, then there’s a handy
set of cheat sheets for it and many
other R-related subjects at
https://www.rstudio.com
/resources/cheatsheets/
There are a well of resources out
there with detailed answers to just
about any question you could have
related to R.
Packages & ‘StatsBombR’
What is an ‘R Package’?
‘Packages’ are downloadable bundles of functions that make tasks easier. Most packages are installed by
running install.packages(‘PackageNameHere’). However, if the package comes via Github we use the devtools
package to install it (this includes StatsBombR, which we will walk through installing on the next page).
The main packages we will focus on here and which need installing are:
‘tidyverse’: tidyverse contains a whole host of other packages (such as dplyr and magrittr) that are useful for
manipulating data. install.packages(“tidyverse”)
‘devtools’: Most packages are hosted on CRAN. However there are also countless useful ones hosted on
Github. Devtools allows for downloading of packages directly from Github. install.packages(“devtools”)
‘ggplot2’: The most popular package for visualising data within R. It is contained within tidyverse.
‘StatsBombR’: This is StatsBomb’s own package for parsing our data.
Once a package is installed it can be loaded into R by running library(PackageNameHere). You should load all
of these at the start of any session.
What is ‘StatsBombR’ and how to Install It?
StatsBomb’s former data scientist Derrick Yam created ‘StatsBombR’, an R package dedicated to making using
StatBomb’s data in R much easier. It can be found on Github at the following link, along with much more
information on its uses. There are lots of helpful functions within it that you should get to know.
https://github.com/statsbomb/StatsBombR
To install the package in R, you’ll need to install the ‘Devtools’ package, which can be done by running the
following line of code:
install.packages("devtools")
install.packages("remotes")
remotes::install_version("SDMTools", "1.1-221")
Then, to install StatsBombR itself, run:
devtools::install_github("statsbomb/StatsBombR")
Finding More Info On Packages
If you want more detail on the various functions within
a package then click on the package’s name in the
viewer in the bottom right. That will take you to the
documentation for that package. It should contain all
sorts of information on the ins and outs of its functions.
Pulling in StatsBomb Data
Key Functions for Getting the Free Data
There are several key functions within StatsBombR to familiarise
yourself with for bringing StatsBomb Data into R.
FreeCompetitions() - This shows you all the competitions that are available as free data
If you want to store the output of this (or any other functions) so you can pull it up at any time, instead of just
having it in the R console, you can run something like the following:
Comp <- FreeCompetitions(). Then, anytime you run Comp (or whatever word you choose to store it under,
you can go with anything), you will see the output of FreeCompetitions().
Matches <- FreeMatches(Comp) - This shows the available matches within the competitions chosen
StatsBombData <- StatsBombFreeEvents(MatchesDF = Matches, Parallel = T) - This pulls all the event data
for the matches that are chosen.
Pulling the Free Data
Now we’re going to run through an example of how to pull the data into R. Open up a new ‘script’, so we can store this
code and have it easily accessible, by going to File -> New File -> R Script. This script can be saved at any time.
library(tidyverse) #1: tidyverse loads many different packages. Most important for this task
library(StatsBombR) #1 are dplyr and magrittr. StatsBombR loads StatsBombR.
Comp <- FreeCompetitions() %>% #2: This grabs the competitions that are available to the user and filters it
filter(competition_id==37 & season_name=="2020/2021") #2 down, using dplyr’s ‘filter’ function, to just the 2020/21 FA Women’s Super
League season in this example.
Matches <- FreeMatches(Comp) #3
#3: This pulls all the matches for the desired competition.
StatsBombData <- StatsBombFreeEvents(MatchesDF = Matches, Parallel = T) #4
#4: Now we have created a ‘dataframe’ (essentially a table) called
StatsBombData = allclean(StatsBombData) #5 ‘StatsBombData’ (or whatever you choose to call it) of the free event data
for the FAWSL season in 2020/2021.
Per Game
Data Use Case 2: From Data to a Chart
library(ggplot2) #1: Here we are telling ggplot what data we are using and what we
want to plot on the x/y axes of our chart. ‘Reorder’ quite literally
ggplot(data = shots_goals, reorders the team names according to the number of shots they
aes(x = reorder(team.name, shots), y = shots)) + #1 have.
geom_bar(stat = "identity", width = 0.5) + #2
labs(y="Shots") + #3 #2: Now we are telling ggplot to format it is a bar chart.
theme(axis.title.y = element_blank()) + #4
scale_y_continuous( expand = c(0,0)) + #5 #3: This relabels the shots axis.
coord_flip() + #6 #4: This removes the title for the axis.
theme_SB() #7
#5: Here we cut down on the space between the bars and the edge
of the plot
#6: This flips the entire plot, with the bars now going horizontally
instead.
#7: theme_SB() is our own internal visual aesthetic for ggplot charts
that we have packaged with StatsBombR. Optional of course.
Data Use Case 2: From Data to a Chart
All that should result in a chart like this.
Of course this is a basic chart, fairly visually plain on its own
and it could be altered in many ways to add your own spin on
it.
Almost every element of a ggplot chart - from the text to the
plotted data itself and beyond - can be changed how you see
fit. There’s lots of room for creativity.
For an in depth reference point on what kind of charts you
can create or how you can modify them, you can look here:
https://ggplot2.tidyverse.org/reference/
Data Use Case 3: Player Shots Per 90
player_shots = StatsBombData %>% #1: Much the same as the team calculation. We are including
group_by(player.name, player.id) %>% ‘player.id’ here as it will be important later.
summarise(shots = sum(type.name=="Shot", na.rm = TRUE)) #1
player_minutes = get.minutesplayed(StatsBombData) #2
#2: This function gives us the minutes played in each match by
ever player in the dataset.
player_minutes = player_minutes %>%
group_by(player.id) %>% #3: Now we group that by player and sum it altogether to get
summarise(minutes = sum(MinutesPlayed)) #3 their total minutes played.
player_shots = left_join(player_shots, player_minutes) #4 #4: left_join allows us to combine our shots table and our
player_shots = player_shots %>% mutate(nineties = minutes/90) #5 minutes table, with the the player.id acting as a reference point.
player_shots = player_shots %>% mutate(shots_per90 = shots/nineties) #6 #5: mutate is a dplyr function that creates a new column. In this
instance we are creating a column that divides the minutes
totals by 90, giving us each playerЀs number of 90s played for
the season.
#6: Finally we divide our shots totals by our number of 90s to get
our shots per 90s column.
Data Use Case 3: Player Shots Per 90
Now you’ll have shots per 90 for all the players
across the WSL (or your league of choice). This can
be cleaned up using dplyr’s ‘filter’ function, in order
to get rid of the players with few minutes played.
This same process can of course be applied to all
sorts of events with StatsBomb Data. Certain types of
passes, defensive actions and so on.
Data Use Case 4: Plotting Passes
Finally, we’re going to look at plotting a player’s passes on a pitch. For this we of course need some sort of
pitch visualisation. You might want to create your own once you become more familiar with ggplot and using
it for more complex purposes (there will be a flexible version that we use later in this presentation). However,
handily, there are several pre-made solutions out there.
The one we’ll be using here comes courtesy of FC rStats. A twitter user who has put together various helpful,
public R packages for parsing football data. The package is called ‘SBPitch’ and it does exactly what it says on
the tin. There will be further options in the ‘Other Useful Packages’ at the end of this document. First let’s get it
installed with the following code:
devtools::install_github("FCrSTATS/SBpitch")
We’re going to plot Fran Kirby’s completed passes into the box for the 2020/21 FA Women’s Super League
season. Plotting all of her passes would get messy of course, so this is a clearer subset. Make sure you’ve used
the functions previously discussed to pull that data.
Data Use Case 4: Plotting Passes
library(SBpitch) #1: Pull some of the FA WSL data of your choice and call it
‘wsldata’ for us to work with here. Then we can filter to Fran
passes = wsldata %>%
filter(type.name=="Pass" & is.na(pass.outcome.name) &
Kirby’s passes. is.na(pass.outcome.name) filters to only
player.id==4641) %>% #1 completed passes.
filter(pass.end_location.x>=102 & pass.end_location.y<=62 &
pass.end_location.y>=18) #2 #2: Filtering to passes within the box. The coordinates for pitch
markings in SBD can be found in our event spec.
create_Pitch() +
geom_segment(data = passes, aes(x = location.x, y = location.y, #3: This creates an arrow from one point (location.x/y, the start
xend = pass.end_location.x, yend = pass.end_location.y), part of the pass) to an end point (pass.end_location.x/y, the end
lineend = "round", size = 0.5, colour = "#000000", arrow =
of the pass). Lineend, size and length are are all customisation
arrow(length = unit(0.07, "inches"), ends = "last", type = "open")) + #3
labs(title = "Fran Kirby, Completed Box Passes", subtitle = "WSL, options for the arrow.
2020-21") + #4
scale_y_reverse() + #5 #4: Creates a title and a subtitle for the plot. You can also add
coord_fixed(ratio = 105/100) #6 captions using caption =, along with other options.
You’ll have this plot. Again, it’s simple and bare but it starts you
off and from here you can layer on all sorts of customisation.
theme() options allow you to change the size, placement, font
and much more of the titles. As well as to alter lots of other
aesthetic aspects of the plot.
You can add colour = to geom_segment() in order to colour
passes according to what you choose.
Again, be sure to dig into the ggplot documentation to get the
full scope of how powerful it is.
This is a great cheat sheet for various
ways you can use the package.
Useful StatsBombR Functions
There all sorts of functions within StatsBombR for different purposes. You can find them all here, deeper into
the github page. Not all functions are related to the free data. Some are only accessible for customers via our
API. Here’s a quick rundown of some you may find useful:
allclean() - Mentioned previously but to elucidate: this extrapolates lots of new, helpful columns from the pre
existing columns. For example, it takes the location column and splits it up into separate x/y columns. It also
extracts freeze frame data and goalkeeper information. Make sure to use.
get.playerfootedness() - Gives you a player’s assumed preferred foot using our pass footedness data.
get.opposingteam() - Returns an opposing team column for each team in each match.
get.gamestate() - Returns information for how much time each team spent in various game states
(winning/drawing/losing) for each match.
annotate_pitchSB() - Our own solution for plotting a pitch with ggplot.
Other Useful Packages
The community around R is packed with packages that fulfill all sorts of needs. Chances are that, if you’re
looking to do something in R or fix some sort of issue, there’s a package out there for it. There are far too many
to name but here’s a brief selection of some that may be relevant to working with StatsBomb Data:
Ben Torvaney, ggsoccer - A package that contains an alternative for plotting a pitch with SB Data.
Joe Gallagher, soccermatics - Also offers an option for pitch plotting along with other useful shortcuts for
creating heatmaps and so on.
ggrepel - Useful for when you’re having issues with overlapping labels on a chart.
gganimate - If you ever feel like getting more elaborate with your graphics, this gives you a simple way to
create animated ones within R and ggplot.
Doing More With StatsBomb
Data In R
More Data Use Cases
The content beyond this point in the guide is aimed at those that have been through the first part of the guide
and have been playing about with SBD for a while now.
It's important that you have done this first as we will not be walking through absolutely everything -- we
assume a certain level of familiarity with R in this section.
There’ll be three use cases this time:
Use Case 5: xG Assisted, Joining, and xG+xGA - An example of how to create and then plot custom metrics
with the data, creating xG Assisted in a dataframe using ‘joining’ and then creating an xG + xG Assisted plot.
Use Case 6: Graphing Shots On a Chart - Heatmaps are one of the everpresents in football data. They are
fairly easy to make in R once you get your head round how to do so, but can be unintuitive without having it
explained first.
Use Case 7: Shot Maps - Another of the quintessential football visualisations, shot maps come in many
shapes and sizes with an inconsistent overlap in design language between them. This version will attempt to
give you the basics.
Data Use Case 5: xG Assisted,
Joining, and xG+xGA
Data Use Case 5: xG Assisted, Joining, and xG+xGA
xG assisted does not exist in our data initially. However, given that xGA #1 Filtering the data to just shots, as they are the only events with xG
is the xG value of a shot that a key pass/assist created, and that xG values.
values do exist in our data, we can create xGA quite easily via joining.
#2 Select() allows you to choose which columns you want to, well,
library(tidyverse) select, from your daata, as not all are always necessary - especially with
library(StatsBombR) big datasets. First we are selecting the shot.key_pass_id column, which
xGA = events %>% is a variable attached to shots that is just the ID of the pass that created
filter(type.name=="Shot") %>% #1 the shot. You can also rename columns within select() which is what we
select(shot.key_pass_id, xGA = shot.statsbomb_xg) #2 are doing with xGA = shot.statsbomb_xg. This is so that, when we join it
with the passes, it already has the correct name.
shot_assists = left_join(events, xGA, by = c("id" = "shot.key_pass_id"))
%>% #3 #3 left_join() lets you combine the columns from two different DFs by
select(team.name, player.name, player.id, type.name, pass.shot_assist, using two columns within either side of the join as reference keys. So in
pass.goal_assist, xGA ) %>% #4
this example we are taking our initial DF (‘events’) and joining it with the
filter(pass.shot_assist==TRUE | pass.goal_assist==TRUE) #5
one we just made (‘xGA’). The key is the by = c("id" = "shot.key_pass_id")
part, this is saying ‘join these two DFs on instances where the id column
in events matches the ‘shot.key_pass_id’ column in xGA’. So now the
passes have the xG of the shots they created attached to them under the
new column ‘xGA’.
#2 Filtering out penalties and summing each player's xG, then joining with
the xGA and adding the two together to get a third combined column.
#3 Getting minutes played for each player. If you went through the earlier
data use cases in this guide you will have done this already.
#4 Joining the xG/xGA to the minutes, creating the 90s and dividing each stat
by the 90s to get xG per 90 etc.
#5 Here we ungroup as we need the data in ungrouped form for what we're
about to do. First we filter to players with a minimum of 600 minutes, just to
get rid of notably small samples. Then we use top_n(). This filters your DF to
the top *insert number of your choice here* based on a column you specify.
So here we're filtering to the top 15 players in terms of xG90+xGA90.
#6 The pivot_longer() function flattens out the data. It's easier to explain what
that means if you see it first:
#3: Here we are providing specific colour hex codes to the values (so xG
= red and xGA = blue) and then labelling them so they are named
correctly on the chart's legend.
Data Use Case 5: xG Assisted, Joining, and xG+xGA
The end result should look like this:
Data Use Case 6: Heatmaps
Data Use Case 6: Heatmaps
For this example we're going to do a defensive heatmap, looking at how often teams make a % of
their overall defensive actions in certain zones, then comparing that % vs league average:
library(tidyverse)
#1 Some of the coordinates in our data sit outside the bounds of the pitch (you can see the layout of our pitch coordinates in our event spec,
but it's 0-120 along the x axis and 0-80 along the y axis). This will cause issue with a heatmap and give you dodgy looking zones outside the
pitch. So what we're doing here is using ifelse() to say 'if a location.x/y coordinate is outside the bounds that we want, then replace it with one
that's within the boundaries. If it is not outside the bounds just leave it as is'.
#2 cut() literally cuts up the data how you ask it to. Here, we're cutting along the x axis (from 0-120, again the length of our pitch according to our
coordinates in the spec) and the y axis (0-80), and we're cutting them 'by' the value we feed it, in this case 20. So we're splitting it up into
buckets of 20. This creates 6 buckets/zones along the x axis (120/20 = 6) and 4 along the y axis (80/20 = 4). This creates the buckets we need to
plot our zones.
Data Use Case 6: Heatmaps
heatmap = heatmap%>% #3: This is using those buckets to create the zones. Let's
filter(type.name=="Pressure" | duel.type.name=="Tackle" | break it down bit-by-bit:
type.name=="Foul Committed" | type.name=="Interception" | - Filtering to only defensive events
type.name=="Block" ) %>% - Grouping by team and getting how many defensive events
group_by(team.name) %>%
they made in total ( n() just counts every row that you ask it
mutate(total_DA = n()) %>%
group_by(team.name, xbin, ybin) %>%
to, so here we're counting every row for every team - i.e
summarise(total_DA = max(total_DA), counting every defensive event for each team)
bin_DA = n(), - Then we group again by team and the xbin/ybin to count
bin_pct = bin_DA/total_DA, how many defensive events a team has in a given bin/zone -
location.x = median(location.x), that's what 'bin_DA = n()' is doing. 'total_DA = max(total_DA),'
location.y = median(location.y)) %>% is just grabbing the team totals we made earlier. 'bin_pct =
group_by(xbin, ybin) %>% bin_DA/total_DA,' is dividing the two to see what percentage
mutate(league_ave = mean(bin_pct)) %>% of a team's overall defensive events were made in a given
group_by(team.name, xbin, ybin) %>% zone. The 'location.x = median(location.x/y)' is doing what it
mutate(diff_vs_ave = bin_pct - league_ave) #3
says on the tin and getting the median coordinate for each
zone. This is used later in the plotting.
- Then we ungroup and mutate to find the league average for
each bin, followed by grouping by team/bin again
subtracting the league average in each bin from each team's
% in those bins to get the difference.
Data Use Case 6: Heatmaps
Data Use Case 6: Heatmaps
Now onto the plotting. For this please install the package 'grid' if you do not have it, and load
it in. You could use a package like 'ggsoccer' or 'SBPitch' for drawing the pitch, but for these
purposes it's helpful to try and show you how to create your own pitch, should you want to:
library(grid)
shotmapxgcolors <- c("#192780", "#2a5d9f", "#40a7d0", "#87cdcf", "#e7f8e6", "#f4ef95", "#FDE960", "#FCDC5F",
"#F5B94D", "#F0983E", "#ED8A37", "#E66424", "#D54F1B", "#DC2608", "#BF0000", "#7F0000", "#5F0000") #2
#1: Simple filtering, leaving out penalties. Choose any player you like of course.
#2: Much like the defensive activity colours earlier, these will set the colours for our xG values.
Data Use Case 7: Shot Maps
ggplot() + Again bear in mind that this next
annotate("rect",xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = "black", size = 0.6) +
annotate("rect",xmin = 0, xmax = 60, ymin = 0, ymax = 80, fill = NA, colour = "black", size = 0.6) + set of ggplot code (on this slide
annotate("rect",xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) + and the next two) should be
annotate("rect",xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = "black", size = 0.6) + pasted in one block.
annotate("rect",xmin = 0, xmax = 6, ymin = 30, ymax = 50, fill = NA, colour = "black", size = 0.6) +
annotate("rect",xmin = 120, xmax = 114, ymin = 30, ymax = 50, fill = NA, colour = "black", size = 0.6) +
annotate("rect",xmin = 120, xmax = 120.5, ymin =36, ymax = 44, fill = NA, colour = "black", size = 0.6) + #3: Here's where the actual
annotate("rect",xmin = 0, xmax = -0.5, ymin =36, ymax = 44, fill = NA, colour = "black", size = 0.6) + plotting of shots comes in, via
annotate("segment", x = 60, xend = 60, y = -0.5, yend = 80.5, colour = "black", size = 0.6)+
annotate("segment", x = 0, xend = 0, y = 0, yend = 80, colour = "black", size = 0.6)+ geom_point. We're using the the
annotate("segment", x = 120, xend = 120, y = 0, yend = 80, colour = "black", size = 0.6)+ xG values as the fill and the body
theme(rect = element_blank(), part for the shape of the points.
line = element_blank()) +
# add penalty spot right This could reasonably be
annotate("point", x = 108 , y = 40, colour = "black", size = 1.05) + anything though. You could even
annotate("path", colour = "black", size = 0.6, add in colour parameters which
x=60+10*cos(seq(0,2*pi,length.out=2000)),
y=40+10*sin(seq(0,2*pi,length.out=2000)))+ would change the colour of the
# add centre spot outline of the shape.
annotate("point", x = 60 , y = 40, colour = "black", size = 1.05) +
annotate("path", x=12+10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40+10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col="black") +
annotate("path", x=107.84-10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40-10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col="black") +
geom_point(data = shots, aes(x = location.x, y = location.y, fill = shot.statsbomb_xg, shape = shot.body_part.name),
size = 6, alpha = 0.8) + #3
Data Use Case 7: Shot Maps
theme(axis.text.x=element_blank(), #4: Again titling. This can be
axis.title.x = element_blank(), done dynamically so that it
axis.title.y = element_blank(),
plot.caption=element_text(size=13,family="Source Sans Pro", hjust=0.5, vjust=0.5),
changes according to the
plot.subtitle = element_text(size = 18, family="Source Sans Pro", hjust = 0.5), player/season etc but we will
axis.text.y=element_blank(), leave that for now. Feel free to
legend.position = "top", explore for youself though.
legend.title=element_text(size=22,family="Source Sans Pro"),
legend.text=element_text(size=20,family="Source Sans Pro"),
legend.margin = margin(c(20, 10, -85, 50)),
#5: Same as last time but
legend.key.size = unit(1.5, "cm"), worth pointing out that
plot.title = element_text(margin = margin(r = 10, b = 10), face="bold",size = 32.5, family="Source Sans 'name' allows you to change
Pro", colour = "black", hjust = 0.5), the title of a legend from
legend.direction = "horizontal", within the gradient setting.
axis.ticks=element_blank(),
aspect.ratio = c(65/100),
plot.background = element_rect(fill = "white"),
strip.text.x = element_text(size=13,family="Source Sans Pro")) +
labs(title = "Sam Kerr, Shot Map", subtitle = "FA Women's Super League, 2020/21") + #4
scale_fill_gradientn(colours = shotmapxgcolors, limit = c(0,0.8), oob=scales::squish, name = "Expected Goals
Value") + #5
Data Use Case 7: Shot Maps
scale_shape_manual(values = c("Head" = 21, "Right Foot" = 23, "Left Foot" = 24), name ="") + #6
guides(fill = guide_colourbar(title.position = "top"),
shape = guide_legend(override.aes = list(size = 7, fill = "black"))) + #7
coord_flip(xlim = c(85, 125)) #8
#6: Setting the shapes for each body part name. The shape numbers correspond to ggplot's
pre-set shapes, which you can find here. The shapes numbered 21 and up are the ones
which have inner colouring (controlled by fill) and outline colouring (controlled by colour) so
that's why those have been chosen here. oob=scales::squish takes any values that are
outside the bounds of our limits and squishes them within them.
#7: guides() allows you to alter the legends for shape, fill and so on. Here we are changing
the the title position for the fill so that it is positioned above the legend, as well as changing
the size and colour of the shape symbols on that legend.
#8: coord_flip() does what it says on the tin - switches the x and y axes. xlim allows us to set
boundaries for the x axis so that we can show only a certain part of the pitch, giving us:
Data Use Case 7: Shot Maps
Data Use Case 7: Shot Maps
We hope You Enjoy the Data!
Any questions please email Euan Dewar, Senior Analyst
euan.dewar@statsbomb.com