Soccer Analytics Tutorial: Scraping EPL Data using R (2022 Update)

Reading Time: 10 minutes

It’s been a year since the last web scraping tutorial I did and a few things have changed. Thanks to feedback on the relevant YouTube video, an issue with getting data was found, so I had to add an update with the resolution.

I will be focusing on English Premier League data from fbref.com in this tutorial. The start of the season is just a few days away and I’m super excited!

I hope you enjoy the tutorial. I recommend you check out the video below for a better explanation.

If you’re interested in online courses to enhance your skillset, check out the Recommended e-Learning Courses.

Load libraries

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(plotly)
library(readr)

Scrape the main page to get the links

In the first tutorial I made, we were manually copying and pasting the URLs. For cases like a full Premier League season, with 380 games in total, that’s simply impossible.

The method I use here is to first read the HTML of the Premier League fbref.com page and identify the links I want to use.

page <- read_html("https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures") links_1 <- unlist(page %>% html_nodes("a") %>% html_attr('href'))
links_2 <- strsplit(links_1, '"')
links_2[100:120]

I can see that the game-related links contain the word “matches” and also “Premier League”.

What we do is filter using the grepl function and come up with the 380 links – one for each game.


links_3 <- links_2[grepl("matches", links_2)]
links_4 <- links_3[grepl("Premier", links_3)]
all_urls <- unique(links_4)

This analysis is about the Premier League Champions Manchester City and this is how I filter for their games.

team_urls <- all_urls[grepl("Manchester-City", all_urls)]

Scraping the data

Scrape data for one game

Create the actual links by pasting the domain and the rest of the URL, picked up from our previous scraping effort.

selected_urls <- paste("https://fbref.com", team_urls, sep="")
g=1
selected_urls[g]

The URL contains the basic game data, i.e. the names of the teams and the date. To get the game data, I use the substr function. The first nchar function below tells us that we should ignore the first 38 characters and the last 15 characters.

nchar("https://fbref.com/en/matches/ff51efc7/")
nchar("-Premier-League")
game_data <- substr(selected_urls[g], 39, nchar(selected_urls[g])-15)

Abbreviate the months to 3 digits long so that the substr function can be used for all months.

game_data <- str_replace(game_data, "January", "Jan")
game_data <- str_replace(game_data, "February", "Feb")
game_data <- str_replace(game_data, "March", "Mar")
game_data <- str_replace(game_data, "April", "Apr")
game_data <- str_replace(game_data, "June", "Jun")
game_data <- str_replace(game_data, "July", "Jul")
game_data <- str_replace(game_data, "August", "Aug")
game_data <- str_replace(game_data, "September", "Sep")
game_data <- str_replace(game_data, "October", "Oct")
game_data <- str_replace(game_data, "November", "Nov")
game_data <- str_replace(game_data, "December", "Dec")

Then, let’s get the date

date <- substr(game_data, nchar(game_data)-10, nchar(game_data))
nchar("-Aug-15-2021")
teams <- substr(game_data, 1, nchar(game_data)-12)

Remove the type of match, it messes up our string functions.

game_data <- str_replace(game_data, "North-West-London-Derby-", "")
game_data <- str_replace(game_data, "Merseyside-Derby-", "")
game_data <- str_replace(game_data, "North-London-Derby-", "")
game_data <- str_replace(game_data, "Manchester-Derby-", "")
game_data <- str_replace(game_data, "North-West-Derby-", "")

Replace the dash between multiple-word team names. This is because when we define teamA and teamB, the split has to be on the dash so we only want one dash to exist, the one between the two teams.

teams <- str_replace(teams, "Manchester-United", "Manchester Utd")
teams <- str_replace(teams, "Manchester-City", "Manchester City")
teams <- str_replace(teams, "Leeds-United", "Leeds United")
teams <- str_replace(teams, "Crystal-Palace", "Crystal Palace")
teams <- str_replace(teams, "Leicester-City", "Leicester City")
teams <- str_replace(teams, "Aston-Villa", "Aston Villa")
teams <- str_replace(teams, "Norwich-City", "Norwich City")
teams <- str_replace(teams, "Newcastle-United", "Newcastle Utd")
teams <- str_replace(teams, "Wolverhampton-Wanderers", "Wolves")
teams <- str_replace(teams, "West-Ham-United", "West Ham")
teams <- str_replace(teams, "Brighton-and-Hove-Albion", "Brighton")
teams <- str_replace(teams, "Tottenham-Hotspur", "Tottenham")

Now we can get identify which team is team A and which is team B by removing (sub) everything after or before the dash, respectively, from the teams dataset.

teamA <- sub("-.*", "", teams)
teamB <- sub(".*-", "", teams)

Read the first pair of tables

First we need the URL.

url <- selected_urls[g]

Let’s get the html data and assign the 4th table to the variable statA, indicating team A.

statA <- curl::curl(url) %>% 
  xml2::read_html() %>%
  rvest::html_nodes('table') %>%
  rvest::html_table() %>%
  .[[4]]

We see that the column names are messed up because of the way the stats table is set up. The header row as well as the first row contain header info, so let’s create new column names using both rows. Then let’s delete the first row.

colnames(statA) <- paste0(colnames(statA), " >> ", statA[1, ])
names(statA)[1:6] <- paste0(statA[1,1:6])
statA <- statA[-c(1),]

Add the date and team names to the stats.

statA <- cbind(date, Team=teamA, Opponent=teamB, statA)

Read the html and get the 11th table, which is the same type of stats for the opposing team.

statB <- curl::curl(url)  %>% 
  xml2::read_html() %>%
  rvest::html_nodes('table') %>%
  rvest::html_table() %>%
  .[[11]]
colnames(statB) <- paste0(colnames(statB), " >> ", statB[1, ])
names(statB)[1:6] <- paste0(statB[1,1:6])
statB <- statB[-c(1),]
statB <- cbind(date, Team=teamB, Opponent=teamA, statB)
stat_both <- rbind(statA, statB)

Loop to read all data tables

We got the data for one set of stats. Now let’s set up a loop to get the other six sets of stats.

In the loop below, i starts from 5 because it’s the 5th table in the html data we scraped. Also, i+7 refers to the respective data for team B.

#loop for all tables related to the game
  for(i in 5:10){
    statA <- curl::curl(url) %>% 
      xml2::read_html() %>%
      rvest::html_nodes('table') %>%
      rvest::html_table() %>%
      .[[i]]
    colnames(statA) <- paste0(colnames(statA), " >> ", statA[1, ])
    names(statA)[1:6] <- paste0(statA[1,1:6])
    statA <- statA[-c(1),]
    statA <- cbind(date, Team=teamA, Opponent=teamB, statA)
    statB <- curl::curl(url)  %>% 
      xml2::read_html() %>%
      rvest::html_nodes('table') %>%
      rvest::html_table() %>%
      .[[i+7]]
    colnames(statB) <- paste0(colnames(statB), " >> ", statB[1, ])
    names(statB)[1:6] <- paste0(statB[1,1:6])
    statB <- statB[-c(1),]
    statB <- cbind(date, Team=teamB, Opponent=teamA, statB)
    stat_both <- rbind(statA, statB)
    ifelse (i==10 ,all_stat <- merge(all_stat, stat_both, by="Player", all=T), all_stat <- merge(all_stat, stat_both, all=T))
    
    #remove any duplicates
    all_stat <- unique(all_stat)
    
    #remove any leading or trailing whitespaces
    all_stat$Player <- str_trim(all_stat$Player, side = c("both", "left", "right"))
    
    #convert all stats into numeric variables
    if(colnames(all_stat[7])=="Pos") all_stat <- cbind(all_stat[,1:8], mutate_all(all_stat[,9:ncol(all_stat)], function(x) as.numeric(as.character(x))))
    
    write.csv(all_stat,paste0("premier_league_2021-22_",game_data,".csv"))
    
    Sys.sleep(15)
  }

Download game stats from a URL

Create a directory to which we will store the downloaded file.

dir.create("PL21-22_ManCity")

Download the file.

download.file("https://sweepsportsanalytics.com/wp-content/uploads/2022/07/PL21-22_ManCity.zip","PL21-22_ManCity/PL21-22_ManCity.zip")

Unzip the file.

unzip('PL21-22_ManCity/PL21-22_ManCity.zip')

Create a list with all the files that contain the words Manchester-City in their name.

file_names <- list.files(pattern = "Manchester-City", full.names = TRUE)

Read the first file.

all_stat <- read_csv(file_names[1], show_col_types = FALSE)

Read all the remaining files and add them to the first table.

for (f in file_names[-1]) all_stat <- rbind(all_stat, read_csv(f, show_col_types = FALSE))

Check how many unique dates we have to confirm that all data is there.

length(unique(all_stat$date))

Remove the first column.

all_stat[,1] <- NULL
all_stat$Min <- NULL

Fix some column names.

all_stat <- all_stat %>% rename_at(vars(starts_with(">> ")), funs(str_replace(., ">> ", "")))

Create plots using ggplot and plotly

I first want to create a graph to see Manchester City’s opponents’ performance. Specifically, I want to see how many expected goals and goals scored they had.

To do this, I take the all_stat dataset, filter for the rows with over 500 minutes (this could be anything over 90 and below 990, since players can only have 90-120 minutes played and teams will always have at least 990) and the opponent is Manchester City.

Then, I add the relevant goals and xG variables to x and y respectively, adding geom_point() so that the data points are shows as dots. I use the geom_text_repel function to make sure the labels do not overlap. For labels, I use (i.e. paste0) the team name and date. Last, I add a title, subtitle, and caption.

all_stat %>% filter(Min>500 & Opponent=="Manchester City") %>% 
      ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
      geom_point() +
      ggrepel::geom_text_repel(aes(label = paste0(Team,"\n",date)), color = "black", size = 2.5, segment.color = "grey") +
      labs(title = "Expected Goals and Goals Conceded\n by Manchester City",
         subtitle = "2021-22 Premier League",
         caption = "Data source: fbref.com")

Next, I want to see how Manchester City players performed. I change the initial filter to keep only rows where the minutes played are below 120 and the player’s team is Manchester City. I change the theme to theme_light().

In the geom_text_repel arguments, I use a filter so that only players with over 1 goal scored and at least 1 expected goal will have their labels show up. I add the titles and also a sloped line.

all_stat %>% filter(Min<=120 & Team=="Manchester City") %>%
  ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
  geom_point() +
  theme_light() +
  ggrepel::geom_text_repel(aes(label = ifelse(`Performance >> Gls`>1 | `Expected >> xG`>0.9,paste0(Player,"\n",Opponent,"-",date),"")), color = "black", size = 2.5, segment.color = "grey") +
  labs(title = "Expected Goals and Goals Scored\n by Manchester City Players",
       subtitle = "2021-22 Premier League",
       caption = "Data source: fbref.com")+
  geom_abline(intercept = 0, slope = 1,
              linetype="dashed", size=0.5)

Last, I create the same graph as above using plotly, which is a cool plotting library that makes the graphs interactive.

p <- all_stat %>% filter(Min<100 & Team=="Manchester City") %>%
  ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`, text=paste0(Player,"\n",Opponent,"-",date))) +
  geom_point() +
  theme_light() +
  labs(title = "Expected Goals and Goals Scored by Manchester City Players in 2021-22 Premier League (Data source: fbref.com)")+
  geom_abline(intercept = 0, slope = 1,
              linetype="dashed", size=0.5)

ggplotly(p)

That’s it for now! Feel free to reach out for any questions and to let me know what you think.

I would really appreciate you following us on our social media accounts below!

Link to download the dataset

PL21-22_ManCity Download

R Markdown Code

---
title: "Manchester City Analysis"
author: "Sweep Sports Analytics"
date: '2022-07-31'
output: html_document
---
# Load libraries
```{r message=FALSE, warning=FALSE}
library(rvest)
library(stringr)
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
```
# Download and read all game data 
```{r message=FALSE, warning=FALSE}
# create a directory to which we will store the downloaded file
dir.create("PL21-22_ManCity")
# download the file
download.file("https://sweepsportsanalytics.com/wp-content/uploads/2022/07/PL21-22_ManCity.zip","PL21-22_ManCity/PL21-22_ManCity.zip")
# unzip the file
unzip('PL21-22_ManCity/PL21-22_ManCity.zip')
# create a list with all the files that contain the words Manchester-City in their name
file_names <- list.files(pattern = "Manchester-City", full.names = TRUE)
# read the first file
all_stat <-  read_csv(file_names[1], show_col_types = FALSE)
# read all the remaining files and add them to the first table
for (f in file_names[-1]) all_stat <- rbind(all_stat, read_csv(f, show_col_types = FALSE))
# check how many unique dates we have to confirm that all data is there
length(unique(all_stat$date))
# remove the first column
all_stat[,1] <- NULL
all_stat$Min <- NULL
# fix some column names
all_stat <- all_stat %>% rename_at(vars(starts_with(">> ")), funs(str_replace(., ">> ", "")))
```
# Create graphs
```{r}
all_stat %>% filter(is.na(Pos)==T & Opponent=="Manchester City") %>% 
      ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
      geom_point() +
      ggrepel::geom_text_repel(aes(label = paste0(Team,"\n",date)), color = "black", size = 2.5, segment.color = "grey") +
      labs(title = "Goals and Expected Goals Conceded\n by Manchester City",
         subtitle = "2021-22 Premier League",
         caption = "Data source: fbref.com")
```

```{r}
all_stat %>% filter(Min<100 & Team=="Manchester City") %>%
  ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
  geom_point() +
  theme_light() +
  ggrepel::geom_text_repel(aes(label = ifelse(`Performance >> Gls`>1 | `Expected >> xG`>0.9,paste0(Player,"\n",Opponent,"-",date),"")), color = "black", size = 2.5, segment.color = "grey") +
  labs(title = "Expected Goals and Goals Scored\n by Manchester City Players",
       subtitle = "2021-22 Premier League",
       caption = "Data source: fbref.com")+
  geom_abline(intercept = 0, slope = 1,
              linetype="dashed", size=0.5)
```

```{r}
p <- all_stat %>% filter(Min<100 & Team=="Manchester City") %>%
  ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`, text=paste0(Player,"\n",Opponent,"-",date))) +
  geom_point() +
  theme_light() +
  labs(title = "Expected Goals and Goals Scored by Manchester City Players\n in 2021-22 Premier League (Data source: fbref.com)")+
  geom_abline(intercept = 0, slope = 1,
              linetype="dashed", size=0.5)

ggplotly(p)
```

Full R Code

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)

page <- read_html("https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures")
links_1 <- unlist(page %>% html_nodes("a") %>% html_attr('href'))
head(links_1)

links_2 <- strsplit(links_1, '"')
#links_2[100:120]
links_3 <- links_2[grepl("matches", links_2)]
links_4 <- links_3[grepl("Premier", links_3)]
all_urls <- unique(links_4)

team_urls <- all_urls[grepl("Manchester-City", all_urls)]

# Create the actual links by pasting the domain and the rest of the URL, picked up from our previous scraping effort
selected_urls <- paste("https://fbref.com", team_urls, sep="")

# Modify URLs for dates in first 9 days of the month
selected_urls <- str_replace(selected_urls, "-1-", "-01-")
selected_urls <- str_replace(selected_urls, "-2-", "-02-")
selected_urls <- str_replace(selected_urls, "-3-", "-03-")
selected_urls <- str_replace(selected_urls, "-4-", "-04-")
selected_urls <- str_replace(selected_urls, "-5-", "-05-")
selected_urls <- str_replace(selected_urls, "-6-", "-06-")
selected_urls <- str_replace(selected_urls, "-7-", "-07-")
selected_urls <- str_replace(selected_urls, "-8-", "-08-")
selected_urls <- str_replace(selected_urls, "-9-", "-09-")

#initialize tables
all_stat <- NULL
full_stat <- NULL

for (g in 1:length(selected_urls)){
  # Get the game info from the URL
  game_data <- substr(selected_urls[g], 39, nchar(selected_urls[g])-15)
  game_data <- str_replace(game_data, "January", "Jan")
  game_data <- str_replace(game_data, "February", "Feb")
  game_data <- str_replace(game_data, "March", "Mar")
  game_data <- str_replace(game_data, "April", "Apr")
  game_data <- str_replace(game_data, "June", "Jun")
  game_data <- str_replace(game_data, "July", "Jul")
  game_data <- str_replace(game_data, "August", "Aug")
  game_data <- str_replace(game_data, "September", "Sep")
  game_data <- str_replace(game_data, "October", "Oct")
  game_data <- str_replace(game_data, "November", "Nov")
  game_data <- str_replace(game_data, "December", "Dec")
  
  game_data <- str_replace(game_data, "North-West-London-Derby-", "")
  game_data <- str_replace(game_data, "Merseyside-Derby-", "")
  game_data <- str_replace(game_data, "North-London-Derby-", "")
  game_data <- str_replace(game_data, "Manchester-Derby-", "")
  game_data <- str_replace(game_data, "North-London-Derby-", "")
  
  date <- substr(game_data, nchar(game_data)-10, nchar(game_data))
  teams <- substr(game_data, 1, nchar(game_data)-12)
  teams <- str_replace(teams, "Manchester-United", "Manchester Utd")
  teams <- str_replace(teams, "Manchester-City", "Manchester City")
  teams <- str_replace(teams, "Leeds-United", "Leeds United")
  teams <- str_replace(teams, "Crystal-Palace", "Crystal Palace")
  teams <- str_replace(teams, "Leicester-City", "Leicester City")
  teams <- str_replace(teams, "Aston-Villa", "Aston Villa")
  teams <- str_replace(teams, "Norwich-City", "Norwich City")
  teams <- str_replace(teams, "Newcastle-United", "Newcastle Utd")
  teams <- str_replace(teams, "Wolverhampton-Wanderers", "Wolves")
  teams <- str_replace(teams, "West-Ham-United", "West Ham")
  teams <- str_replace(teams, "Brighton-and-Hove-Albion", "Brighton")
  teams <- str_replace(teams, "Tottenham-Hotspur", "Tottenham")
  
  teamA <- sub("-.*", "", teams)
  teamB <- sub(".*-", "", teams)
  
  #read the first pair of tables
  url <- selected_urls[g]
  statA <- curl::curl(url) %>% 
    xml2::read_html() %>%
    rvest::html_nodes('table') %>%
    rvest::html_table() %>%
    .[[4]]
  colnames(statA) <- paste0(colnames(statA), " >> ", statA[1, ])
  names(statA)[1:5] <- paste0(statA[1,1:5])
  statA <- statA[-c(1),]
  statA <- cbind(date, Team=teamA, Opponent=teamB, statA)
  statB <- curl::curl(url)  %>% 
    xml2::read_html() %>%
    rvest::html_nodes('table') %>%
    rvest::html_table() %>%
    .[[11]]
  colnames(statB) <- paste0(colnames(statB), " >> ", statB[1, ])
  names(statB)[1:5] <- paste0(statB[1,1:5])
  statB <- statB[-c(1),]
  statB <- cbind(date, Team=teamB, Opponent=teamA, statB)
  stat_both <- rbind(statA, statB)
  #define the game's data frame
  all_stat <- stat_both
  Sys.sleep(15)
  
  #loop for all tables related to the game
  for(i in 5:10){
    statA <- curl::curl(url) %>% 
      xml2::read_html() %>%
      rvest::html_nodes('table') %>%
      rvest::html_table() %>%
      .[[i]]
    colnames(statA) <- paste0(colnames(statA), " >> ", statA[1, ])
    names(statA)[1:5] <- paste0(statA[1,1:5])
    statA <- statA[-c(1),]
    statA <- cbind(date, Team=teamA, Opponent=teamB, statA)
    statB <- curl::curl(url)  %>% 
      xml2::read_html() %>%
      rvest::html_nodes('table') %>%
      rvest::html_table() %>%
      .[[i+7]]
    colnames(statB) <- paste0(colnames(statB), " >> ", statB[1, ])
    names(statB)[1:5] <- paste0(statB[1,1:5])
    statB <- statB[-c(1),]
    statB <- cbind(date, Team=teamB, Opponent=teamA, statB)
    stat_both <- rbind(statA, statB)
    all_stat <- merge(all_stat, stat_both, all=T)
    
    #remove any duplicates
    all_stat <- unique(all_stat)
    
    #remove any leading or trailing whitespaces
    all_stat$Player <- str_trim(all_stat$Player, side = c("both", "left", "right"))
    
    #convert all stats into numeric variables
    all_stat <- cbind(all_stat[,1:8], mutate_all(all_stat[,9:ncol(all_stat)], function(x) as.numeric(as.character(x))))
    
    #rename columns such as " >> xG" to "xG"
    #if(i==10){all_stat <- all_stat %>% rename_at(vars(starts_with(" >> ")), funs(str_replace(., " >> ", "")))}
    
    write.csv(all_stat,paste0("premier_league_2021-22_",game_data,".csv"))
    
    Sys.sleep(15)
  }
  #add the game tables to the total data frame
  full_stat <- rbind(full_stat, all_stat)
}

#####
# 
#####
dir.create("PL21-22_ManCity")
download.file("https://sweepsportsanalytics.com/wp-content/uploads/2022/07/PL21-22_ManCity-1.zip","PL21-22_ManCity/PL21-22_ManCity.zip")
unzip('PL21-22_ManCity/PL21-22_ManCity.zip')
file_names <- list.files(pattern = "Manchester-City", full.names = TRUE)
list_files <- lapply(file_names, read.csv, sep = ",", row.names=NULL)
list_files <- lapply(file_names, read_csv)
all_stat <-  read_csv(file_names[1], show_col_types = FALSE)
for (f in file_names[-1]) all_stat <- rbind(all_stat, read_csv(f, show_col_types = FALSE))

all_stat[,1] <- NULL
all_stat <- all_stat %>% rename_at(vars(starts_with(">> ")), funs(str_replace(., ">> ", "")))

all_stat %>% filter(is.na(Pos)==T & Opponent=="Manchester City") %>% ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
  ggrepel::geom_text_repel(aes(label = paste0(Team,"\n",date)), color = "black", size = 2.5, segment.color = "grey")+
  geom_point()

#####
#
#####
all_stat %>% filter(Min<100) %>% ggplot(aes(x = `Performance >> Gls`, y = `Expected >> xG`)) +
  ggrepel::geom_text_repel(aes(label = paste0(Player,"\n",Opponent,"-",date)), color = "black", size = 2.5, segment.color = "grey")+
  geom_point()

#####
#
#####
all_stat %>% filter(Team=="Manchester City" & Min<100) %>% ggplot(aes(x = `Performance >> Ast`, y = `Expected >> xA`)) +
  ggrepel::geom_text_repel(aes(label = paste0(Player,"\n",Opponent,"-",date)), color = "black", size = 2.5, segment.color = "grey")+
  geom_point()