NBA Analytics Tutorial – Part 1: Using R to Analyze the Chicago Bulls’ Last Dance

Reading Time: 18 minutes

It’s time for basketball analytics, folks, with a focus on the NBA! This tutorial is for beginners and intermediate sports analytics enthusiasts. I will show you how to extract and prepare NBA data, create basic plots, and run two clustering algorithms.

It’s been a while since my first tutorial. Raising a daughter has nothing to do with it! The EURO 2020, Copa América, and NBA playoffs drained a lot of my energy. I did manage to publish the article on getting started with sports analytics so that’s something, I guess!

We are two months before the start of the 2021/22 NBA Season. The NBA does have things going on though. The draft may turn out to have some great future all stars. So far there have been a bunch of interesting signings and trades. There have been some crazy summer league performances like Isaiah Thomas scoring 81 points and Payton Pritchard dropping 92. There’s a lot of speculation going on about where players such as Damian Lillard will continue their career.

Besides following the news to get an idea of what next season’s storylines will be, there are still some ongoing classic discussions. The center of most is the G.O.A.T. Michael Jordan. This tutorial is focused on his last three years at the Chicago Bulls and the “Last Dance” season.

There will be a visual walkthrough soon, so make sure to subscribe to our YouTube channel for updates.

Let’s Sweep!

Disclosure: Some of the links below are affiliate links. This means that, at zero cost to you, we will earn an affiliate commission if you click through the link and finalize a purchase. This post is not sponsored in any way.

Step 1: Download R Studio

The debate about which programming language is best for data science has been going on for a while. R and Python are the main choices. Both are awesome and it’s rather a matter of preference, as well as what kind of projects you have in mind. For some additional info, check out Step 3 of the article on getting started with sports analytics.

That being said, having a statistical background, I have opted to use R. So, first step, if you have not done so, download the latest version of R and R Studio from the links below.

https://cran.r-project.org/

https://www.rstudio.com/products/rstudio/download/

Step 2: Install packages

R has A LOT of packages you can use. Let’s start by installing the ones we use.

Open R Studio and run the below commands.

#####################
# Step 2: Install packages
#####################
install.packages("tidyverse")
devtools::install_github(“abresler/nbastatR”)
install.packages("BasketballAnalyzeR")
install.packages("jsonlite")
install.packages("janitor")
install.packages("extrafont")
install.packages("ggrepel")
install.packages("scales")
install.packages("teamcolors")
install.packages("zoo")
install.packages("future")
install.packages("lubridate")

After installing the above packages, you will no longer need to install them on your system.

Step 3: Load libraries

Run the below commands to load the libraries we use. We also increase the vroom connection size to accommodate for the large files we read.

#####################
# Step 3: Load libraries
#####################
library(tidyverse)
library(nbastatR)
library(BasketballAnalyzeR)
library(jsonlite)
library(janitor)
library(extrafont)
library(ggrepel)
library(scales)
library(teamcolors)
library(zoo)
library(future)
library(lubridate)

Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 2)

As you may know I’ve been doing a bunch of basketball analytics. I can’t stress how lucky I feel to have come across the great book Basketball Data Science with Applications in R. Anyone interested in basketball analytics should definitely get their hands on a copy.

The authors of the above book have released an awesome R package named BasketballAnalyzeR. The data preparation, graphs, and data science techniques here have the BasketballAnalyzeR package in mind. This means that the schema of the data matches what the package requires.

Step 4: Get game IDs and gamelog data

In this step, we use the nbastatR package to get the game IDs and gamelog data we need for the analysis. Game IDs are unique IDs for each NBA game and are common across almost all datasets available. Gamelog data refer to rows that contain player or team stats for each game of a season.

Run the below code and see for yourself!

#####################
## Get game IDs
#####################
# Select seasons from 1949 and after
selectedSeasons <- c(1996:1998)
# Get game IDs for Regular Season and Playoffs
gameIds_Reg <- suppressWarnings(seasons_schedule(seasons = selectedSeasons, season_types = "Regular Season") %>% select(idGame, slugMatchup))
gameIds_PO <- suppressWarnings(seasons_schedule(seasons = selectedSeasons, season_types = "Playoffs") %>% select(idGame, slugMatchup))
gameIds_all <- rbind(gameIds_Reg, gameIds_PO)

# Peek at the game IDs
head(gameIds_all)
tail(gameIds_all)

#####################
## Retrieve gamelog data for players and teams
#####################
# Get player gamelogs
P_gamelog_reg <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "player", season_types = "Regular Season"))
P_gamelog_po <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "player", season_types = "Playoffs"))
P_gamelog_all <- rbind(P_gamelog_reg, P_gamelog_po)
View(head(P_gamelog_all))

# Get team gamelogs
T_gamelog_reg <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "team", season_types = "Regular Season"))
T_gamelog_po <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "team", season_types = "Playoffs"))
T_gamelog_all <- rbind(T_gamelog_reg, T_gamelog_po)
View(head(T_gamelog_all))

Step 5: Create team and player boxscores

Using the gamelogs we retrieved in the previous step, let’s create boxscores for a whole season. Tbox contains statistics for each team in a set of games, e.g. over a season. Obox contains statistics of a team’s opponents in a set of games. Pbox contains the statistics of a player over a set of games.

For each of the three above boxscores, the process we follow is:

  • Select a table
  • Group the table by the variables we want each row to represent, e.g. “Season” and “Team”
  • Summarise the table, i.e. define the columns we want and the way to calculate them.
    • For example, “GP” indicates the games played and to get them we use n() which counts the number of rows we have for the grouping we defined above.
    • “W” indicates the Wins and is measured by summing the rows where the outcome of the game is equal to “W”.
########
### Create player and team boxscores
########
# Create Tbox (Team boxscore) per season
Tbox_all <- T_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugTeam) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="W"), L=sum(outcomeGame=="L"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

# Create Obox (Opponent Team boxscore) per season
Obox_all <- T_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugOpponent) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="L"), L=sum(outcomeGame=="W"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

# Create Pbox (Player boxscore) per season
Pbox_all <- P_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugTeam, "Player"=namePlayer) %>%
  dplyr::summarise(GP=n(), MIN=sum(minutes), PTS=sum(pts),
                   P2M=sum(fg2m), P2A=sum(fg2a), P2p=100*P2M/P2A,
                   P3M=sum(fg3m), P3A=sum(fg3a), P3p=100*P3M/P3A,
                   FTM=sum(ftm), FTA=sum(fta), FTp=100*FTM/FTA,
                   OREB=sum(oreb), DREB=sum(dreb), AST=sum(ast),
                   TOV=sum(tov), STL=sum(stl), BLK=sum(blk),
                   PF=sum(pf)) %>%
  as.data.frame()

Let’s have a look at the Tbox and Obox tables for the Chicago Bulls.

View(Tbox_all[Tbox_all$Team=="CHI",])
View(Obox_all[Obox_all$Team=="CHI",])

Let’s also look at the Pbox table for Michael Jordan.

View(Pbox_all[Pbox_all$Player=="Michael Jordan",])

Below I modify the code and select regular season data. That’s the “T_gamelog_reg” data frame.

#####################
## Use Regular Season data
#####################
# Create Tbox (Team boxscore) for each Regular Season
Tbox <- T_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugTeam) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="W"), L=sum(outcomeGame=="L"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

# Create Obox (Opponent Team boxscore) for each Regular Season
Obox <- T_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugOpponent) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="L"), L=sum(outcomeGame=="W"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

# Create Pbox (Player boxscore) for each Regular Season
Pbox <- P_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugTeam, "Player"=namePlayer) %>%
  dplyr::summarise(GP=n(), MIN=sum(minutes), PTS=sum(pts),
                   P2M=sum(fg2m), P2A=sum(fg2a), P2p=100*P2M/P2A,
                   P3M=sum(fg3m), P3A=sum(fg3a), P3p=100*P3M/P3A,
                   FTM=sum(ftm), FTA=sum(fta), FTp=100*FTM/FTA,
                   OREB=sum(oreb), DREB=sum(dreb), AST=sum(ast),
                   TOV=sum(tov), STL=sum(stl), BLK=sum(blk),
                   PF=sum(pf)) %>%
  as.data.frame()

View(Pbox[Pbox$Player=="Michael Jordan",])

Step 6: Bar plots, scatter plots, and bubble plots

Barplots

Using the description from the R Graph Gallery,  “a barplot is used to display the relationship between a numeric and a categorical variable”. What we want to do is display the relationship between the points and types of shots (our numeric variables) and the players of the Chicago Bulls in the 1998 season (our categorical variables).

The code below does the following:

  • Use the teamSelected variable to store the team in scope, using the three-letter abbreviation.
  • Create a subset of player boxscores named Pbox.sel from the main Pbox data frame, filtering for players that are part of the teamSelected and have played more than 1,000 minutes in a season.
  • Use the seasonSelected variable to store the season in scope.
  • Create a barplot using the “barline” command.
    • The data selected is the Pbox.sel, filtered to contain the seasonSelected variable we defined above.
    • The id, i.e. the categorical variable, is the Player.
    • The bars, i.e. the numerical variables, are the types of shots made.
    • I also add an additional numerical variable, the points, as a line.
    • Order the players by total points scored.
    • Add labels to the bars.
    • Add a title.
#####################
# Bar plots
#####################
teamSelected <- "CHI"
Pbox.sel <- subset(Pbox, Team==teamSelected &
                    MIN>=1000)
seasonSelected <- 1998
barline(data=Pbox.sel[Pbox.sel$Season==seasonSelected,], id="Player",
        bars=c("P2M","P3M","FTM"), line="PTS",
        order.by="PTS", labels.bars=c("2PM","3PM","FTM"),
        title=teamSelected)

Scatterplots

Using the description from the R Graph Gallery,  “A Scatterplot displays the relationship between 2 numeric variables. Each dot represents an observation. Their position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables.”.

What we want to do is display the relationship between the assists and turnovers (our numeric variables). I have added an additional numeric variable, the points, depicted as a color. Each dot represents a player of the Chicago Bulls in the respective season (our categorical variable).

The code below does the following:

  • Use the teamSelected variable to store the team in scope, using the three-letter abbreviation.
  • Create a subset of player boxscores named Pbox.sel from the main Pbox data frame, filtering for players that are part of the teamSelected and have played more than 1,000 minutes in a season.
  • Use the “attach” function to attach the Pbox.sel data frame to the R search path. This means that the data frame is searched by R when evaluating a variable, so objects in the data frame can be accessed by simply giving their names.
  • Create a new data frame, with the numerical variables AST, TOV, and PTS per minute.
  • Create a color palette for the color coding of the PTS variable.
  • Create a scatter plot using the “scatterplot” function.
    • The data selected is the X data frame we used above. The AST variable is placed on the X axis, the TOV variable on the Y axis, and the PTS as the additional variable.
    • The label is defined as the player name and the respective season.
    • The color varies depending on the numeric values of the PTS.
  • Last, we zoom in to the area between 0.08 and 0.16 assists per minute and 0.05 and 0.10 turnovers per minute.
#####################
# Scatter plots
#####################
teamSelected <- "CHI"
Pbox.sel <- subset(Pbox, Team==teamSelected & MIN>=1000)
attach(Pbox.sel)
X <- data.frame(AST, TOV, PTS)/MIN
detach(Pbox.sel)
mypal <- colorRampPalette(c("blue","yellow","red"))

scatterplot(X, data.var=c("AST","TOV"), z.var="PTS",
            labels=paste(Pbox.sel$Player,", ",Pbox.sel$Season), palette=mypal)

scatterplot(X, data.var=c("AST","TOV"), z.var="PTS",
            labels=paste(Pbox.sel$Player,", ",Pbox.sel$Season), palette=mypal,
            zoom=c(0.08,0.16,0.05,0.10))

I find it quite interesting that in his 1998 season, Michael Jordan had a similar performance to Dennis Rodman‘s 1997 season in terms of assists and turnovers. Also, it’s nice to see how efficient Steve Kerr was, with his low ratio of turnovers to assists.

Bubble plots

Using the description from the R Graph Gallery,  “A bubble plot is a scatter plot with a third numeric variable mapped to circle size.”.

Below I have two separate bubble plots.

The code below does the following:

  • Use the teamSelected variable to store the team in scope, using the three-letter abbreviation.
  • Create a subset of player boxscores named Pbox.sel from the main Pbox data frame, filtering for players that are part of the teamSelected and have played more than 1,000 minutes in a season.
  • Use the “attach” function to attach the Pbox.sel data frame to the R search path. This means that the data frame is searched by R when evaluating a variable, so objects in the data frame can be accessed by simply giving their names.
  • Create a new data frame, with the numerical variables AST, TOV, and PTS per minute.
  • Create a color palette for the color coding of the PTS variable.
  • Create a scatter plot using the “scatterplot” function.
    • The data selected is the X data frame we used above. The AST variable is placed on the X axis, the TOV variable on the Y axis, and the PTS as the additional variable.
    • The label is defined as the player name and the respective season.
    • The color varies depending on the numeric values of the PTS.
  • Last, we zoom into the area between 0.08 and 0.16 assists per minute and 0.05 and 0.10 turnovers per minute.
#####################
# Bubble plots
#####################
teamSelected <- "CHI"
seasonSelected <- 1998
Tbox.sel <- subset(Tbox_all,Season==seasonSelected)

attach(Tbox.sel)
X <- data.frame(T=Team, P2p, P3p, FTp, AS=P2A+P3A+FTA)
detach(Tbox.sel)
labs <- c("2-point shots (% made)",
          "3-point shots (% made)",
          "free throws (% made)",
          "Total shots attempted")
bubbleplot(X, id="T", x="P2p", y="P3p", col="FTp",
           size="AS", labels=labs)

teamsSelected <- c("CHI", "UTA", "IND", "LAL")
seasonSelected <- 1998
Pbox.sel <- subset(Pbox, Team %in% teamsSelected & MIN>=1500 & Season==seasonSelected)
                   
attach(Pbox.sel)
X <- data.frame(ID=Player, Team, V1=DREB/MIN, V2=STL/MIN,
                V3=BLK/MIN, V4=MIN)
detach(Pbox.sel)
labs <- c("Defensive Rebounds","Steals","Blocks",
          "Total minutes played")
bubbleplot(X, id="ID", x="V1", y="V2", col="V3",
           size="V4", text.col="Team", labels=labs,
           title=paste0("NBA Players in ", seasonSelected),
           text.legend=TRUE, text.size=3.5, scale=FALSE)

The first bubble plot displays the relationship between the 2-point percentage and 3-point percentage (our numeric variables), with the attempted shots being the circle size. I have added an additional numeric variable, the free-throw percentage, depicted as a color. Each bubble represents an NBA team in the 1998 season (our categorical variable).

The Lakers were abysmal in free throw % (blue color) but had a high number of shots attempted (big bubble size) and over 50% of 2-point shots made. The Jazz were 2nd in 2-point % but also had a good free throw percentage. From a scoring perspective, the Bulls were average in 2-point % and below average from beyond the arc, with a fairly good FT%.

The bubble plot above displays the relationship between the rebounds and steals (our numeric variables), with the total minutes played being the circle size. I have added an additional numeric variable, the blocks, depicted as a color. Each bubble represents a player from the four teams that made it to the Eastern and Western Division Finals in the 1998 season (our categorical variable).

Look at how dominant Dennis Rodman and Shaquille O’ Neal were on the glass! Shaq also has a red bubble indicating he was very good in blocking shots. Even more impressive though was Karl Malone: 3rd in defensive rebounds but also quite good in steals.

Step 7: K-means and hierarchical clustering of teams and players

With all this data we have now in our R environment, let’s try to make sense of it by finding groups. We do this with Cluster Analysis.

Cluster Analysis is a classification technique aiming at dividing individual cases into groups (clusters) such that the cases in a cluster are very similar (according to a given criterion) to one another and very different from the cases in other clusters.
As mentioned before, Cluster Analysis is unsupervised, and it should not be confused with supervised classification methods, such as discriminant analysis, where the groups are known a priori and the aim of the analysis is to create rules for classifying new observation units into one or an other of the known groups. On the contrary, Cluster Analysis is an exploratory method that aims to recognize the natural groups that appear in the data structure. In some sense, Cluster Analysis is a dimensionality reduction technique, because the high (sometimes huge) number of units observed at the beginning is reduced to a smaller number of groups that are homogeneous, allowing a parsimonious description and an easy interpretation of the data. Results can then be used for several aims, for example to identify outliers or find out some hidden relationships.

Paola Zuccolotto and Marica Manisera, Basketball Data Science (2020)

K-Means Clustering of 1998 NBA Teams Using Four Factors

K-means clustering is a machine learning algorithm used for partitioning a given data set into groups i.e. clusters. K represents the number of groups and is decided before the algorithm is run by the analyst. The result of the algorithm is clusters of similar objects. Each cluster is represented by its centroid, which corresponds to the mean of each value assigned to the cluster.

For more info, have a look at the article below:

https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/

The code below runs the k-means clustering algorithm on the 1998 NBA teams. The data upon which the grouping is made are the Four Factors, first introduced by Dean Oliver in 2002 in his book Basketball on Paper.

They answer the question “what are the main strategies related to success?”. The Four Factors can be simply described as Score, Protect, Crash, and Attack:

  • Effective Field Goal Percentage
  • Turnover Ratio
  • Rebound Percentage
  • Free Throw Rate
####################
# K-means clustering of NBA teams
#####################
seasonSelected <- 1998
Tbox.sel <- subset(Tbox_all,Season==seasonSelected)
Obox.sel <- subset(Obox_all,Season==seasonSelected)

FF <- fourfactors(Tbox.sel,Obox.sel)
OD.Rtg <- FF$ORtg/FF$DRtg
F1.r <- FF$F1.Off/FF$F1.Def
F2.r <- FF$F2.Def/FF$F2.Off
F3.Off <- FF$F3.Off
F3.Def <- FF$F3.Def
P3M.ff <- Tbox.sel$P3M
STL.r <- Tbox.sel$STL/Obox.sel$STL
data <- data.frame(OD.Rtg, F1.r, F2.r, F3.Off, F3.Def, P3M.ff, STL.r)

RNGkind(sample.kind="Rounding")
set.seed(29)
kclu1 <- kclustering(data)
plot(kclu1)

set.seed(29)
kclu2 <- kclustering(data, labels=Tbox.sel$Team, k=7)
plot(kclu2)

kclu2.W <- tapply(Tbox.sel$W, kclu2$Subjects$Cluster, mean)

cluster <- as.factor(kclu2$Subjects$Cluster)
Xbubble <- data.frame(Team=Tbox.sel$Team, PTS=Tbox.sel$PTS,
                      PTS.Opp=Obox.sel$PTS, cluster,
                      W=Tbox.sel$W)
labs <- c("PTS", "PTS.Opp", "cluster", "Wins")
bubbleplot(Xbubble, id="Team", x="PTS", y="PTS.Opp",
           col="cluster", size="W", labels=labs,
           title=paste0("NBA Team Clusters - ",seasonSelected))

The graph below on the left indicates that 7 clusters would be the best choice. We want to minimize the number of clusters while at the same time achieving the most consistency and information.

The radial plots of the average cluster profiles are shown on the bottom right graph. It gives us an idea of what the clusters represent. For example, Cluster 4 has Cluster Heterogeneity Index of 0.21 and contains the Celtics and the Nets, with good steals ratio, turnovers ratio, and offensive rebounds performance. Cluster 2 contains my beloved Knicks and their cluster has a 0.33 CHI containing teams that average high defensive rebound ratios, good offensive/defensive ratios, and effective field goal ratios.

The bubble plot above depicts the 1998 NBA teams, with the x-axis depicting the points scored and y-axis the points against. The colors indicate the cluster in which a team is placed and the size of the bubble the number of wins. The Jazz, Knicks, Spurs, and Trailblazers are similar according to the clusters. The Bulls are in the same cluster with teams such as the Timberwolves, the Hornets and the Hawks.

Hierarchical Clustering of NBA Players in 1996, 1997, and 1998

Hierarchical clustering is an alternative method to partitional clustering for grouping objects based on their similarity. Unlike the k-means clustering algorithm above, hierarchical clustering does not require to pre-specify the number of clusters.

There are two types of hierarchical clustering, agglomerative clustering and divise clustering. The result of hierarchical clustering is a dendrogram, a tree-based representation of the objects.

For more details check out the link below:

https://www.geeksforgeeks.org/hierarchical-clustering-in-r-programming/

The code below is what we run to get the clusters using hierarchical clustering. It contains comments explaining each row.

#####################
## Hierarchical clustering of NBA players
#####################
#select seasons to analyze
seasonSelected <- c(1996, 1997, 1998)
#filter the player boxscores dataset to include the seasons we selected before and select the top 100 players in points scored
Pbox.sel <- Pbox %>% filter(Season %in% seasonSelected) %>% slice_max(PTS, n = 100)

#attach the player boxscores dataset from above
attach(Pbox.sel)
#create a data frame that contains the columns/stats that we want to use to cluster players
data <- data.frame(PTS, P3M, REB=OREB+DREB,
                   AST, TOV, STL, BLK, PF)
#detach the player boxscores dataset
detach(Pbox.sel)

#create the ID variable to annotate each data point, which is essentially the player name, their team, and season
ID <- paste0(Pbox.sel$Player,"-",Pbox.sel$Team,", ", Pbox.sel$Season)

#run the hierarchical clustering algorithm
hclu1 <- hclustering(data)
#plot the algorithm to choose the optimal number of clusters.
plot(hclu1)

#run the hierarchical clustering algorithm, adding the IDs and opting for 5 clusters
hclu2 <- hclustering(data, labels=ID, k=5)
#show the radar plot of each cluster
plot(hclu2, profiles=TRUE)
#plot the dendrogram
plot(hclu2, rect=TRUE, labels=ID, cex.labels=0.75)

### Variability of the clusters
#create a player boxscore subset, containing only players that had played over 1,000 minutes
Pbox.subset <- subset(Pbox.sel, MIN>=1000)
#define MIN
MIN <- Pbox.subset$MIN
#create a data frame with the player clusters, scaling the data and adding the minutes played
X <- data.frame(hclu2$Subjects, scale(data), MIN)

#select the variables we want to see the variability for
dvar <- c("PTS","P3M","REB","AST",
          "TOV","STL","BLK","PF")
#select the variable to use as the size
svar <- "MIN"
yRange <- range(X[,dvar])
quant <- quantile(x = X$MIN, type = 3)
sizeRange <- c(quant[[1]], quant[[5]])

#define the number of clusters
no.clu <- 5

p <- vector(no.clu, mode="list")
for (k in 1:no.clu) {
  XC <- subset(X, Cluster==k)
  vrb <- variability(XC[,3:11], data.var=dvar,
                     size.var=svar, weight=FALSE,
                     VC=FALSE)
  title <- paste("Cluster", k)
  p[[k]] <- plot(vrb, size.lim=sizeRange, ylim=yRange,
                 title=title, leg.pos=c(0,1),
                 leg.just=c(-0.5,0),
                 leg.box="vertical",
                 leg.brk=seq(quant[[1]],quant[[5]],(quant[[5]]-quant[[1]])/5),
                 leg.title.pos="left", leg.nrow=1,
                 max.circle=7)
}
library(gridExtra)
grid.arrange(grobs=p, ncol=3)

View(X)

The graph below on the left indicates that 5 clusters would be the best choice. This is where the incremental increase in number of clusters reaches a plateau.

The radial plots of the average cluster profiles are shown on the bottom right graph. It gives us an idea of what the clusters represent. For example, Cluster 1 has a high Cluster Heterogeneity Index and contains players with many points scored and around average stats in all other areas. Cluster 2 contains 3-point shooters.

The dendrogram below has the top 100 scorers in the 1996, 1997, and 1998 NBA seasons and their clusters. Surprise surprise: Michael Jordan’s seasons were awesome and unique. It’s interesting to see all Karl Malone and Grant Hill seasons in the cluster. I also find it interesting that Allen Iverson‘s rookie season was placed in Cluster 1, with his 2nd season placed in Cluster 5, alongside the likes of Scottie Pippen and Gary Payton.

The variability diagrams above depict all players’ season performance in each cluster. We can see that in Cluster 1 there are some clear outliers in terms of points scored. We also see some awesome block stats in Cluster 3. Can you guess who? Look at the screenshot below for the answers!


I hope you enjoyed this tutorial. Stay tuned for part 2 which will be on Giannis and the Bucks’ championship season.

Let us know what you think in the comments or reach out to us here or on our social media accounts below.


Full Code

#####################
# Install packages
#####################
install.packages("tidyverse")
install.packages("nbastatR")
install.packages("BasketballAnalyzeR")
install.packages("jsonlite")
install.packages("janitor")
install.packages("extrafont")
install.packages("ggrepel")
install.packages("scales")
install.packages("teamcolors")
install.packages("zoo")
install.packages("future")
install.packages("lubridate")

######################
## Load packages
#####################
library(tidyverse)
library(nbastatR)
library(BasketballAnalyzeR)
library(jsonlite)
library(janitor)
library(extrafont)
library(ggrepel)
library(scales)
library(teamcolors)
library(zoo)
library(future)
library(lubridate)

Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 2)
#####################
## Get game IDs
#####################
# Select seasons from 1949 and after
selectedSeasons <- c(1996:1998)

# Get game IDs for Regular Season and Playoffs
gameIds_Reg <- suppressWarnings(seasons_schedule(seasons = selectedSeasons, season_types = "Regular Season") %>% select(idGame, slugMatchup))
gameIds_PO <- suppressWarnings(seasons_schedule(seasons = selectedSeasons, season_types = "Playoffs") %>% select(idGame, slugMatchup))
gameIds_all <- rbind(gameIds_Reg, gameIds_PO)
# Peek at the game IDs
head(gameIds_all)
tail(gameIds_all)

#####################
## Retrieve gamelog data for players and teams
#####################
# Get player gamelogs
P_gamelog_reg <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "player", season_types = "Regular Season"))
P_gamelog_po <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "player", season_types = "Playoffs"))
P_gamelog_all <- rbind(P_gamelog_reg, P_gamelog_po)
View(head(P_gamelog_all))

# Get team gamelogs
T_gamelog_reg <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "team", season_types = "Regular Season"))
T_gamelog_po <- suppressWarnings(game_logs(seasons = selectedSeasons, league = "NBA", result_types = "team", season_types = "Playoffs"))
T_gamelog_all <- rbind(T_gamelog_reg, T_gamelog_po)
View(head(T_gamelog_all))

#####################
## Create player and team boxscores
#####################
# Create Tbox (Team boxscore) per season
Tbox_all <- T_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugTeam) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="W"), L=sum(outcomeGame=="L"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()
View(Tbox_all[Tbox_all$Team=="CHI",])

# Create Obox (Opponent Team boxscore) per season
Obox_all <- T_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugOpponent) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="L"), L=sum(outcomeGame=="W"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()
View(Obox_all[Obox_all$Team=="CHI",])

# Create Pbox (Player boxscore) per season
Pbox_all <- P_gamelog_all %>%
  group_by("Season"=yearSeason, "Team"=slugTeam, "Player"=namePlayer) %>%
  dplyr::summarise(GP=n(), MIN=sum(minutes), PTS=sum(pts),
                   P2M=sum(fg2m), P2A=sum(fg2a), P2p=100*P2M/P2A,
                   P3M=sum(fg3m), P3A=sum(fg3a), P3p=100*P3M/P3A,
                   FTM=sum(ftm), FTA=sum(fta), FTp=100*FTM/FTA,
                   OREB=sum(oreb), DREB=sum(dreb), AST=sum(ast),
                   TOV=sum(tov), STL=sum(stl), BLK=sum(blk),
                   PF=sum(pf)) %>%
  as.data.frame()
View(Pbox_all[Pbox_all$Player=="Michael Jordan",])


#####################
## Use Regular Season data
#####################
Tbox <- T_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugTeam) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="W"), L=sum(outcomeGame=="L"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

Obox <- T_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugOpponent) %>%
  dplyr::summarise(GP=n(), MIN=sum(round(minutesTeam/5)),
                   PTS=sum(ptsTeam),
                   W=sum(outcomeGame=="L"), L=sum(outcomeGame=="W"),
                   P2M=sum(fg2mTeam), P2A=sum(fg2aTeam), P2p=P2M/P2A,
                   P3M=sum(fg3mTeam), P3A=sum(fg3aTeam), P3p=P3M/P3A,
                   FTM=sum(ftmTeam), FTA=sum(ftaTeam), FTp=FTM/FTA,
                   OREB=sum(orebTeam), DREB=sum(drebTeam), AST=sum(astTeam),
                   TOV=sum(tovTeam), STL=sum(stlTeam), BLK=sum(blkTeam),
                   PF=sum(pfTeam), PM=sum(plusminusTeam)) %>%
  as.data.frame()

# Create Pbox (Player boxscore) per season
Pbox <- P_gamelog_reg %>%
  group_by("Season"=yearSeason, "Team"=slugTeam, "Player"=namePlayer) %>%
  dplyr::summarise(GP=n(), MIN=sum(minutes), PTS=sum(pts),
                   P2M=sum(fg2m), P2A=sum(fg2a), P2p=100*P2M/P2A,
                   P3M=sum(fg3m), P3A=sum(fg3a), P3p=100*P3M/P3A,
                   FTM=sum(ftm), FTA=sum(fta), FTp=100*FTM/FTA,
                   OREB=sum(oreb), DREB=sum(dreb), AST=sum(ast),
                   TOV=sum(tov), STL=sum(stl), BLK=sum(blk),
                   PF=sum(pf)) %>%
  as.data.frame()
View(Pbox[Pbox$Player=="Michael Jordan",])


#####################
## Bar plots
#####################
teamSelected <- "CHI"
Pbox.sel <- subset(Pbox, Team==teamSelected &
                    MIN>=1000)
seasonSelected <- 1998
barline(data=Pbox.sel[Pbox.sel$Season==seasonSelected,], id="Player",
        bars=c("P2M","P3M","FTM"), line="PTS",
        order.by="PTS", labels.bars=c("2PM","3PM","FTM"),
        title=paste0(teamSelected," - ",seasonSelected))

#####################
## Scatter plots
#####################
teamSelected <- "CHI"
Pbox.sel <- subset(Pbox, Team==teamSelected & MIN>=1000)
attach(Pbox.sel)
X <- data.frame(AST, TOV, PTS)/MIN
detach(Pbox.sel)
mypal <- colorRampPalette(c("blue","yellow","red"))

scatterplot(X, data.var=c("AST","TOV"), z.var="PTS",
            labels=paste(Pbox.sel$Player,", ",Pbox.sel$Season), palette=mypal)

scatterplot(X, data.var=c("AST","TOV"), z.var="PTS",
            labels=paste(Pbox.sel$Player,", ",Pbox.sel$Season), palette=mypal,
            zoom=c(0.08,0.16,0.05,0.10))

#####################
## Bubble plots
#####################
seasonSelected <- 1998
Tbox.sel <- subset(Tbox_all,Season==seasonSelected)

attach(Tbox.sel)
X <- data.frame(T=Team, P2p, P3p, FTp, AS=P2A+P3A+FTA)
detach(Tbox.sel)
labs <- c("2-point shots (% made)",
          "3-point shots (% made)",
          "free throws (% made)",
          "Total shots attempted")
bubbleplot(X, id="T", x="P2p", y="P3p", col="FTp",
           size="AS", labels=labs, title=paste0("NBA - ", seasonSelected))

teamsSelected <- c("CHI", "UTA", "IND", "LAL")
seasonSelected <- 1998
Pbox.sel <- subset(Pbox, Team %in% teamsSelected & MIN>=1500 & Season==seasonSelected)
                   
attach(Pbox.sel)
X <- data.frame(ID=Player, Team, V1=DREB/MIN, V2=STL/MIN,
                V3=BLK/MIN, V4=MIN)
detach(Pbox.sel)
labs <- c("Defensive Rebounds","Steals","Blocks",
          "Total minutes played")
bubbleplot(X, id="ID", x="V1", y="V2", col="V3",
           size="V4", text.col="Team", labels=labs,
           title=paste0("NBA Players in ", seasonSelected, "\n (values per minute)"),
           text.legend=TRUE, text.size=3.5, scale=FALSE)

#####################
## k-means clustering of NBA teams - using four factors
#####################
seasonSelected <- 1998
Tbox.sel <- subset(Tbox_all,Season==seasonSelected)
Obox.sel <- subset(Obox_all,Season==seasonSelected)

FF <- fourfactors(Tbox.sel,Obox.sel)
OD.Rtg <- FF$ORtg/FF$DRtg
F1.r <- FF$F1.Off/FF$F1.Def
F2.r <- FF$F2.Def/FF$F2.Off
F3.Off <- FF$F3.Off
F3.Def <- FF$F3.Def
P3M.ff <- Tbox.sel$P3M
STL.r <- Tbox.sel$STL/Obox.sel$STL
data <- data.frame(OD.Rtg, F1.r, F2.r, F3.Off, F3.Def,
                   P3M.ff, STL.r)

RNGkind(sample.kind="Rounding")
set.seed(29)
kclu1 <- kclustering(data)
plot(kclu1)

set.seed(29)
kclu2 <- kclustering(data, labels=Tbox.sel$Team, k=7)
plot(kclu2)

kclu2.W <- tapply(Tbox.sel$W, kclu2$Subjects$Cluster, mean)

cluster <- as.factor(kclu2$Subjects$Cluster)
Xbubble <- data.frame(Team=Tbox.sel$Team, PTS=Tbox.sel$PTS,
                      PTS.Opp=Obox.sel$PTS, cluster,
                      W=Tbox.sel$W)
labs <- c("PTS", "PTS.Opp", "cluster", "Wins")
bubbleplot(Xbubble, id="Team", x="PTS", y="PTS.Opp",
           col="cluster", size="W", labels=labs,
           title=paste0("NBA Team Clusters - ",seasonSelected))

#####################
## Hierarchical clustering of NBA players
#####################
#select seasons to analyze
seasonSelected <- c(1996, 1997, 1998)
#filter the player boxscores dataset to include the seasons we selected before and select the top 100 players in points scored
Pbox.sel <- Pbox %>% filter(Season %in% seasonSelected) %>% slice_max(PTS, n = 100)

#attach the player boxscores dataset from above
attach(Pbox.sel)
#create a data frame that contains the columns/stats that we want to use to cluster players
data <- data.frame(PTS, P3M, REB=OREB+DREB,
                   AST, TOV, STL, BLK, PF)
#detach the player boxscores dataset
detach(Pbox.sel)

#create the ID variable to annotate each data point, which is essentially the player name, their team, and season
ID <- paste0(Pbox.sel$Player,"-",Pbox.sel$Team,", ", Pbox.sel$Season)

#run the hierarchical clustering algorithm
hclu1 <- hclustering(data)
#plot the algorithm to choose the optimal number of clusters.
plot(hclu1)

#run the hierarchical clustering algorithm, adding the IDs and opting for 5 clusters
hclu2 <- hclustering(data, labels=ID, k=5)
#show the radar plot of each cluster
plot(hclu2, profiles=TRUE)
#plot the dendrogram
plot(hclu2, rect=TRUE, labels=ID, cex.labels=0.75)

### Variability of the clusters
#create a player boxscore subset, containing only players that had played over 1,000 minutes
Pbox.subset <- subset(Pbox.sel, MIN>=1000)
#define MIN
MIN <- Pbox.subset$MIN
#create a data frame with the player clusters, scaling the data and adding the minutes played
X <- data.frame(hclu2$Subjects, scale(data), MIN)

#select the variables we want to see the variability for
dvar <- c("PTS","P3M","REB","AST",
          "TOV","STL","BLK","PF")
#select the variable to use as the size
svar <- "MIN"
yRange <- range(X[,dvar])
quant <- quantile(x = X$MIN, type = 3)
sizeRange <- c(quant[[1]], quant[[5]])

#define the number of clusters
no.clu <- 5

p <- vector(no.clu, mode="list")
for (k in 1:no.clu) {
  XC <- subset(X, Cluster==k)
  vrb <- variability(XC[,3:11], data.var=dvar,
                     size.var=svar, weight=FALSE,
                     VC=FALSE)
  title <- paste("Cluster", k)
  p[[k]] <- plot(vrb, size.lim=sizeRange, ylim=yRange,
                 title=title, leg.pos=c(0,1),
                 leg.just=c(-0.5,0),
                 leg.box="vertical",
                 leg.brk=seq(quant[[1]],quant[[5]],(quant[[5]]-quant[[1]])/5),
                 leg.title.pos="left", leg.nrow=1,
                 max.circle=7)
}
library(gridExtra)
grid.arrange(grobs=p, ncol=3)

8 comments

  1. I’ve attempted to install.packages(BasketballAnalyzeR). It installs, but I am unable to call on the library. It gives me an error message:

    Error: package or namespace load failed for ‘BasketballAnalyzeR’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
    there is no package called ‘rle’

    Here is a link to the other ways of calling on the library that I have tried:

    https://bodai.unibs.it/bdsports/basketballanalyzer/

    If I could get help with this?

    1. Hey Josh!

      Try installing “rle” by running:
      install.packages(“rle”).

      If that still throws you errors, try:
      install.packages(“devtools”)
      devtools::install_github(“sndmrc/BasketballAnalyzeR”)

      If you still get errors, try:
      install.packages(“BasketballAnalyzeR”, repos=”http://cran.rstudio.com/”, dependencies=TRUE)

      If that still doesn’t work, I’ll try asking around for help too 🙂

      Let me know if any of the above works! Cheers!

      1. Ok looks like BasketballAnalyzeR came through with the rle package. Now, nbastatR I needed to install through devtools::install_github(“abresler/nbastatR”)

        Thanks for your help!

Leave a Reply