Introduction

With 10:25 remaining the 3rd quarter of Game 3 of the 2021 NBA Finals, Suns center Deandre Ayton was called for his 4th foul of the game after he fouled Giannis Antetokounmpo on a layup. In a controversial decision, Suns head coach Monty Williams made the difficult decision to sub out Ayton, one of the Suns most impactful players during their unlikely run to the Finals. Ayton had 24 points and 5 rebounds when he was subbed out of the game, and by the time he was subbed back in the beginning of the 4th quarter, the Suns were down 22 points and would never get the game to back within 15 points. They ended up losing Game 3 by 20 points and would go on to 4 consecutive games, losing the series after having a 2-0 lead.

This scenario begs the question: was Monty Williams correct by pulling Ayton? This question is one of the most common in the world of basketball, and despite how often players run into foul trouble during big games, there is not a clear, definitive answer on whether or not you should pull your players.

Many people believe decisions like the one Monty made in the NBA Finals is justified because you should not risk a player like Ayton fouling out of the game early. Players are of no use to a team if they are stuck on the bench, and coaches should make sure your best players are available during the most important minutes of the game - crunch time - rather than having them get fouls in less impactful minutes and be unavailable when you really need them.

Of course, many fans disagree with this line of thinking. Fans who believe pulling your players is wrong claim that coaches are essentially fouling out their own players by guaranteeing they play fewer minutes. While most people would claim having good players is more important at the end of close games, keeping your talent in as long as possible may keep the game from every getting close in the first place. The odds they foul out are low to begin with, and even if they do, the better players have hopefully run up the score enough that you don’t even need them at the end.

Who is correct? The question is a complicated calculus: there are simply too many factors within a basketball game like team match-ups, personnel, game plan, flow of the game so far, and many other characteristics that could influence whether or not pulling a player is smart or not. However, this doesn’t mean there isn’t some information available that could help coaches make more informed decisions about whether or not they are accomplishing the things they think they are by pulling or keeping players in.

Gathering Data

To investigate this question, I would need to start by gathering information on how often players get fouls. Unfortunately, there are no good databases available with information just on fouls in the NBA, meaning I would need to create my own. I will include the results of my code in this report, but the code only when necessary. The full code can be viewed in the RMarkdown file.

I am next going to upload the play by play file for the 2020-2021 NBA season found on Big Data Ball. I chose to use this dataset because it is complete with the 10 players who were on the court at the time of each event, which I would use to calculate the playing time a player had between fouls, not just the game time between fouls. I also ran a loop on the dataframe to properly account for substitutions which take place at the beginning of quarters that are not listed in the play by play.

If you are reading within the RMarkdown file, I have most of these early lines of code commented out because I upload the finished raw data, complete with all of the changes I make, to save time and computing power. Given the size of the raw data data frame (506,000 observations), it takes over an hour for the computer to run all of the calculations.

## # A tibble: 6 × 23
##    game_id a1     a2     a3     a4    a5    h1    h2    h3    h4    h5    period
##      <dbl> <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>  <dbl>
## 1 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## 2 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## 3 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## 4 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## 5 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## 6 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen…      1
## # … with 11 more variables: remaining_time <dbl>, elapsed <dbl>,
## #   event_type <chr>, entered <chr>, left <chr>, player_name <chr>, type <chr>,
## #   id <int>, game_seconds_remaining <dbl>, all_players <chr>,
## #   total_rownumber <int>

Next, I will imported a for the 2020-2021 NBA season, which I will use to get a complete record of every NBA player’s name and position. I then create a list of all players 20that played in each game with their position.

game_id player_name pos
22000001 Stephen Curry G
22000001 James Wiseman C
22000001 Andrew Wiggins F
22000001 Kelly Oubre Jr. F-G
22000001 Eric Paschall F
22000001 DeAndre Jordan C

I next created a function that inputs a player’s name, the game, and a rownumber from the pbp dataframe and outputs the amount of time a player has been in a given game up until a desired event. The full code can be viewed in the RMarkdown File.

For example, to see how many seconds of playing time playername = ‘JJ Redick’ has gotten in gameid = ‘22000005’ up until rownumber = ‘7158’, you would run:

#Returns Redick's playing time, in seconds, until row 7158
#in the Pelican's game against the Heat
playing_time("JJ Redick", 7158, 22000005)
## [1] 1524

Which returns 1524 seconds. Confirming with ESPN, we can see Redick played 1524 seconds, or 24 minutes, meaning the playing_time function works as intended.

I then iterate through the entire pbp data frame, running the playing_time function on each row to get how much the player responsible for that row’s event has played in the game up until (and including) that row. Given the size of the data frame, this can take a long time. In the RMarkdown file, I comment this out because I upload the completed csv, with these calculations, later.

I then make a data frame of the number of minutes each player played in each game.

game_id player_name pos end_game_pt
22000001 Andrew Wiggins F 1874
22000001 Brad Wanamaker G 1015
22000001 Bruce Brown G-F 421
22000001 Caris LeVert G 1474
22000001 Damion Lee G-F 726
22000001 DeAndre Jordan C 1013

Lastly, I create a new data frame which has a list of each time a player gets a foul in a given game, whether or not they get a given foul in a game, how long it takes to get from one foul to another, and the total playing time for a player in a game.

I create two data frames with this data. One has purely the playing times at which each player gets each fouls in a game, along with the other information I have said above.

## # A tibble: 6 × 22
##    game_id player_name      pos   foul_1 foul_2 foul_3 foul_4 foul_5 foul_6
##      <dbl> <chr>            <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 22000015 Brandon Goodwin  G          1     40    625    696     NA     NA
## 2 22000027 Patty Mills      G          1   1260     NA     NA     NA     NA
## 3 22000773 Gabe Vincent     G          1     NA     NA     NA     NA     NA
## 4 42000101 Daniel Gafford   F-C        1    251    581    880   1187     NA
## 5 42000105 Garrison Mathews G          1     23     80     NA     NA     NA
## 6 22000230 Josh Jackson     G-F        2    859    907   1540   1665   1843
## # … with 13 more variables: foul_1_indicator <dbl>, foul_2_indicator <dbl>,
## #   foul_3_indicator <dbl>, foul_4_indicator <dbl>, foul_5_indicator <dbl>,
## #   foul_6_indicator <dbl>, end_game_pt <dbl>, 0-1 <dbl>, 1-2 <dbl>, 2-3 <dbl>,
## #   3-4 <dbl>, 4-5 <dbl>, 5-6 <dbl>

For the the other data frame, I decided to impute values for the first foul a player did not get with the total playing time they had in a game. For example, if Mikal Bridges gets 3 fouls in a game, the data frame would have the amount of playing time Bridges had when he got each of his first 3 fouls, and NA for fouls 4, 5, and 6. This makes sense in theory, but it doesn’t take into account for the fact he played minutes after his 3rd foul without fouling! I decided to pretend that every player gets their last foul in their last second of playing time. We don’t know how long Bridges could have played before he got his 4th foul since he doesn’t play forever and this is the next best substitute we have (the foul indicators here still reflect the fact that Bridges never actually got his 4th foul).

## # A tibble: 6 × 22
##    game_id player_name      pos   foul_1 foul_2 foul_3 foul_4 foul_5 foul_6
##      <dbl> <chr>            <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 22000015 Brandon Goodwin  G          1     40    625    696     NA     NA
## 2 22000027 Patty Mills      G          1   1260   1747     NA     NA     NA
## 3 22000773 Gabe Vincent     G          1     55     NA     NA     NA     NA
## 4 42000101 Daniel Gafford   F-C        1    251    581    880   1187     NA
## 5 42000105 Garrison Mathews G          1     23     80    136     NA     NA
## 6 22000230 Josh Jackson     G-F        2    859    907   1540   1665   1843
## # … with 13 more variables: foul_1_indicator <dbl>, foul_2_indicator <dbl>,
## #   foul_3_indicator <dbl>, foul_4_indicator <dbl>, foul_5_indicator <dbl>,
## #   foul_6_indicator <dbl>, 0-1 <dbl>, 1-2 <dbl>, 2-3 <dbl>, 3-4 <dbl>,
## #   4-5 <dbl>, 5-6 <dbl>, end_game_pt <dbl>

We can now begin to analyze the data.

Summarizing the Data

I want to start by calculating some summary statistics.

Summary statistics for unimputed fouls by foul_change and position. Mean and standard deviation for each foul_type and position are calculated. I also calculate the percentage of players which foul out from each row.

pos foul_change n mean sd
C 0-1 1713 531.858 436.771
C-F 0-1 1009 444.625 374.800
F 0-1 5331 545.659 451.061
F-C 0-1 1587 444.708 374.998
F-G 0-1 1002 596.766 465.092
G 0-1 6834 585.550 465.507
G-F 0-1 2266 535.584 429.542
C 1-2 1264 412.883 356.490
C-F 1-2 760 373.736 329.990
F 1-2 3595 444.404 377.165
F-C 1-2 1127 395.578 346.549
F-G 1-2 707 478.402 386.147
G 1-2 4375 466.188 388.280
G-F 1-2 1530 451.837 377.778
C 2-3 774 320.578 285.904
C-F 2-3 510 316.124 271.711
F 2-3 2070 382.985 323.407
F-C 2-3 713 334.539 299.953
F-G 2-3 375 424.443 357.980
G 2-3 2293 396.813 345.891
G-F 2-3 892 348.876 300.638
C 3-4 375 267.480 251.704
C-F 3-4 263 306.084 270.198
F 3-4 955 307.426 257.361
F-C 3-4 386 280.122 244.157
F-G 3-4 159 362.440 277.916
G 3-4 953 314.781 262.346
G-F 3-4 395 321.122 268.611
C 4-5 133 189.564 172.310
C-F 4-5 103 213.019 169.047
F 4-5 323 217.672 181.181
F-C 4-5 140 213.450 174.020
F-G 4-5 52 170.712 161.144
G 4-5 290 224.797 205.542
G-F 4-5 144 243.542 213.837
C 5-6 27 139.296 103.541
C-F 5-6 28 137.250 103.863
F 5-6 68 148.044 124.763
F-C 5-6 26 174.962 138.536
F-G 5-6 13 208.615 164.469
G 5-6 39 142.795 119.264
G-F 5-6 27 171.778 139.180

I then summarize the foul statistics by only foul_change.

foul_change n mean sd
0-1 19744 546.397 446.244
1-2 13359 443.083 375.482
2-3 7627 369.859 322.200
3-4 3486 306.076 260.560
4-5 1185 216.441 188.388
5-6 228 154.118 124.900

Lastly I summarize foul statistics by only position.

pos n mean sd
C 4286 422.390 381.981
C-F 2673 374.176 333.033
F 12342 459.673 401.254
F-C 3979 385.185 342.656
F-G 2308 504.581 419.060
G 14784 495.256 421.898
G-F 5254 453.501 388.432

Position Linear Model

I then run a F test of foul_change_time on position to find whether or not the mean foul_change varies depending on position.

pos.lm <- foul_pivot_imputed %>% lm(foul_change_time ~ as.factor(pos), data = .)

summary(pos.lm)
## 
## Call:
## lm(formula = foul_change_time ~ as.factor(pos), data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -548.5 -327.5 -124.7  209.5 2429.5 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        439.915      5.680  77.443 < 0.0000000000000002 ***
## as.factor(pos)C-F  -50.736      9.230  -5.497    0.000000038822302 ***
## as.factor(pos)F     46.818      6.555   7.142    0.000000000000927 ***
## as.factor(pos)F-C  -37.959      8.162  -4.651    0.000003312721135 ***
## as.factor(pos)F-G  109.540      9.457  11.583 < 0.0000000000000002 ***
## as.factor(pos)G     92.616      6.383  14.511 < 0.0000000000000002 ***
## as.factor(pos)G-F   47.057      7.569   6.217    0.000000000508183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 435.5 on 65987 degrees of freedom
## Multiple R-squared:  0.01156,    Adjusted R-squared:  0.01147 
## F-statistic: 128.6 on 6 and 65987 DF,  p-value: < 0.00000000000000022

Given the incredibly high F-value, we can safely assume that position does significantly impact that rate at which players get fouls. This should make sense. Big men often have a very different role in defenses and given their jobs of defending the rim, where more fouls occur, it should make sense that Centers get called for more fouls, and therefore get fouls called in quicker succession (they theoretically should play as many minutes per game as any other position).

Foul Change Linear Model

I then run a similar linear regression on whether the rate a player gets fouls is changed by which foul they are receiving.

fc.lm <- foul_pivot_imputed %>% lm(foul_change_time ~ as.factor(foul_change), data = .)

summary(fc.lm)
## 
## Call:
## lm(formula = foul_change_time ~ as.factor(foul_change), data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -608.1 -309.1 -103.1  209.9 2352.9 
## 
## Coefficients:
##                           Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)                609.106      2.694  226.12 <0.0000000000000002 ***
## as.factor(foul_change)1-2 -116.948      4.090  -28.60 <0.0000000000000002 ***
## as.factor(foul_change)2-3 -200.793      4.659  -43.10 <0.0000000000000002 ***
## as.factor(foul_change)3-4 -283.370      5.852  -48.42 <0.0000000000000002 ***
## as.factor(foul_change)4-5 -374.722      8.511  -44.03 <0.0000000000000002 ***
## as.factor(foul_change)5-6 -441.033     15.460  -28.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 422.7 on 65988 degrees of freedom
## Multiple R-squared:  0.06868,    Adjusted R-squared:  0.06861 
## F-statistic: 973.2 on 5 and 65988 DF,  p-value: < 0.00000000000000022

Given the incredibly big F-value of this linear model, we can also safely assume the means foul time for each of these fouls are significantly distinct.

Foul Time Linear Model

Lastly, I am going to run a linear regression of the rate fouls are called and the game time the foul was called at to see if when in the game you play would have a significant impact on how often fouls are called.

foul_w_time_pivot2 <- foul_w_time_pivot %>% select(game_id, player_name, foul_type, foul_game_time)

foul_pivot_unimputed2 <- foul_pivot_unimputed %>% left_join(foul_w_time_pivot2,
            by = c("game_id", "player_name", "foul_type"))

foul_pivot_unimputed3 <- foul_pivot_unimputed2 %>% mutate(is_GF = case_when(
  pos == 'G-F' ~ 1,
  T ~ 0), is_F = case_when(pos == "F" ~ 1, T ~ 0),
  is_FC = case_when(pos == "F-C" ~ 1, T ~ 0),
  is_C = case_when(pos == "C" ~ 1, T ~ 0)) %>% 
  mutate(is_f1 = ifelse(foul_type == "foul_1", 1, 0),
         is_f2 = ifelse(foul_type == "foul_2", 1, 0),
         is_f3 = ifelse(foul_type == "foul_3", 1, 0),
         is_f4 = ifelse(foul_type == "foul_4", 1, 0))

ft_reg <- foul_pivot_unimputed2 %>% lm(foul_change_time ~ foul_game_time, data = .)

summary(ft_reg)
## 
## Call:
## lm(formula = foul_change_time ~ foul_game_time, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -870.6 -262.4  -67.5  194.8 2147.9 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    220.980674   3.747077   58.97 <0.0000000000000002 ***
## foul_game_time   0.150930   0.002115   71.36 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 378.5 on 46199 degrees of freedom
## Multiple R-squared:  0.09927,    Adjusted R-squared:  0.09925 
## F-statistic:  5092 on 1 and 46199 DF,  p-value: < 0.00000000000000022

Like the previous regressions, given the incredibly low p-value, we can safely assume game time does significantly impact how long it takes for players to get fouls, with times closer to the end of games being associated with more fouls. This doesn’t immediately make sense. There isn’t any inherent reason why the end of games would have more fouls than any other period, but it can likely be explained with refs scrutinizing the end of games more and intentional fouls.

Visualizations

Box plot of distribution of foul times.

Similar box plot of foul_time by foul_change.

This confirms what should have been evident already about the rate at which players get fouls. The first 4 fouls a player gets are rarely strategic and often come during the beginning and middle of a game, and are therefore relatively random in how long it takes to get a given foul. The 5th and 6th fouls, however, are more often strategic fouls. Typically coming in the form of intentional fouls at the end of a game when players are more likely to have more fouls, this is a likely reason the mean time between the 5th and 6th foul is much smaller than between 0 and 1st.

Lastly, I look at the distribution of when in a players’ playing time they receive each foul.

It is clear that fouls are not normally distributed, since they come much more often at the end of the game. However, for reasons elaborated on later, I am going to make a simplifying assumption that they are, with the means and standard deviations calculated earlier.

Foul Out Model

The final step is using these distributions to estimate the probability a player will foul out given an initial starting condition, particularly their position and seconds remaining in a game.

For this analysis, I am going to make four important assumptions to my models that are based on the regressions I ran earlier in the paper.

  1. A player’s position does significantly impact the rate at which players get fouls.

  2. Which foul a player is “waiting on” does significantly impact the mean and standard deviation for foul rate

  3. The time in the game you play does not significantly impact the rate at which you earn fouls.

  4. The rate at which you get fouls is normally distributed

It is important to note that the 3rd assumption is the opposite conclusion that the regression done earlier showed. I justify this assumption because the primary factor in fouls being called at faster rates later in the game are intentional fouls, which by definition, are a choice. If you had a player in foul trouble, you could simply choose to not have them intentionally foul, keeping them earning fouls at a typical rate.

The 4th assumption is also not perfectly supported by the data, but given the possibility that games which would end and truncate the distribution have a chance of going into overtimes, as well as intentional fouls distorting the data, I am acting as if all fouls are normally distributed. Future analysis would likely remove this assumption.

I am going to be using the imputed data frame for this model in the hopes of having a result that better reflects the time players play without fouling.

The foul model will be calculated with a probabilistic model, with the summary statistics being used changing depending on the position.

Probabilistic Model

This model simply thinks of all remaining fouls in a game as a new normal distribution, plotting the amount of time it takes to get the remaining amount of fouls from an initial starting point. Then, the probability this time is less than the remaining time in the game would be the probability of a player fouls out. Iterating over the additional minutes a player plays from a given starting point would give a graph of the probability a player fouls out versus the additional time they play from a given starting point.

It’s important to note this function does not return the probability you foul out in the game, since that is completely dependent on how much longer the coach chooses to play their player, but the probability they foul out given ‘X’ number of additional minutes from this point. Obviously, as X increases and the player plays more time, the probability they would foul out increases and approaches 1. In addition, you can input the seconds remaining in a game to see the probability you foul out if you play all remaining games.

foul_out_model("Deandre Ayton", "C", 4, 1345, T)