With 10:25 remaining the 3rd quarter of Game 3 of the 2021 NBA Finals, Suns center Deandre Ayton was called for his 4th foul of the game after he fouled Giannis Antetokounmpo on a layup. In a controversial decision, Suns head coach Monty Williams made the difficult decision to sub out Ayton, one of the Suns most impactful players during their unlikely run to the Finals. Ayton had 24 points and 5 rebounds when he was subbed out of the game, and by the time he was subbed back in the beginning of the 4th quarter, the Suns were down 22 points and would never get the game to back within 15 points. They ended up losing Game 3 by 20 points and would go on to 4 consecutive games, losing the series after having a 2-0 lead.

This scenario begs the question: was Monty Williams correct by pulling Ayton? This question is one of the most common in the world of basketball, and despite how often players run into foul trouble during big games, there is not a clear, definitive answer on whether or not you should pull your players.

Many people believe decisions like the one Monty made in the NBA Finals is justified because you should not risk a player like Ayton fouling out of the game early. Players are of no use to a team if they are stuck on the bench, and coaches should make sure your best players are available during the most important minutes of the game - crunch time - rather than having them get fouls in less impactful minutes and be unavailable when you really need them.

Of course, many fans disagree with this line of thinking. Fans who believe pulling your players is wrong claim that coaches are essentially fouling out their own players by guaranteeing they play fewer minutes. While most people would claim having good players is more important at the end of close games, keeping your talent in as long as possible may keep the game from every getting close in the first place. The odds they foul out are low to begin with, and even if they do, the better players have hopefully run up the score enough that you don’t even need them at the end.

Who is correct? The question is a complicated calculus: there are simply too many factors within a basketball game like team match-ups, personnel, game plan, flow of the game so far, and many other characteristics that could influence whether or not pulling a player is smart or not. However, this doesn’t mean there isn’t some information available that could help coaches make more informed decisions about whether or not they are accomplishing the things they think they are by pulling or keeping players in.

To investigate this question, I would need to start by gathering information on how often players get fouls. Unfortunately, there are no good databases available with information just on fouls in the NBA, meaning I would need to create my own. I will include the results of my code in this report, but the code only when necessary. The full code can be viewed in the RMarkdown file.

I am next going to upload the play by play file for the 2020-2021 NBA season found on Big Data Ball. I chose to use this dataset because it is complete with the 10 players who were on the court at the time of each event, which I would use to calculate the playing time a player had between fouls, not just the game time between fouls. I also ran a loop on the dataframe to properly account for substitutions which take place at the beginning of quarters that are not listed in the play by play.

If you are reading within the RMarkdown file, I have most of these early lines of code commented out because I upload the finished raw data, complete with all of the changes I make, to save time and computing power. Given the size of the raw data data frame (506,000 observations), it takes over an hour for the computer to run all of the calculations.

```
## # A tibble: 6 × 23
## game_id a1 a2 a3 a4 a5 h1 h2 h3 h4 h5 period
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## 2 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## 3 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## 4 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## 5 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## 6 22000001 Steph… James… Andre… Kell… Eric… DeAn… Kyri… Kevi… Joe … Spen… 1
## # … with 11 more variables: remaining_time <dbl>, elapsed <dbl>,
## # event_type <chr>, entered <chr>, left <chr>, player_name <chr>, type <chr>,
## # id <int>, game_seconds_remaining <dbl>, all_players <chr>,
## # total_rownumber <int>
```

Next, I will imported a for the 2020-2021 NBA season, which I will use to get a complete record of every NBA player’s name and position. I then create a list of all players 20that played in each game with their position.

game_id | player_name | pos |
---|---|---|

22000001 | Stephen Curry | G |

22000001 | James Wiseman | C |

22000001 | Andrew Wiggins | F |

22000001 | Kelly Oubre Jr. | F-G |

22000001 | Eric Paschall | F |

22000001 | DeAndre Jordan | C |

I next created a function that inputs a player’s name, the game, and a rownumber from the pbp dataframe and outputs the amount of time a player has been in a given game up until a desired event. The full code can be viewed in the RMarkdown File.

For example, to see how many seconds of playing time playername = ‘JJ Redick’ has gotten in gameid = ‘22000005’ up until rownumber = ‘7158’, you would run:

```
#Returns Redick's playing time, in seconds, until row 7158
#in the Pelican's game against the Heat
playing_time("JJ Redick", 7158, 22000005)
```

`## [1] 1524`

Which returns 1524 seconds. Confirming with ESPN, we can see Redick played 1524 seconds, or 24 minutes, meaning the playing_time function works as intended.

I then iterate through the entire pbp data frame, running the playing_time function on each row to get how much the player responsible for that row’s event has played in the game up until (and including) that row. Given the size of the data frame, this can take a long time. In the RMarkdown file, I comment this out because I upload the completed csv, with these calculations, later.

I then make a data frame of the number of minutes each player played in each game.

game_id | player_name | pos | end_game_pt |
---|---|---|---|

22000001 | Andrew Wiggins | F | 1874 |

22000001 | Brad Wanamaker | G | 1015 |

22000001 | Bruce Brown | G-F | 421 |

22000001 | Caris LeVert | G | 1474 |

22000001 | Damion Lee | G-F | 726 |

22000001 | DeAndre Jordan | C | 1013 |

Lastly, I create a new data frame which has a list of each time a player gets a foul in a given game, whether or not they get a given foul in a game, how long it takes to get from one foul to another, and the total playing time for a player in a game.

I create two data frames with this data. One has purely the playing times at which each player gets each fouls in a game, along with the other information I have said above.

```
## # A tibble: 6 × 22
## game_id player_name pos foul_1 foul_2 foul_3 foul_4 foul_5 foul_6
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22000015 Brandon Goodwin G 1 40 625 696 NA NA
## 2 22000027 Patty Mills G 1 1260 NA NA NA NA
## 3 22000773 Gabe Vincent G 1 NA NA NA NA NA
## 4 42000101 Daniel Gafford F-C 1 251 581 880 1187 NA
## 5 42000105 Garrison Mathews G 1 23 80 NA NA NA
## 6 22000230 Josh Jackson G-F 2 859 907 1540 1665 1843
## # … with 13 more variables: foul_1_indicator <dbl>, foul_2_indicator <dbl>,
## # foul_3_indicator <dbl>, foul_4_indicator <dbl>, foul_5_indicator <dbl>,
## # foul_6_indicator <dbl>, end_game_pt <dbl>, 0-1 <dbl>, 1-2 <dbl>, 2-3 <dbl>,
## # 3-4 <dbl>, 4-5 <dbl>, 5-6 <dbl>
```

For the the other data frame, I decided to impute values for the first foul a player did not get with the total playing time they had in a game. For example, if Mikal Bridges gets 3 fouls in a game, the data frame would have the amount of playing time Bridges had when he got each of his first 3 fouls, and NA for fouls 4, 5, and 6. This makes sense in theory, but it doesn’t take into account for the fact he played minutes after his 3rd foul without fouling! I decided to pretend that every player gets their last foul in their last second of playing time. We don’t know how long Bridges could have played before he got his 4th foul since he doesn’t play forever and this is the next best substitute we have (the foul indicators here still reflect the fact that Bridges never actually got his 4th foul).

```
## # A tibble: 6 × 22
## game_id player_name pos foul_1 foul_2 foul_3 foul_4 foul_5 foul_6
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22000015 Brandon Goodwin G 1 40 625 696 NA NA
## 2 22000027 Patty Mills G 1 1260 1747 NA NA NA
## 3 22000773 Gabe Vincent G 1 55 NA NA NA NA
## 4 42000101 Daniel Gafford F-C 1 251 581 880 1187 NA
## 5 42000105 Garrison Mathews G 1 23 80 136 NA NA
## 6 22000230 Josh Jackson G-F 2 859 907 1540 1665 1843
## # … with 13 more variables: foul_1_indicator <dbl>, foul_2_indicator <dbl>,
## # foul_3_indicator <dbl>, foul_4_indicator <dbl>, foul_5_indicator <dbl>,
## # foul_6_indicator <dbl>, 0-1 <dbl>, 1-2 <dbl>, 2-3 <dbl>, 3-4 <dbl>,
## # 4-5 <dbl>, 5-6 <dbl>, end_game_pt <dbl>
```

We can now begin to analyze the data.

I want to start by calculating some summary statistics.

Summary statistics for unimputed fouls by foul_change and position. Mean and standard deviation for each foul_type and position are calculated. I also calculate the percentage of players which foul out from each row.

pos | foul_change | n | mean | sd |
---|---|---|---|---|

C | 0-1 | 1713 | 531.858 | 436.771 |

C-F | 0-1 | 1009 | 444.625 | 374.800 |

F | 0-1 | 5331 | 545.659 | 451.061 |

F-C | 0-1 | 1587 | 444.708 | 374.998 |

F-G | 0-1 | 1002 | 596.766 | 465.092 |

G | 0-1 | 6834 | 585.550 | 465.507 |

G-F | 0-1 | 2266 | 535.584 | 429.542 |

C | 1-2 | 1264 | 412.883 | 356.490 |

C-F | 1-2 | 760 | 373.736 | 329.990 |

F | 1-2 | 3595 | 444.404 | 377.165 |

F-C | 1-2 | 1127 | 395.578 | 346.549 |

F-G | 1-2 | 707 | 478.402 | 386.147 |

G | 1-2 | 4375 | 466.188 | 388.280 |

G-F | 1-2 | 1530 | 451.837 | 377.778 |

C | 2-3 | 774 | 320.578 | 285.904 |

C-F | 2-3 | 510 | 316.124 | 271.711 |

F | 2-3 | 2070 | 382.985 | 323.407 |

F-C | 2-3 | 713 | 334.539 | 299.953 |

F-G | 2-3 | 375 | 424.443 | 357.980 |

G | 2-3 | 2293 | 396.813 | 345.891 |

G-F | 2-3 | 892 | 348.876 | 300.638 |

C | 3-4 | 375 | 267.480 | 251.704 |

C-F | 3-4 | 263 | 306.084 | 270.198 |

F | 3-4 | 955 | 307.426 | 257.361 |

F-C | 3-4 | 386 | 280.122 | 244.157 |

F-G | 3-4 | 159 | 362.440 | 277.916 |

G | 3-4 | 953 | 314.781 | 262.346 |

G-F | 3-4 | 395 | 321.122 | 268.611 |

C | 4-5 | 133 | 189.564 | 172.310 |

C-F | 4-5 | 103 | 213.019 | 169.047 |

F | 4-5 | 323 | 217.672 | 181.181 |

F-C | 4-5 | 140 | 213.450 | 174.020 |

F-G | 4-5 | 52 | 170.712 | 161.144 |

G | 4-5 | 290 | 224.797 | 205.542 |

G-F | 4-5 | 144 | 243.542 | 213.837 |

C | 5-6 | 27 | 139.296 | 103.541 |

C-F | 5-6 | 28 | 137.250 | 103.863 |

F | 5-6 | 68 | 148.044 | 124.763 |

F-C | 5-6 | 26 | 174.962 | 138.536 |

F-G | 5-6 | 13 | 208.615 | 164.469 |

G | 5-6 | 39 | 142.795 | 119.264 |

G-F | 5-6 | 27 | 171.778 | 139.180 |

I then summarize the foul statistics by only foul_change.

foul_change | n | mean | sd |
---|---|---|---|

0-1 | 19744 | 546.397 | 446.244 |

1-2 | 13359 | 443.083 | 375.482 |

2-3 | 7627 | 369.859 | 322.200 |

3-4 | 3486 | 306.076 | 260.560 |

4-5 | 1185 | 216.441 | 188.388 |

5-6 | 228 | 154.118 | 124.900 |

Lastly I summarize foul statistics by only position.

pos | n | mean | sd |
---|---|---|---|

C | 4286 | 422.390 | 381.981 |

C-F | 2673 | 374.176 | 333.033 |

F | 12342 | 459.673 | 401.254 |

F-C | 3979 | 385.185 | 342.656 |

F-G | 2308 | 504.581 | 419.060 |

G | 14784 | 495.256 | 421.898 |

G-F | 5254 | 453.501 | 388.432 |

I then run a F test of foul_change_time on position to find whether or not the mean foul_change varies depending on position.

```
pos.lm <- foul_pivot_imputed %>% lm(foul_change_time ~ as.factor(pos), data = .)
summary(pos.lm)
```

```
##
## Call:
## lm(formula = foul_change_time ~ as.factor(pos), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548.5 -327.5 -124.7 209.5 2429.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 439.915 5.680 77.443 < 0.0000000000000002 ***
## as.factor(pos)C-F -50.736 9.230 -5.497 0.000000038822302 ***
## as.factor(pos)F 46.818 6.555 7.142 0.000000000000927 ***
## as.factor(pos)F-C -37.959 8.162 -4.651 0.000003312721135 ***
## as.factor(pos)F-G 109.540 9.457 11.583 < 0.0000000000000002 ***
## as.factor(pos)G 92.616 6.383 14.511 < 0.0000000000000002 ***
## as.factor(pos)G-F 47.057 7.569 6.217 0.000000000508183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 435.5 on 65987 degrees of freedom
## Multiple R-squared: 0.01156, Adjusted R-squared: 0.01147
## F-statistic: 128.6 on 6 and 65987 DF, p-value: < 0.00000000000000022
```

Given the incredibly high F-value, we can safely assume that position does significantly impact that rate at which players get fouls. This should make sense. Big men often have a very different role in defenses and given their jobs of defending the rim, where more fouls occur, it should make sense that Centers get called for more fouls, and therefore get fouls called in quicker succession (they theoretically should play as many minutes per game as any other position).

I then run a similar linear regression on whether the rate a player gets fouls is changed by which foul they are receiving.

```
fc.lm <- foul_pivot_imputed %>% lm(foul_change_time ~ as.factor(foul_change), data = .)
summary(fc.lm)
```

```
##
## Call:
## lm(formula = foul_change_time ~ as.factor(foul_change), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -608.1 -309.1 -103.1 209.9 2352.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 609.106 2.694 226.12 <0.0000000000000002 ***
## as.factor(foul_change)1-2 -116.948 4.090 -28.60 <0.0000000000000002 ***
## as.factor(foul_change)2-3 -200.793 4.659 -43.10 <0.0000000000000002 ***
## as.factor(foul_change)3-4 -283.370 5.852 -48.42 <0.0000000000000002 ***
## as.factor(foul_change)4-5 -374.722 8.511 -44.03 <0.0000000000000002 ***
## as.factor(foul_change)5-6 -441.033 15.460 -28.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 422.7 on 65988 degrees of freedom
## Multiple R-squared: 0.06868, Adjusted R-squared: 0.06861
## F-statistic: 973.2 on 5 and 65988 DF, p-value: < 0.00000000000000022
```

Given the incredibly big F-value of this linear model, we can also safely assume the means foul time for each of these fouls are significantly distinct.

Lastly, I am going to run a linear regression of the rate fouls are called and the game time the foul was called at to see if when in the game you play would have a significant impact on how often fouls are called.

```
foul_w_time_pivot2 <- foul_w_time_pivot %>% select(game_id, player_name, foul_type, foul_game_time)
foul_pivot_unimputed2 <- foul_pivot_unimputed %>% left_join(foul_w_time_pivot2,
by = c("game_id", "player_name", "foul_type"))
foul_pivot_unimputed3 <- foul_pivot_unimputed2 %>% mutate(is_GF = case_when(
pos == 'G-F' ~ 1,
T ~ 0), is_F = case_when(pos == "F" ~ 1, T ~ 0),
is_FC = case_when(pos == "F-C" ~ 1, T ~ 0),
is_C = case_when(pos == "C" ~ 1, T ~ 0)) %>%
mutate(is_f1 = ifelse(foul_type == "foul_1", 1, 0),
is_f2 = ifelse(foul_type == "foul_2", 1, 0),
is_f3 = ifelse(foul_type == "foul_3", 1, 0),
is_f4 = ifelse(foul_type == "foul_4", 1, 0))
ft_reg <- foul_pivot_unimputed2 %>% lm(foul_change_time ~ foul_game_time, data = .)
summary(ft_reg)
```

```
##
## Call:
## lm(formula = foul_change_time ~ foul_game_time, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -870.6 -262.4 -67.5 194.8 2147.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 220.980674 3.747077 58.97 <0.0000000000000002 ***
## foul_game_time 0.150930 0.002115 71.36 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 378.5 on 46199 degrees of freedom
## Multiple R-squared: 0.09927, Adjusted R-squared: 0.09925
## F-statistic: 5092 on 1 and 46199 DF, p-value: < 0.00000000000000022
```

Like the previous regressions, given the incredibly low p-value, we can safely assume game time does significantly impact how long it takes for players to get fouls, with times closer to the end of games being associated with more fouls. This doesn’t immediately make sense. There isn’t any inherent reason why the end of games would have more fouls than any other period, but it can likely be explained with refs scrutinizing the end of games more and intentional fouls.

Box plot of distribution of foul times.

Similar box plot of foul_time by foul_change.

This confirms what should have been evident already about the rate at which players get fouls. The first 4 fouls a player gets are rarely strategic and often come during the beginning and middle of a game, and are therefore relatively random in how long it takes to get a given foul. The 5th and 6th fouls, however, are more often strategic fouls. Typically coming in the form of intentional fouls at the end of a game when players are more likely to have more fouls, this is a likely reason the mean time between the 5th and 6th foul is much smaller than between 0 and 1st.

Lastly, I look at the distribution of when in a players’ playing time they receive each foul.

It is clear that fouls are not normally distributed, since they come much more often at the end of the game. However, for reasons elaborated on later, I am going to make a simplifying assumption that they are, with the means and standard deviations calculated earlier.

The final step is using these distributions to estimate the probability a player will foul out given an initial starting condition, particularly their position and seconds remaining in a game.

For this analysis, I am going to make four important assumptions to my models that are based on the regressions I ran earlier in the paper.

A player’s position

*does*significantly impact the rate at which players get fouls.Which foul a player is “waiting on”

*does*significantly impact the mean and standard deviation for foul rateThe time in the game you play

*does not*significantly impact the rate at which you earn fouls.The rate at which you get fouls is

**normally distributed**

It is important to note that the 3rd assumption is the opposite conclusion that the regression done earlier showed. I justify this assumption because the primary factor in fouls being called at faster rates later in the game are intentional fouls, which by definition, are a choice. If you had a player in foul trouble, you could simply choose to not have them intentionally foul, keeping them earning fouls at a typical rate.

The 4th assumption is also not perfectly supported by the data, but given the possibility that games which would end and truncate the distribution have a chance of going into overtimes, as well as intentional fouls distorting the data, I am acting as if all fouls are normally distributed. Future analysis would likely remove this assumption.

I am going to be using the imputed data frame for this model in the hopes of having a result that better reflects the time players play without fouling.

The foul model will be calculated with a probabilistic model, with the summary statistics being used changing depending on the position.

This model simply thinks of all remaining fouls in a game as a new normal distribution, plotting the amount of time it takes to get the remaining amount of fouls from an initial starting point. Then, the probability this time is less than the remaining time in the game would be the probability of a player fouls out. Iterating over the additional minutes a player plays from a given starting point would give a graph of the probability a player fouls out versus the additional time they play from a given starting point.

It’s important to note this function does not return the probability you foul out in the game, since that is completely dependent on how much longer the coach chooses to play their player, but the probability they foul out given ‘X’ number of additional minutes from this point. Obviously, as X increases and the player plays more time, the probability they would foul out increases and approaches 1. In addition, you can input the seconds remaining in a game to see the probability you foul out if you play all remaining games.

`foul_out_model("Deandre Ayton", "C", 4, 1345, T)`