Introduction

I created a multilevel logistic regression model with the goal of predicting upsets in the National Football League (NFL). I defined an upset as the team who was the underdog according to Vegas’ closing spread beating the team who was the favorite. The goal of my research was to use game facts, such as the stadium surface, game type (playoff or regular season), time of game, whether or not it was a divisional game, and other binary predictor variables to predict whether or not a specific game would be an upset. The purpose of this research is to help correctly predict future upsets in the NFL by creating a model that predicts high-potential upset games. This model would benefit analysts in the NFL as well as fans or bettors who wish to learn about what games are likely to result in an upset. The research question I am analyzing is which variables (facts about the game) are the best predictors of whether or not an NFL game will result in an upset, and are the effects of these variables different when the underdog is the home team compared to when the underdog is an away team?

My data was compiled using play-by-play and schedule data from the nflfastR package. I pulled the NFL schedule and play-by-play data dating back to 2000, which is the first year that the package has full data on. After cleaning these datasets to create important predictor variables, I joined the two datasets together and ended up with my final dataset. I chose not to use actual game stats, such as turnovers, yards, or other actual predictor variables as these were not available before the game happened and are therefore not good “predictors” for predicting an upset. The only quantitative predictor in my model is spread, which represents the closing number of points the underdog was expected to lose by. I also chose not to use cumulative season-level data or previous season data as it is too difficult to create different cumulative stats for teams at different points in the season. Also, there are lots of changes from year-to-year between teams in terms of personnel and coaching between teams, so it does not make much sense to assume that there is correlation between teams over multiple seasons. Instead of accounting for different teams individually, I grouped the home and away underdogs into cold weather and dome teams because of the similarities that these types of teams face during the winter. I decided to ignore football-statistic data all-together and focus on which game attributes are the best predictors of an upset in addition to how these predictors change depending on the type of underdog.

Data

As mentioned in the introduction, my data was obtained using the nflfastR package and merging two datasets together. First, I used the fast_scraper_schedules function to scrape NFL schedule data containing every game dating back to the 2000 season. This data originally had facts about each game such as the season, week, kickoff time, day of week, and other facts about the stadium. I turned this raw data into usable predictor variables, such as late_season, an indicator variable that was encoded as 1 if the game took place after week 13 and 0 if it was before week 13. I used week 13 as the cutoff because this is usually the start of December and about when the cold weather starts becoming a factor. Additionally, I used the time of the game to create indicator variables early_game, which was encoded as 1 if the game took place before 3:00 pm EST and 0 otherwise, and late_game, which was encoded as 1 if the game took place after 7:00 pm EST and 0 otherwise. Next, I used the game type to encode a new variable playoff, which was encoded as 1 if the game was a playoff game and 0 if it was a regular season game. I then created a variable sunday_game which was encoded as 1 if the game took place on a Sunday and 0 if the game took place on another day of the week. Lastly, I turned the type of grass and if the game was outside into indicator variables, where grass was encoded as 1 if the game was played on grass and 0 otherwise, and outdoors, which was encoded as 1 if the game was outdoors and 0 if the game was indoors.

I then scraped every play of NFL game data dating back to the 2000 season using the load_pbp function and filtered each game to only display the final play of the game. Each row then returned the final score as well as other important information such as div_game if the game was a divisional game and spread_line which was a numeric variable of the number of points the home team was favored by. I used this spread_line variable to mutate new binary variables for if the underdog in the game was the away team or the home team. Then, I used this indicator variable and the final score (which was originally home_score and away_score) to encode the response variable upset, an indicator variable which is encoded as "Yes" if the game resulted in an upset and "No" if the game did not result in an upset. I also included spread as a potential quantitative predictor to test, which is the absolute value of spread_line (how many points the underdog is expected to lose by).

Both the play-by-play dataset and schedule dataset had a similar game_id key, so I joined the two datasets on game_id and created some new variables based on whether or not the underdog or their opponent were a cold weather team or a dome team. I described cold weather teams as NFL teams whose average December temperature is below 40 degrees Fareinheit and who play their games outside. I also hard coded dome teams by season by looking at which teams played in a dome and for which seasons. Then, I grouped all teams into categories depending on if they were home or away, a cold weather team, or a dome team. In the end, I ended up with 6 “types” of underdogs: Away underdogs who normally play in the cold, away underdogs who normally play in a dome, away underdogs who play in neither the cold nor a dome, home underdogs who normally play in the cold, home underdogs who normally play in a dome, and home underdogs who play in neither the cold nor a dome. Because my analysis is independent from season to season as well as game to game, I thought that grouping teams into these sorts of categories would keep the multilevel approach that a different model for each team would have without the issues of different teams each year. This resulted in some cases like the Minnesota Vikings who played in a dome from 2000 until 2013 and again from 2016 until the present, but played outdoors in cold Minnesota during the 2014 and 2015 seasons. After finishing my data cleaning, the final product resulted in a dataset with 5,827 rows and 13 variables (one response variable and twelve predictor variables).

Exploratory Data Analysis

type_dog upsets games prop
Away Cold 346 1042 0.332
Away Dome 258 821 0.314
Away Normal 661 1987 0.333
Home Cold 186 502 0.371
Home Dome 144 402 0.358
Home Normal 370 1073 0.345

As we can see from the table above, there are way more games for each type of team (dome, cold, normal) where the away team is the underdog compared to the home team. This is mostly due to the three point home-field advantage mentioned before. Additionally, we can see that home underdogs perform better than away underdogs for each type of team as well. Another thing to note is that home underdogs that are cold teams have the highest winning probability, which would make sense given what is known about home underdogs as well as cold teams having an advantage in the winter. While these summarized statistics by themselves do not lead us to believe that the data follows any sort of multilevel structure, the difference in proportion of games won by away underdogs compared to home underdogs in addition to the perceived difference in home and away predictor variables suggests that the location might warrant a second level of the model.

The histogram above shows the distribution of spreads across NFL games, where negative X values correspond to the away team being favored and positive X values representing a favored home team. The right side of the histogram is much denser, which confirms the numbers in the table above which shows home teams are favored more than away teams. Additionally, the boxplot on the right shows the median spread for underdogs in games lost vs. games won, which shows that underdogs clearly win more games when the spread is not as large. Also, we can see that the biggest underdog any winner has been is about 17.5 points. While non-winning underdogs have spreads that approach 30, we can see from this graph that every team who was expected to lose by 18 or more ended up losing the game.

The above graphs show the probability of an upset in playoff games and divisional games. As we can see, the probability of an upset is higher when the game occurs in the playoffs as opposed to the regular season. However, the probability of an upset is lower when the game occurs between two divisional opponents.

The bar graphs on the top represent the probability of an upset for teams playing against cold teams, while the graphs on the bottom represent the probability of an upset playing against dome teams. As we can see on the top, underdogs are the most likely to win when they are at home after week 13 and playing against a cold team, while the probability of an upset is about the same before week 13. While the graphs on the bottom don’t show us much about opp_dome_team, we can see that home underdogs playing dome teams after Week 13 perform slightly better than away underdogs while home underdogs playing dome teams before week 13 actually perform worse than away underdogs. This clearly highlights my proposed interaction between opp_dome_team and late_season as the probability of an upset is higher for home teams late in the season but higher for away teams early in the season.

As we can see in the bar graphs on the left, which represent games after week 13, cold underdogs and dome underdogs both perform significantly better at home compared to on the road. This is shown further when examining the graphs on the right, which show that cold and dome underdogs perform slightly better at home compared to on the road. However, these probabilities are much closer together than the probabilities after Week 13, which suggests that dome teams and cold weather teams perform better later in the season, as we might expect.

Methodology

The main reason I used a multilevel model is to account for the difference in winning a game as an underdog on the road in the NFL compared to winning as an underdog at home. In the NFL, home field advantage is very important and is thought by many to move the spread by as many as 3 points in the direction of the home team (Panayotovich 2021). Many teams have distinct features to their stadium, such as the type of grass, whether or not they play in a dome or outside, and whether or not the city they play in is cold in the winter. Many past studies have attempted to quantify the effect that a warm-weather team or dome-team team playing on the road in the cold has on the outcome of the game. While these factors regarding a team’s home stadium factors are normally important, they become even more important as the weather becomes colder later into the season.

For this reason, I wanted to examine how teams who normally play in domes do late in the season when they aren’t playing in domes as well as how teams who play most of their games in warm weather. While the play-by-play data had the temperature available in the Notes column of about half of the games, there was too much missing data to use the temperature as a predictor variable. Instead, I hard-coded variables for dome_team and cold_team as well as opp_dome_team and opp_cold_team as mentioned above and attempt to look at the interaction between these variables and late_season. I believe these variables, in addition to outdoors, have certain dependence on each other that will vary depending on if the underdog is the away team or the home team.

Modeling

I first started by creating my level one model with every possible predictor variable, including the Vegas spread. I also wanted to include the interaction terms between late_season and outdoors, late_season and opp_cold_team, and late_season and opp_dome_team. While I also want to look at the interactions between late_season and whether the underdog themselves were cold weather or dome teams, this issue will be addressed in the second level of the model. The output for this model is listed below:

term estimate std.error statistic p.value
(Intercept) 0.374 0.121 3.096 0.002
div_game1 -0.012 0.060 -0.203 0.839
grass1 0.124 0.067 1.858 0.063
outdoors1 -0.211 0.103 -2.047 0.041
playoffs1 0.246 0.153 1.607 0.108
early_game1 -0.050 0.067 -0.755 0.450
late_game1 -0.101 0.091 -1.118 0.264
late_season1 -0.159 0.178 -0.892 0.373
opp_dome_team1 -0.293 0.104 -2.813 0.005
opp_cold_team1 -0.211 0.077 -2.719 0.007
spread -0.154 0.010 -15.858 0.000
outdoors1:late_season1 0.017 0.175 0.097 0.922
late_season1:opp_cold_team1 0.019 0.147 0.129 0.897
late_season1:opp_dome_team1 0.368 0.190 1.942 0.052

I also wanted to fit a model without the Vegas spread using solely game facts without any indication of how good the teams are perceived to be. I used the same interaction terms and predictors as the last model without spread. The output for that model is listed below:

term estimate std.error statistic p.value
(Intercept) -0.446 0.109 -4.100 0.000
div_game1 -0.031 0.058 -0.534 0.593
grass1 0.177 0.065 2.709 0.007
outdoors1 -0.189 0.102 -1.856 0.063
playoffs1 0.305 0.150 2.032 0.042
early_game1 -0.027 0.065 -0.417 0.676
late_game1 -0.065 0.089 -0.729 0.466
late_season1 -0.191 0.176 -1.085 0.278
opp_dome_team1 -0.257 0.103 -2.502 0.012
opp_cold_team1 -0.234 0.076 -3.084 0.002
outdoors1:late_season1 -0.029 0.174 -0.168 0.867
late_season1:opp_cold_team1 -0.010 0.143 -0.068 0.946
late_season1:opp_dome_team1 0.304 0.188 1.618 0.106

I then performed a drop-in deviance test on the two above models to determine which model I should continue with:

Resid..Df Resid..Dev df Deviance p.value
5814 7409.477 NA NA NA
5813 7119.246 1 290.231 0

The p-value for the drop-in deviance test is essentially 0, so we reject the null hypothesis that the coefficient of spread is not significant and conclude that spread is a significant predictor. While I knew that the spread model would be more informative as knowing how many points the underdog is expected to lose by will almost certainly provide more information on the odds of an upset, I originally wanted my model to contain solely game facts. However, the spread model appears much better from the output above when comparing the deviance and p-value, and the signs of almost all the coefficients are the same, meaning we will likely get similar interpretations for the predictor variables if we choose to include or not include spread. Therefore, I will continue using the spread model from above.

My original model started with 10 predictors as well as 3 interaction terms. However, I need to perform backward selection on my spread_model to identify which predictor variables are relevant. I decided to use AIC as the criteria, which favors simpler models like this. The resulting level one model after backwards selection is outputted below:

term estimate std.error statistic p.value
(Intercept) 0.318 0.097 3.280 0.001
grass1 0.129 0.067 1.941 0.052
outdoors1 -0.207 0.089 -2.338 0.019
playoffs1 0.269 0.149 1.798 0.072
late_season1 -0.137 0.077 -1.774 0.076
opp_dome_team1 -0.293 0.097 -3.002 0.003
opp_cold_team1 -0.209 0.066 -3.173 0.002
spread -0.154 0.010 -15.844 0.000
late_season1:opp_dome_team1 0.352 0.152 2.314 0.021

I then needed to start testing out multilevel models. I decided to start with the variables from my backwards selection model and added a random effect term for type_dog as the level two variable.

effect group term estimate std.error statistic p.value
fixed NA (Intercept) 0.318 0.097 3.280 0.001
fixed NA grass1 0.129 0.067 1.941 0.052
fixed NA outdoors1 -0.207 0.089 -2.338 0.019
fixed NA playoffs1 0.269 0.149 1.798 0.072
fixed NA late_season1 -0.137 0.077 -1.774 0.076
fixed NA opp_dome_team1 -0.293 0.097 -3.002 0.003
fixed NA opp_cold_team1 -0.209 0.066 -3.173 0.002
fixed NA spread -0.154 0.010 -15.844 0.000
fixed NA late_season1:opp_dome_team1 0.352 0.152 2.314 0.021
ran_pars type_dog sd__(Intercept) 0.000 NA NA NA

While the fixed effects look the same as the model from before, the estimate for the standard deviation is 0. I decided to continue to examine this issue by creating different models with different fixed and random effects.

effect group term estimate std.error statistic p.value
fixed NA (Intercept) -0.621 0.037 -16.869 0.000
fixed NA late_season1 -0.177 0.071 -2.498 0.012
fixed NA opp_dome_team1 -0.114 0.079 -1.441 0.150
fixed NA late_season1:opp_dome_team1 0.320 0.148 2.165 0.030
ran_pars type_dog sd__(Intercept) 0.000 NA NA NA

I fit another model removing many variables that weren’t extremely important in the last model, and the same issue still arose. I decided to start playing around with the random effects.

effect group term estimate std.error statistic p.value
fixed NA (Intercept) -0.521 0.046 -11.428 0.000
fixed NA late_season1 -0.168 0.071 -2.377 0.017
fixed NA opp_dome_team1 -0.212 0.094 -2.257 0.024
fixed NA opp_cold_team1 -0.262 0.064 -4.105 0.000
fixed NA late_season1:opp_dome_team1 0.319 0.148 2.151 0.031
ran_pars type_dog sd__(Intercept) 0.026 NA NA NA
ran_pars type_dog cor__(Intercept).opp_dome_team1 -1.000 NA NA NA
ran_pars type_dog sd__opp_dome_team1 0.098 NA NA NA

When I represent the error term and variance component associated with opp_dome_team, I receive a model that is usable. I will now continue with some previous models and compare them to this one.

effect group term estimate std.error statistic p.value
fixed NA (Intercept) 0.287 0.103 2.784 0.005
fixed NA grass1 0.134 0.067 2.009 0.045
fixed NA outdoors1 -0.171 0.098 -1.745 0.081
fixed NA playoffs1 0.263 0.150 1.758 0.079
fixed NA late_season1 -0.135 0.077 -1.750 0.080
fixed NA opp_dome_team1 -0.284 0.113 -2.519 0.012
fixed NA opp_cold_team1 -0.209 0.066 -3.169 0.002
fixed NA spread -0.155 0.010 -15.830 0.000
fixed NA late_season1:opp_dome_team1 0.364 0.153 2.387 0.017
ran_pars type_dog sd__(Intercept) 0.012 NA NA NA
ran_pars type_dog cor__(Intercept).opp_dome_team1 1.000 NA NA NA
ran_pars type_dog sd__opp_dome_team1 0.122 NA NA NA
term npar AIC BIC logLik deviance statistic df p.value
l2_model4 8 7439.925 7493.287 -3711.963 7423.925 NA NA NA
l2_model_5 12 7143.195 7223.238 -3559.597 7119.195 304.73 4 0

The model with all of the terms from the backwards select in addition to the opp_dome_team random effect performs better than the smaller model in the Chi-squared test above. While the explained intraclass variability goes down when there are more level one predictors, the resulting model is a better overall predictor of an upset game. Therefore, I will proceed with this model as my final model. The \(j\)-th level one model (there are 5,827 level one models) is: \(Y_{ij} = a_{ij} + b_{ij} \times\) grass + \(c_{ij} \times\) outdoors + \(d_{ij} \times\) playoffs + \(e_{ij} \times\) late_season + \(f_{ij} \times\) opp_dome_team + \(g_{ij} \times\) opp_cold_team + \(h_{ij} \times\) spread + \(k_{ij} \times\) (late_season \(\times\) opp_dome_team) + \(\epsilon_{ij}\). There are then nine level two models: one for the intercept, and eight for each of the level one predictors. There are also random effects for the intercept and the estimate of opp_dome_team as mentioned above.

\(a_{ij} = \alpha_{01} + u_{i1}\)

\(b_{ij} = \alpha_{02}\)

\(c_{ij} = \alpha_{03}\)

\(d_{ij} = \alpha_{04}\)

\(e_{ij} = \alpha_{05}\)

\(f_{ij} = \alpha_{06} + u_{i6}\)

\(g_{ij} = \alpha_{07}\)

\(h_{ij} = \alpha_{08}\)

\(k_{ij} = \alpha_{09}\)

The final composite model is therefore:

\(Y_{ij} = (\alpha_{01} + u_{i1}) + \alpha_{02} \times\) grass + \(\alpha_{03} \times\) outdoors + \(\alpha_{04} \times\) playoffs + \(\alpha_{05} \times\) late_season + \((\alpha_{06} + u_{i6}) \times\) opp_dome_team + \(\alpha_{07} \times\) opp_cold_team + \(\alpha_{08} \times\) spread + \(\alpha_{09} \times\) late_season \(\times\) opp_dome_team + \(\epsilon_{ij}\).

Grouping the fixed terms and error terms together, we get our final composite model as:

\(Y_{ij} = [\alpha_{01} + \alpha_{02}\) grass + \(\alpha_{03}\) outdoors + \(\alpha_{04}\) playoffs + \(\alpha_{05}\) late_season + \(\alpha_{06}\) opp_dome_team + \(\alpha_{07}\) opp_cold_team + \(\alpha_{08}\) spread + \(\alpha_{09}\) (late_season \(\times\) opp_dome_team)] + [\(\epsilon_{ij} + u_{i1} + u_{i6}\) opp_dome_team].

Results

effect group term estimate std.error statistic p.value
fixed NA (Intercept) 0.287 0.103 2.784 0.005
fixed NA grass1 0.134 0.067 2.009 0.045
fixed NA outdoors1 -0.171 0.098 -1.745 0.081
fixed NA playoffs1 0.263 0.150 1.758 0.079
fixed NA late_season1 -0.135 0.077 -1.750 0.080
fixed NA opp_dome_team1 -0.284 0.113 -2.519 0.012
fixed NA opp_cold_team1 -0.209 0.066 -3.169 0.002
fixed NA spread -0.155 0.010 -15.830 0.000
fixed NA late_season1:opp_dome_team1 0.364 0.153 2.387 0.017
ran_pars type_dog sd__(Intercept) 0.012 NA NA NA
ran_pars type_dog cor__(Intercept).opp_dome_team1 1.000 NA NA NA
ran_pars type_dog sd__opp_dome_team1 0.122 NA NA NA

The final model with estimates is shown above as well as in the Methodology section in statistical notation. Based on the model above, the estimates for the fixed effects are: \(\hat \alpha_{01} = 0.287\), \(\hat \alpha_{02} = 0.134\), \(\hat \alpha_{03} = -0.171\), \(\hat \alpha_{04} = 0.263\), \(\hat \alpha_{05} = -0.135\), \(\hat \alpha_{06} = -0.247\), \(\hat \alpha_{07} = -0.209\), \(\hat \alpha_{08} = -0.155\), \(\hat \alpha_{09} = 0.364\). The estimates for the random effects are \(\hat \epsilon_{ij} = 0.103\), \(\hat u_{i1} = 0.012\), and \(\hat u_{i6} = 0.122\).

My results were mostly on par with my previous hypotheses about the estimates of the predictor variables. Some important interpretations to take away are that the estimated intercept, \(\hat \alpha_{01} = 0.287\), is positive. This is the expected log odds of an upset when all predictor variables are equal to 0. Because this value is positive and we expect the log odds of an upset to be negative, this might suggest that this intercept is an uncommon point.

The estimated coefficient for opp_dome_team is -0.284, suggesting that underdogs generally do worse against teams whose home stadium is a dome. However, the coefficient for the interaction between opp_dome_team and late_season is 0.364. This tells us that late in the season, underdogs are much more likely to beat dome teams. Compared to the odds of beating a team who does not play in a dome, the odds of beating a team who does play in a dome late in the season is expected to multiply by a factor of \(e^{0.8}\), or 2.226, holding all else constant. However, the estimate of late_season alone actually lowers the odds of an upset by more than this change. Therefore, the exact effect of opp_dome_team and late_season are going to depend on the type of underdog.

Additionally, the standard deviation of opp_dome_team between groups is higher than the standard deviation of \(\epsilon_{ij}\), which suggests that the multilevel approach is a good modeling selection as more variation in the data is explained between groups rather than within groups.

Discussion + Conclusion

One of the main limitations of my analysis was the lack of actual football data in my model. My original idea was to predict the outcome of NFL games using data from the game itself, but this type of modeling does not make sense as we would be using data that contributed to the result itself. I also wanted to incorporate a level to the model that takes into account team stats for the season for both the underdog and the team they were playing. However, I again ran into the issue of potentially using the individual game data in both level one and level two and it was impossible to make level two cumulative for each week. Therefore, I focused my final model on using binary predictors about the game itself to find the best predictors of an upset for home underdogs and away underdogs.

Another important thing to consider is that “underdog” in this paper is defined as the closing line Vegas underdog. However, that does not always mean that the team encoded as the underdog was always the betting underdog or thought to be the betting underdog. A lot of games have spreads that swing between -1 and +1 for either team, which means there are games where one team starts as the favorite and then becomes the underdog. In addition to these games with an unclear favorite, there are some games with a spread of 0, which means there is no favorite or underdog and I could not use that data in my model. If I had to repeat this process, I would consider only trying to predict games where the underdog is expected to lose by more than 3 points because I am already losing data for games with a spread of 0 and many games around 0 don’t have “true” underdogs or upsets.

Another problem I ran into with the spread was whether or not I should use the Vegas spread in my model. I first began by testing whether or not I should use the spread as a predictor variable in the logistic regression model. Because I had the result as well as the spread, I also thought about potentially performing a linear regression model with the response variable being underdog_pred, which would be numeric and represent the number of points the underdog was expected to lose by. However, I realized that predicting whether or not the game was an upset was more applicable than the final score which is sometimes misleading. While I decided to continue with a model containing spread, I wonder how the results would have compared had I created a multilevel model without spread and only using facts relating to the game itself and not the teams.

The last issue I had to deal with was classifying teams into groups. I knew one of my main predictors was going to involve the weather of the game as well as the normal weather of the underdog as well as their opponent. While the play-by-play data had weather for about half of the games, there was too many games with missing data that I would have been limiting my findings to only half the available data. For that reason, I did my best to create binary predictor variables that correspond to games late in the season where it should be cold as well as creating a variable for teams whose stadiums get abnormally cold in the winter. I believe my model would’ve performed much better if the temperature variable as well as more accurate statistics about the teams and stadiums. However, I believe I did my best to tackle this problem given the information I had.

References

Panayotovich, Sam. 2021. “NFL Odds: How Much Is Homw-Field Advantage Really Worth on the Spread?” FOX Sports. https://www.foxsports.com/stories/nfl/nfl-odds-how-much-home-field-advantage-worth-spread.