I created a multilevel logistic regression model with the goal of predicting upsets in the National Football League (NFL). I defined an upset as the team who was the underdog according to Vegas’ closing spread beating the team who was the favorite. The goal of my research was to use game facts, such as the stadium surface, game type (playoff or regular season), time of game, whether or not it was a divisional game, and other binary predictor variables to predict whether or not a specific game would be an upset. The purpose of this research is to help correctly predict future upsets in the NFL by creating a model that predicts high-potential upset games. This model would benefit analysts in the NFL as well as fans or bettors who wish to learn about what games are likely to result in an upset. The research question I am analyzing is which variables (facts about the game) are the best predictors of whether or not an NFL game will result in an upset, and are the effects of these variables different when the underdog is the home team compared to when the underdog is an away team?

My data was compiled using play-by-play and schedule data from the nflfastR package. I pulled the NFL schedule and play-by-play data dating back to 2000, which is the first year that the package has full data on. After cleaning these datasets to create important predictor variables, I joined the two datasets together and ended up with my final dataset. I chose not to use actual game stats, such as turnovers, yards, or other actual predictor variables as these were not available before the game happened and are therefore not good “predictors” for predicting an upset. The only quantitative predictor in my model is spread, which represents the closing number of points the underdog was expected to lose by. I also chose not to use cumulative season-level data or previous season data as it is too difficult to create different cumulative stats for teams at different points in the season. Also, there are lots of changes from year-to-year between teams in terms of personnel and coaching between teams, so it does not make much sense to assume that there is correlation between teams over multiple seasons. Instead of accounting for different teams individually, I grouped the home and away underdogs into cold weather and dome teams because of the similarities that these types of teams face during the winter. I decided to ignore football-statistic data all-together and focus on which game attributes are the best predictors of an upset in addition to how these predictors change depending on the type of underdog.


As mentioned in the introduction, my data was obtained using the nflfastR package and merging two datasets together. First, I used the fast_scraper_schedules function to scrape NFL schedule data containing every game dating back to the 2000 season. This data originally had facts about each game such as the season, week, kickoff time, day of week, and other facts about the stadium. I turned this raw data into usable predictor variables, such as late_season, an indicator variable that was encoded as 1 if the game took place after week 13 and 0 if it was before week 13. I used week 13 as the cutoff because this is usually the start of December and about when the cold weather starts becoming a factor. Additionally, I used the time of the game to create indicator variables early_game, which was encoded as 1 if the game took place before 3:00 pm EST and 0 otherwise, and late_game, which was encoded as 1 if the game took place after 7:00 pm EST and 0 otherwise. Next, I used the game type to encode a new variable playoff, which was encoded as 1 if the game was a playoff game and 0 if it was a regular season game. I then created a variable sunday_game which was encoded as 1 if the game took place on a Sunday and 0 if the game took place on another day of the week. Lastly, I turned the type of grass and if the game was outside into indicator variables, where grass was encoded as 1 if the game was played on grass and 0 otherwise, and outdoors, which was encoded as 1 if the game was outdoors and 0 if the game was indoors.

I then scraped every play of NFL game data dating back to the 2000 season using the load_pbp function and filtered each game to only display the final play of the game. Each row then returned the final score as well as other important information such as div_game if the game was a divisional game and spread_line which was a numeric variable of the number of points the home team was favored by. I used this spread_line variable to mutate new binary variables for if the underdog in the game was the away team or the home team. Then, I used this indicator variable and the final score (which was originally home_score and away_score) to encode the response variable upset, an indicator variable which is encoded as "Yes" if the game resulted in an upset and "No" if the game did not result in an upset. I also included spread as a potential quantitative predictor to test, which is the absolute value of spread_line (how many points the underdog is expected to lose by).

Both the play-by-play dataset and schedule dataset had a similar game_id key, so I joined the two datasets on game_id and created some new variables based on whether or not the underdog or their opponent were a cold weather team or a dome team. I described cold weather teams as NFL teams whose average December temperature is below 40 degrees Fareinheit and who play their games outside. I also hard coded dome teams by season by looking at which teams played in a dome and for which seasons. Then, I grouped all teams into categories depending on if they were home or away, a cold weather team, or a dome team. In the end, I ended up with 6 “types” of underdogs: Away underdogs who normally play in the cold, away underdogs who normally play in a dome, away underdogs who play in neither the cold nor a dome, home underdogs who normally play in the cold, home underdogs who normally play in a dome, and home underdogs who play in neither the cold nor a dome. Because my analysis is independent from season to season as well as game to game, I thought that grouping teams into these sorts of categories would keep the multilevel approach that a different model for each team would have without the issues of different teams each year. This resulted in some cases like the Minnesota Vikings who played in a dome from 2000 until 2013 and again from 2016 until the present, but played outdoors in cold Minnesota during the 2014 and 2015 seasons. After finishing my data cleaning, the final product resulted in a dataset with 5,827 rows and 13 variables (one response variable and twelve predictor variables).

Exploratory Data Analysis

type_dog upsets games prop
Away Cold 346 1042 0.332
Away Dome 258 821 0.314
Away Normal 661 1987 0.333
Home Cold 186 502 0.371
Home Dome 144 402 0.358
Home Normal 370 1073 0.345

As we can see from the table above, there are way more games for each type of team (dome, cold, normal) where the away team is the underdog compared to the home team. This is mostly due to the three point home-field advantage mentioned before. Additionally, we can see that home underdogs perform better than away underdogs for each type of team as well. Another thing to note is that home underdogs that are cold teams have the highest winning probability, which would make sense given what is known about home underdogs as well as cold teams having an advantage in the winter. While these summarized statistics by themselves do not lead us to believe that the data follows any sort of multilevel structure, the difference in proportion of games won by away underdogs compared to home underdogs in addition to the perceived difference in home and away predictor variables suggests that the location might warrant a second level of the model.

The histogram above shows the distribution of spreads across NFL games, where negative X values correspond to the away team being favored and positive X values representing a favored home team. The right side of the histogram is much denser, which confirms the numbers in the table above which shows home teams are favored more than away teams. Additionally, the boxplot on the right shows the median spread for underdogs in games lost vs. games won, which shows that underdogs clearly win more games when the spread is not as large. Also, we can see that the biggest underdog any winner has been is about 17.5 points. While non-winning underdogs have spreads that approach 30, we can see from this graph that every team who was expected to lose by 18 or more ended up losing the game.

The above graphs show the probability of an upset in playoff games and divisional games. As we can see, the probability of an upset is higher when the game occurs in the playoffs as opposed to the regular season. However, the probability of an upset is lower when the game occurs between two divisional opponents.

The bar graphs on the top represent the probability of an upset for teams playing against cold teams, while the graphs on the bottom represent the probability of an upset playing against dome teams. As we can see on the top, underdogs are the most likely to win when they are at home after week 13 and playing against a cold team, while the probability of an upset is about the same before week 13. While the graphs on the bottom don’t show us much about opp_dome_team, we can see that home underdogs playing dome teams after Week 13 perform slightly better than away underdogs while home underdogs playing dome teams before week 13 actually perform worse than away underdogs. This clearly highlights my proposed interaction between opp_dome_team and late_season as the probability of an upset is higher for home teams late in the season but higher for away teams early in the season.