I created a multilevel logistic regression model with the goal of predicting upsets in the National Football League (NFL). I defined an upset as the team who was the underdog according to Vegas’ closing spread beating the team who was the favorite. The goal of my research was to use game facts, such as the stadium surface, game type (playoff or regular season), time of game, whether or not it was a divisional game, and other binary predictor variables to predict whether or not a specific game would be an upset. The purpose of this research is to help correctly predict future upsets in the NFL by creating a model that predicts high-potential upset games. This model would benefit analysts in the NFL as well as fans or bettors who wish to learn about what games are likely to result in an upset. The research question I am analyzing is which variables (facts about the game) are the best predictors of whether or not an NFL game will result in an upset, and are the effects of these variables different when the underdog is the home team compared to when the underdog is an away team?
My data was compiled using play-by-play and schedule data from the
nflfastR
package. I pulled the NFL schedule and
play-by-play data dating back to 2000, which is the first year that the
package has full data on. After cleaning these datasets to create
important predictor variables, I joined the two datasets together and
ended up with my final dataset. I chose not to use actual game stats,
such as turnovers, yards, or other actual predictor variables as these
were not available before the game happened and are therefore not good
“predictors” for predicting an upset. The only quantitative predictor in
my model is spread
, which represents the closing number of
points the underdog was expected to lose by. I also chose not to use
cumulative season-level data or previous season data as it is too
difficult to create different cumulative stats for teams at different
points in the season. Also, there are lots of changes from year-to-year
between teams in terms of personnel and coaching between teams, so it
does not make much sense to assume that there is correlation between
teams over multiple seasons. Instead of accounting for different teams
individually, I grouped the home and away underdogs into cold weather
and dome teams because of the similarities that these types of teams
face during the winter. I decided to ignore football-statistic data
all-together and focus on which game attributes are the best predictors
of an upset in addition to how these predictors change depending on the
type of underdog.
As mentioned in the introduction, my data was obtained using the
nflfastR
package and merging two datasets together. First,
I used the fast_scraper_schedules
function to scrape NFL
schedule data containing every game dating back to the 2000 season. This
data originally had facts about each game such as the season, week,
kickoff time, day of week, and other facts about the stadium. I turned
this raw data into usable predictor variables, such as
late_season
, an indicator variable that was encoded as
1
if the game took place after week 13 and 0
if it was before week 13. I used week 13 as the cutoff because this is
usually the start of December and about when the cold weather starts
becoming a factor. Additionally, I used the time of the game to create
indicator variables early_game
, which was encoded as
1
if the game took place before 3:00 pm EST and
0
otherwise, and late_game
, which was encoded
as 1
if the game took place after 7:00 pm EST and
0
otherwise. Next, I used the game type to encode a new
variable playoff
, which was encoded as 1
if
the game was a playoff game and 0
if it was a regular
season game. I then created a variable sunday_game
which
was encoded as 1
if the game took place on a Sunday and
0
if the game took place on another day of the week.
Lastly, I turned the type of grass and if the game was outside into
indicator variables, where grass
was encoded as
1
if the game was played on grass and 0
otherwise, and outdoors
, which was encoded as
1
if the game was outdoors and 0
if the game
was indoors.
I then scraped every play of NFL game data dating back to the 2000
season using the load_pbp
function and filtered each game
to only display the final play of the game. Each row then returned the
final score as well as other important information such as
div_game
if the game was a divisional game and
spread_line
which was a numeric variable of the number of
points the home team was favored by. I used this
spread_line
variable to mutate new binary variables for if
the underdog in the game was the away team or the home team. Then, I
used this indicator variable and the final score (which was originally
home_score
and away_score
) to encode the
response variable upset
, an indicator variable which is
encoded as "Yes"
if the game resulted in an upset and
"No"
if the game did not result in an upset. I also
included spread
as a potential quantitative predictor to
test, which is the absolute value of spread_line
(how many
points the underdog is expected to lose by).
Both the play-by-play dataset and schedule dataset had a similar
game_id
key, so I joined the two datasets on
game_id
and created some new variables based on whether or
not the underdog or their opponent were a cold weather team or a dome
team. I described cold weather teams as NFL teams whose average December
temperature is below 40 degrees Fareinheit and who play their games
outside. I also hard coded dome teams by season by looking at which
teams played in a dome and for which seasons. Then, I grouped all teams
into categories depending on if they were home or away, a cold weather
team, or a dome team. In the end, I ended up with 6 “types” of
underdogs: Away underdogs who normally play in the cold, away underdogs
who normally play in a dome, away underdogs who play in neither the cold
nor a dome, home underdogs who normally play in the cold, home underdogs
who normally play in a dome, and home underdogs who play in neither the
cold nor a dome. Because my analysis is independent from season to
season as well as game to game, I thought that grouping teams into these
sorts of categories would keep the multilevel approach that a different
model for each team would have without the issues of different teams
each year. This resulted in some cases like the Minnesota Vikings who
played in a dome from 2000 until 2013 and again from 2016 until the
present, but played outdoors in cold Minnesota during the 2014 and 2015
seasons. After finishing my data cleaning, the final product resulted in
a dataset with 5,827 rows and 13 variables (one response variable and
twelve predictor variables).
type_dog | upsets | games | prop |
---|---|---|---|
Away Cold | 346 | 1042 | 0.332 |
Away Dome | 258 | 821 | 0.314 |
Away Normal | 661 | 1987 | 0.333 |
Home Cold | 186 | 502 | 0.371 |
Home Dome | 144 | 402 | 0.358 |
Home Normal | 370 | 1073 | 0.345 |
As we can see from the table above, there are way more games for each type of team (dome, cold, normal) where the away team is the underdog compared to the home team. This is mostly due to the three point home-field advantage mentioned before. Additionally, we can see that home underdogs perform better than away underdogs for each type of team as well. Another thing to note is that home underdogs that are cold teams have the highest winning probability, which would make sense given what is known about home underdogs as well as cold teams having an advantage in the winter. While these summarized statistics by themselves do not lead us to believe that the data follows any sort of multilevel structure, the difference in proportion of games won by away underdogs compared to home underdogs in addition to the perceived difference in home and away predictor variables suggests that the location might warrant a second level of the model.
The histogram above shows the distribution of spreads across NFL games, where negative X values correspond to the away team being favored and positive X values representing a favored home team. The right side of the histogram is much denser, which confirms the numbers in the table above which shows home teams are favored more than away teams. Additionally, the boxplot on the right shows the median spread for underdogs in games lost vs. games won, which shows that underdogs clearly win more games when the spread is not as large. Also, we can see that the biggest underdog any winner has been is about 17.5 points. While non-winning underdogs have spreads that approach 30, we can see from this graph that every team who was expected to lose by 18 or more ended up losing the game.
The above graphs show the probability of an upset in playoff games and divisional games. As we can see, the probability of an upset is higher when the game occurs in the playoffs as opposed to the regular season. However, the probability of an upset is lower when the game occurs between two divisional opponents.
The bar graphs on the top represent the probability of an upset for
teams playing against cold teams, while the graphs on the bottom
represent the probability of an upset playing against dome teams. As we
can see on the top, underdogs are the most likely to win when they are
at home after week 13 and playing against a cold team, while the
probability of an upset is about the same before week 13. While the
graphs on the bottom don’t show us much about
opp_dome_team
, we can see that home underdogs playing dome
teams after Week 13 perform slightly better than away underdogs while
home underdogs playing dome teams before week 13 actually perform worse
than away underdogs. This clearly highlights my proposed interaction
between opp_dome_team
and late_season
as the
probability of an upset is higher for home teams late in the season but
higher for away teams early in the season.