I created a multilevel logistic regression model with the goal of predicting upsets in the National Football League (NFL). I defined an upset as the team who was the underdog according to Vegas’ closing spread beating the team who was the favorite. The goal of my research was to use game facts, such as the stadium surface, game type (playoff or regular season), time of game, whether or not it was a divisional game, and other binary predictor variables to predict whether or not a specific game would be an upset. The purpose of this research is to help correctly predict future upsets in the NFL by creating a model that predicts high-potential upset games. This model would benefit analysts in the NFL as well as fans or bettors who wish to learn about what games are likely to result in an upset. The research question I am analyzing is which variables (facts about the game) are the best predictors of whether or not an NFL game will result in an upset, and are the effects of these variables different when the underdog is the home team compared to when the underdog is an away team?
My data was compiled using play-by-play and schedule data from the
nflfastR
package. I pulled the NFL schedule and
play-by-play data dating back to 2000, which is the first year that the
package has full data on. After cleaning these datasets to create
important predictor variables, I joined the two datasets together and
ended up with my final dataset. I chose not to use actual game stats,
such as turnovers, yards, or other actual predictor variables as these
were not available before the game happened and are therefore not good
“predictors” for predicting an upset. The only quantitative predictor in
my model is spread
, which represents the closing number of
points the underdog was expected to lose by. I also chose not to use
cumulative season-level data or previous season data as it is too
difficult to create different cumulative stats for teams at different
points in the season. Also, there are lots of changes from year-to-year
between teams in terms of personnel and coaching between teams, so it
does not make much sense to assume that there is correlation between
teams over multiple seasons. Instead of accounting for different teams
individually, I grouped the home and away underdogs into cold weather
and dome teams because of the similarities that these types of teams
face during the winter. I decided to ignore football-statistic data
all-together and focus on which game attributes are the best predictors
of an upset in addition to how these predictors change depending on the
type of underdog.
As mentioned in the introduction, my data was obtained using the
nflfastR
package and merging two datasets together. First,
I used the fast_scraper_schedules
function to scrape NFL
schedule data containing every game dating back to the 2000 season. This
data originally had facts about each game such as the season, week,
kickoff time, day of week, and other facts about the stadium. I turned
this raw data into usable predictor variables, such as
late_season
, an indicator variable that was encoded as
1
if the game took place after week 13 and 0
if it was before week 13. I used week 13 as the cutoff because this is
usually the start of December and about when the cold weather starts
becoming a factor. Additionally, I used the time of the game to create
indicator variables early_game
, which was encoded as
1
if the game took place before 3:00 pm EST and
0
otherwise, and late_game
, which was encoded
as 1
if the game took place after 7:00 pm EST and
0
otherwise. Next, I used the game type to encode a new
variable playoff
, which was encoded as 1
if
the game was a playoff game and 0
if it was a regular
season game. I then created a variable sunday_game
which
was encoded as 1
if the game took place on a Sunday and
0
if the game took place on another day of the week.
Lastly, I turned the type of grass and if the game was outside into
indicator variables, where grass
was encoded as
1
if the game was played on grass and 0
otherwise, and outdoors
, which was encoded as
1
if the game was outdoors and 0
if the game
was indoors.
I then scraped every play of NFL game data dating back to the 2000
season using the load_pbp
function and filtered each game
to only display the final play of the game. Each row then returned the
final score as well as other important information such as
div_game
if the game was a divisional game and
spread_line
which was a numeric variable of the number of
points the home team was favored by. I used this
spread_line
variable to mutate new binary variables for if
the underdog in the game was the away team or the home team. Then, I
used this indicator variable and the final score (which was originally
home_score
and away_score
) to encode the
response variable upset
, an indicator variable which is
encoded as "Yes"
if the game resulted in an upset and
"No"
if the game did not result in an upset. I also
included spread
as a potential quantitative predictor to
test, which is the absolute value of spread_line
(how many
points the underdog is expected to lose by).
Both the play-by-play dataset and schedule dataset had a similar
game_id
key, so I joined the two datasets on
game_id
and created some new variables based on whether or
not the underdog or their opponent were a cold weather team or a dome
team. I described cold weather teams as NFL teams whose average December
temperature is below 40 degrees Fareinheit and who play their games
outside. I also hard coded dome teams by season by looking at which
teams played in a dome and for which seasons. Then, I grouped all teams
into categories depending on if they were home or away, a cold weather
team, or a dome team. In the end, I ended up with 6 “types” of
underdogs: Away underdogs who normally play in the cold, away underdogs
who normally play in a dome, away underdogs who play in neither the cold
nor a dome, home underdogs who normally play in the cold, home underdogs
who normally play in a dome, and home underdogs who play in neither the
cold nor a dome. Because my analysis is independent from season to
season as well as game to game, I thought that grouping teams into these
sorts of categories would keep the multilevel approach that a different
model for each team would have without the issues of different teams
each year. This resulted in some cases like the Minnesota Vikings who
played in a dome from 2000 until 2013 and again from 2016 until the
present, but played outdoors in cold Minnesota during the 2014 and 2015
seasons. After finishing my data cleaning, the final product resulted in
a dataset with 5,827 rows and 13 variables (one response variable and
twelve predictor variables).
type_dog | upsets | games | prop |
---|---|---|---|
Away Cold | 346 | 1042 | 0.332 |
Away Dome | 258 | 821 | 0.314 |
Away Normal | 661 | 1987 | 0.333 |
Home Cold | 186 | 502 | 0.371 |
Home Dome | 144 | 402 | 0.358 |
Home Normal | 370 | 1073 | 0.345 |
As we can see from the table above, there are way more games for each type of team (dome, cold, normal) where the away team is the underdog compared to the home team. This is mostly due to the three point home-field advantage mentioned before. Additionally, we can see that home underdogs perform better than away underdogs for each type of team as well. Another thing to note is that home underdogs that are cold teams have the highest winning probability, which would make sense given what is known about home underdogs as well as cold teams having an advantage in the winter. While these summarized statistics by themselves do not lead us to believe that the data follows any sort of multilevel structure, the difference in proportion of games won by away underdogs compared to home underdogs in addition to the perceived difference in home and away predictor variables suggests that the location might warrant a second level of the model.
The histogram above shows the distribution of spreads across NFL games, where negative X values correspond to the away team being favored and positive X values representing a favored home team. The right side of the histogram is much denser, which confirms the numbers in the table above which shows home teams are favored more than away teams. Additionally, the boxplot on the right shows the median spread for underdogs in games lost vs. games won, which shows that underdogs clearly win more games when the spread is not as large. Also, we can see that the biggest underdog any winner has been is about 17.5 points. While non-winning underdogs have spreads that approach 30, we can see from this graph that every team who was expected to lose by 18 or more ended up losing the game.
The above graphs show the probability of an upset in playoff games and divisional games. As we can see, the probability of an upset is higher when the game occurs in the playoffs as opposed to the regular season. However, the probability of an upset is lower when the game occurs between two divisional opponents.
The bar graphs on the top represent the probability of an upset for
teams playing against cold teams, while the graphs on the bottom
represent the probability of an upset playing against dome teams. As we
can see on the top, underdogs are the most likely to win when they are
at home after week 13 and playing against a cold team, while the
probability of an upset is about the same before week 13. While the
graphs on the bottom don’t show us much about
opp_dome_team
, we can see that home underdogs playing dome
teams after Week 13 perform slightly better than away underdogs while
home underdogs playing dome teams before week 13 actually perform worse
than away underdogs. This clearly highlights my proposed interaction
between opp_dome_team
and late_season
as the
probability of an upset is higher for home teams late in the season but
higher for away teams early in the season.
As we can see in the bar graphs on the left, which represent games after week 13, cold underdogs and dome underdogs both perform significantly better at home compared to on the road. This is shown further when examining the graphs on the right, which show that cold and dome underdogs perform slightly better at home compared to on the road. However, these probabilities are much closer together than the probabilities after Week 13, which suggests that dome teams and cold weather teams perform better later in the season, as we might expect.
The main reason I used a multilevel model is to account for the difference in winning a game as an underdog on the road in the NFL compared to winning as an underdog at home. In the NFL, home field advantage is very important and is thought by many to move the spread by as many as 3 points in the direction of the home team (Panayotovich 2021). Many teams have distinct features to their stadium, such as the type of grass, whether or not they play in a dome or outside, and whether or not the city they play in is cold in the winter. Many past studies have attempted to quantify the effect that a warm-weather team or dome-team team playing on the road in the cold has on the outcome of the game. While these factors regarding a team’s home stadium factors are normally important, they become even more important as the weather becomes colder later into the season.
For this reason, I wanted to examine how teams who normally play in
domes do late in the season when they aren’t playing in domes as well as
how teams who play most of their games in warm weather. While the
play-by-play data had the temperature available in the
Notes
column of about half of the games, there was too much
missing data to use the temperature as a predictor variable. Instead, I
hard-coded variables for dome_team
and
cold_team
as well as opp_dome_team
and
opp_cold_team
as mentioned above and attempt to look at the
interaction between these variables and late_season
. I
believe these variables, in addition to outdoors
, have
certain dependence on each other that will vary depending on if the
underdog is the away team or the home team.
I first started by creating my level one model with every possible
predictor variable, including the Vegas spread. I also wanted to include
the interaction terms between late_season
and
outdoors
, late_season
and
opp_cold_team
, and late_season
and
opp_dome_team
. While I also want to look at the
interactions between late_season
and whether the underdog
themselves were cold weather or dome teams, this issue will be addressed
in the second level of the model. The output for this model is listed
below:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.374 | 0.121 | 3.096 | 0.002 |
div_game1 | -0.012 | 0.060 | -0.203 | 0.839 |
grass1 | 0.124 | 0.067 | 1.858 | 0.063 |
outdoors1 | -0.211 | 0.103 | -2.047 | 0.041 |
playoffs1 | 0.246 | 0.153 | 1.607 | 0.108 |
early_game1 | -0.050 | 0.067 | -0.755 | 0.450 |
late_game1 | -0.101 | 0.091 | -1.118 | 0.264 |
late_season1 | -0.159 | 0.178 | -0.892 | 0.373 |
opp_dome_team1 | -0.293 | 0.104 | -2.813 | 0.005 |
opp_cold_team1 | -0.211 | 0.077 | -2.719 | 0.007 |
spread | -0.154 | 0.010 | -15.858 | 0.000 |
outdoors1:late_season1 | 0.017 | 0.175 | 0.097 | 0.922 |
late_season1:opp_cold_team1 | 0.019 | 0.147 | 0.129 | 0.897 |
late_season1:opp_dome_team1 | 0.368 | 0.190 | 1.942 | 0.052 |
I also wanted to fit a model without the Vegas spread using solely
game facts without any indication of how good the teams are perceived to
be. I used the same interaction terms and predictors as the last model
without spread
. The output for that model is listed
below:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.446 | 0.109 | -4.100 | 0.000 |
div_game1 | -0.031 | 0.058 | -0.534 | 0.593 |
grass1 | 0.177 | 0.065 | 2.709 | 0.007 |
outdoors1 | -0.189 | 0.102 | -1.856 | 0.063 |
playoffs1 | 0.305 | 0.150 | 2.032 | 0.042 |
early_game1 | -0.027 | 0.065 | -0.417 | 0.676 |
late_game1 | -0.065 | 0.089 | -0.729 | 0.466 |
late_season1 | -0.191 | 0.176 | -1.085 | 0.278 |
opp_dome_team1 | -0.257 | 0.103 | -2.502 | 0.012 |
opp_cold_team1 | -0.234 | 0.076 | -3.084 | 0.002 |
outdoors1:late_season1 | -0.029 | 0.174 | -0.168 | 0.867 |
late_season1:opp_cold_team1 | -0.010 | 0.143 | -0.068 | 0.946 |
late_season1:opp_dome_team1 | 0.304 | 0.188 | 1.618 | 0.106 |
I then performed a drop-in deviance test on the two above models to determine which model I should continue with:
Resid..Df | Resid..Dev | df | Deviance | p.value |
---|---|---|---|---|
5814 | 7409.477 | NA | NA | NA |
5813 | 7119.246 | 1 | 290.231 | 0 |
The p-value for the drop-in deviance test is essentially 0, so we
reject the null hypothesis that the coefficient of spread
is not significant and conclude that spread
is a
significant predictor. While I knew that the spread model would be more
informative as knowing how many points the underdog is expected to lose
by will almost certainly provide more information on the odds of an
upset, I originally wanted my model to contain solely game facts.
However, the spread model appears much better from the output above when
comparing the deviance and p-value, and the signs of almost all the
coefficients are the same, meaning we will likely get similar
interpretations for the predictor variables if we choose to include or
not include spread. Therefore, I will continue using the spread model
from above.
My original model started with 10 predictors as well as 3 interaction
terms. However, I need to perform backward selection on my
spread_model
to identify which predictor variables are
relevant. I decided to use AIC as the criteria, which favors simpler
models like this. The resulting level one model after backwards
selection is outputted below:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.318 | 0.097 | 3.280 | 0.001 |
grass1 | 0.129 | 0.067 | 1.941 | 0.052 |
outdoors1 | -0.207 | 0.089 | -2.338 | 0.019 |
playoffs1 | 0.269 | 0.149 | 1.798 | 0.072 |
late_season1 | -0.137 | 0.077 | -1.774 | 0.076 |
opp_dome_team1 | -0.293 | 0.097 | -3.002 | 0.003 |
opp_cold_team1 | -0.209 | 0.066 | -3.173 | 0.002 |
spread | -0.154 | 0.010 | -15.844 | 0.000 |
late_season1:opp_dome_team1 | 0.352 | 0.152 | 2.314 | 0.021 |
I then needed to start testing out multilevel models. I decided to
start with the variables from my backwards selection model and added a
random effect term for type_dog
as the level two
variable.
effect | group | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
fixed | NA | (Intercept) | 0.318 | 0.097 | 3.280 | 0.001 |
fixed | NA | grass1 | 0.129 | 0.067 | 1.941 | 0.052 |
fixed | NA | outdoors1 | -0.207 | 0.089 | -2.338 | 0.019 |
fixed | NA | playoffs1 | 0.269 | 0.149 | 1.798 | 0.072 |
fixed | NA | late_season1 | -0.137 | 0.077 | -1.774 | 0.076 |
fixed | NA | opp_dome_team1 | -0.293 | 0.097 | -3.002 | 0.003 |
fixed | NA | opp_cold_team1 | -0.209 | 0.066 | -3.173 | 0.002 |
fixed | NA | spread | -0.154 | 0.010 | -15.844 | 0.000 |
fixed | NA | late_season1:opp_dome_team1 | 0.352 | 0.152 | 2.314 | 0.021 |
ran_pars | type_dog | sd__(Intercept) | 0.000 | NA | NA | NA |
While the fixed effects look the same as the model from before, the estimate for the standard deviation is 0. I decided to continue to examine this issue by creating different models with different fixed and random effects.
effect | group | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
fixed | NA | (Intercept) | -0.621 | 0.037 | -16.869 | 0.000 |
fixed | NA | late_season1 | -0.177 | 0.071 | -2.498 | 0.012 |
fixed | NA | opp_dome_team1 | -0.114 | 0.079 | -1.441 | 0.150 |
fixed | NA | late_season1:opp_dome_team1 | 0.320 | 0.148 | 2.165 | 0.030 |
ran_pars | type_dog | sd__(Intercept) | 0.000 | NA | NA | NA |
I fit another model removing many variables that weren’t extremely important in the last model, and the same issue still arose. I decided to start playing around with the random effects.
effect | group | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
fixed | NA | (Intercept) | -0.521 | 0.046 | -11.428 | 0.000 |
fixed | NA | late_season1 | -0.168 | 0.071 | -2.377 | 0.017 |
fixed | NA | opp_dome_team1 | -0.212 | 0.094 | -2.257 | 0.024 |
fixed | NA | opp_cold_team1 | -0.262 | 0.064 | -4.105 | 0.000 |
fixed | NA | late_season1:opp_dome_team1 | 0.319 | 0.148 | 2.151 | 0.031 |
ran_pars | type_dog | sd__(Intercept) | 0.026 | NA | NA | NA |
ran_pars | type_dog | cor__(Intercept).opp_dome_team1 | -1.000 | NA | NA | NA |
ran_pars | type_dog | sd__opp_dome_team1 | 0.098 | NA | NA | NA |
When I represent the error term and variance component associated
with opp_dome_team
, I receive a model that is usable. I
will now continue with some previous models and compare them to this
one.
effect | group | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
fixed | NA | (Intercept) | 0.287 | 0.103 | 2.784 | 0.005 |
fixed | NA | grass1 | 0.134 | 0.067 | 2.009 | 0.045 |
fixed | NA | outdoors1 | -0.171 | 0.098 | -1.745 | 0.081 |
fixed | NA | playoffs1 | 0.263 | 0.150 | 1.758 | 0.079 |
fixed | NA | late_season1 | -0.135 | 0.077 | -1.750 | 0.080 |
fixed | NA | opp_dome_team1 | -0.284 | 0.113 | -2.519 | 0.012 |
fixed | NA | opp_cold_team1 | -0.209 | 0.066 | -3.169 | 0.002 |
fixed | NA | spread | -0.155 | 0.010 | -15.830 | 0.000 |
fixed | NA | late_season1:opp_dome_team1 | 0.364 | 0.153 | 2.387 | 0.017 |
ran_pars | type_dog | sd__(Intercept) | 0.012 | NA | NA | NA |
ran_pars | type_dog | cor__(Intercept).opp_dome_team1 | 1.000 | NA | NA | NA |
ran_pars | type_dog | sd__opp_dome_team1 | 0.122 | NA | NA | NA |
term | npar | AIC | BIC | logLik | deviance | statistic | df | p.value |
---|---|---|---|---|---|---|---|---|
l2_model4 | 8 | 7439.925 | 7493.287 | -3711.963 | 7423.925 | NA | NA | NA |
l2_model_5 | 12 | 7143.195 | 7223.238 | -3559.597 | 7119.195 | 304.73 | 4 | 0 |
The model with all of the terms from the backwards select in addition
to the opp_dome_team
random effect performs better than the
smaller model in the Chi-squared test above. While the explained
intraclass variability goes down when there are more level one
predictors, the resulting model is a better overall predictor of an
upset game. Therefore, I will proceed with this model as my final model.
The \(j\)-th level one model (there are
5,827 level one models) is: \(Y_{ij} = a_{ij}
+ b_{ij} \times\) grass + \(c_{ij}
\times\) outdoors + \(d_{ij}
\times\) playoffs + \(e_{ij}
\times\) late_season + \(f_{ij}
\times\) opp_dome_team + \(g_{ij}
\times\) opp_cold_team + \(h_{ij}
\times\) spread + \(k_{ij}
\times\) (late_season \(\times\)
opp_dome_team) + \(\epsilon_{ij}\).
There are then nine level two models: one for the intercept, and eight
for each of the level one predictors. There are also random effects for
the intercept and the estimate of opp_dome_team
as
mentioned above.
\(a_{ij} = \alpha_{01} + u_{i1}\)
\(b_{ij} = \alpha_{02}\)
\(c_{ij} = \alpha_{03}\)
\(d_{ij} = \alpha_{04}\)
\(e_{ij} = \alpha_{05}\)
\(f_{ij} = \alpha_{06} + u_{i6}\)
\(g_{ij} = \alpha_{07}\)
\(h_{ij} = \alpha_{08}\)
\(k_{ij} = \alpha_{09}\)
The final composite model is therefore:
\(Y_{ij} = (\alpha_{01} + u_{i1}) + \alpha_{02} \times\) grass + \(\alpha_{03} \times\) outdoors + \(\alpha_{04} \times\) playoffs + \(\alpha_{05} \times\) late_season + \((\alpha_{06} + u_{i6}) \times\) opp_dome_team + \(\alpha_{07} \times\) opp_cold_team + \(\alpha_{08} \times\) spread + \(\alpha_{09} \times\) late_season \(\times\) opp_dome_team + \(\epsilon_{ij}\).
Grouping the fixed terms and error terms together, we get our final composite model as:
\(Y_{ij} = [\alpha_{01} + \alpha_{02}\) grass + \(\alpha_{03}\) outdoors + \(\alpha_{04}\) playoffs + \(\alpha_{05}\) late_season + \(\alpha_{06}\) opp_dome_team + \(\alpha_{07}\) opp_cold_team + \(\alpha_{08}\) spread + \(\alpha_{09}\) (late_season \(\times\) opp_dome_team)] + [\(\epsilon_{ij} + u_{i1} + u_{i6}\) opp_dome_team].
effect | group | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
fixed | NA | (Intercept) | 0.287 | 0.103 | 2.784 | 0.005 |
fixed | NA | grass1 | 0.134 | 0.067 | 2.009 | 0.045 |
fixed | NA | outdoors1 | -0.171 | 0.098 | -1.745 | 0.081 |
fixed | NA | playoffs1 | 0.263 | 0.150 | 1.758 | 0.079 |
fixed | NA | late_season1 | -0.135 | 0.077 | -1.750 | 0.080 |
fixed | NA | opp_dome_team1 | -0.284 | 0.113 | -2.519 | 0.012 |
fixed | NA | opp_cold_team1 | -0.209 | 0.066 | -3.169 | 0.002 |
fixed | NA | spread | -0.155 | 0.010 | -15.830 | 0.000 |
fixed | NA | late_season1:opp_dome_team1 | 0.364 | 0.153 | 2.387 | 0.017 |
ran_pars | type_dog | sd__(Intercept) | 0.012 | NA | NA | NA |
ran_pars | type_dog | cor__(Intercept).opp_dome_team1 | 1.000 | NA | NA | NA |
ran_pars | type_dog | sd__opp_dome_team1 | 0.122 | NA | NA | NA |
The final model with estimates is shown above as well as in the Methodology section in statistical notation. Based on the model above, the estimates for the fixed effects are: \(\hat \alpha_{01} = 0.287\), \(\hat \alpha_{02} = 0.134\), \(\hat \alpha_{03} = -0.171\), \(\hat \alpha_{04} = 0.263\), \(\hat \alpha_{05} = -0.135\), \(\hat \alpha_{06} = -0.247\), \(\hat \alpha_{07} = -0.209\), \(\hat \alpha_{08} = -0.155\), \(\hat \alpha_{09} = 0.364\). The estimates for the random effects are \(\hat \epsilon_{ij} = 0.103\), \(\hat u_{i1} = 0.012\), and \(\hat u_{i6} = 0.122\).
My results were mostly on par with my previous hypotheses about the estimates of the predictor variables. Some important interpretations to take away are that the estimated intercept, \(\hat \alpha_{01} = 0.287\), is positive. This is the expected log odds of an upset when all predictor variables are equal to 0. Because this value is positive and we expect the log odds of an upset to be negative, this might suggest that this intercept is an uncommon point.
The estimated coefficient for opp_dome_team
is -0.284,
suggesting that underdogs generally do worse against teams whose home
stadium is a dome. However, the coefficient for the interaction between
opp_dome_team
and late_season
is 0.364. This
tells us that late in the season, underdogs are much more likely to beat
dome teams. Compared to the odds of beating a team who does not play in
a dome, the odds of beating a team who does play in a dome late in the
season is expected to multiply by a factor of \(e^{0.8}\), or 2.226, holding all else
constant. However, the estimate of late_season
alone
actually lowers the odds of an upset by more than this change.
Therefore, the exact effect of opp_dome_team
and
late_season
are going to depend on the type of
underdog.
Additionally, the standard deviation of opp_dome_team
between groups is higher than the standard deviation of \(\epsilon_{ij}\), which suggests that the
multilevel approach is a good modeling selection as more variation in
the data is explained between groups rather than within groups.
One of the main limitations of my analysis was the lack of actual football data in my model. My original idea was to predict the outcome of NFL games using data from the game itself, but this type of modeling does not make sense as we would be using data that contributed to the result itself. I also wanted to incorporate a level to the model that takes into account team stats for the season for both the underdog and the team they were playing. However, I again ran into the issue of potentially using the individual game data in both level one and level two and it was impossible to make level two cumulative for each week. Therefore, I focused my final model on using binary predictors about the game itself to find the best predictors of an upset for home underdogs and away underdogs.
Another important thing to consider is that “underdog” in this paper is defined as the closing line Vegas underdog. However, that does not always mean that the team encoded as the underdog was always the betting underdog or thought to be the betting underdog. A lot of games have spreads that swing between -1 and +1 for either team, which means there are games where one team starts as the favorite and then becomes the underdog. In addition to these games with an unclear favorite, there are some games with a spread of 0, which means there is no favorite or underdog and I could not use that data in my model. If I had to repeat this process, I would consider only trying to predict games where the underdog is expected to lose by more than 3 points because I am already losing data for games with a spread of 0 and many games around 0 don’t have “true” underdogs or upsets.
Another problem I ran into with the spread was whether or not I
should use the Vegas spread in my model. I first began by testing
whether or not I should use the spread as a predictor variable in the
logistic regression model. Because I had the result as well as the
spread, I also thought about potentially performing a linear regression
model with the response variable being underdog_pred
, which
would be numeric and represent the number of points the underdog was
expected to lose by. However, I realized that predicting whether or not
the game was an upset was more applicable than the final score which is
sometimes misleading. While I decided to continue with a model
containing spread
, I wonder how the results would have
compared had I created a multilevel model without spread
and only using facts relating to the game itself and not the teams.
The last issue I had to deal with was classifying teams into groups. I knew one of my main predictors was going to involve the weather of the game as well as the normal weather of the underdog as well as their opponent. While the play-by-play data had weather for about half of the games, there was too many games with missing data that I would have been limiting my findings to only half the available data. For that reason, I did my best to create binary predictor variables that correspond to games late in the season where it should be cold as well as creating a variable for teams whose stadiums get abnormally cold in the winter. I believe my model would’ve performed much better if the temperature variable as well as more accurate statistics about the teams and stadiums. However, I believe I did my best to tackle this problem given the information I had.