Introduction

The FIFA World Cup is the most important international soccer tournament in the world, bringing in over 5 billion projected viewers across the 29 day tournament. It is held every four years and brings together the best national teams from countries around the globe to compete for the title of world champions. Not only does the World Cup provide an opportunity for players to showcase their skills on the biggest stage, it also generates a huge amount of interest and excitement (and thus revenue), and provides people from all over the world an opportunity to come together and celebrate their love of soccer, fostering a sense of global unity and understanding.

For a subset of the followers of the World Cup, sports betting has become a prominent part of the experience. The sports betting industry is a large and growing market that involves people placing bets on the outcome of various sports events. This can include bets on individual games or on the overall results of a season or tournament. According to Bloomberg, a total of $35 billion will be wagered on the 2022 FIFA World Cup, a 65% increase on the previous World Cup.

An integral part of sports betting is the usage of statistics, as it can provide valuable information about the likelihood of certain outcomes in a given game or match. By using statistics, bettors can analyze the true risk on various bets (as opposed to a sportsbook’s “odds”) and make more informed decisions about which bets to place, giving bettors a better chance of winning and achieving profitable returns.

In short, the goal of our project is to build a predictive model for the 2022 World Cup games using historical international football results and FIFA team rankings in order to provide sports bettors valuable information about the probability of certain outcomes in this year’s World Cup. We hope to predict goal line bets (point spread/goal differential) as well as 3 way money-line (draw, home win, and away win) bets from the group stage, comparing the odds of respective matches with the odds given by sportsbooks to determine which bets are more likely to be profitable then the odds say.

The dataset we use is a set of thousands of international soccer matches from June 2002 to June 2021, with metrics including team FIFA rank, team FIFA points, match results, offense/defense/midfield score metrics and more. We will first use a Poisson regression model on both home and away team scores to predict score distribution for each team respectively, with a lasso penalty to select significant predictor variables and reduce collinearity. Using these relevant predictors, we will then fit a bivariate Poisson model that takes into account the dependency between home and away team goal distributions. Since we have a small number of goals, Poisson regression makes sense as it is intended for response variables that take on small, positive values. Getting results that are probabilistic distributions are important in our case because sports betters are interested in the distribution of results in order to be able to quantify their risk.

Data

Our analysis utilized a few different datasets, which were combined (and later cleaned) into one final dataset. The main dataset was found on GitHub, which gathered data from Wikipedia, the Rec.Sport.Soccer Statistics Foundation (a group which “strives to be the most comprehensive and complete” archive of soccer statistics), and individual soccer team websites. It features the results of 44,341 international soccer matches between 1872 (the year of the first official international match) and 2022.

We also used three other datasets to give us the predictor variables we need to successfully analyze the results and scores of international soccer matches. These included FIFA World Rankings scraped from 2002 onwards, FIFA Player and Team Data, which details the ratings, positions, and other metrics of individual players in the FIFA video game series from the 2015 to 2022 versions of the game, and (box dataset), which tells us additional game specific results like shots on target, possession, red/yellow cards etc.

Data Cleaning

In order to combine these datasets into a usable one, we first had to clean the data. Upon merging all of the relevant datasets together for the international matches, we discovered that a lot of these matches had non-existent values for box scores or FIFA ratings. We wanted to be able to include all of these potential predictors in our model diagnostics, so we elected to remove these observations from the model. This led to us getting a dataset of mostly recent international matches (as box score data was not widely recorded in the world of soccer until the 2010s).

Our final dataset held data for 786 international matches, including box score data for each team. To make the creation of a predictive model easier, we decided to combine data for each team into a “differential” metric, which found the difference between a statistic for the home team and that same statistic for the away team. Our final set of predictors included FIFA goalkeeper rating differential, FIFA defense rating differential, FIFA midfield rating differential, FIFA offense rating differential, percentage of possession differential, shots taken differential, shots on target differential, fouls differential, yellow cards differential, red cards differential, FIFA team ranking differential, whether the match was played at a neutral stadium (i.e. neither team was playing in their home stadium), and the teams playing in the match. Since we would not have in game predictors such as shots on target, fouls, yellow/red cards, before a match starts, we computed the averages for the team’s past 5 games. For our response variables, we have the home team’s score in the match, the away team’s score in the match, the computed score differential (home team score minus away team score), and a categorical outcome of the game, where matches were assigned one of three outcomes: the home team winning, the home team losing, and the match ending in a draw (note that for neutral matches, a team is randomly assigned to be the home team).

For our predictor variables above, it is important to note that a positive differential value does not always indicate a good outcome for the home team - for example, a positive shots on target differential indicates that the home team was able to place more shot attempts on the goal than the away team, which is a positive outcome. However, a positive fouls (or yellow and red cards) differential indicates that the home team committed more penalties, which is a negative outcome. This is an important observation to keep in mind when observing our graphs and models featured later in this report.

Lastly, we also had to create the World Cup 2022 Dataset that has the same columns as our model dataset in order to use it as a final test set. In order to do this, we acquired a CSV with the group stage teams and their FIFA point metrics (offense, defense, midfield, rank, points, goalkeeper), then took averages of the last 5 games as values for other predictors (avg goals, avg shots on target, etc). The reason we had to do this is because typically, in the case of sports betting, the game hasn’t occurred so those metrics would not be available yet; thus, we are taking the averages as the inputs. Lastly, using these values, we created a dataset that has the same columns as our model dataset by computing differentials as the difference in scores between the home and away team.

Exploratory Data Analysis

To begin our analysis, we first looked at the distribution of the score differentials and the game outcomes. Since the outcomes of the international matches is a categorical variable that can only take on three values, we show a table approach:

International Match Results
Home Team Result Count
Draw 201
Lose 227
Win 369

We see that the home team appears to win more often than any other outcome - this will be an important thing to factor into our model as it could accidentally skew outcomes in favor of the home team. Diving further into these games, we see a histogram of the score differentials of all the matches in our dataset:

The score differentials range from -6 to 7, and it appears that the distribution of score differentials appears to be close to normally distributed. This graph could indicate that if a team were to win or lose, it is most likely by only a goal or two, as those outcomes make up the bulk of the distribution. The median (blue line) of the score differentials is zero and the mean (red line) of the score differentials is 0.37.

Next, we wanted to see if we could associate any of the predictors visually with the outcome of the game. The predictor variable that appeared to have the largest effect on the goal differential was the shots on target differential:

This made sense to us as an increase in the number of shots on target relative to the other team indicates that there are more opportunities for a team to score. Therefore, an increase in the shots on target differential indicates that they are more likely to win (and win my more goals).

We also saw similar trends (but to a lesser extent) with each of the rating metrics from the FIFA series, shown below:

We see positive relationships with each of these potential predictors and the goal differential. These trends are something interesting we would like to explore in our numerical and probabilisitic models. However, the similarly-distributed scatterplots and lines of best fit could indicate a possible instance of multicollinearity between these predictors - this is something we must explore and do our best to avoid when fitting our models.

Methodology

Model selection and validation

We initially chose to use a GLM-based Poisson regression model to predict goals for home and away teams in each match, as Poisson regression is most appropriate when the response is a discrete count, as is the case for goals scored in a match. We wanted to perform both regularization and variable selection, as our dataset had a very large number of correlated predictors and we suspected that performance could be predicted using a much sparser set of variables; therefore, we introduced a LASSO penalty using the glmnet package, and 5-fold cross validated for the optimal value of lambda. However, we quickly realized that since the outcome of a soccer game is very dependent and how two specific teams interact with each other, we could not use univariate models that would predict a team’s score in a vacuum. Hence, we finally decided to use a bivariate poisson regression model to account for these possible correlations.

The general form of this model is as follows (Kallis & Ioannis) :

Consider random variables \(X_k\), K = 1, 2, 3 which follow independent Poisson distributions with parameters \(\lambda_k\), respectively. Then random variables X and Y are given by \(X = X_1+X_3\) and \(Y = X_2+X_3\) and jointly follow a bivariate Poisson distribution. So, E(X) = \(\lambda_1 + \lambda_3\) and E(Y) = \(\lambda_2 + \lambda_3\). In our case, X represents goals scored by the home team, while Y represents goals score by the away team.

When we add covariates, we then have:

\[(X_i, Y_i) \sim BP(\lambda_{1i}, \lambda_{2i}, \lambda_{3i})\] \[log(\lambda_{1i}) = w_{1i}^T\beta_1\] \[log(\lambda_{2i}) = w_{2i}^T\beta_2\] \[log(\lambda_{3i}) = w_{3i}^T\beta_3\]

where i = 1, . . . , n, is the observation number, \(w_i\) is a vector of predictors for the i-th observation used to model \(\lambda_{ki}\), and \(\beta_k\) denotes the corresponding vector of regression coefficients K = 1, 2, 3.

Upon research, we discovered that bivariate models with LASSO regularization have not been invented as of yet (or at least not in a form that is easy to implement). To simulate the potential benefits of this framework as much as possible, we decided to screen our variables using the results of our univariate LASSO models, choosing the sparser set of variables among the home and away models as final predictors in the bivariate model.

During this process, we tried a number of other models and datasets to build the most robust model - this included a multinomial model that predicted outcomes, as well as a bivariate Poisson model on a far larger dataset that contained possession stats, shot accuracy etc. for individual matches, which we could use to calculate running averages for each team. However, when comparing accuracy using misclassfication error from Monte Carlo draws (this will be elucidated upon shortly), using 5-fold cross validation on our training set as well as predicting outcomes for the test set (i.e 2022 games), we found that our original dataset and model worked best and were the most statistically robust.

Assumptions

Clearly, independence cannot be satisfied here as each team’s result in a match is not independent of their result in another; however, for the purposes of our analysis, we can assume that our fairly large sample size can counter this to some extent and stay. Caveating this, the two primary assumptions that need to be satisfied are:

  1. Ensure that the response is a count, and that these counts are Poisson distributed. This means mean should be roughly equal to variance for goals scored across all of our historical matches we use as a training set.
  2. There are linear relationships between the log of each rate parameter (i.e \(\log{\lambda_k}\)) and changes in predictor variables.

The first of these is undoubtedly the most important - we can examine the mean and variance of historical goals, as well as visualize their distribution below (it is self-evident that the response is a count):

We see that for both home and away goals, mean and variance across games are nearly equal, while the distributions themselves also look roughly Poisson distributed. We can assume this condition is satisfied.

The second assumption is strongly satisfied for \(\lambda_1\) and \(\lambda_2\), while slightly less strongly satisfied for \(\lambda_3\). See Figure 1 in Appendix for details.

Creating Moneyline Estimations

As sports bettors are generally interested in a spread/probabilistic outlook on game outcomes to make an informed decision, we decided to run 100,000 Monte Carlo simulations of our games to obtain estimated odds/probabilities of each game closing with a home team win, loss or draw - this is the “3-way moneyline” referred to in our introduction. We did this by drawing randomly from bivariate Poisson distributions, whose parameters were outputted/estimated by our regression model based on predictor values. Although moneyline estimations are generally presented in an odds-based format, we chose to compute probabilities instead as this is more intuitive from a statistical standpoint. However, these can easily be converted to odds by the following formula:

\(\text{Negative Odds: Probability} = \frac{-1\times\text{odds}}{-1\times \text{odds} +100}\)

\(\text{Positive Odds: Probability} = \frac{100}{\text{odds} +100}\)

We did try computing goal-line estimations as we had initially hoped to; however, we had limited success with its accuracy and also could not find any estimations online that we could compare against, so we chose to exclude them from this report.

Model Results

Interpretation

x
(l1):(Intercept) 0.1717589
(l1):defense_differential 0.0237813
(l1):midfield_differential 0.0095639
(l1):offense_differential 0.0133452
(l2):(Intercept) 0.0591298
(l2):defense_differential -0.0044015
(l2):midfield_differential -0.0250488
(l2):offense_differential -0.0237410
(l3):(Intercept) -2.9651838
(l3):offense_differential -0.0253598
(l3):defense_differential -0.0890923
(l3):midfield_differential 0.1558049

Our final predictors for our bivariate poisson model were defense_differential, midfield_differential, and offense_differential. These predictors suggest that the FIFA rankings of the players on the respective teams are the most indicative in predicting match score outcomes. For every one unit the home defense is better than the away team, we can expect the rate parameter for home goals to be multiplying by a factor of \(e^{0.0237813-0.0890923}\) = 0.937. For every one unit the home midfield is better than the away team, we can expect the rate parameter for home goals to be multiplying by a factor of \(e^{0.0095639 +0.1558049}\) = 1.18. For every one unit the home offense is better than the away team, we can expect the rate parameter for home goals to be multiplying by a factor of \(e^{0.0133452-0.0253598}\) = 0.988.

For every one unit the home defense is better than the away team, we can expect the rate parameter for away goals to be multiplied by a factor of \(e^{-0.0044015 -0.0044015}\) = 0.991. For every one unit the home midfield is better than the away team, we can expect the rate parameter for away goals to be multiplied by a factor of \(e^{-0.02504881+0.1558049}\) = 1.17. For every one unit the home offense is better than the away team, we can expect the rate parameter for away goals to be multiplied by a factor of \(e^{-0.0237410-0.0253598}\) = 0.952.

From the model coefficients, it’s interesting to note that some of the conclusions for the home and away team goals are seemingly contradictory - for example, our model predicted that for every one unit the home midfield is better than the away team, we can expect the rate parameter for home goals to be multiplying by a factor of \(e^{0.0095639 +0.1558049}\) = 1.18, but also predicted that for every one unit the home midfield is better than the away team, we can expect the rate parameter for away goals to be multiplied by a factor of \(e^{-0.02504881+0.1558049}\) = 1.17. However, since 1.18 is still marginally greater than 1.17, we can say that the goal differential (home-away) will still be positive - in the case that home midfield is one unit better than the away team, the rate parameter for home goals is greater than the rate parameter for away goals. The coefficients for both defense and offense seem a little bit counterintuitive, as they state that as defense/offense score increase, the rate parameter for home goals decreases, implying that teams should weaken their defense/offense to score more, which doesn’t really make sense. However, a possible interpretation is that we can look at a focus on defense, midfield, and offense as a spectrum – the more a team focuses on defense, to less they focus on offense, and thus may be less likely to score, or vice versa.

For sports bettors, the main takeaway is that a team’s player composition - their defense, midfield, and offense player rankings - are the most important metrics to investigate (over other metrics like shots or saves in past games) when the results of a matchup. In addition, the composition of a team and their respective strengths in defense, midfield, and offense, is also indicative of how many goals they end up scoring compared to the other team and thus the final outcome of the game.

Estimated Moneyline Probabilities

Based on our model, we can generate moneyline predictions for all matches as below; as mentioned earlier, we chose to stick with probabilities for a more intuitive understanding of match outcomes from a statistical point of view.

Match Number Home Team Away Team Win Prob Loss Prob Draw Prob
1 Qatar Ecuador 0.295 0.428 0.277
2 Senegal Netherlands 0.254 0.479 0.266
3 England Iran 0.634 0.131 0.235
4 United States Wales 0.387 0.322 0.291
5 France Australia 0.727 0.092 0.181
6 Denmark Tunisia 0.602 0.154 0.244
7 Mexico Poland 0.408 0.299 0.293
8 Argentina Saudi Arabia 0.748 0.081 0.171
9 Belgium Canada 0.580 0.177 0.243
10 Spain Costa Rica 0.685 0.110 0.204
11 Germany Japan 0.615 0.152 0.233
12 Morocco Croatia 0.290 0.440 0.270
13 Switzerland Cameroon 0.494 0.228 0.278
14 Uruguay South Korea 0.520 0.207 0.273
15 Portugal Ghana 0.560 0.179 0.261
16 Brazil Serbia 0.564 0.182 0.254
17 Wales Iran 0.437 0.270 0.293
18 Qatar Senegal 0.240 0.490 0.270
19 Netherlands Ecuador 0.607 0.156 0.238
20 England United States 0.594 0.166 0.240
21 Tunisia Australia 0.417 0.298 0.286
22 Poland Saudi Arabia 0.558 0.197 0.246
23 France Denmark 0.488 0.251 0.262
24 Argentina Mexico 0.573 0.179 0.248
25 Japan Costa Rica 0.463 0.256 0.282
26 Belgium Morocco 0.521 0.214 0.266
27 Croatia Canada 0.543 0.197 0.260
28 Spain Germany 0.389 0.322 0.289
29 Cameroon Serbia 0.275 0.445 0.281
30 South Korea Ghana 0.352 0.361 0.287
31 Brazil Switzerland 0.585 0.171 0.244
32 Portugal Uruguay 0.467 0.254 0.280
33 Wales England 0.212 0.528 0.260
34 Iran United States 0.336 0.395 0.269
35 Ecuador Senegal 0.327 0.386 0.287
36 Netherlands Qatar 0.694 0.104 0.202
37 Australia Denmark 0.175 0.596 0.230
38 Tunisia France 0.144 0.628 0.229
39 Poland Argentina 0.208 0.544 0.248
40 Saudi Arabia Mexico 0.226 0.504 0.270
41 Croatia Belgium 0.363 0.342 0.295
42 Canada Morocco 0.336 0.367 0.298
43 Japan Spain 0.194 0.551 0.255
44 Costa Rica Germany 0.146 0.632 0.222
45 Ghana Uruguay 0.294 0.434 0.271
46 South Korea Portugal 0.195 0.564 0.241
47 Serbia Switzerland 0.403 0.317 0.280
48 Cameroon Brazil 0.143 0.639 0.218

A good barometer that our probabilities are generally in the right direction is to comparing them to odds marked by professional sports books. Using an aggregation of odds from OddsPortal (a popular sports betting website) and converting them to probabilities, we found that the median difference in home team win probabilities was only about 0.059, with the medians for loss and draw being 0.041 and 0.026 roughly.

Predictive Accuracy

Finally, based on our model’s probabilities, we were wanted to see how often our model’s most likely outcome for a given match would be the correct one. We calculated our confusion matrix below:

Confusion Matrix for Model
Likely Loss Likely Win
Draw 2 8
Loss 12 7
Win 5 14

This implied a home team win accuracy of about 74%, a home team loss/away team win accuracy of about 63%, and an overall accuracy of about 54%. Note that our model did not predict any draws at all, despite there being 10 draws across 48 games, which hence put our draw accuracy at zero. Although this last finding seems problematic, it must be acknowledged that draws are notoriously difficult to predict accurately - in fact, using the aggregation of odds from OddsPortal to predict outcomes leads to exactly the same result:

Confusion Matrix for Sports Books
Likely Loss Likely Win
Draw 3 7
Loss 10 9
Win 5 14

Clearly, no “Likely Draw” column exists here either. We can also calculate accuracies as above: win accuracy would be rround(confusion_mat_books[6]/(confusion_mat_books[3] + confusion_mat_books[6]),2) * 100%, loss accuracyr round(confusion_mat_books[2]/(confusion_mat_books[2] + confusion_mat_books[5]),2) * 100% and overall accuracy 50%. We see that our model has performed somewhat better than sports books odds for home team losses/away team wins, the ultimate effect and implications of which will be discussed briefly in our conclusion.

Conclusion

To summarize our work - our model tackles the most important bet in soccer, the 3-way moneyline (win, loss, draw). For the group stage in the 2022 World Cup, we came up with relative probabilities for each possible match outcome based on modeling the number of home goals and away goals on a bivariate Poisson model. We further analyzed our model coefficients and what they said about drivers of performance in soccer, as well as their implications for sports bettors. Finally, we compared our model’s average accuracy over all matches to the accuracy implied by sports books odds.

In closing out our analysis, we wanted to see if our model could actually make us money on average; with payouts based on the pregame moneyline odds taken from OddsPortal, we chose to simply bet on the most likely outcome for each match based on our model. Of course, this is not always the most optimal betting strategy - importantly, the objective of the model is to mark odds as accurately as possible for each match, not tell bettors who to bet on - but it would give us an idea of how profitable the model would be if we were to make the safest bets possible.

For moneyline bets, our model would have obtained a net profit of -$327, i.e it would lose 327 dollars over all 48 bets across group stage matches. This is assuming one placed a flat $100 bet for each match on the most likely line as predicted by our model, as previously mentioned. Furthermore, one can apply a number of more complex betting strategies given a set of odds our model generates, but that is beyond the scope of the project. However, while our model lost money, it still outperforms a bettor who places the same 100 dollars on the most likely line as indicated by the sports books’ odds, who would lose roughly $2600. This makes perfect sense - according to most sports betting experts, the optimal betting strategy is to bet on underdogs at least some of the time i.e bet on less likely outcomes in at least some matches. This is because lines for unlikely outcomes have far greater payout, so while betting on likely outcomes would probably end up in making the right call more often, this does not ensure long-term profit. However, it is reassuring all the same that our model’s had odds and calls marked more accurately for these “safe” bets.

One intuitive way to apply our findings for future matchups is to spot significant differences between the sports book odds and the ones our model puts out. A significant difference could mean that we see a value bet in a less likely event according to a sports book but more likely according to the model. For example: Let’s take a look at the Ecuador - Senegal match which had the pregame line of +148, +208, +223 for an Ecuador win,draw, and loss respectively. That translates the most likely outcome being an Ecuador win, whereas our model had probabilities of roughly 0.32, 0.28, 0.38, translating to predicting a Ecuador loss, which did happen as Ecuador lost to Senegal 1-2.

Most of our work’s limitations stem from the nature of the problem we are trying to solve - sports are inherently hard to predict, which is why any model we built did not have great accuracy. If building high performing models was easy, then sports books would simply go out of business! Much of the intrigue of sports betting is indeed knowing when to make the call on the underdog, something our model simply cannot capture. Furthermore, it is difficult to take into account the interaction between two specific teams, as they prepare customized game plans that differ based upon opponent that are not released to the public. These intangibles were not accounted for by the predictors for the models were built. Sports bettors should heed also caution when using our model (or any model) to accurately predict draws; as we saw before, it predicted none of the 10 draws across 48 matches correctly. This limitation is likely linked to the low-scoring nature of soccer. Future work may include building models to target other bets, such as over/under on total goals scored or various team/player prop bets.

Appendix

Figure 1

Citations

“The Introduction Page of the RSSSF – the Rec.sport.soccer Statistics Foundation.” The Introduction Page of the RSSSF – The Rec.Sport.Soccer Statistics Foundation., https://rsssf.org/.

Karlis, Dimitris, and Ioannis Ntzoufras. “Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R.” Journal of Statistical Software, vol. 14, no. 10, Sept. 2005, https://doi.org/10.18637/jss.v014.i10.

Leone, Stefano. “FIFA 22 Complete Player Dataset.” Kaggle, 1 Nov. 2021, https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset.

martj42. “Martj42/international_results.” GitHub, https://github.com/martj42/international_results. “Men’s Ranking.” FIFA, https://www.fifa.com/fifa-world-ranking/men?dateId=id13792.

“Search Results.” Wikipedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/Special:Search?go=Go&search=soccer%2Bresults&ns0=1.

“World Cup 2022 Results & Historical Odds.” Oddsportal.com, LiveSport, https://www.oddsportal.com/soccer/world/world-cup-2022/results/.