Today I am super excited to share with you an article from some of the global academic leaders in soccer analytics!
The founding members of the AUEB Sports Analytics Group, Ioannis Ntzoufras and Dimitris Karlis, have been publishing important papers since 2003. If you are seriously interested in soccer analytics, as a professional or as a researcher, you MUST look at their work if you have not done so already. They have been a personal inspiration to me and many others as they have opened the field to many sports analytics enthusiasts. I was actually lucky enough to do my thesis with Prof. Karlis a few years back.
Recently they published articles with their EURO 2020 predictions using advanced statistical modeling here and here, both in Greek, alongside Leonardo Egidi from the University of Trieste. They have been quite successful with their predictions in fact! Most importantly though, their work is soon to be open to an even larger crowd. With an upcoming book in soccer analytics and an R package close to its final version, these are promising times for the analytics field.
I would like to thank the authors for allowing us to republish their article. It is loosely translated from Greek to English by myself. Feel free to reach out for any questions. Enjoy!
The below article was originally published in Greek at the link here on June 26, 2021.
EURO 2020: Round of 16 – Predictions Based on Statistical Football Analytics Models
After the first analysis on June 22, 2021 on the predictions for the last games in the Group Stage for the EURO 2020, we move on to the scientific predictions for the Round of 16 Knockout Stage
In the previous phase, our model did pretty well, correctly predicting 5 out of 6 games (the winner, not the score). The only game we incorrectly predicted was that between Hungary and Germany which was a surprise to everyone since Germany was in danger of disqualification from a battleworthy team that no one expected would play so well.
Note here that statistical as well as machine learning models will not predict upsets but will quantify what we expect will happen based on the performance of the team until that point. If a team does not do well in its latest games then our model will “learn” and decrease the chances of that team winning, without looking at how “big” the team is or what players it has (this happens indirectly from the historical data). Moreover, if the historical data are plenty and the last 2-3 games are not good, then the probabilities are not expected to change much. In this case, it’s not the model’s fault for its slow learning and adjustment, but it’s (mainly) the data that we fed the model that may not reflect the current state of the team (and secondarily it may be the structure of the model that is not too flexible).
We mention these so that you don’t think there is a magic model or equation that will always find the winner. If there was such a model, be certain that we would not be presenting it here but would have used it for our own benefit. Even though the models will not predict precisely the future, they are greatly useful because they quantify what we see through numbers and they make us understand the importance of an upset (i.e. a game whose result we did not expect). Before we move on to the predictions, let’s remind you of some basic details about the methodology that we used.
A few words about the model
The technique and the art of statistical modeling can be directly applied to the area of athletics and specifically to soccer with direct application in making reliable predictions for future soccer games where the interest of fans increases dramatically.
The use of statistical techniques for predicting outcomes of soccer games first appeared in the scientific literature in 1968 with the pioneering scientific publication of Reep & Benjamin. The next true innovation came in the 80’s with Michael Maher’s work and the work of Lee in 1997 where he placed the question of whether Manchester United was truly the best team. The question was confirmed with the use of a simple statistical model and simulation. This analysis set the foundations of modern modelization in soccer and sports. The next important publications were the Dixon & Coles papers in 1997 and the bivariate Poisson model of Karlis and Ntzoufras in 2003 (two of the authors of this specific analysis). These two models set the foundation of modern prediction models for soccer games.
The basic idea of the statistical model of Athens University of Economics and Business professors Karlis and Ntzoufras are based on an expansion of the well-known distribution named Poisson for the prediction of the number of goals each team will score. The anticipated number of goals is written as a function of the home effect that can now be quantified and the attacking and defensive ability of the teams. Here a variation of this model is used to predict the EURO 2020 games. Moreover, time-dynamic variables that reflect the team strength and the difference in the ranking between the two opponents based on the Coca-Cola FIFA ranking on May 27th, 2021 are used. The model was estimated using the Bayesian approach with the statistical packages of R and STAN. These predictions have a similar precision to those used by betting companies.
The definition of the model is given in detail at the end of this article.
The Model’s Predictions for the Round of 16
The predictions of the model are summarized in the table that follows. Along with the probabilities for each result, the score with the highest probability (the probability is in the parenthesis) and the expected score, rounded to the nearest integer, are given.
|Team 1||Team 2||Team 1 Win||Draw||Team 2 Win||Most Probable Score (Probability)||Expected Score (Rounded)||Qualification|
|Wales||Denmark||0.370||0.287||0.344||0-0 (0.122)||1-1||Even match|
|Netherlands||Czech Republic||0.569||0.241||0.190||1-0 (0.132)||2-1||Netherlands but the Czech Republic has chances|
|Croatia||Spain||0.327||0.269||0.403||1-1 (0.124)||1-1||Balanced match - Spain|
|Sweden||Ukraine||0.539||0.246||0.215||1-0 (0.129)||2-1||Sweden but Ukraine has chances|
Based on the above results we see that Italy, Belgium, France, and England have good chances of making it to the next round, with probabilities over 60%. For Italy (facing Austria) and France (against Switzerland), these results are expected. Italy had good performances with great scores while France (without performing exceptionally) received many points against very hard opponents.
The indicated easy prevalence of Belgium against Portugal (65.7% winning probabilities) and of England against Germany (with 61.4%) are quite opposite to the fans’ intuition, who expect greatly balanced and interesting games. The reason this model distinguishes Belgium and England over their opponents is that they had great results contrary to Portugal and Germany who did worse than expected in a tough group. This is also based on the fact that these teams had plenty of goals scored against them in the group stage (Germany let in 5 and Portugal 6) leading the model to correct their defensive ratings and giving them smaller winning probabilities in the round of 16 (i.e. the key is their defensive improvement). The graphs that follow show the evolution of offensive and defensive ratings (model parameters) for all teams.
Two games can be considered balanced based on the model’s predictions. The match between Wales and Denmark seems to be greatly balanced with a slight edge for Wales. For this match though let’s keep in mind that Denmark lost to Finland while possibly affected by the unfortunate event with Christian Eriksen. The second balanced match is that between Croatia and Spain who have a slight edge with 40% win chances.
Last, in the final two games, Sweden has a slight edge over Ukraine with a 54% chance of winning and the Netherlands have an edge over the Czech Republic with 57% chances of winning. In both these cases, the weaker teams (Ukraine and Czech Republic) have chances of fighting for the game and going to overtime or winning with 46% and 43% chances respectively. The darker colors indicate the most probable results while the lighter areas indicate results with lower chances.
The predictions are made for scientific purposes and are not encouragement or advice for betting
Bibliography for fans that like to read
· Dixon, M.J. and Coles, S.G. (1997), Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46, 265-280.
· Karlis, D. and Ntzoufras, I. (2003), Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52, 381-393.
· Lee A.J. (1997). Modeling Scores in the Premier League: Is Manchester United Really the Best? Chance, 10, 15-19.
· Maher, M.J. (1982), Modelling association football scores. Statistica Neerlandica, 36, 109-118.
· Reep, C., & Benjamin, B. (1968). Skill and Chance in Association Football. Journal of the Royal Statistical Society. Series A (General), 131, 581-585.
The Magic Equations of the statistical model
- i is the game identifier
- Xi and Yi is the number of goals between Team 1 and Team 2 in game i
- home is the home effect (only for games where applicable). Usually in EURO tournaments most matches take place at a neutral arena so this bonus is not added to neither of the opposing teams
- λ1i and λ2i is Team 1 and Team 2 respectively (or home and away team, where applicable) for game i
- attk,t and defk,t are the parameters that estimate the attacking and defensive ability respectively of team k at time t (dynamic parameters that change throughout time)
- ranking is the Coca-Cola FIFA ranking on May 27 2021 for team k
- γ/2 is the effect of the Coca-Cola FIFA ranking on the log of expected goals
A few words about the Authors
Leonardo Egidi is assistant professor of Statistics at the University of Trieste and a member of the research team of the AUEB Sports Analytics Group. He possesses a PhD in modeling and soccer analytics and has intensive research in Bayesian Statistical methodology.
Ioannis Ntzoufras is professor of Statistics and president of the Department of Statistics at Athens Univerity of Economics and Business. He is a founding member of the AUEB Sports Analytics Group research team along with Dimitris Karlis. He has recognized scientific work in subjects such as Bayesian statistical modeling, computational statistics, Biostatistics, psychometrics, and sports analytics.
Dimitris Karlis is professor of Statistics and deputy president of the Department of Statistics at Athens Univerity of Economics and Business. He is a founding member of the AUEB Sports Analytics Group research team along with Ioannis Ntzoufras. He has recognized scientific work in subjects such as statistical methodology, computational statistics, Biostatistics, and sports analytics.
The three authors of this article are currently working on writing a book on Football Analytics for an international publication while in the latest workshop of the team they gave a seminar lecture on Football Analytics.
The research team of Athens University of Economics and Business AUEB Sports Analytics Group was founded in 2015 by professors Ioannis Ntzoufras and Dimitris Karlis. Its members are important members of the sports analytics community such as Stefan Kesenne (University of Antwerp & Leuven), Leonardo Egidi (University of Trieste), Ioannis Kosmidis (Warwick), Constantinos Pelechrinis (Pittsburg), Nial Friel (UCD), and Gianluca Baio (UCL) as well as former coach of the Greek National Volleyball team Sotiris Drikos. The research team is responsible for an annual series of conferences with the nay AUEB Sports Analytics Workshop (5 in total) while in 2019 it organized the international conference MathSport 2019 with 200 participating scientists throughout the world. Last, the team has a series of important scientific publications from the field of sports analytics.