Open AccessArticle

Predicting Football Team Performance with Explainable AI: Leveraging SHAP to Identify Key Team-Level Performance Metrics

Serafeim Moustakidis

Spyridon Plakias

Christos Kokkotis

³,

Themistoklis Tsatalas

and

Dimitrios Tsaopoulos

^4,*

AIDEAS OÜ, 10117 Tallinn, Estonia

Department of Physical Education and Sport Science, University of Thessaly, 42100 Trikala, Greece

Department of Physical Education and Sport Science, Democritus University of Thrace, 69100 Komotini, Greece

⁴

Institute for Bio-Economy and Agri-Technology, Center for Research and Technology Hellas, 38333 Volos, Greece

Author to whom correspondence should be addressed.

Future Internet 2023, 15(5), 174; https://doi.org/10.3390/fi15050174

Submission received: 21 March 2023 / Revised: 30 April 2023 / Accepted: 3 May 2023 / Published: 5 May 2023

(This article belongs to the Section Big Data and Augmented Intelligence)

Download

Browse Figures

Figure 1
Proposed pipeline. "> Figure 2
SHAP summary dot plot presenting the overall contribution of the most important variables on the teams’ score performance. "> Figure 3
Comparison of predicted and actual averaged team score performance (normalized). The scatter plot illustrates the relationship between the model’s predictions (y axis) and the actual results (x-axis), with the line of best fit providing an indication of the model’s accuracy. Each data point represents a specific team. "> Figure 4
SHAP force plot indicating key team-level performance variables of Liverpool FC and their impact on team’s score performance. "> Figure 5
SHAP force plot indicating key team-level performance variables of Manchester FC and their impact on team’s score performance. "> Figure 6
SHAP force plot indicating key team-level performance variables of West Ham FC and their impact on team’s score performance. "> Figure 7
SHAP force plot indicating key team-level performance variables of Lazio FC and their impact on team’s score performance. ">

Versions Notes

Abstract

Understanding the performance indicators that contribute to the final score of a football match is crucial for directing the training process towards specific goals. This paper presents a pipeline for identifying key team-level performance variables in football using explainable ML techniques. The input data includes various team-specific features such as ball possession and pass behaviors, with the target output being the average scoring performance of each team over a season. The pipeline includes data preprocessing, sequential forward feature selection, model training, prediction, and explainability using SHapley Additive exPlanations (SHAP). Results show that 14 variables have the greatest contribution to the outcome of a match, with 12 having a positive effect and 2 having a negative effect. The study also identified the importance of certain performance indicators, such as shots, chances, passing, and ball possession, to the final score. This pipeline provides valuable insights for coaches and sports analysts to understand which aspects of a team’s performance need improvement and enable targeted interventions to improve performance. The use of explainable ML techniques allows for a deeper understanding of the factors contributing to the predicted average team score performance.

Keywords:

soccer; machine learning; explainability; performance metrics

1. Introduction

Artificial intelligence (AI) and specifically machine learning (ML) are quickly becoming popular methods for predicting the average scoring performance of European football teams [1]. This is because the technical data collected during football matches can provide valuable insights into a team’s playing style and tactics [2]. By analyzing this data, coaches and analysts can gain a deeper understanding of a team’s strengths and weaknesses, and use this information to make more informed decisions about player recruitment and opposition analysis [3].

One of the key challenges in analyzing this data is that it comes in a variety of forms, including match sheets, ball events, and tracking data [4,5]. These data types differ in their granularity and availability, but data collection companies are increasingly annotating more types of events and providing information about each event [6]. To effectively analyze team behavior, it is necessary to summarize its playing style in a way that is both humanly interpretable and suitable for data analysis [7]. This typically involves constructing a “fingerprint” of the team’s behavior, capturing characteristics such as the types of actions they tend to perform and the types of gameplay patterns the team’s players participate in [8]. Similarly, there are three ways to analyze team tactics: by summarizing the team’s playing style in a number of features [9], extracting patterns from the data using pattern mining algorithms [10], and attempting to model the complete behavior of the team in a network-based approach [11]. Modern match analysis in soccer goes beyond evaluating classical game characteristics like ball possession and pass behaviors. These data often have little predictive power for determining the outcome of a match, but with the help of AI, it is possible to gain valuable insights into successful soccer performance using computer science and statistics [12,13].

The use of AI for predicting football team performance is a growing area of research. Machine learning algorithms were used to analyze large amounts of data on team and player performance, such as statistics, results, and player attributes. These algorithms can be trained to identify patterns and make predictions about future performance [14]. Moreover, Natural Language Processing techniques were applied for the analysis of the content of news articles, social media posts, and other sources of information to gain insights into the sentiment and emotions of fans, players, and teams [15]. Computer vision techniques [16,17] have been used to analyze video footage of matches to extract data on player and team performance, such as ball possession, passing accuracy, and tactical formation. The use of AI in football predictions is still in its early stages, and most of the current solutions are based on historical data. Additionally, the results of these predictions are not always accurate and may be affected by multiple variables that are hard to predict.

In recent years, predicting football match outcomes has gained significant attention in the research community. Various approaches have been explored for predicting outcomes, focused on target types such as win/loss/draw and final scores. The methods employed can be categorized into statistical models, machine learning algorithms, and rating systems. Over time, researchers have transitioned from predicting specific scores to predicting whether both teams will score [18], as score prediction is often challenging and results in low accuracy. Approaches such as Poisson distribution [19] and ELO ratings [20] have been used to evaluate teams offensive and defensive capabilities, taking home advantage into account. Machine learning techniques have evolved from simple methods [21,22,23] to more advanced deep learning strategies [24,25]. Feature engineering has played a critical role in improving prediction accuracy with the inclusion of diverse and sophisticated variables. Despite the emphasis on the need for richer predictive features and more robust validation, many studies still rely on existing features [26]. Factors such as the nature of the features used, the complexity of the predictive problem, and the size of the dataset significantly influence prediction accuracy.

This paper aims to position itself uniquely within the realm of football match outcome prediction by shifting the focus from predicting specific match outcomes to analyzing a team’s overall performance throughout the year. The main novelties of this study include identifying key parameters that contribute to the overall scoring performance of a team and utilizing the extensive dataset collected during the regular season of the top division in 11 European countries. When referencing a team’s overall scoring performance, we are referring to the average goal difference for the team throughout the entire season. Moreover, this research employs a comprehensive feature dataset comprising 160 parameters, providing a richer context for analysis and potentially enhancing the accuracy and insights derived from the study.

Moreover, one of the major challenges with using AI in sports performance analysis is the lack of transparency and interpretability of the results. Traditional AI models, such as neural networks, can be difficult to understand and interpret, making it hard to explain how and why a particular decision or prediction was made. This is where explainable AI (XAI) comes in. Explainable AI is a subfield of AI that focuses on creating models that can provide clear and interpretable explanations for their predictions and decisions [27]. One of the most popular methods for explainable AI is SHapley Additive exPlanations (SHAP), which is a unified method for interpreting the predictions of any machine learning model [28]. It is based on the concept of Shapley values from cooperative game theory, which provides a way to fairly distribute a value among a group of individuals, such as the players on a football team. Recently, two studies [29,30] have employed SHAP as a post-hoc explainability tool to evaluate the impact of each feature on the final outcome, specifically in the context of match-specific score prediction. Both studies primarily concentrated on individual match predictions and were limited to data from a single league, utilizing a moderately sized feature set.

The objective of this paper is to use XAI to identify the key team-level performance metrics that are most important in predicting a football team’s performance. By concentrating on overall performance and identifying critical parameters influencing scoring performance (average goal difference over the season), this paper seeks to offer a broader understanding of the factors that determine success in football. The use of an expansive dataset from multiple European leagues, combined with an extensive set of features, sets this study apart from previous research and enhances its potential to uncover novel insights and patterns in football performance. This approach allows for a more transparent and interpretable understanding of how a machine learning model is making its predictions, which can be particularly useful in high-stakes decision-making scenarios such as predicting a team’s performance in a football match. By leveraging explainable AI, coaches and analysts can gain a deeper understanding of team performance and make more informed decisions about player recruitment and opposition analysis.

The structure of the paper is as follows: In Section 2, the proposed methodology and characteristics of the dataset are presented. The results of the study, including the performance of the proposed ML model and the explanations generated at both the global and local levels, are discussed in Section 3. The implications and limitations of the study are discussed in Section 4, and conclusions are presented in Section 5.

2. Materials and Methods

In this article, we aim to predict the average goal difference of football teams over an entire season using machine learning and subsequently explain the predictions made by the model. Our input data consists of various team-specific features such as ball possession and pass behaviors. The target output is the average scoring performance (goal difference) of each football team over the season. The proposed pipeline (Figure 1) includes three main steps: (1) data preprocessing, (2) model training and prediction, and (3) explainability. We first preprocess and clean the data to ensure that it is suitable for training and testing our model. Next, we use XGBoost, a powerful and widely used machine learning algorithm, to train the model and make predictions. Finally, to ensure the explainability of the predictions, we will use SHAP to provide an explanation for each prediction at a team-specific or overall level. This will allow us to understand the factors that drive the prediction and the contribution of each feature to the predicted average team score performance.

2.1. Dataset

Our dataset includes all matches played during the regular season of the top division in 11 European countries for the 2021–2022 season (Table 1). For each match, data were recorded for both teams, resulting in a total of 5992 observations. However, data was unavailable or incomplete for eight matches, as recorded by Instatscout (https://football.instatscout.com/ (accessed on 20 June 2022)).

Specifically, this study involved collecting 160 variables, either directly through InstatScout or indirectly calculated by the authors using data from this platform. These variables were recorded in a Microsoft Excel spreadsheet (full description of the variables is given in the Appendix A). Prior research has shown that the indicators obtained through Instat Scout have high reliability, with K values ranging from 0.90 to 0.98, as per studies by Casal et al. (2019), Castellano and Echeazarra (2019) and Gómez et al. (2018) [31,32,33].

2.2. Data Pre-Processing

To ensure the quality and consistency of our dataset, we performed several pre-processing steps before conducting the analysis: (1) Data cleaning: We first cleaned the raw data by removing any missing values or erroneous records to ensure the accuracy and reliability of the dataset. (2) Averaging variables: All variables, including the output variable, were averaged over the course of the season to provide a general representation of the team’s performance throughout the year. For instance, an average team score performance of +2.1 indicates that on average, the team scored 2.1 more goals than the goals they conceded. (3) Feature scaling: To ensure that all of the features are on the same scale Feature scaling was implemented, using the StandardScaler library [34],. This is important for many ML algorithms, as it can help prevent one feature from dominating the others during the training process.

2.3. Machine Learning

Just prior to the model training process, we applied a feature selection technique, sequential forward selection, to identify the most important features for the task at hand [35]. This is a wrapper-based feature selection method, where we used XGBoost as the model and R-squared as the selection criterion. The algorithm iteratively adds features to the model one by one and evaluates their impact on its performance. This helps us identify the most relevant features that contribute the most to the model’s accuracy and avoid overfitting.

Once the relevant features have been selected, we trained and tested an XGBoost regressor on the data by using a ten-fold cross-validation strategy and internal hyperparameter tuning in the training phase [36]. XGBoost is a powerful gradient-boosting algorithm that has been shown to perform well on a wide range of tasks. We also used the SHAP model to understand the contribution of each feature to the global and local predictions. The performance of the proposed model was compared to that of three other well-known regression algorithms: Support Vector Regression (SVR) [37], Random Forest (RF) [38], and the k-Nearest Neighbor Regressor (kNN) [39].

2.4. Explainability

In order to understand and explain the predictions made by our machine learning model, we used the SHAP library [28,40]. SHAP values provide a unified measure of feature importance that can be used for both linear and non-linear models. SHAP is a powerful and unified measure for interpreting the output of machine learning models, offering a consistent approach to understanding the impact of features on model predictions. SHAP values are derived from cooperative game theory and provide an interpretable allocation of each feature’s contribution to a prediction, while ensuring that the sum of all feature attributions equals the difference between the predicted outcome and the average baseline prediction. This approach allows for a fair distribution of each feature’s influence on the prediction, accounting for potential interactions and dependencies among features. In our study, we employ SHAP as a post-hoc explainability tool to quantify the effects of each feature on the final outcome, helping us identify the key parameters that contribute to a team’s overall scoring performance. For a given prediction, SHAP values attribute a contribution value to each feature, with positive values indicating that the feature pushed the prediction higher and negative values indicating that the feature pushed the prediction lower. This allows us to understand how each feature contributed to the final prediction and how they compare to one another. Overall, the use of SHAP values provides a detailed, accurate, and easily interpretable explanation of the inner workings of our regression model.

In addition to the SHAP values employed in our study, there are other explainable AI methods, such as Local Interpretable Model-agnostic Explanations (LIME) [41], that can be used to provide insights into the importance of features in complex models. LIME is a popular technique that explains individual predictions by fitting a locally interpretable model around the specific data point. While both SHAP and LIME aim to increase the interpretability of machine learning models, they differ in their approach. SHAP values are grounded in cooperative game theory and provide a unified measure of feature importance that is both locally and globally accurate. In comparison, LIME focuses on local interpretability and may not provide the same level of global accuracy. Additionally, SHAP values maintain consistency, which means that the order of feature importance will remain the same across different models, while LIME does not guarantee this property. For our study, we chose to use SHAP values because they provide a more consistent and accurate measure of feature importance. However, future work could explore the use of LIME or other XAI methods to analyze football team performance and compare the resulting insights with our findings.

3. Results

This section presents the results of the proposed explainable machine learning pipeline, including the explanations generated by the SHAP algorithm, which provides insight into the factors that influence the model’s predictions.

Based on the sequential forward selection method, we identified 141 out of the 159 initial variables as the most relevant features for predicting football team performance. The selected variables are listed in Appendix A, and the importance of the top 15 variables is visualized in Figure 2. By using these 141 features, our model was able to achieve a satisfactory performance in terms of both accuracy and interpretability.

Table 2 and Figure 3 present the results of the model’s performance in predicting the average team score over a year. The scatter plot in Figure 3 compares the actual values (x-axis) with the predicted values (y-axis). Each point in the scatter plot represents a team, with the x-coordinate denoting the actual averaged team score performance and the y-coordinate denoting the predicted averaged team score performance. The line of best fit is a visual representation of how closely the predictions align with the actual results, with a slope of 1 indicating a perfect fit. The distribution of the points around the line of best fit demonstrates the accuracy and balance of the predictions. Additionally, the reported results in Table 2 (RMSE of 0.32) give an indication of how accurate the model is, with lower values indicating a better performance compared to SVR (RMSE of 0.42), RF (RMSE of 0.36) and kNN (RMSE of 0.40).

The next step in analyzing the model’s performance is to examine its explainability. This analysis aims to understand the factors that influence the model’s predictions and how they relate to the actual outcome (the team’s average score performance). By understanding the underlying relationships and patterns, we can gain insight into the behavior of the model and identify areas for improvement (modifiable key team-level performance metrics). This can also provide valuable information for stakeholders (e.g., coaches, sport analysts) to understand the decision-making process of the model and the rationale behind its predictions.

Global explanations: Figure 3 shows the contribution of each feature to the model’s predictions, and how each feature affects the final outcome. Specifically, the summary plot presents a combination of feature importance and feature effects. Each point on the plot represents a Shapley value for a specific feature and instance. The y-axis displays the feature, while the x-axis shows the corresponding Shapley value. The color of each point represents the feature’s value, with low values being represented by one color and high values by another. To avoid overlap, points with the same feature are slightly shifted in the y-axis direction, providing a clear understanding of the distribution of Shapley values per feature. Additionally, the features are arranged in order of their importance. In our analysis, the 14 most significant variables are shown in Figure 3. As depicted, variables such as shots per possession percentage, missed chances, entries into the penalty box, conversion percentage of chances, and passes have a positive impact on the team’s predicted score performance. Conversely, variables such as lost balls in the team’s own half and the ratio of dribbles per minute of possession have a negative effect on the score, indicating that an increase in these variables leads to a decrease in the team’s score.

Local explanations (team-specific): Figure 4, Figure 5, Figure 6 and Figure 7 are actual SHAP force plots that allow us to see how the different variables contributed to the model’s prediction f(x) for specific teams. The higher the score, the more the model is likely to predict a positive outcome (good score performance), and the lower the score, the more the model is likely to predict a negative outcome (bad score performance). The variables that were important to making the prediction for this team are shown in red and blue, with red representing features that pushed the score higher, and blue representing features that pushed the score lower. The features that had more of an impact on the score are placed higher, and the size of that impact is represented by the size of the bar.

In the case of Liverpool FC, all the variables pushed the score higher (as indicated by the red bars), indicating that they are important for the model’s prediction of a positive outcome. Similar findings were obtained for Manchester City FC, where the team performed well in all key team-level performance variables. On the other hand, using SHAP force plots, it is possible to identify which variables have a negative effect on the team’s performance. For example, four key variables (shots per quantity of possession percent, chances percent of conversion, accurate passes, and high pressing percent) were identified as negatively impacting the scoring performance (average goal difference) of West Ham FC. Similarly, lost balls in their own half, offsides, and corners were identified as key performance variables for Lazio FC that have a negative effect on the scoring performance and would require improvement. In summary, SHAP force plots allow stakeholders such as coaches or sports analysts to see which aspects of a specific team’s game performance are satisfactory and which need improvement, enabling targeted interventions and adjustments to be made to improve the team’s performance.

4. Discussion

Recognizing the performance indicators that contribute to the final score of a match is important in order to direct the training process toward specific goals. Consequently, the purpose of the current study was to identify and measure the contribution of each performance indicator to the final score of a match. We managed: (i) to predict the goal difference between teams in a match and (ii) to identify the contribution of each performance indicator to the match score both for the teams as a whole and for each team individually. The results showed that for the teams as a whole, fourteen variables had the greatest contribution to the outcome of the match. Of these, twelve (shots per quantity of possession percent, missed chances, entrance to the penalty box, chances percent of conversion, key passes accurate, passes, key passes, accurate passes, ratio passes per lost balls, high pressing percent, positional attacks with shots, sum duration of ball possession) had a positive effect, while two (lost balls in own half, ratio dribbles per minute of ball possession) had a negative effect. When we looked at each team separately, the variables that contribute the most to shaping the scores in their matches differ.

Shots per quantity of possession percent is the variable with the biggest contribution. In addition, among the fourteen most important performance indicators is the variable positional attacks with shots. Both of the above variables show that the ability of teams to make shots has a significant positive contribution to the final score in their favor. This finding is in agreement with other research that showed that the total number of shots made by a team is an important factor in determining the match outcome [42,43,44], but also with the research of Castellano et al. (2012) which showed that successful teams make more shots [45].

However, besides the shots, there are also other variables that contribute positively to the final score of the match. Firstly, our research showed that the creation of chances, even if they are lost (chances missed), but also the ability to convert the chances into goals (chances percent of conversion) had a significant positive contribution. Although chances are the factor that determines the variable xG [46], only one study conducted on beach soccer [47] has examined their effect on match score and found that chance creation is a factor that can distinguish winners from defeated teams. Secondly, four variables related to passing and ball possession (passes, passes accurate, ratio passes per lost balls, sum duration with ball possession) are among the fourteen most important. This finding confirms almost all previous research that has examined the contribution of ball possession to match outcome [14,42,43,44,48,49,50]. On the contrary, the research of Harrop and Nevill (2014) showed that only successful passes help distinguish the games that a team wins [51], while total passes showed the opposite. However, it should be pointed out that this research was carried out with data that concerned only one team. Thirdly, entrance to the penalty box is another variable that we found to significantly contribute to a positive score in a match and this is in agreement with research that had similar objectives [52]. Finally, key passes and high pressing percent (high pressing success rate) have not been examined by relevant research for their contribution to the match outcome. However, other studies showed the importance of key passes to the playing effectiveness of a team [53,54,55], but also the usefulness of a successful high pressing because defending near the opponent’s goal seems to be associated with success in soccer [56,57,58].

On the other hand, among the fourteen variables that affect the outcome of the match, there are two variables (ratio dribbles per min of possession, lost balls in own half) that have a negative contribution. Liu et al. (2015) and Harrop and Nevill (2014) had already shown that dribbles had clearly negative effects on the probability of winning [43,51], which agrees with our own finding. The variable lost balls in own half has not been considered in research investigating the contribution of performance indicators to the match outcome. However, both among coaches and in the scientific literature, it is commonly accepted that the closer to the rival goal the start of the offensive action, the greater the probability of success in ball possessions [59,60,61].

In addition to applying our methods to all teams as a whole, we also applied them to some teams separately. In these cases, there were differences in the fourteen variables that had a greater contribution to shaping the outcome of their matches depending on the philosophy of their coach and the tactical principles they adopted. For example, Liverpool manager Jurgen Klopp’s preference for the “high press” is well known [62,63,64]. This is reflected in the results of our research, since three of the fourteen variables (high pressing percent, ball recoveries in opponent’s half, ratio defensive challenges attacking 3rd plus defensive challenges midfield 3rd per defensive challenges) for Liverpool are related to this particular philosophy, while for the teams as a whole only one of them appeared.

On the other hand, the style of Guardiola’s teams (tiki-taka) is characterized by high percentages of possession with many and short passes [65,66,67]. The results of our research showed that in Pep’s team (Manchester City) five of the fourteen variables that have the greatest contribution to the final score of the matches are related to this style of play (passes, accurate passes, sum duration with ball possession, ratio passes per lost balls, ball possession percent). We looked at one more English Premier League team (West Ham). When they attempted to press their opponent high, they usually did so in a 4-2-4 formation and the players in the front four line often had long distances from the remaining six players. This made them vulnerable in the given situations. This particular observation was made after a qualitative analysis of West Ham’s games by one of our authors, who is a certified soccer performance analyst. The results of our quantitative research confirm this particular observation, since although in the teams as a whole the “high pressing percentage” variable contributes positively to the result, in West Ham it contributes negatively.

In addition to the three English teams, we also looked at an Italian team (Lazio) whose manager (Maurizio Sarri) has given his name to a style of play called Sarribal [68]. Sarribal is characterized by persistence in building up from the back even if the opponent presses high with many players. That is, he uses many small passes in the defensive half with the aim of drawing the opponent high. When this is done, the players are instructed to make vertical forward passes to the back of the opposing defensive line. The results of our study fully reflect this specific style of play. In particular, (a) many short passes increase the number of passes, accurate passes and ball possession percentage, (b) the persistence in the build up and the big number of passes in the team’s half can increase the opponent’s recoveries closer to the team’s goal (lost balls in own half), (c) the vertical passes are often key passes that increase the number of final actions (shots per quantity of possession percentage), while (d) the movements attempted by players at the back of the opposing defensive line (to receive vertical passes) can also increase offsides.

In this paper, we presented a pipeline for predicting the average team score performance of football teams using machine learning, data preprocessing, and explainability. However, there are certain limitations to the study that should be acknowledged. First, the data used in this study is limited to one season, the 2021–2022 season, which may not fully capture the dynamics and complexities of team performance over time. Additionally, while our prediction is focused on the average team score performance over the year, it is not able to predict individual team score performance per match, as this prediction would not have a good performance. This limits the scope of the study and the potential applications of the proposed pipeline. To improve the model’s performance and to provide more robust predictions, it would be beneficial to gather data from multiple seasons, and also work on predicting individual match scores’ performances.

In addition to the proposed sequential forward selection technique, there are other robust feature selection techniques, such as BORUTA, which is a wrapper-based method built around the random forest algorithm [69]. BORUTA iteratively compares the importance of features to that of shadow features, which are shuffled copies of the original features, to determine their relevance. While BORUTA is considered more robust and can handle non-linear relationships better than sequential forward selection, it may be computationally more expensive. We acknowledge that comparing different feature selection methods could provide further insights into the best approach for our specific task. Future work could investigate the performance of BORUTA and other feature selection techniques in the context of predicting football team performance.

Finally, while our model’s primary objective is to understand the importance of various team-level performance metrics within the current season, we acknowledge that the pipeline does not predict future performance. The input and output are simultaneous in time, which means that the model cannot be used as a predictor for subsequent seasons. Future work could explore the possibility of incorporating lagged variables or historical data to enable predictions for upcoming seasons. However, our current approach still provides valuable insights into the factors that contribute to a team’s performance, helping stakeholders make informed decisions based on these insights.

5. Conclusions

This paper aimed to identify and measure the contribution of various performance indicators to the final score of a football match. Through the use of explainable machine learning techniques, we were able to identify the contribution of each team-level performance indicator to the match score for all teams as a whole and for each team individually. The results provided valuable insights into which performance indicators had the greatest impact on the outcome of a match. This information can be used by coaches and sports analysts to make targeted interventions and adjustments to improve the performance of teams. It is important to note that the results of this study are based on data from one season and are not able to predict individual match scores, which are limitations that should be considered when interpreting the findings. Despite this, the study provides a useful framework for understanding the key factors that contribute to a team’s performance and can be applied to future research using data from multiple seasons.

Author Contributions

Conceptualization, S.M. and S.P.; Data curation, S.M. and S.P.; Methodology, S.M.; Software, S.M. and C.K.; Supervision, D.T.; Validation, T.T. and D.T.; Visualization, C.K.; Writing—original draft, S.M., S.P. and C.K.; Writing—review and editing, T.T. and D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not accessible.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Full list of variables used in our analysis.

Features	Description
Sum_long_passes	Passes with a length of at least 40 m, regardless of the area from which they were made
Pass_long_def_3rd	Passes made in the defensive third that were at least 40 m long
Pass_long_mid_3rd	Passes made in the midfield third that were at least 40 m long
Pass_long_att_3rd	Passes made in the attacking third that were at least 40 m long
RATIO_long_passes_PER_passes	Passes with a length of at least 40 m/total number of passes
Defensive_challenges	Duels involving the players of the defending team
Def_challenges_def_3rd	Duels involving the players of the defending team and taking place in the defensive third of that team
Def_challenges_mid_3rd	Duels involving the players of the defending team and taking place in the midfield third of that team
Def_challenges_att_3rd	Duels involving the players of the defending team and taking place in the attacking third of that team
Air_challenges	Duels in which the ball is above shoulder height and players try to play with their heads
Air_challenges_won	Successful air challenges
Air_challenges_missed	Unsuccessful air challenges
Air_challenges_won__percent	Air challenges won/air challenges (%)
Air_challenges_def_3rd	Air challenges in the team’s defensive third
Air_challenges_mid_3rd	Air challenges in the team’s midfield third
Air_challenges_att_3rd	Air challenges in the team’s attacking third
Challenges	Total number of duels
Challenges_won	Successful challenge is registered for a team that keeps possession of a ball after a duel
Challenges_missed	Duels that a team does not keep the possession of the ball
Challenges_won__percent	Challenges won/challenges (%)
Defensive_challenges_won	Successful attempts of defensive challenges that lead to a touch made by own team player
Defensive_challenges_missed	Defensive challenges minus defensive challenges won
Challenges_in_defence_won__percent	Defensive challenges won/defensive challenges (%)
Attacking_challenges	Duels involving the players of the attacking team
Attacking_challenges_won	Successful attacking challenges
Attacking_challenges_missed	Unsuccessful attacking challenges
Challenges_in_attack_won__percent	Attacking challenges won/attacking challenges (%)
Ground_challenges	Challenges minus air challenges
Ground_challenges_won	Successful ground challenges
Ground_challenges_missed	Unsuccessful ground challenges
Ground_challenges_won_percent	Successful ground challenges/ground challenges (%)
RATIO_ground_challenges_PER_air_challenges	Ground challenges/air challenges
RATIO_def_challenges_def_3rd_PER_defensive_challenges	Duels involving the players of the defending team and taking place in the defensive third of that team/total duels involving the players of the defending team
RATIO_def_challenges_mid_3rd_PER_defensive_challenges	Duels involving the players of the defending team and taking place in the midfield third of that team/total duels involving the players of the defending team
RATIO_def_challenges_att_3rd_PER_defensive_challenges	Duels involving the players of the defending team and taking place in the attacking third of that team/total duels involving the players of the defending team
RATIO_def_challenges_att_3rd__def_chall_mid_3rd_PER_defensive_c	Duels involving the players of the defending team and taking place in the midfield and attacking third of that team/total duels involving the players of the defending team
DIFFERENCE_air_challenges_att_3rd_MINUS_air_challenges_def_3rd	Air challenges in the team’s attacking third minus air challenges in the team’s defensive third
RATIO_air_challenges_att_3rd___air_challenges_def_3rd_PER_air_c	Duels involving the players of the defending team and taking place in the defensive and attacking third of that team/total duels involving the players of the defending team
Chances	A goal-scoring opportunity
Missed_chances	A goal-scoring opportunity which did not result in a goal
Fouls	An action that is not compatible with the rules of the game and is used to stop the progress of the opponent’s attack
Yellow_cards	An illegal action punishable by a yellow card from the referee
Yellow_cards_Fouls	The ratio yellow cards/fouls
Red_cards	An illegal action punishable by a red card by the referee and results in the player being sent off from the match
Corners	The total number of corners for a team
Shots	Total number of all shots made by a team
RATIO_shots_PER_10_minOf_ball_possession	The average number of shots a team made for every 10 min they had the ball
Shots_on_target	Shots going inside the goal, might end in a goal or be deflected by the goalkeeper or by a field player from the GK zone.
Shots_on_target__percent	Shots on target/shots (%)
Shots_wide	Shots out of target
Blocked_shots	Shots when an opposing player stopped the ball
Shots_on_post_Bar	Shots ended on a post/bar
Passes	Total number of passes
Accurate_passes	Successful attempt to pass a ball from one teammate to another
Accurate_passes__percent	Successful passes/passes (%)
Wrong_passes	Passes minus accurate passes
RATIO_passes_PER_wrong_passes	Passes/wrong passes
Key_passes	Pass that if successful creates a goal scoring opportunity
Key_passes_accurate	Successful pass that creates a goal scoring opportunity
Crosses	Passes from a wide area of the field towards the opponent’s box
Crosses_accurate	Successful crosses
Accurate_crosses__percent	Successful crosses/crosses (%)
Lost_balls	Any loss of ball for a team whether it comes from an unsuccessful pass, dribble or control
RATIO_passes_PER_lost_balls	Passes/lost balls
Lost_balls_in_own_half	Lost balls in the team’s own half
Lost_balls_in_opponent_s_half	Lost balls in the opposite half
Ball_recoveries	Action by which the team wins possession of the ball from the opponent
Ball_recoveries_in_opponent_s_half	Ball recoveries in the opposite half
Ball_recoveries_in_own_half	Ball recoveries in the team’s own half
Pressing_efficiency__percent	Percentage share of successful team pressing in the total number of team pressing attempts
Entrances_to_the_opposition_half	Number of team possessions during which at least one entrance into the opponent’s half was made
Entrances_to_the_finalThird	Number of team possessions during which at least one entrance into the opponent’s final third was made
Entrance_to_the_penalty_box	Number of team possessions during which at least one entrance into the opponent’s penalty box was made
RATIO_Entrances_to_the_final_third_PER_10_min_of_ball_possessio	Average entries into the attacking third per 10 min of possession
RATIO_Entrance_to_the_penalty_box_PER_10_min_of_ball_possession	Average entries into the opponent’s penalty box per 10 min of possession
Dribbles	The ball possessor’s attempt to outrun an opponent while maintaining possession of the ball
RATIO_dribbles_PER_min_of_possession	Average number of dribbles attempted by a team per minute of possession
Dribbles_successful	When the player attempting a dribble retains possession of the ball
Successful_dribbles__percent	Successful dribbles/dribbles (%)
Tackles	The attempt of a player to stop an opponent who is dribbling
RATIO_tackles_PER_min_of_opponent_s_ball_possession	The average number of tackles attempted by a team per minute of possession by the opposing team
Tackles_successful	When the opponent player attempts a dribble and loses the ball possession.
Tackles_won__percent	Tackles successful/tackles (%)
Ball_interceptions	A player’s attempt to stop a pass
Free_ball_pick_ups	When a player wins possession of the ball, when it was not in the possession of either team
Opponent_s_passes_per_defensive_action	Total number of passes attempted by the opponent team/total number of defensive challenges
Building_ups	When a team builds an attack in its own half
Building_ups_without_pressing	Build-up without pressing from the opponent
Team_pressing	Is counted for the opponents of a team that is building its attack when players are actively trying to get the ball back
Team_pressing_successful	When pressing results in the ball being recovered
High_pressing	pressing in the attacking third
High_pressing_successful	Successful pressing in the attacking third
High_pressing__percent	Successful high pressing/high pressing (%)
Low_pressing	Pressing in the defensive and midfield third
Low_pressing_successful	Successful pressing in the defensive and midfield third
Low_pressing__percent	Successful low pressing/low pressing (%)
Passing_rate	Average passes per minute of possession
AVERAGE_passes_PER_ball_possession	Average passes per possession
Ball_possessions__quantity	The number of ball possessions
Average_duration_of_ball_possession_sec	The average duration of each ball possession
Sum_duration_with_ball_possession	The total duration of possession for a team
Ball_possession__percent	The percentage of ball possession for a team
Opponent_s_ball_possession_percent	The percentage of ball possession for the opponent’s team
RATIO_interceptions___free_balls_pick_up_PER_min_of_opponents_b	Interceptions plus free balls pick up/minutes of opponent’s ball possession
RATIO_defensive_challenges_PER_min_of_opponent_s_ball_possessio	Defensive challenges/minutes of opponent’s ball possession
Opponent_s_sum_duration_of_ball_possession_sec	The total duration of possession for the opponent’s team
Effective_time_secs	The total time that the ball is in the possession of one or the other team (i.e., the time that the ball is contestable, and the interruptions of the match are not included)
Attacks	Possession contains at least one action made by the team in the opposition half (except fouls) and continues for more than 3 s
Attacks_Left_flank	Attacks from left flank (flank is the zone on the side of the pitch 20 m from each sideline)
Attacks_with_shots_Left_flank	Attacks from left flank which resulted in a shot
Attacks_Center	Attacks that are not made from the flanks
Attacks_with_shots_Center	Attacks from center which resulted in a shot
Attacks_Right_flank	Attacks from right flank
Attacks_with_shots_Right_flank	Attacks from right flank which resulted in a shot
RATIO_left_attacks_PER_total_attacks_percent	Attacks left flank/attacks (%)
RATIO_right_attacks_PER_total_attacks_percent	Attacks right flank/attacks (%)
Wide_attacks_percent	Attacks left flank plus attacks right flank/attacks (%)
Attacks_center_percent	Attacks center/attacks (%)
Counterattacks	Open play situation where the team that wins the ball from the opponent makes a quick offensive transition (<8 s)
Positional_attacks	Open play attacks minus counterattacks
Positional_attacks_with_shots	Positional attacks which resulted in a shot
RATIO_counterattacks_PER_ballRecoveries	Counterattacks/ball recoveries
Counterattacks_with_a_shot	Counterattacks which resulted in a shot
Set_pieces_attacks	Attacks from free kick, corner kick, throw in and penalty
Set_pieces_attacks_with_shots	Attacks from set pieces which resulted in a shot
Free_kick_attacks	Attacks from free kicks
Free_kick_attacks_with_shots	Attacks from free kicks which resulted in a shot
Corner_attacks	Attacks from corner kicks
Corner_attacks_with_shots	Attacks from corner kicks which resulted in a shot
Throw_in_attacks	Attacks from throw ins
Throw_in_attacks_with_shots	Attacks from throw ins which resulted in a shot
Free_kick_shots	Free kicks taken directly towards the goal
Penalties	Penalty kicks
Chances_percent_of_conversion	The chances that resulted in goals
Shots_on_target_per_shot_percent	Shots on target/shots (%)
Open_play_attacks	Attacks minus set pieces attacks
Open_play_attacks_percent	Open play attacks/attacks (%)
Set_pieces_attacks_percent	Set pieces attacks/attacks (%)
Counterattacks_percent	Counterattacks/attacks (%)
Positional_attacks_percent	Positional attacks/attacks (%)
Ratio_counterattacks_per_open_play_attacks_percent	Counterattacks/open play attacks (%)
Ratio_posit_att_from_openplay_PER_openplay_att_percent	Positional attacks from open play/open play attacks (%)
Offsides	When a team player is caught offside
Opponent_Attacks	Attacks by the opposing team
Opponent_Positional_attacks	Positional attacks by the opposing team
Opponent_Counterattacks	Counterattacks by the opposing team
Opponent_Set_pieces_attacks	Set pieces attacks by the opposing team
Opponent_Open_play_attacks	Open play attacks by the opposing team
Opponent_open_play_attacks_percent	Open play attacks by the opposing team/attacks by the opposing team (%)
Opponent_set_pieces_attacks_percent	Set pieces attacks by the opposing team/attacks by the opposing team (%)
Opponent_Counterattacks_percent	Counterattacks/attacks (%) for the opponent’s team
Opponent_Positional_attacks_percent	Positional attacks/attacks (%) for the opponent’s team
Opp_Ratio_counteratt_per_openplay_att_percent	Counterattacks/open play attacks (%) for the opponent’s team
Opp_Ratio_posit_att_from_openplay_PER_openplay_att_percent	Positional attacks from open play/open play attacks (%) for the opponent’s team
Opponent_Offsides	When a player of the opposing team is caught offside
Crosses_per_quantity_of_possession_percent	Crosses/quantity of possessions (%)
Crosses_per_attacks_percent	Crosses/attacks (%)
Shots_per_quantity_of_possession_percent	Shots/quantity of possessions (%)
Shots_per_entrances_to_final_third_percent	Shots/entrances to final third (%)
GoalDif	Goals for minus goals against

References

Rathi, K.; Somani, P.; Koul, A.V.; Manu, K. Applications of Artificial Intelligence in the Game of Football: The Global Perspective. Res. World 2020, 11, 18–29. [Google Scholar]
Fernandez-Navarro, J.; Fradua, L.; Zubillaga, A.; Ford, P.R.; McRobert, A.P. Attacking and Defensive Styles of Play in Soccer: Analysis of Spanish and English Elite Teams. J. Sport. Sci. 2016, 34, 2195–2204. [Google Scholar] [CrossRef] [PubMed]
Rein, R.; Memmert, D. Big Data and Tactical Analysis in Elite Soccer: Future Challenges and Opportunities for Sports Science. SpringerPlus 2016, 5, 1410. [Google Scholar] [CrossRef] [PubMed]
Brand, A.; Niemann, A.; Spitaler, G. The Europeanization of Austrian Football: History, Adaptation and Transnational Dynamics. Soccer Soc. 2010, 11, 761–774. [Google Scholar] [CrossRef]
Goes, F.; Kempe, M.; Lemmink, K.; Goes, F.; Kempe, M.; Lemmink, K. Predicting Match Outcome in Professional Dutch Football Using Tactical Performance Metrics Computed from Position Tracking Data; Propobos Publications: Athens, Greece, 2019; pp. 105–115. [Google Scholar]
Park, E.-M.; Seo, J.-H.; Ko, M.-H. The Effects of Leadership by Types of Soccer Instruction on Big Data Analysis. Clust. Comput. 2016, 19, 1647–1658. [Google Scholar] [CrossRef]
Decroos, T.; Van Roy, M.; Davis, J. SoccerMix: Representing Soccer Actions with Mixture Models; Springer: Berlin/Heidelberg, Germany, 2021; pp. 459–474. [Google Scholar]
Plakias, S.; Moustakidis, S.; Kokkotis, C.; Tsatalas, T.; Papalexi, M.; Plakias, D.; Giakas, G.; Tsaopoulos, D. Identifying Soccer Teams’ Styles of Play: A Scoping and Critical Review. J. Funct. Morphol. Kinesiol. 2023, 8, 39. [Google Scholar] [CrossRef]
Lago-Peñas, C.; Gómez-Ruano, M.; Yang, G. Styles of Play in Professional Soccer: An Approach of the Chinese Soccer Super League. Int. J. Perform. Anal. Sport 2017, 17, 1073–1084. [Google Scholar] [CrossRef]
Decroos, T.; Van Haaren, J.; Davis, J. Automatic Discovery of Tactics in Spatio-Temporal Soccer Match Data. In Proceedings of the KDD ‘18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2018; pp. 223–232. [Google Scholar]
Perl, J.; Grunz, A.; Memmert, D. Tactics Analysis in Soccer–an Advanced Approach. Int. J. Comput. Sci. Sport 2013, 12, 33–44. [Google Scholar]
Fialho, G.; Manhães, A.; Teixeira, J.P. Predicting Sports Results with Artificial Intelligence—A Proposal Framework for Soccer Games. Procedia Comput. Sci. 2019, 164, 131–136. [Google Scholar] [CrossRef]
Ulmer, B.; Fernandez, M.; Peterson, M. Predicting Soccer Match Results in the English Premier League. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2013. [Google Scholar]
Bilek, G.; Ulas, E. Predicting Match Outcome According to the Quality of Opponent in the English Premier League Using Situational Variables and Team Performance Indicators. Int. J. Perform. Anal. Sport 2019, 19, 930–941. [Google Scholar] [CrossRef]
Patel, R.; Passi, K. Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning. IoT 2020, 1, 14. [Google Scholar] [CrossRef]
Naik, B.T.; Hashmi, M.F.; Bokde, N.D. A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Appl. Sci. 2022, 12, 4429. [Google Scholar] [CrossRef]
Barris, S.; Button, C. A Review of Vision-Based Motion Analysis in Sport. Sport. Med. 2008, 38, 1025–1043. [Google Scholar] [CrossRef] [PubMed]
Danisik, N.; Lacko, P.; Farkas, M. Football Match Prediction Using Players Attributes; IEEE: Piscataway, NJ, USA, 2018; pp. 201–206. [Google Scholar]
Inan, T. Using Poisson Model for Goal Prediction in European Football. 2021. Available online: https://rua.ua.es/dspace/bitstream/10045/107443/6/JHSE_16-4_16.pdf (accessed on 24 April 2023).
Robberechts, P.; Davis, J. Forecasting the FIFA World Cup–Combining Result-and Goal-Based Team Ability Parameters; Springer: Berlin/Heidelberg, Germany, 2019; pp. 16–30. [Google Scholar]
Prasetio, D. Predicting Football Match Results with Logistic Regression; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
Bunker, R.P.; Thabtah, F. A Machine Learning Framework for Sport Result Prediction. Appl. Comput. Inform. 2019, 15, 27–33. [Google Scholar] [CrossRef]
Hubáček, O.; Šourek, G.; Železný, F. Learning to Predict Soccer Results from Relational Data with Gradient Boosted Trees. Mach. Learn. 2019, 108, 29–47. [Google Scholar] [CrossRef]
Hsu, Y.-C. Using Convolutional Neural Network and Candlestick Representation to Predict Sports Match Outcomes. Appl. Sci. 2021, 11, 6594. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Hu, H.; Li, C.; Lin, Y.; Ma, R. Sports Match Prediction Model for Training and Exercise Using Attention-Based LSTM Network. Digit. Commun. Netw. 2022, 8, 508–515. [Google Scholar] [CrossRef]
Wunderlich, F.; Memmert, D. The Betting Odds Rating System: Using Soccer Forecasts to Forecast Soccer. PLoS ONE 2018, 13, e0198668. [Google Scholar] [CrossRef]
Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L.K.; Müller, K.-R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11700, ISBN 3-030-28954-0. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Geurkink, Y.; Boone, J.; Verstockt, S.; Bourgois, J.G. Machine Learning-Based Identification of the Strongest Predictive Variables of Winning and Losing in Belgian Professional Soccer. Appl. Sci. 2021, 11, 2378. [Google Scholar] [CrossRef]
Ren, Y.; Susnjak, T. Predicting Football Match Outcomes with EXplainable Machine Learning and the Kelly Index. arXiv 2022, arXiv:2211.15734. [Google Scholar]
Gómez, M.-Á.; Mitrotasios, M.; Armatas, V.; Lago-Peñas, C. Analysis of Playing Styles According to Team Quality and Match Location in Greek Professional Soccer. Int. J. Perform. Anal. Sport 2018, 18, 986–997. [Google Scholar] [CrossRef]
Casal, C.A.; Maneiro, R.; Ardá, A.; Losada, J.L. Gender Differences in Technical-Tactical Behaviour of Laliga Spanish Football Teams. J. Hum. Sport Exerc. 2019, 16, 37–52. [Google Scholar]
Castellano, J.; Echeazarra, I. Network-Based Centrality Measures and Physical Demands in Football Regarding Player Position: Is There a Connection? A Preliminary Study. J. Sport. Sci. 2019, 37, 2631–2638. [Google Scholar] [CrossRef]
Kumar, R.; Kumar, P.; Tripathi, R.; Gupta, G.P.; Garg, S.; Hassan, M.M. A Distributed Intrusion Detection System to Detect DDoS Attacks in Blockchain-Enabled IoT Network. J. Parallel Distrib. Comput. 2022, 164, 55–68. [Google Scholar] [CrossRef]
Zhang, Z.; Li, Y.; Jin, S.; Zhang, Z.; Wang, H.; Qi, L.; Zhou, R. Modulation Signal Recognition Based on Information Entropy and Ensemble Learning. Entropy 2018, 20, 198. [Google Scholar] [CrossRef]
Shahani, N.M.; Zheng, X.; Liu, C.; Hassan, F.U.; Li, P. Developing an XGBoost Regression Model for Predicting Young’s Modulus of Intact Sedimentary Rocks for the Stability of Surface and Subsurface Structures. Front. Earth Sci. 2021, 9, 761990. [Google Scholar] [CrossRef]
Malik, A.; Tikhamarine, Y.; Souag-Gamane, D.; Kisi, O.; Pham, Q.B. Support Vector Regression Optimized by Meta-Heuristic Algorithms for Daily Streamflow Prediction. Stoch. Environ. Res. Risk Assess. 2020, 34, 1755–1773. [Google Scholar] [CrossRef]
Babar, B.; Luppino, L.T.; Boström, T.; Anfinsen, S.N. Random Forest Regression for Improved Mapping of Solar Irradiance at High Latitudes. Sol. Energy 2020, 198, 81–92. [Google Scholar] [CrossRef]
Zhou, Y.; Huang, M.; Pecht, M. Remaining Useful Life Estimation of Lithium-Ion Cells Based on k-Nearest Neighbor Regression with Differential Evolution Optimization. J. Clean. Prod. 2020, 249, 119409. [Google Scholar] [CrossRef]
Lipovetsky, S.; Conklin, M. Analysis of Regression in Game Theory Approach. Appl. Stoch. Model. Bus. Ind. 2001, 17, 319–330. [Google Scholar] [CrossRef]
Palatnik de Sousa, I.; Maria Bernardes Rebuzzi Vellasco, M.; Costa da Silva, E. Local Interpretable Model-Agnostic Explanations for Classification of Lymph Node Metastases. Sensors 2019, 19, 2969. [Google Scholar] [CrossRef] [PubMed]
Lago-Ballesteros, J.; Lago-Peñas, C. Performance in Team Sports: Identifying the Keys to Success in Soccer. J. Hum. Kinet. 2010, 25, 85–91. [Google Scholar] [CrossRef]
Liu, H.; Gomez, M.-Á.; Lago-Peñas, C.; Sampaio, J. Match Statistics Related to Winning in the Group Stage of 2014 Brazil FIFA World Cup. J. Sport. Sci. 2015, 33, 1205–1213. [Google Scholar] [CrossRef]
Liu, H.; Hopkins, W.G.; Gómez, M.-A. Modelling Relationships between Match Events and Match Outcome in Elite Football. Eur. J. Sport. Sci. 2016, 16, 516–525. [Google Scholar] [CrossRef]
Castellano, J.; Casamichana, D.; Lago, C. The Use of Match Statistics That Discriminate between Successful and Unsuccessful Soccer Teams. J. Hum. Kinet. 2012, 31, 137–147. [Google Scholar] [CrossRef]
Rathke, A. An Examination of Expected Goals and Shot Efficiency in Soccer. J. Hum. Sport Exerc. 2017, 12, 514–529. [Google Scholar] [CrossRef]
Muazu Musa, R.; PP Abdul Majeed, A.; Abdullah, M.R.; Ab. Nasir, A.F.; Arif Hassan, M.H.; Mohd Razman, M.A. Technical and Tactical Performance Indicators Discriminating Winning and Losing Team in Elite Asian Beach Soccer Tournament. PLoS ONE 2019, 14, e0219138. [Google Scholar] [CrossRef] [PubMed]
Pappalardo, L.; Cintia, P. Quantifying the Relation between Performance and Success in Soccer. Adv. Complex Syst. 2018, 21, 1750014. [Google Scholar] [CrossRef]
Zhou, C.; Zhang, S.; Lorenzo Calvo, A.; Cui, Y. Chinese Soccer Association Super League, 2012–2017: Key Performance Indicators in Balance Games. Int. J. Perform. Anal. Sport 2018, 18, 645–656. [Google Scholar] [CrossRef]
Zhou, C.; Calvo, A.L.; Robertson, S.; Gómez, M.-Á. Long-Term Influence of Technical, Physical Performance Indicators and Situational Variables on Match Outcome in Male Professional Chinese Soccer. J. Sport. Sci. 2021, 39, 598–608. [Google Scholar] [CrossRef]
Harrop, K.; Nevill, A. Performance Indicators That Predict Success in an English Professional League One Soccer Team. Int. J. Perform. Anal. Sport 2014, 14, 907–920. [Google Scholar] [CrossRef]
Yang, G.; Leicht, A.S.; Lago, C.; Gómez, M.-Á. Key Team Physical and Technical Performance Indicators Indicative of Team Quality in the Soccer Chinese Super League. Res. Sport. Med. 2018, 26, 158–167. [Google Scholar] [CrossRef] [PubMed]
Akyildiz, Z.; Nobari, H.; González-Fernández, F.T.; Praça, G.M.; Sarmento, H.; Guler, A.H.; Saka, E.K.; Clemente, F.M.; Figueiredo, A.J. Variations in the Physical Demands and Technical Performance of Professional Soccer Teams over Three Consecutive Seasons. Sci. Rep. 2022, 12, 2412. [Google Scholar] [CrossRef]
Cakmak, A.; Uzun, A.; Delibas, E. Computational Modeling of Pass Effectiveness in Soccer. Adv. Complex Syst. 2018, 21, 1850010. [Google Scholar] [CrossRef]
MULAZIMOGLU, O. The Effect of Special Technical Events in the Game on the Success of Professional Soccer Teams: Turkish Super League. Rev. Line Política Gestão Educ. 2021, 25, 1418–1431. [Google Scholar] [CrossRef]
Almeida, C.H.; Ferreira, A.P.; Volossovitch, A. Effects of Match Location, Match Status and Quality of Opposition on Regaining Possession in UEFA Champions League. J. Hum. Kinet. 2014, 41, 203–214. [Google Scholar] [CrossRef]
Bojinov, I.; Bornn, L. The Pressing Game: Optimal Defensive Disruption in Soccer. In Proceedings of the 10th MIT Sloan Sports Analytics Conference, Boston, MA, USA, 12 March 2016. [Google Scholar]
Merckx, S.; Robberechts, P.; Euvrard, Y.; Davis, J. Measuring the Effectiveness of Pressing in Soccer. In Proceedings of the Workshop on Machine Learning and Data Mining for Sports Analytics, Virtual, 13 September 2021. [Google Scholar]
Iván-Baragaño, I.; Maneiro, R.; Losada, J.L.; Ardá, A. Multivariate Analysis of the Offensive Phase in High-Performance Women’s Soccer: A Mixed Methods Study. Sustainability 2021, 13, 6379. [Google Scholar] [CrossRef]
Maneiro, R.; Casal, C.A.; Álvarez, I.; Moral, J.E.; López, S.; Ardá, A.; Losada, J.L. Offensive Transitions in High-Performance Football: Differences between UEFA Euro 2008 and UEFA Euro 2016. Front. Psychol. 2019, 10, 1230. [Google Scholar] [CrossRef]
Scanlan, M.; Harms, C.; Cochrane Wilkie, J.; Ma’ayah, F. The Creation of Goal Scoring Opportunities at the 2015 Women’s World Cup. Int. J. Sport. Sci. Coach. 2020, 15, 803–808. [Google Scholar] [CrossRef]
Hughes, M.; Lovell, T. Transition to Attack in Elite Soccer. J. Hum. Sport Exerc. 2019, 14, 1. [Google Scholar] [CrossRef]
Warwick, J. The efficacy of counter-pressing as an offensive-defensive philosophy. Master’s Thesis, University of Miami, Miami, FL, USA, 2019. [Google Scholar]
Stöckl, M.; Seidl, T.; Marley, D.; Power, P. Making Offensive Play Predictable-Using a Graph Convolutional Network to Understand Defensive Performance in Soccer. In Proceedings of the 15th MIT Sloan Sports Analytics Conference, Virtual, 8–9 April 2021; Volume 2022. [Google Scholar]
Davies, J.C. Coaching the Tiki Taka Style of Play; SoccerTutor.com Limited: Essex, UK, 2013; ISBN 978-0-9576705-4-9. [Google Scholar]
Llopis-Goig, R.; Llopis-Goig, R. The Decline of the Spanish Fury. In Spanish Football and Social Change: Sociological Investigations; Palgrave Macmillan: London, UK, 2015; pp. 64–85. [Google Scholar]
Rashid, M.F.F.A. Tiki-Taka Algorithm: A Novel Metaheuristic Inspired by Football Playing Style. Eng. Comput. 2020, 38, 313–343. [Google Scholar] [CrossRef]
Cintia, P.; Pappalardo, L. Coach2vec: Autoencoding the Playing Style of Soccer Coaches. arXiv 2021, arXiv:2106.15444. [Google Scholar]
Ahmed, A.M.; Deo, R.C.; Feng, Q.; Ghahramani, A.; Raj, N.; Yin, Z.; Yang, L. Deep Learning Hybrid Model with Boruta-Random Forest Optimiser Algorithm for Streamflow Forecasting with Climate Mode Indices, Rainfall, and Periodicity. J. Hydrol. 2021, 599, 126350. [Google Scholar] [CrossRef]

Figure 1. Proposed pipeline.

Figure 2. SHAP summary dot plot presenting the overall contribution of the most important variables on the teams’ score performance.

Figure 3. Comparison of predicted and actual averaged team score performance (normalized). The scatter plot illustrates the relationship between the model’s predictions (y axis) and the actual results (x-axis), with the line of best fit providing an indication of the model’s accuracy. Each data point represents a specific team.

Figure 4. SHAP force plot indicating key team-level performance variables of Liverpool FC and their impact on team’s score performance.

Figure 5. SHAP force plot indicating key team-level performance variables of Manchester FC and their impact on team’s score performance.

Figure 6. SHAP force plot indicating key team-level performance variables of West Ham FC and their impact on team’s score performance.

Figure 7. SHAP force plot indicating key team-level performance variables of Lazio FC and their impact on team’s score performance.

Table 1. Sample characteristics and number of variables per category.

Sample Characteristics
Number of leagues	11
Number of teams	174
Number of matches	2996
Variables
Number of variables related to attack	74
Number of variables related to defense	44
Number of variables related to attacking transition	5
Number of variables related to defensive transition	3
Number of variables related to attacking set pieces	12
Number of variables related to defensive set pieces	2
Number of variables not related to a specific phase	20

Table 2. ML model characteristics and performance.

Machine Learning Model Specifications
Model	XGBRegressor	SVR	RF	kNN
Hyperparameters:	{Colsample bytree: 0.3; learning rate: 0.1; max depth: 3; n estimators: 500}	{C: 1; kernel: ‘sigmoid’; epsilon:1}	{min_samples_leaf’: 2; ‘min samples split’: 6; ‘n_estimators’: 25}	{leaf size: 1, n_neighbors: 9, p: 2}
Validation Strategy
Approach	Cross-validation
Number of folds	10
Performance
Root mean squared error	32.09%	42.41%	36.01%	40.71%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moustakidis, S.; Plakias, S.; Kokkotis, C.; Tsatalas, T.; Tsaopoulos, D. Predicting Football Team Performance with Explainable AI: Leveraging SHAP to Identify Key Team-Level Performance Metrics. Future Internet 2023, 15, 174. https://doi.org/10.3390/fi15050174

AMA Style

Moustakidis S, Plakias S, Kokkotis C, Tsatalas T, Tsaopoulos D. Predicting Football Team Performance with Explainable AI: Leveraging SHAP to Identify Key Team-Level Performance Metrics. Future Internet. 2023; 15(5):174. https://doi.org/10.3390/fi15050174

Chicago/Turabian Style

Moustakidis, Serafeim, Spyridon Plakias, Christos Kokkotis, Themistoklis Tsatalas, and Dimitrios Tsaopoulos. 2023. "Predicting Football Team Performance with Explainable AI: Leveraging SHAP to Identify Key Team-Level Performance Metrics" Future Internet 15, no. 5: 174. https://doi.org/10.3390/fi15050174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu