December 2022
This project focuses on modeling expected goals (xG) in soccer using a Quasi-Poisson regression approach. Expected goals has become a critical metric in soccer analytics, representing the probability of a shot resulting in a goal based on various factors like shot location, angle, and game context.
The study analyzed data from the top five European soccer leagues (English Premier League, Spanish La Liga, German Bundesliga, Italian Serie A, and French Ligue 1) to identify factors that significantly influence goal-scoring and to develop predictive models for expected goals.
The methodology for this project included:
Quasi-Poisson regression was chosen specifically to address the overdispersion typically observed in soccer scoring data, where the variance of goals exceeds the mean, violating a key assumption of standard Poisson models.
The following code demonstrates the implementation of the Quasi-Poisson model for expected goals:
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(MASS)
library(pROC)
library(boot)
# Load soccer match data
soccer_data <- read.csv("soccer_shots_data.csv")
# Data preprocessing
soccer_data <- soccer_data %>%
# Create additional features
mutate(
distance_squared = distance^2,
angle_rad = angle * pi / 180,
# Calculate shot quality metrics
shot_quality = cos(angle_rad) / (1 + distance/10),
is_big_chance = factor(is_big_chance),
is_counterattack = factor(is_counterattack),
body_part = factor(body_part, levels = c("foot", "head", "other")),
league = factor(league),
is_goal = as.numeric(outcome == "goal")
)
# Train-test split
set.seed(123)
train_indices <- sample(1:nrow(soccer_data), 0.7 * nrow(soccer_data))
train_data <- soccer_data[train_indices, ]
test_data <- soccer_data[-train_indices, ]
# Fit standard Poisson model
poisson_model <- glm(
is_goal ~ distance + distance_squared + angle + body_part + is_counterattack +
is_big_chance + league,
data = train_data,
family = poisson()
)
# Fit Quasi-Poisson model
quasi_poisson_model <- glm(
is_goal ~ distance + distance_squared + angle + body_part + is_counterattack +
is_big_chance + league,
data = train_data,
family = quasipoisson()
)
# Compare dispersion parameters
summary(poisson_model)
summary(quasi_poisson_model)
# Calculate dispersion parameter explicitly
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) /
poisson_model$df.residual
cat("Estimated dispersion parameter:", dispersion, "\n")
# Predict expected goals for test data
test_data$xg_poisson <- predict(poisson_model, newdata = test_data, type = "response")
test_data$xg_quasi <- predict(quasi_poisson_model, newdata = test_data, type = "response")
# Evaluate model performance
evaluate_predictions <- function(actual, predicted) {
# Calculate ROC and AUC
roc_result <- roc(actual, predicted)
auc_value <- auc(roc_result)
# Calculate mean squared error
mse <- mean((actual - predicted)^2)
# Calculate mean absolute error
mae <- mean(abs(actual - predicted))
# Calculate log loss (more appropriate for probabilistic predictions)
eps <- 1e-15 # Small value to prevent log(0)
pred_bounded <- pmax(pmin(predicted, 1 - eps), eps)
logloss <- -mean(actual * log(pred_bounded) + (1 - actual) * log(1 - pred_bounded))
# Return metrics
return(list(AUC = auc_value, MSE = mse, MAE = mae, LogLoss = logloss))
}
# Compare models
poisson_metrics <- evaluate_predictions(test_data$is_goal, test_data$xg_poisson)
quasi_metrics <- evaluate_predictions(test_data$is_goal, test_data$xg_quasi)
# Print results
results_df <- data.frame(
Model = c("Poisson", "Quasi-Poisson"),
AUC = c(poisson_metrics$AUC, quasi_metrics$AUC),
MSE = c(poisson_metrics$MSE, quasi_metrics$MSE),
MAE = c(poisson_metrics$MAE, quasi_metrics$MAE),
LogLoss = c(poisson_metrics$LogLoss, quasi_metrics$LogLoss)
)
print(results_df)
The code for analyzing shot characteristics across leagues:
# Analyze shot characteristics across leagues
league_analysis <- soccer_data %>%
group_by(league) %>%
summarize(
n_shots = n(),
n_goals = sum(is_goal),
conversion_rate = mean(is_goal) * 100,
avg_distance = mean(distance),
avg_angle = mean(angle),
big_chance_pct = mean(is_big_chance == "1") * 100,
counter_pct = mean(is_counterattack == "1") * 100
)
# Visualize xG model
ggplot(test_data, aes(x = distance, y = xg_quasi, color = factor(angle > 45))) +
geom_smooth(method = "loess") +
facet_wrap(~body_part) +
labs(
title = "Expected Goals by Distance and Angle",
x = "Distance from Goal (meters)",
y = "Expected Goals (xG)",
color = "Wide Angle (>45°)"
) +
theme_minimal()
# Calculate average xG by pitch position (for heatmap)
pitch_xg <- soccer_data %>%
mutate(
x_bin = cut(x_coordinate, breaks = 10),
y_bin = cut(y_coordinate, breaks = 10)
) %>%
group_by(x_bin, y_bin) %>%
summarize(
avg_xg = mean(xg_quasi),
n_shots = n()
)
# Visualize xG heatmap
ggplot(pitch_xg, aes(x = x_bin, y = y_bin, fill = avg_xg)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Expected Goals Heatmap by Pitch Position",
x = "X Position",
y = "Y Position",
fill = "Average xG"
) +
theme_minimal()
The analysis of expected goals across the top five leagues revealed several key insights:
Expected goals by shot distance and angle:
[xG by distance and angle plot would appear here]
xG comparison across different leagues:
[League comparison chart would appear here]
This project demonstrated the value of Quasi-Poisson models for soccer analytics:
The expected goals model developed in this project can be used to evaluate player finishing ability, assess team offensive and defensive performance, and identify potential market inefficiencies in player valuation. Future work could incorporate more detailed spatial data and defensive pressure metrics to further refine xG estimates.