May 2023
This project focuses on using Generalized Additive Models (GAMs) to explore and predict diabetes risk based on various health factors. The goal was to develop a robust predictive model that captures non-linear relationships between risk factors and diabetes status.
GAMs provide significant advantages over traditional logistic regression by allowing flexible modeling of non-linear effects without requiring a priori specification of the functional form. This flexibility is particularly valuable when analyzing complex health data where relationships often don't follow simple linear patterns.
The approach for this project followed these steps:
I used thin plate regression splines as the smoothing basis, with automatic selection of smoothing parameters via generalized cross-validation (GCV) to achieve an optimal balance between model fit and complexity.
The following code demonstrates the implementation of GAMs for diabetes prediction:
# Load necessary libraries
library(mgcv)
library(ggplot2)
library(dplyr)
library(pROC)
library(caret)
# Load data
diabetes_data <- read.csv("diabetes_data.csv")
# Data preprocessing
diabetes_data <- diabetes_data %>%
# Convert outcome to factor
mutate(Outcome = factor(Outcome, levels = c(0, 1), labels = c("No Diabetes", "Diabetes"))) %>%
# Scale continuous predictors
mutate_at(vars(Pregnancies, Glucose, BloodPressure, SkinThickness,
Insulin, BMI, DiabetesPedigreeFunction, Age), scale)
# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(diabetes_data$Outcome, p = 0.7, list = FALSE)
train_data <- diabetes_data[train_index, ]
test_data <- diabetes_data[-train_index, ]
# Fit logistic regression model (for comparison)
logistic_model <- glm(Outcome ~ .,
data = train_data,
family = binomial())
# Fit GAM with smooth terms for continuous predictors
gam_model <- gam(Outcome ~ s(Pregnancies) + s(Glucose) + s(BloodPressure) +
s(SkinThickness) + s(Insulin) + s(BMI) +
s(DiabetesPedigreeFunction) + s(Age),
data = train_data,
family = binomial(),
method = "REML")
# Model summary
summary(gam_model)
# Display effective degrees of freedom for smoothers
edf_summary <- data.frame(
Variable = c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
"Insulin", "BMI", "DiabetesPedigreeFunction", "Age"),
EDF = summary(gam_model)$edf
)
print(edf_summary)
# Predictions on test data
test_data$logistic_pred <- predict(logistic_model, newdata = test_data, type = "response")
test_data$gam_pred <- predict(gam_model, newdata = test_data, type = "response")
# Calculate ROC curves
logistic_roc <- roc(test_data$Outcome, test_data$logistic_pred)
gam_roc <- roc(test_data$Outcome, test_data$gam_pred)
# Compare AUC values
auc_comparison <- data.frame(
Model = c("Logistic Regression", "GAM"),
AUC = c(auc(logistic_roc), auc(gam_roc))
)
print(auc_comparison)
Visualization of the smooth functions for key risk factors:
# Function to create better visualizations of GAM smooth terms
visualize_smooths <- function(model, pred_var, pred_range = NULL, data = train_data) {
if (is.null(pred_range)) {
pred_range <- seq(min(data[[pred_var]]), max(data[[pred_var]]), length.out = 100)
}
# Create prediction data frame
pred_data <- data.frame(matrix(0, nrow = length(pred_range), ncol = ncol(data)))
colnames(pred_data) <- colnames(data)
pred_data[[pred_var]] <- pred_range
# Generate predictions
preds <- predict(model, newdata = pred_data, type = "link", se.fit = TRUE)
# Convert to probability scale with confidence intervals
pred_results <- data.frame(
x = pred_range,
fit = preds$fit,
se = preds$se.fit
) %>%
mutate(
fit_prob = plogis(fit),
lower_prob = plogis(fit - 1.96 * se),
upper_prob = plogis(fit + 1.96 * se)
)
# Plot results
ggplot(pred_results, aes(x = x, y = fit_prob)) +
geom_line(size = 1.2, color = "blue") +
geom_ribbon(aes(ymin = lower_prob, ymax = upper_prob), alpha = 0.2) +
labs(
title = paste("Effect of", pred_var, "on Diabetes Risk"),
x = pred_var,
y = "Probability of Diabetes"
) +
theme_minimal()
}
# Generate plots for key variables
bmi_plot <- visualize_smooths(gam_model, "BMI")
glucose_plot <- visualize_smooths(gam_model, "Glucose")
age_plot <- visualize_smooths(gam_model, "Age")
# Display plots
print(bmi_plot)
print(glucose_plot)
print(age_plot)
The GAM-based analysis revealed several important insights:
Visualization of BMI's non-linear effect on diabetes risk:
[BMI smooth function plot would appear here]
Comparison of ROC curves between GAM and logistic regression:
[ROC curve comparison would appear here]
This project demonstrated the advantages of GAMs for health risk modeling:
The ability to visualize these complex relationships through smooth function plots provides not only better predictions but also more interpretable insights for healthcare providers. Future work could explore incorporating interactions between risk factors and extending the approach to longitudinal data.