October 2022
This project investigates the similarities and differences between classical (frequentist) and Bayesian approaches to regression analysis through simulation studies. While both approaches aim to estimate relationships between variables, they differ fundamentally in their philosophical foundations and practical implementations.
The study explores how these differences manifest in parameter estimation, uncertainty quantification, and prediction under various data conditions. Through controlled simulations, I evaluated the performance of both approaches across different sample sizes, noise levels, and model misspecifications.
The methodology for this project included:
For the Bayesian implementation, I explored both conjugate priors (which allow for analytical solutions) and more flexible prior specifications that required Markov Chain Monte Carlo (MCMC) sampling. This allowed for a comprehensive comparison of different approaches within the Bayesian framework itself.
The following code demonstrates the implementation of classical and Bayesian regression models:
# Load necessary libraries
library(rstanarm) # For Bayesian regression
library(bayesplot)
library(ggplot2)
library(dplyr)
# Function to generate synthetic data
generate_data <- function(n_samples, beta_true, sigma, x_range = c(-3, 3)) {
# Generate predictor variables
X <- matrix(runif(n_samples * length(beta_true),
min = x_range[1], max = x_range[2]),
nrow = n_samples)
# Generate response variable with noise
y <- X %*% beta_true + rnorm(n_samples, mean = 0, sd = sigma)
# Return as data frame
data.frame(y = y, X)
}
# Simulation parameters
set.seed(123)
n_samples <- 100
beta_true <- c(2.5, 1.5, -0.5) # Intercept, X1, X2
sigma_true <- 2
variable_names <- c("Intercept", "X1", "X2")
# Generate data
sim_data <- generate_data(n_samples, beta_true, sigma_true)
colnames(sim_data) <- c("y", "X1", "X2")
# Fit classical regression model
classical_model <- lm(y ~ X1 + X2, data = sim_data)
classical_summary <- summary(classical_model)
# Extract classical estimates and confidence intervals
classical_coef <- coef(classical_model)
classical_se <- classical_summary$coefficients[, "Std. Error"]
classical_ci <- confint(classical_model, level = 0.95)
classical_sigma <- classical_summary$sigma
# Fit Bayesian regression model with weakly informative priors
bayesian_model <- stan_glm(y ~ X1 + X2,
data = sim_data,
family = gaussian(),
prior = normal(0, 10),
prior_intercept = normal(0, 10),
prior_aux = cauchy(0, 5),
chains = 4,
iter = 2000,
seed = 123)
# Extract Bayesian estimates and credible intervals
bayesian_summary <- summary(bayesian_model)
bayesian_coef <- bayesian_summary$coefficients[, "mean"]
bayesian_ci <- posterior_interval(bayesian_model, prob = 0.95)
bayesian_sigma <- mean(as.matrix(bayesian_model)[, "sigma"])
# Compare results
comparison_df <- data.frame(
Parameter = c(variable_names, "Sigma"),
True_Value = c(beta_true, sigma_true),
Classical_Estimate = c(classical_coef, classical_sigma),
Classical_Lower = c(classical_ci[, 1], NA),
Classical_Upper = c(classical_ci[, 2], NA),
Bayesian_Estimate = c(bayesian_coef, bayesian_sigma),
Bayesian_Lower = c(bayesian_ci[, 1], NA),
Bayesian_Upper = c(bayesian_ci[, 2], NA)
)
# Print comparison
print(comparison_df)
The code for comparing performance across different data conditions:
# Function to run simulation for different conditions
run_simulation <- function(n_samples_list, n_reps, beta_true, sigma) {
results <- data.frame()
for (n in n_samples_list) {
for (rep in 1:n_reps) {
# Generate data
sim_data <- generate_data(n, beta_true, sigma)
colnames(sim_data) <- c("y", "X1", "X2")
# Fit models
classical_model <- lm(y ~ X1 + X2, data = sim_data)
# Fit Bayesian model
bayesian_model <- stan_glm(y ~ X1 + X2,
data = sim_data,
family = gaussian(),
prior = normal(0, 10),
prior_intercept = normal(0, 10),
prior_aux = cauchy(0, 5),
chains = 4,
iter = 2000,
refresh = 0, # Suppress output
seed = 123 + rep)
# Extract estimates
classical_coef <- coef(classical_model)
bayesian_coef <- bayesian_summary$coefficients[, "mean"]
# Calculate error metrics
classical_mse <- mean((beta_true - classical_coef)^2)
bayesian_mse <- mean((beta_true - bayesian_coef)^2)
# Save results
results <- rbind(results, data.frame(
n_samples = n,
rep = rep,
classical_mse = classical_mse,
bayesian_mse = bayesian_mse
))
}
}
return(results)
}
# Run simulation for different sample sizes
n_samples_list <- c(20, 50, 100, 200, 500)
n_reps <- 10 # Reduced for demonstration, use more for actual analysis
sim_results <- run_simulation(n_samples_list, n_reps, beta_true, sigma_true)
# Summarize results
summary_results <- sim_results %>%
group_by(n_samples) %>%
summarize(
classical_mse_mean = mean(classical_mse),
bayesian_mse_mean = mean(bayesian_mse),
classical_mse_sd = sd(classical_mse),
bayesian_mse_sd = sd(bayesian_mse)
)
# Visualize performance comparison
ggplot(summary_results, aes(x = n_samples)) +
geom_line(aes(y = classical_mse_mean, color = "Classical")) +
geom_line(aes(y = bayesian_mse_mean, color = "Bayesian")) +
geom_ribbon(aes(ymin = classical_mse_mean - classical_mse_sd,
ymax = classical_mse_mean + classical_mse_sd,
fill = "Classical"), alpha = 0.2) +
geom_ribbon(aes(ymin = bayesian_mse_mean - bayesian_mse_sd,
ymax = bayesian_mse_mean + bayesian_mse_sd,
fill = "Bayesian"), alpha = 0.2) +
scale_color_manual(values = c("Classical" = "blue", "Bayesian" = "red")) +
scale_fill_manual(values = c("Classical" = "blue", "Bayesian" = "red")) +
labs(
title = "MSE Comparison by Sample Size",
x = "Sample Size",
y = "Mean Squared Error",
color = "Method",
fill = "Method"
) +
theme_minimal()
The comparative analysis of classical and Bayesian regression revealed several key findings:
Comparison of parameter estimates by sample size:
[Parameter estimate comparison plot would appear here]
Posterior distributions for key parameters:
[Posterior distribution plot would appear here]
This project highlighted the complementary nature of classical and Bayesian approaches to regression:
The simulation study demonstrated that while classical and Bayesian methods often yield similar point estimates, the Bayesian approach offers advantages in terms of uncertainty quantification and incorporation of external knowledge. These insights are valuable for researchers deciding between methodological approaches for specific applications.