May 2023
This project investigates bike rental patterns using multivariate analysis techniques. I developed a multivariate multiple linear regression model to understand how various factors influence bike rental dynamics, while also examining the effects of non-normal errors on regression inference.
The study explores how weather conditions, calendar events, and time-related factors collectively affect bike sharing systems, with a focus on modeling complex relationships between predictors and multiple response variables.
The methodology for this project involved:
I implemented a semi-parametric approach that allowed for flexible modeling of both linear and non-linear relationships in the data, while accounting for potential violations of standard regression assumptions.
The following code demonstrates the implementation of a multivariate multiple regression model:
# Load necessary libraries
library(car)
library(MASS)
library(ggplot2)
library(dplyr)
library(mgcv)
library(readr)
# Load and prepare dataset
bike_data <- read_csv("bike_sharing.csv")
# Data preprocessing
bike_data <- bike_data %>%
mutate(
season = as.factor(season),
holiday = as.factor(holiday),
workingday = as.factor(workingday),
weather = as.factor(weather),
# Convert date and extract time features
date = as.Date(dteday),
month = factor(format(date, "%m")),
dayofweek = factor(weekdays(date))
)
# Create multivariate response variable (casual and registered users)
Y <- as.matrix(bike_data[, c("casual", "registered")])
# Create design matrix for predictors
X <- model.matrix(~ temp + atemp + humidity + windspeed +
season + holiday + workingday + dayofweek, data = bike_data)
# Fit multivariate multiple regression model
mmr_model <- lm(Y ~ X - 1) # -1 to exclude intercept already in X
# Summary of the model
summary(mmr_model)
# Test for multivariate normality of errors
mshapiro.test(residuals(mmr_model))
The following code explores the impact of non-normal errors on regression inference:
# Function to simulate data with different error distributions
simulate_nonormal_errors <- function(n, distribution, parameters) {
# Create predictors
X1 <- rnorm(n, mean = 0, sd = 1)
X2 <- rnorm(n, mean = 0, sd = 1)
# True coefficients
beta0 <- 2
beta1 <- 1.5
beta2 <- -0.8
# Generate errors based on specified distribution
if (distribution == "normal") {
errors <- rnorm(n, mean = 0, sd = parameters$sd)
} else if (distribution == "t") {
errors <- rt(n, df = parameters$df) * parameters$scale
} else if (distribution == "skewed") {
errors <- rchisq(n, df = parameters$df) - parameters$df
}
# Generate response
Y <- beta0 + beta1 * X1 + beta2 * X2 + errors
# Return data
data.frame(Y = Y, X1 = X1, X2 = X2)
}
# Run simulation with different error distributions
set.seed(123)
normal_data <- simulate_nonormal_errors(1000, "normal", list(sd = 1))
t_data <- simulate_nonormal_errors(1000, "t", list(df = 3, scale = 1))
skewed_data <- simulate_nonormal_errors(1000, "skewed", list(df = 3))
# Fit models to each dataset
normal_model <- lm(Y ~ X1 + X2, data = normal_data)
t_model <- lm(Y ~ X1 + X2, data = t_data)
skewed_model <- lm(Y ~ X1 + X2, data = skewed_data)
# Compare coefficient estimates and standard errors
models_comparison <- data.frame(
Distribution = c("Normal", "t", "Skewed"),
Intercept = c(coef(normal_model)[1], coef(t_model)[1], coef(skewed_model)[1]),
X1 = c(coef(normal_model)[2], coef(t_model)[2], coef(skewed_model)[2]),
X2 = c(coef(normal_model)[3], coef(t_model)[3], coef(skewed_model)[3]),
SE_Intercept = c(summary(normal_model)$coef[1,2],
summary(t_model)$coef[1,2],
summary(skewed_model)$coef[1,2]),
SE_X1 = c(summary(normal_model)$coef[2,2],
summary(t_model)$coef[2,2],
summary(skewed_model)$coef[2,2]),
SE_X2 = c(summary(normal_model)$coef[3,2],
summary(t_model)$coef[3,2],
summary(skewed_model)$coef[3,2])
)
# Print comparison table
print(models_comparison)
The analysis yielded several key findings:
Visualization of the relationship between temperature and bike rentals:
[Temperature vs. Rentals plot would appear here]
Comparison of error distributions and their impact on regression inference:
[Error distribution comparison would appear here]
This project demonstrated the value of multivariate analysis in understanding complex systems like bike rentals. Key conclusions include:
This work contributed to a published paper in Scientific African, providing insights for bike-sharing system operators to optimize fleet management based on predicted usage patterns.