Multivariate Analysis of Bike Rental Systems

Project Overview

This project investigates bike rental patterns using multivariate analysis techniques. I developed a multivariate multiple linear regression model to understand how various factors influence bike rental dynamics, while also examining the effects of non-normal errors on regression inference.

The study explores how weather conditions, calendar events, and time-related factors collectively affect bike sharing systems, with a focus on modeling complex relationships between predictors and multiple response variables.

Methodology

The methodology for this project involved:

Data collection and preprocessing of bike rental records
Exploratory data analysis to identify key patterns and relationships
Development of multivariate multiple regression models
Assessment of model assumptions, particularly focusing on non-normal error distributions
Evaluation of prediction accuracy and model robustness

I implemented a semi-parametric approach that allowed for flexible modeling of both linear and non-linear relationships in the data, while accounting for potential violations of standard regression assumptions.

R Code Implementation

The following code demonstrates the implementation of a multivariate multiple regression model:

# Load necessary libraries
library(car)
library(MASS)
library(ggplot2)
library(dplyr)
library(mgcv)
library(readr)

# Load and prepare dataset
bike_data <- read_csv("bike_sharing.csv")

# Data preprocessing
bike_data <- bike_data %>%
  mutate(
    season = as.factor(season),
    holiday = as.factor(holiday),
    workingday = as.factor(workingday),
    weather = as.factor(weather),
    # Convert date and extract time features
    date = as.Date(dteday),
    month = factor(format(date, "%m")),
    dayofweek = factor(weekdays(date))
  )

# Create multivariate response variable (casual and registered users)
Y <- as.matrix(bike_data[, c("casual", "registered")])

# Create design matrix for predictors
X <- model.matrix(~ temp + atemp + humidity + windspeed + 
                  season + holiday + workingday + dayofweek, data = bike_data)

# Fit multivariate multiple regression model
mmr_model <- lm(Y ~ X - 1)  # -1 to exclude intercept already in X

# Summary of the model
summary(mmr_model)

# Test for multivariate normality of errors
mshapiro.test(residuals(mmr_model))

The following code explores the impact of non-normal errors on regression inference:

# Function to simulate data with different error distributions
simulate_nonormal_errors <- function(n, distribution, parameters) {
  # Create predictors
  X1 <- rnorm(n, mean = 0, sd = 1)
  X2 <- rnorm(n, mean = 0, sd = 1)
  
  # True coefficients
  beta0 <- 2
  beta1 <- 1.5
  beta2 <- -0.8
  
  # Generate errors based on specified distribution
  if (distribution == "normal") {
    errors <- rnorm(n, mean = 0, sd = parameters$sd)
  } else if (distribution == "t") {
    errors <- rt(n, df = parameters$df) * parameters$scale
  } else if (distribution == "skewed") {
    errors <- rchisq(n, df = parameters$df) - parameters$df
  }
  
  # Generate response
  Y <- beta0 + beta1 * X1 + beta2 * X2 + errors
  
  # Return data
  data.frame(Y = Y, X1 = X1, X2 = X2)
}

# Run simulation with different error distributions
set.seed(123)
normal_data <- simulate_nonormal_errors(1000, "normal", list(sd = 1))
t_data <- simulate_nonormal_errors(1000, "t", list(df = 3, scale = 1))
skewed_data <- simulate_nonormal_errors(1000, "skewed", list(df = 3))

# Fit models to each dataset
normal_model <- lm(Y ~ X1 + X2, data = normal_data)
t_model <- lm(Y ~ X1 + X2, data = t_data)
skewed_model <- lm(Y ~ X1 + X2, data = skewed_data)

# Compare coefficient estimates and standard errors
models_comparison <- data.frame(
  Distribution = c("Normal", "t", "Skewed"),
  Intercept = c(coef(normal_model)[1], coef(t_model)[1], coef(skewed_model)[1]),
  X1 = c(coef(normal_model)[2], coef(t_model)[2], coef(skewed_model)[2]),
  X2 = c(coef(normal_model)[3], coef(t_model)[3], coef(skewed_model)[3]),
  SE_Intercept = c(summary(normal_model)$coef[1,2], 
                   summary(t_model)$coef[1,2], 
                   summary(skewed_model)$coef[1,2]),
  SE_X1 = c(summary(normal_model)$coef[2,2], 
            summary(t_model)$coef[2,2], 
            summary(skewed_model)$coef[2,2]),
  SE_X2 = c(summary(normal_model)$coef[3,2], 
            summary(t_model)$coef[3,2], 
            summary(skewed_model)$coef[3,2])
)

# Print comparison table
print(models_comparison)

Results

The analysis yielded several key findings:

Temperature and season were the strongest predictors of bike rental patterns
Casual and registered users showed distinct usage patterns, justifying the multivariate approach
Non-normal errors significantly affected coefficient standard errors and confidence intervals
The semi-parametric model outperformed standard linear models in prediction accuracy

Visualization of the relationship between temperature and bike rentals:

[Temperature vs. Rentals plot would appear here]

Comparison of error distributions and their impact on regression inference:

[Error distribution comparison would appear here]

Conclusions

This project demonstrated the value of multivariate analysis in understanding complex systems like bike rentals. Key conclusions include:

Multivariate models capture interdependencies between different user types that would be missed in separate univariate analyses
Non-normal errors can substantially bias inference if not properly addressed
Semi-parametric approaches provide flexibility for modeling non-linear relationships
Weather variables exhibit complex, non-linear relationships with rental patterns that are better captured with advanced modeling techniques

This work contributed to a published paper in Scientific African, providing insights for bike-sharing system operators to optimize fleet management based on predicted usage patterns.