Mastering Linear Regression

Introduction to Linear Regression

Linear regression is one of the most widely used statistical techniques for modeling the relationship between a dependent variable (response) and one or more independent variables (predictors). The objective is to fit a linear equation to the observed data, enabling prediction, estimation, and inference.

In its simplest form, simple linear regression models a single predictor variable:

In multiple linear regression, the model generalizes to more predictors:

This allows the inclusion of multiple independent variables, enhancing the model’s complexity and ability to explain the variance in the dependent variable

Advanced Techniques in Linear Regression

Regularization (Ridge and Lasso Regression)

When the number of predictors is large, or when multicollinearity exists, regular linear regression can lead to overfitting. Ridge regression and Lasso regression apply penalties to the size of coefficients to prevent overfitting.

Ridge regression penalizes large coefficients by adding the sum of squared coefficients to the loss function:

Lasso regression (Least Absolute Shrinkage and Selection Operator) uses the absolute values of the coefficients in the penalty term:

Lasso has the additional property of performing variable selection by driving some coefficients to zero, effectively eliminating irrelevant predictors.

2. Polynomial Regression

Sometimes the relationship between the dependent and independent variables is nonlinear. Polynomial regression fits a nonlinear model to the data by extending the linear regression model to include polynomial terms:

Though the model is still linear in terms of coefficients, the predictor variables are raised to powers greater than one, allowing it to capture nonlinear relationships.

3. Interaction Terms

Linear regression can include interaction terms to model the joint effect of two or more variables. For example, if two predictors

x_1

and

x_2

interact, the model can include their interaction:

4. Multicollinearity and Variance Inflation Factor (VIF)

Multicollinearity occurs when two or more predictors in the model are highly correlated, which can inflate the variance of coefficient estimates and make the model unstable. To detect multicollinearity, the variance inflation factor (VIF) is used:

Where

R_j^2

is the R-squared value obtained by regressing the

j

-th predictor against all other predictors. A VIF value greater than 5 or 10 is often considered problematic.

5. Assumption Diagnostics and Model Robustness

Linear regression relies on several assumptions that need to be tested for model validity:

Linearity: The relationship between the predictors and the response is linear.
Homoscedasticity: The residuals have constant variance.
Normality of Errors: The residuals are normally distributed.
Independence of Errors: No autocorrelation in the residual

Violations of these assumptions can be diagnosed using techniques like residual plots, QQ plots, and Durbin-Watson tests.

6.Generalized Linear Models (GLM)

Generalized linear models extend linear regression to model responses that are not normally distributed. For example, for a binary response variable, the logistic regression model can be used, which transforms the response using a logit link function:

GLMs also handle count data using the Poisson regression model and proportional odds using the probit model.

Applications of Advanced Linear Regression

1. Finance and Actuarial Science
Linear regression models are used to forecast financial metrics, such as stock prices, asset returns, and risk factors. By extending the basic model with interaction terms and regularization techniques, these predictions become more robust in financial market volatility.

2. Healthcare and Survival Analysis

In healthcare, regression models predict patient outcomes based on features such as age, treatment, and medical history. Advanced techniques like the Cox Proportional Hazards Model (which can be seen as a generalized regression model for survival data) are used to analyze time-to-event data.

3. Marketing and Consumer Analytics

Regression models help in predicting customer behavior and sales based on demographic variables and past purchasing patterns. Regularization techniques like Lasso are particularly useful in this context due to the large number of predictors.

4. Natural Sciences and Engineering

In fields like biology, chemistry, and engineering, polynomial regression and interaction terms are used to model nonlinear relationships between variables, such as temperature and reaction rates.

Conclusion

Linear regression is a foundational technique in statistics and machine learning, but advanced versions like regularization, polynomial regression, and interaction terms significantly expand its applicability. Regular diagnostic checks ensure the model's assumptions are valid, while extensions like GLMs provide flexibility for non-normal response data. Understanding these advanced methods allows for more accurate modeling of complex real-world phenomena, making linear regression an indispensable tool across industries.

Dataset and Problem

We’ll use a common dataset, such as the Boston Housing Dataset, which contains various predictors for house prices. The goal is to predict the median house price based on predictors like crime rate, number of rooms, and other socioeconomic factors.

Let’s start with a code walkthrough for basic linear regression, then move to Ridge and Lasso regression.

Step 1: Basic Linear Regression

# Import necessary libraries

import numpy as np

import pandas as pd

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt

import seaborn as sns

# Load the Boston housing dataset

boston = load_boston()

df = pd.DataFrame(boston.data, columns=boston.feature_names)

df['MEDV'] = boston.target # MEDV is the median value of owner-occupied homes

# Define predictors and response variable

X = df.drop('MEDV', axis=1) # Predictors

y = df['MEDV'] # Response (House Prices)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

# Predict on test data

y_pred = lr_model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Linear Regression MSE: {mse}")

print(f"Linear Regression R-squared: {r2}")

# Visualize predicted vs actual values

plt.scatter(y_test, y_pred)

plt.xlabel("Actual Prices")

plt.ylabel("Predicted Prices")

plt.title("Linear Regression: Actual vs Predicted Prices")

plt.show()

# Residual plot to check model assumptions

residuals = y_test - y_pred

sns.histplot(residuals, kde=True)

plt.title('Residuals Distribution')

plt.show()

Linear Regression MSE: 0.5558915986952444 Linear Regression R-squared: 0.5757877060324508

In this code:

We first load the Boston Housing Dataset and split it into predictors (X) and the response variable (y).
We use LinearRegression from sklearn to build a linear regression model and evaluate it using Mean Squared Error (MSE) and R-squared.
A scatter plot visualizes predicted vs actual values, and a residuals plot helps us check if the errors are normally distributed.

Step 2: Ridge and Lasso Regression

Ridge and Lasso help handle multicollinearity and prevent overfitting by shrinking the coefficients. Here’s how to implement these models:

from sklearn.linear_model import Ridge, Lasso

from sklearn.preprocessing import StandardScaler

# Standardize the data (important for regularization)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# Ridge Regression

ridge_model = Ridge(alpha=1.0)

ridge_model.fit(X_train_scaled, y_train)

ridge_pred = ridge_model.predict(X_test_scaled)

# Evaluate Ridge Regression

ridge_mse = mean_squared_error(y_test, ridge_pred)

ridge_r2 = r2_score(y_test, ridge_pred)

print(f"Ridge Regression MSE: {ridge_mse}")

print(f"Ridge Regression R-squared: {ridge_r2}")

# Lasso Regression

lasso_model = Lasso(alpha=0.1)

lasso_model.fit(X_train_scaled, y_train)

lasso_pred = lasso_model.predict(X_test_scaled)

# Evaluate Lasso Regression

lasso_mse = mean_squared_error(y_test, lasso_pred)

lasso_r2 = r2_score(y_test, lasso_pred)

print(f"Lasso Regression MSE: {lasso_mse}")

print(f"Lasso Regression R-squared: {lasso_r2}")

Ridge Regression MSE: 0.5558548589435971

Ridge Regression R-squared: 0.5758157428913684

Lasso Regression MSE: 0.6796290284328825

Lasso Regression R-squared: 0.48136113250290735

In this part:

We scale the data using StandardScaler since Ridge and Lasso are sensitive to the scale of the predictors.
Ridge applies L2 regularization (penalizes the sum of squared coefficients), and Lasso applies L1 regularization (penalizes the absolute values of the coefficients).
We evaluate both Ridge and Lasso models similarly to the basic linear regression model.

Step 3: Diagnostics and Model Comparison

After fitting all three models, you can compare their performance:

# Compare MSE and R-squared for all models

print(f"Linear Regression MSE: {mse}, R2: {r2}")

print(f"Ridge Regression MSE: {ridge_mse}, R2: {ridge_r2}")

print(f"Lasso Regression MSE: {lasso_mse}, R2: {lasso_r2}")

Linear Regression MSE: 0.5558915986952444, R2: 0.5757877060324508

Ridge Regression MSE: 0.5558548589435971, R2: 0.5758157428913684

Lasso Regression MSE: 0.6796290284328825, R2: 0.48136113250290735

This comparison will show how Ridge and Lasso handle overfitting, especially in cases where multicollinearity is present or when some predictors have little relevance.

Conclusion

Linear Regression is the baseline model, useful for modeling linear relationships.

Ridge Regression handles multicollinearity by shrinking the coefficients but keeps all variables in the model.

Lasso Regression not only shrinks coefficients but can set some to zero, performing variable selection.

Let's dive deeper into residual diagnostics and regularization techniques to ensure we understand the behavior of the models and how to fine-tune them for better performance.

1. Residual Diagnostics in Linear Regression

Residuals (the difference between actual and predicted values) provide important insights into how well the model fits the data. We typically look for:

Normal distribution of residuals: Residuals should ideally follow a normal distribution. Deviations from normality can indicate a poor model fit.
Homoscedasticity: The residuals should have constant variance. If the spread of residuals increases or decreases with the predicted values, this suggests heteroscedasticity, indicating a non-constant error variance.
No Autocorrelation: Residuals should not be correlated with each other. This is especially important in time series data.

Let's expand the residual analysis:

# Residual diagnostics for Linear Regression

residuals = y_test - y_pred

# 1. Residuals vs Fitted Values

plt.scatter(y_pred, residuals)

plt.axhline(0, color='r', linestyle='--')

plt.xlabel("Fitted Values")

plt.ylabel("Residuals")

plt.title("Residuals vs Fitted Values")

plt.show()

# 2. QQ Plot to check normality of residuals

import statsmodels.api as sm

sm.qqplot(residuals, line='45')

plt.title("QQ Plot of Residuals")

plt.show()

# 3. Plot residuals distribution

sns.histplot(residuals, kde=True)

plt.title('Residuals Distribution')

plt.show()

Residuals vs Fitted Plot: If the residuals are randomly scattered around zero without any pattern, it suggests that the model is well-fitted. Any visible patterns could indicate a need for a nonlinear transformation.

QQ Plot: This checks if residuals follow a normal distribution. If residuals deviate from the straight line, it implies non-normality, indicating model inadequacy.

Residuals Distribution: A bell-shaped distribution suggests normality. Skewness in this plot can indicate issues with the model fit or outliers.

If you find heteroscedasticity or non-normality, you may consider transforming the response variable (e.g., log or square root transformation).

2. Regularization Techniques (Ridge and Lasso)

Tuning Hyperparameters

Both Ridge and Lasso have hyperparameters (the alpha term) that control the strength of regularization. The goal is to find the optimal value of alpha that balances bias and variance, leading to a better-performing model. This can be done using cross-validation.

Here’s how to tune alpha for both Ridge and Lasso using cross-validation:

from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Ridge

ridge_params = {'alpha': [0.1, 1.0, 10.0, 100.0, 1000.0]}

ridge_cv = GridSearchCV(Ridge(), ridge_params, scoring='neg_mean_squared_error', cv=5)

ridge_cv.fit(X_train_scaled, y_train)

best_ridge_alpha = ridge_cv.best_params_['alpha']

print(f"Best alpha for Ridge: {best_ridge_alpha}")

# Hyperparameter tuning for Lasso

lasso_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}

lasso_cv = GridSearchCV(Lasso(), lasso_params, scoring='neg_mean_squared_error', cv=5)

lasso_cv.fit(X_train_scaled, y_train)

best_lasso_alpha = lasso_cv.best_params_['alpha']

print(f"Best alpha for Lasso: {best_lasso_alpha}")

# Re-fit Ridge and Lasso with optimal alpha

ridge_model = Ridge(alpha=best_ridge_alpha)

ridge_model.fit(X_train_scaled, y_train)

lasso_model = Lasso(alpha=best_lasso_alpha)

lasso_model.fit(X_train_scaled, y_train)

GridSearchCV is used to perform cross-validation for different values of alpha. This automatically selects the value that minimizes the error, making the model more generalizable.

Once the best alpha is found, we re-fit the models to use the optimal parameter.

Best alpha for Ridge: 0.1
Best alpha for Lasso: 0.001

3. ElasticNet: Combining Ridge and Lasso

ElasticNet combines both L1 (Lasso) and L2 (Ridge) regularization, giving you the flexibility to control both types of penalties. It is especially useful when dealing with datasets with many correlated predictors.

from sklearn.linear_model import ElasticNet

# ElasticNet with cross-validation to tune both alpha and l1_ratio

elastic_params = {

'alpha': [0.1, 1.0, 10.0],

'l1_ratio': [0.2, 0.5, 0.8] # l1_ratio controls the mix of Lasso and Ridge

}

elastic_cv = GridSearchCV(ElasticNet(), elastic_params, scoring='neg_mean_squared_error', cv=5)

elastic_cv.fit(X_train_scaled, y_train)

best_elastic_alpha = elastic_cv.best_params_['alpha']

best_elastic_l1_ratio = elastic_cv.best_params_['l1_ratio']

print(f"Best alpha for ElasticNet: {best_elastic_alpha}, Best L1 ratio: {best_elastic_l1_ratio}")

# Re-fit ElasticNet with optimal alpha and l1_ratio

elastic_model = ElasticNet(alpha=best_elastic_alpha, l1_ratio=best_elastic_l1_ratio)

elastic_model.fit(X_train_scaled, y_train)

elastic_pred = elastic_model.predict(X_test_scaled)

# Evaluate ElasticNet

elastic_mse = mean_squared_error(y_test, elastic_pred)

elastic_r2 = r2_score(y_test, elastic_pred)

print(f"ElasticNet MSE: {elastic_mse}, R-squared: {elastic_r2}")

ElasticNet blends the benefits of Ridge and Lasso. The l1_ratio controls the proportion of Lasso vs Ridge. An l1_ratio of 0 means it’s pure Ridge, and a ratio of 1 means it’s pure Lasso.

GridSearchCV is used again to tune both alpha and l1_ratio.

Best alpha for ElasticNet: 0.1, Best L1 ratio: 0.2

ElasticNet MSE: 0.6012812713499678, R-squared: 0.5411499147715463

4. Model Comparison and Final Evaluation

After performing residual diagnostics and regularization, we can compare the performance of all models to choose the best one:

# Compare performance metrics

print(f"Linear Regression - MSE: {mse}, R2: {r2}")

print(f"Ridge Regression - MSE: {ridge_mse}, R2: {ridge_r2}")

print(f"Lasso Regression - MSE: {lasso_mse}, R2: {lasso_r2}")

print(f"ElasticNet - MSE: {elastic_mse}, R2: {elastic_r2}")

Linear Regression - MSE: 0.5558915986952444, R2: 0.5757877060324508

Ridge Regression - MSE: 0.5558548589435971, R2: 0.5758157428913684

Lasso Regression - MSE: 0.6796290284328825, R2: 0.48136113250290735

ElasticNet - MSE: 0.6012812713499678, R2: 0.5411499147715463

Conclusion

1. Residual Diagnostics: Residual analysis is crucial for checking model assumptions (normality, homoscedasticity, and autocorrelation). Non-random patterns or skewed distributions suggest model inadequacies that may require transformations or more complex models.

2. Regularization Techniques:

Ridge Regression is best suited for handling multicollinearity by shrinking coefficients.
Lasso Regression performs feature selection by forcing some coefficients to zero, especially useful for sparse models.
ElasticNet offers the flexibility to mix Ridge and Lasso regularization, making it ideal when you expect both sparse features and multicollinearity.

3. Tuning Hyperparameters: Cross-validation ensures the selection of optimal regularization strength (alpha), improving model generalizability.

Menu