Multiple Linear Regression Walkthrough

This walkthrough explores how multiple linear regression (MLR) is used to model and predict student performance based on various academic and lifestyle factors.

📥 Data Import & Description

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pd.read_csv('Student_Performance.csv')

We work with a dataset of 10,000 students, containing:

🔍 Exploratory Data Analysis (EDA)

We start by analyzing data types, distribution, and visualizing feature relationships with the target variable.

Distributions

Feature Distributions

Scatter Plots vs Target

Scatter plots vs Performance Index

Correlation Heatmap

Correlation Heatmap

Insights:

🧹 Data Preprocessing

Binary Encoding for Extracurricular Activities:

dataset['Extracurricular Activities'] = dataset['Extracurricular Activities'].map({'Yes': 1, 'No': 0})

Standardization for continuous variables:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cols = ['Hours Studied', 'Previous Scores', 'Sleep Hours', 'Sample Question Papers Practiced']
dataset[cols] = scaler.fit_transform(dataset[cols])

✂️ Train-Test Split

from sklearn.model_selection import train_test_split

X = dataset.drop("Performance Index", axis=1)
y = dataset["Performance Index"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

🧠 Model Fitting

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Model Coefficients

print(pd.Series(model.coef_, index=X.columns))

Equation of the fitted model:

Y = β₀ + β₁·X₁ + β₂·X₂ + β₃·X₃ + β₄·X₄ + β₅·X₅

📊 Model Evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

Results:

📈 Residual Analysis

Residuals vs Predicted

Residuals are evenly spread around 0 ; no pattern detected. Indicates assumptions of linearity and homoscedasticity hold.

Residual Distribution

Residuals appear normally distributed ; supports the assumption of normality.