PGD01C02
Module 3 · Exploratory Data Analytics and Model Development

Multiple Linear Regression

Core Titles
Key headlines and terms for quick recall
  • Multiple Linear Regression (MLR) — many predictors
  • Model: y^=β0+β1x1++βpxp\hat y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p
  • Matrix form: y^=Xβ\hat y = X \beta
  • OLS estimate: β^=(XTX)1XTy\hat \beta = (X^T X)^{-1} X^T y
  • Adjusted R2R^2 — penalises extra predictors
  • Multicollinearity — VIF
  • Categorical encoding, interactions
  • Assumptions: LINE + no multicollinearity
Basic Idea
What it is, why it matters, how it works

What it is

Multiple Linear Regression (MLR) generalises SLR to multiple predictors:

y^=β0+β1x1+β2x2++βpxp.\hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p.

Each βj\beta_j is the expected change in yy for a 1-unit change in xjx_j, holding other features constant.

Matrix form

Stack predictors into a design matrix XRn×(p+1)X \in \mathbb{R}^{n \times (p+1)} (with a leading column of 1s for the intercept):

y=Xβ+εy = X \beta + \varepsilon

OLS solution: β^=(XTX)1XTy.\boxed{\hat \beta = (X^T X)^{-1} X^T y}.

(Same idea as SLR — minimise yXβ2\|y - X \beta\|^2.)

Goodness of fit

  • SSE =(yiy^i)2= \sum (y_i - \hat y_i)^2.
  • R2=1SSE/SSTR^2 = 1 - \text{SSE}/\text{SST}.
  • Adjusted R2R^2 penalises extra predictors: Radj2=1(1R2)(n1)np1.R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}. Adds a feature only if it reduces SSE enough to overcome the penalty — useful for model selection.

Categorical predictors

  • One-hot encode: KK categories → K1K - 1 dummy variables (drop one as reference).
  • Coefficients interpret as effect relative to the reference category.

Interactions

  • Add term x1x2x_1 \cdot x_2 if effect of one feature depends on the value of another.

Assumptions

Same as SLR plus:

  • No multicollinearity — predictors not strongly linearly dependent. Otherwise XTXX^T X is near-singular and coefficient variances explode.
  • Detect via:
    • Pairwise correlations > 0.9 are red flags.
    • Variance Inflation Factor VIFj=1/(1Rj2)\text{VIF}_j = 1 / (1 - R^2_j), where Rj2R^2_j comes from regressing xjx_j on the other predictors. VIF>5\text{VIF} > 5 or 10 is concerning.

Coefficient interpretation

βj\beta_j is the expected change in yy per unit increase in xjx_j holding all other predictors constant. The "ceteris paribus" caveat matters — predictors must vary independently in the data for this interpretation to hold.

Regularisation — when pp is large or features are correlated

  • Ridge — adds λβj2\lambda \sum \beta_j^2 penalty; shrinks all coefficients.
  • Lasso — adds λβj\lambda \sum |\beta_j|; performs feature selection.
  • Elastic Net — both.

Diagnostic plots

  • Residual vs fitted — random scatter ideal.
  • Q-Q plot — normality.
  • Scale-location — homoscedasticity.
  • Residual vs leverage — Cook's distance for influential points.

Worked sketch — yield prediction

yield^=1.2+0.005rainfall0.08temp+0.3fertiliser\hat{\text{yield}} = 1.2 + 0.005 \cdot \text{rainfall} - 0.08 \cdot \text{temp} + 0.3 \cdot \text{fertiliser}.

Interpretation:

  • 1 mm extra rainfall → +0.005 t/ha yield.
  • 1 °C extra → −0.08 t/ha (heat stress).
  • 1 unit fertiliser → +0.3 t/ha.

Why it matters

  • Workhorse of statistics and econometrics.
  • Strong, interpretable baseline before going non-linear.
  • Foundation for logistic, ridge, lasso, GLMs.
Mind Map
Visual structure of the concept
MULTIPLE LINEAR REGRESSION
├── Model: ŷ = β₀ + Σ βⱼ xⱼ
├── Matrix form: ŷ = Xβ
├── OLS: β̂ = (XᵀX)⁻¹ Xᵀy
├── Fit measures
│   ├── R²
│   └── Adjusted R²  (penalises p)
├── Categorical → one-hot (K−1 dummies)
├── Interactions: xᵢ · xⱼ
├── Assumptions (LINE + no multicollinearity)
├── Multicollinearity check
│   ├── Pairwise corr > 0.9
│   └── VIF > 5 or 10
└── Regularisation
    ├── Ridge (L2)
    ├── Lasso (L1)
    └── Elastic Net
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Write the general form of multiple linear regression. y^=β0+β1x1+β2x2++βpxp\hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p. In matrix form y=Xβ+εy = X \beta + \varepsilon.

Q2. How is the OLS estimate of β\beta computed in matrix form? β^=(XTX)1XTy\hat \beta = (X^T X)^{-1} X^T y.

Q3. What is multicollinearity? How can you detect it? A condition where two or more predictors are strongly linearly related, making XTXX^T X near-singular and inflating coefficient variances. Detected via pairwise correlations >0.9> 0.9 or the Variance Inflation Factor VIFj=1/(1Rj2)\text{VIF}_j = 1 / (1 - R^2_j), with VIF >5> 5 or 10 considered problematic.


Part B (20 marks)

Q. Discuss Multiple Linear Regression. Derive the OLS estimator in matrix form. Explain how to handle multicollinearity, categorical predictors, and how Adjusted R2R^2 differs from R2R^2. Give an example application.

Model. y^i=β0+β1xi1++βpxip\hat y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}, or in matrix form y=Xβ+ε,XRn×(p+1),  βRp+1.y = X \beta + \varepsilon, \quad X \in \mathbb{R}^{n \times (p+1)}, \; \beta \in \mathbb{R}^{p+1}.

Derivation of OLS. Minimise

SSE(β)=yXβ2=(yXβ)T(yXβ).\text{SSE}(\beta) = \| y - X \beta \|^2 = (y - X \beta)^T (y - X \beta).

Differentiate w.r.t. β\beta: SSEβ=2XTy+2XTXβ=0.\frac{\partial \text{SSE}}{\partial \beta} = -2 X^T y + 2 X^T X \beta = 0.

Solve: β^=(XTX)1XTy.\boxed{\hat \beta = (X^T X)^{-1} X^T y}.

(Assumes XTXX^T X is invertible — i.e., predictors are linearly independent and n>pn > p.)

Categorical predictors. One-hot encode a KK-level categorical into K1K - 1 dummy columns (one reference dropped). Each coefficient is the effect relative to the reference. E.g., region with levels {N, S, E, W} and reference N gives dummies 1S,1E,1W\mathbb{1}_S, \mathbb{1}_E, \mathbb{1}_W; their coefficients give average yy-difference vs region N.

Multicollinearity.

  • Effect. (XTX)1(X^T X)^{-1} near singular → coefficients unstable, large standard errors, wrong signs, p-values unreliable.

  • Detection.

    • Pairwise correlation matrix; values >0.9> 0.9 suspicious.
    • Variance Inflation Factor VIFj=1/(1Rj2)\text{VIF}_j = 1 / (1 - R^2_j) where Rj2R^2_j is from regressing xjx_j on the remaining predictors. VIF>5\text{VIF} > 5 or 1010 is problematic.
    • Condition number of XTXX^T X.
  • Treatment.

    • Drop one of the offending features.
    • Combine into a single composite (PCA, summed index).
    • Use Ridge regression which adds λI\lambda I, stabilising the inverse.

R2R^2 vs Adjusted R2R^2.

MetricFormulaBehaviour
R2R^21SSE/SST1 - \text{SSE}/\text{SST}Never decreases when adding a predictor — even useless ones.
Adj. R2R^21(1R2)(n1)np11 - \dfrac{(1 - R^2)(n - 1)}{n - p - 1}Adds penalty for pp; can decrease if new feature doesn't help enough. Better for model comparison.

Example — agricultural yield prediction.

Predict crop yield (t/ha) from rainfall (mm), temperature (°C), fertiliser (kg).

After fitting on historical seasons: yield^=1.2+0.005rain0.08temp+0.3fert.\hat{\text{yield}} = 1.2 + 0.005 \cdot \text{rain} - 0.08 \cdot \text{temp} + 0.3 \cdot \text{fert}.

Interpretation:

  • Each extra mm rainfall raises yield by 0.005 t/ha.
  • Each extra °C reduces yield by 0.08 t/ha (heat stress).
  • Each extra unit fertiliser raises yield by 0.3 t/ha.

Diagnostics. Plot residuals vs fitted (random scatter ✓), Q-Q plot (approximately normal ✓), check VIFs (rainfall and temperature highly correlated → VIF > 10 → either drop one, or use Ridge).

Why MLR matters.

  • Interpretable, fast, strong baseline.
  • Foundation for logistic regression, GLMs, Ridge / Lasso.
  • Used in econometrics, finance, agriculture, marketing-mix modelling.