PGDDSA Study · Semester 1

Core Titles

Key headlines and terms for quick recall

Multiple Linear Regression (MLR) — many predictors
Model: $\hat y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p$
Matrix form: $\hat y = X \beta$
OLS estimate: $\hat \beta = (X^T X)^{-1} X^T y$
Adjusted $R^2$ — penalises extra predictors
Multicollinearity — VIF
Categorical encoding, interactions
Assumptions: LINE + no multicollinearity

Basic Idea

What it is, why it matters, how it works

What it is

Multiple Linear Regression (MLR) generalises SLR to multiple predictors:

$\hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p.$

Each $\beta_j$ is the expected change in $y$ for a 1-unit change in $x_j$ , holding other features constant.

Matrix form

Stack predictors into a design matrix $X \in \mathbb{R}^{n \times (p+1)}$ (with a leading column of 1s for the intercept):

$y = X \beta + \varepsilon$

OLS solution: $\boxed{\hat \beta = (X^T X)^{-1} X^T y}.$

(Same idea as SLR — minimise $\|y - X \beta\|^2$ .)

Goodness of fit

SSE $= \sum (y_i - \hat y_i)^2$ .
$R^2 = 1 - \text{SSE}/\text{SST}$ .
Adjusted $R^2$ penalises extra predictors: $R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}.$ Adds a feature only if it reduces SSE enough to overcome the penalty — useful for model selection.

Categorical predictors

One-hot encode: $K$ categories → $K - 1$ dummy variables (drop one as reference).
Coefficients interpret as effect relative to the reference category.

Interactions

Add term $x_1 \cdot x_2$ if effect of one feature depends on the value of another.

Assumptions

Same as SLR plus:

No multicollinearity — predictors not strongly linearly dependent. Otherwise $X^T X$ is near-singular and coefficient variances explode.
Detect via:
- Pairwise correlations > 0.9 are red flags.
- Variance Inflation Factor $\text{VIF}_j = 1 / (1 - R^2_j)$ , where $R^2_j$ comes from regressing $x_j$ on the other predictors. $\text{VIF} > 5$ or 10 is concerning.

Coefficient interpretation

$\beta_j$ is the expected change in $y$ per unit increase in $x_j$ holding all other predictors constant. The "ceteris paribus" caveat matters — predictors must vary independently in the data for this interpretation to hold.

Regularisation — when $p$ is large or features are correlated

Ridge — adds $\lambda \sum \beta_j^2$ penalty; shrinks all coefficients.
Lasso — adds $\lambda \sum |\beta_j|$ ; performs feature selection.
Elastic Net — both.

Diagnostic plots

Residual vs fitted — random scatter ideal.
Q-Q plot — normality.
Scale-location — homoscedasticity.
Residual vs leverage — Cook's distance for influential points.

Worked sketch — yield prediction

$\hat{\text{yield}} = 1.2 + 0.005 \cdot \text{rainfall} - 0.08 \cdot \text{temp} + 0.3 \cdot \text{fertiliser}$ .

Interpretation:

1 mm extra rainfall → +0.005 t/ha yield.
1 °C extra → −0.08 t/ha (heat stress).
1 unit fertiliser → +0.3 t/ha.

Why it matters

Workhorse of statistics and econometrics.
Strong, interpretable baseline before going non-linear.
Foundation for logistic, ridge, lasso, GLMs.

Mind Map

Visual structure of the concept

MULTIPLE LINEAR REGRESSION
├── Model: ŷ = β₀ + Σ βⱼ xⱼ
├── Matrix form: ŷ = Xβ
├── OLS: β̂ = (XᵀX)⁻¹ Xᵀy
├── Fit measures
│   ├── R²
│   └── Adjusted R²  (penalises p)
├── Categorical → one-hot (K−1 dummies)
├── Interactions: xᵢ · xⱼ
├── Assumptions (LINE + no multicollinearity)
├── Multicollinearity check
│   ├── Pairwise corr > 0.9
│   └── VIF > 5 or 10
└── Regularisation
    ├── Ridge (L2)
    ├── Lasso (L1)
    └── Elastic Net

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Write the general form of multiple linear regression. $\hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p$ . In matrix form $y = X \beta + \varepsilon$ .

Q2. How is the OLS estimate of $\beta$ computed in matrix form? $\hat \beta = (X^T X)^{-1} X^T y$ .

Q3. What is multicollinearity? How can you detect it? A condition where two or more predictors are strongly linearly related, making $X^T X$ near-singular and inflating coefficient variances. Detected via pairwise correlations $> 0.9$ or the Variance Inflation Factor $\text{VIF}_j = 1 / (1 - R^2_j)$ , with VIF $> 5$ or 10 considered problematic.

Part B (20 marks)

Q. Discuss Multiple Linear Regression. Derive the OLS estimator in matrix form. Explain how to handle multicollinearity, categorical predictors, and how Adjusted $R^2$ differs from $R^2$ . Give an example application.

Model. $\hat y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}$ , or in matrix form $y = X \beta + \varepsilon, \quad X \in \mathbb{R}^{n \times (p+1)}, \; \beta \in \mathbb{R}^{p+1}.$

Derivation of OLS. Minimise

$\text{SSE}(\beta) = \| y - X \beta \|^2 = (y - X \beta)^T (y - X \beta).$

Differentiate w.r.t. $\beta$ : $\frac{\partial \text{SSE}}{\partial \beta} = -2 X^T y + 2 X^T X \beta = 0.$

Solve: $\boxed{\hat \beta = (X^T X)^{-1} X^T y}.$

(Assumes $X^T X$ is invertible — i.e., predictors are linearly independent and $n > p$ .)

Categorical predictors. One-hot encode a $K$ -level categorical into $K - 1$ dummy columns (one reference dropped). Each coefficient is the effect relative to the reference. E.g., region with levels {N, S, E, W} and reference N gives dummies $\mathbb{1}_S, \mathbb{1}_E, \mathbb{1}_W$ ; their coefficients give average $y$ -difference vs region N.

Multicollinearity.

Effect. $(X^T X)^{-1}$ near singular → coefficients unstable, large standard errors, wrong signs, p-values unreliable.
Detection.
- Pairwise correlation matrix; values $> 0.9$ suspicious.
- Variance Inflation Factor $\text{VIF}_j = 1 / (1 - R^2_j)$ where $R^2_j$ is from regressing $x_j$ on the remaining predictors. $\text{VIF} > 5$ or $10$ is problematic.
- Condition number of $X^T X$ .
Treatment.
- Drop one of the offending features.
- Combine into a single composite (PCA, summed index).
- Use Ridge regression which adds $\lambda I$ , stabilising the inverse.

$R^2$ vs Adjusted $R^2$ .

Metric	Formula	Behaviour
$R^2$	$1 - \text{SSE}/\text{SST}$	Never decreases when adding a predictor — even useless ones.
Adj. $R^2$	$1 - \dfrac{(1 - R^2)(n - 1)}{n - p - 1}$	Adds penalty for $p$ ; can decrease if new feature doesn't help enough. Better for model comparison.

Example — agricultural yield prediction.

Predict crop yield (t/ha) from rainfall (mm), temperature (°C), fertiliser (kg).

After fitting on historical seasons: $\hat{\text{yield}} = 1.2 + 0.005 \cdot \text{rain} - 0.08 \cdot \text{temp} + 0.3 \cdot \text{fert}.$