PGDDSA Study · Semester 1

Core Titles

Key headlines and terms for quick recall

Simple Linear Regression (SLR) — one predictor
Model: $\hat y = \beta_0 + \beta_1 x$
Goal: minimise Sum of Squared Errors (SSE)
OLS — Ordinary Least Squares
Closed-form: $\beta_1 = \dfrac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}$ , $\beta_0 = \bar y - \beta_1 \bar x$
$R^2$ — coefficient of determination
Assumptions: linearity, independence, homoscedasticity, normality

Basic Idea

What it is, why it matters, how it works

What it is

Simple Linear Regression (SLR) models the relationship between one predictor $x$ and a continuous response $y$ as a straight line:

$\hat y = \beta_0 + \beta_1 x$

$\beta_0$ — intercept (predicted $y$ when $x = 0$ ).
$\beta_1$ — slope (change in $y$ per unit change in $x$ ).

Goal — least squares

We want the line that minimises the sum of squared residuals: $\text{SSE} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2.$

This is Ordinary Least Squares (OLS).

Closed-form solution

Setting $\partial \text{SSE}/\partial \beta_0 = 0$ and $\partial \text{SSE}/\partial \beta_1 = 0$ :

$\boxed{\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}}, \quad \boxed{\beta_0 = \bar y - \beta_1 \bar x.}$

Equivalently $\beta_1 = \text{Cov}(x, y) / \text{Var}(x)$ , and $\beta_1 = r \cdot s_y / s_x$ where $r$ is correlation.

Goodness of fit — $R^2$

$R^2 = 1 - \dfrac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2.$

Ranges 0 to 1; fraction of variance explained by the model.

For SLR, $R^2 = r^2$ (Pearson correlation squared).

Assumptions (LINE)

Linearity — relationship truly is linear.
Independence — observations independent.
Normality — residuals approximately Gaussian.
Equal variance (homoscedasticity) — constant variance of residuals.

Violations show up in residual plots.

Residuals

$e_i = y_i - \hat y_i$ . A residual plot of $e_i$ vs $\hat y_i$ should look like random noise:

A funnel → heteroscedasticity.
A U-shape → missed non-linearity.
A trend → biased model.

Worked example

Hours studied $x = (1, 2, 3, 4, 5)$ ; marks $y = (52, 55, 60, 65, 70)$ .

$\bar x = 3, \; \bar y = 60.4$ .
$\sum (x - \bar x)(y - \bar y) = (-2)(-8.4) + (-1)(-5.4) + 0 + (1)(4.6) + (2)(9.6) = 16.8 + 5.4 + 0 + 4.6 + 19.2 = 46$ .
$\sum (x - \bar x)^2 = 4 + 1 + 0 + 1 + 4 = 10$ .
$\beta_1 = 46 / 10 = 4.6$ .
$\beta_0 = 60.4 - 4.6 \cdot 3 = 46.6$ .

$\hat y = 46.6 + 4.6 x$ . So each extra hour of study adds ~4.6 marks.

When SLR fails

Non-linear relationships (use polynomial / non-linear regression).
Multiple drivers (use multiple regression).
Categorical predictors (encode).
Heavy outliers (use robust regression).

Why it matters in data science

Simplest interpretable model — strong baseline.
Foundation for all linear methods (Logistic Regression, GLMs).
Closed-form solution makes it instantly trainable on huge datasets.

Mind Map

Visual structure of the concept

SIMPLE LINEAR REGRESSION
├── Model: ŷ = β₀ + β₁ x
├── Method: OLS (minimise SSE)
├── Closed-form
│   ├── β₁ = Σ(x−x̄)(y−ȳ) / Σ(x−x̄)²
│   └── β₀ = ȳ − β₁ x̄
├── R² = 1 − SSE/SST
├── Assumptions (LINE)
│   ├── Linearity
│   ├── Independence
│   ├── Normality of residuals
│   └── Equal variance (homoscedasticity)
└── Diagnostics
    ├── Residual plot
    │   ├── U-shape → non-linear
    │   ├── Funnel → heteroscedasticity
    │   └── Trend → biased
    └── Q-Q plot for normality

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Write the equation of simple linear regression. $\hat y = \beta_0 + \beta_1 x$ , where $\beta_0$ is the intercept and $\beta_1$ is the slope.

Q2. What is the least-squares criterion? Choose $\beta_0, \beta_1$ to minimise the sum of squared residuals $\sum (y_i - \hat y_i)^2$ .

Q3. What does $R^2$ represent? The fraction of variance in $y$ explained by the regression model: $R^2 = 1 - SSE/SST$ . Ranges 0 to 1; for SLR $R^2 = r^2$ , where $r$ is the Pearson correlation.

Part B (20 marks)

Q. Explain Simple Linear Regression. Derive the OLS estimates of $\beta_0$ and $\beta_1$ . Apply to the data: hours studied $x = (1, 2, 3, 4, 5)$ , marks $y = (52, 55, 60, 65, 70)$ . Discuss assumptions and interpretation of $R^2$ .

Model. $\hat y = \beta_0 + \beta_1 x$ , with residual $e_i = y_i - \hat y_i$ .

Goal. Minimise $\text{SSE} = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2$ .

Derivation of OLS estimates.

Take partial derivatives and set to zero:

$\dfrac{\partial \text{SSE}}{\partial \beta_0} = -2 \sum (y_i - \beta_0 - \beta_1 x_i) = 0$

$\dfrac{\partial \text{SSE}}{\partial \beta_1} = -2 \sum x_i (y_i - \beta_0 - \beta_1 x_i) = 0$

From the first equation: $\beta_0 = \bar y - \beta_1 \bar x$ .

Substitute into the second: $\sum x_i (y_i - \bar y + \beta_1 \bar x - \beta_1 x_i) = 0$ $\sum x_i (y_i - \bar y) = \beta_1 \sum x_i (x_i - \bar x)$

Both sides equal centred sums, so: $\boxed{\beta_1 = \dfrac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}}, \quad \boxed{\beta_0 = \bar y - \beta_1 \bar x.}$

Application — given data.

$x$	$y$	$x - \bar x$	$y - \bar y$	$(x-\bar x)(y - \bar y)$	$(x - \bar x)^2$
1	52	−2	−8.4	16.8	4
2	55	−1	−5.4	5.4	1
3	60	0	−0.4	0	0
4	65	1	4.6	4.6	1
5	70	2	9.6	19.2	4
Sum	302			46.0	10

$\bar x = 3, \; \bar y = 60.4$ .

$\beta_1 = 46 / 10 = \boxed{4.6}$ . $\beta_0 = 60.4 - 4.6 \cdot 3 = \boxed{46.6}$ .

Fitted line: $\hat y = 46.6 + 4.6 x$ . Each extra hour of study adds ~4.6 marks.

Predictions and residuals.

$x$	$y$	$\hat y$	residual $e$
1	52	51.2	0.8
2	55	55.8	−0.8
3	60	60.4	−0.4
4	65	65.0	0.0
5	70	69.6	0.4

SSE = $\sum e_i^2 = 0.64 + 0.64 + 0.16 + 0 + 0.16 = 1.60$ .

SST = $\sum (y - \bar y)^2 = 70.56 + 29.16 + 0.16 + 21.16 + 92.16 = 213.2$ .

$R^2 = 1 - 1.6 / 213.2 \approx 0.9925$ .

Interpretation. $R^2 \approx 99.25\%$ — the model explains nearly all of the variance in marks. Excellent linear fit. (Cross-validate before trusting on new students.)

Assumptions (LINE).

Linearity — relationship truly linear.
Independence — observations independent (students don't copy answers).
Normality of residuals — needed for inference (CIs, p-values), not for point estimates.
Equal variance (homoscedasticity) — constant spread of residuals.

Diagnostic plots.

Residual vs fitted plot should look like random noise. A U-shape → missed non-linearity (try $x^2$ ). A funnel → heteroscedasticity (try log-transform).
Q-Q plot of residuals — checks normality.

Simple Linear Regression