PGD01C02
Module 3 · Exploratory Data Analytics and Model Development

Simple Linear Regression

Core Titles
Key headlines and terms for quick recall
  • Simple Linear Regression (SLR) — one predictor
  • Model: y^=β0+β1x\hat y = \beta_0 + \beta_1 x
  • Goal: minimise Sum of Squared Errors (SSE)
  • OLS — Ordinary Least Squares
  • Closed-form: β1=(xixˉ)(yiyˉ)(xixˉ)2\beta_1 = \dfrac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}, β0=yˉβ1xˉ\beta_0 = \bar y - \beta_1 \bar x
  • R2R^2 — coefficient of determination
  • Assumptions: linearity, independence, homoscedasticity, normality
Basic Idea
What it is, why it matters, how it works

What it is

Simple Linear Regression (SLR) models the relationship between one predictor xx and a continuous response yy as a straight line:

y^=β0+β1x\hat y = \beta_0 + \beta_1 x

  • β0\beta_0 — intercept (predicted yy when x=0x = 0).
  • β1\beta_1 — slope (change in yy per unit change in xx).

Goal — least squares

We want the line that minimises the sum of squared residuals: SSE=i=1n(yiy^i)2=i=1n(yiβ0β1xi)2.\text{SSE} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2.

This is Ordinary Least Squares (OLS).

Closed-form solution

Setting SSE/β0=0\partial \text{SSE}/\partial \beta_0 = 0 and SSE/β1=0\partial \text{SSE}/\partial \beta_1 = 0:

β1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2,β0=yˉβ1xˉ.\boxed{\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}}, \quad \boxed{\beta_0 = \bar y - \beta_1 \bar x.}

Equivalently β1=Cov(x,y)/Var(x)\beta_1 = \text{Cov}(x, y) / \text{Var}(x), and β1=rsy/sx\beta_1 = r \cdot s_y / s_x where rr is correlation.

Goodness of fit — R2R^2

R2=1SSESST,SST=(yiyˉ)2.R^2 = 1 - \dfrac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2.

Ranges 0 to 1; fraction of variance explained by the model.

For SLR, R2=r2R^2 = r^2 (Pearson correlation squared).

Assumptions (LINE)

  1. Linearity — relationship truly is linear.
  2. Independence — observations independent.
  3. Normality — residuals approximately Gaussian.
  4. Equal variance (homoscedasticity) — constant variance of residuals.

Violations show up in residual plots.

Residuals

ei=yiy^ie_i = y_i - \hat y_i. A residual plot of eie_i vs y^i\hat y_i should look like random noise:

  • A funnel → heteroscedasticity.
  • A U-shape → missed non-linearity.
  • A trend → biased model.

Worked example

Hours studied x=(1,2,3,4,5)x = (1, 2, 3, 4, 5); marks y=(52,55,60,65,70)y = (52, 55, 60, 65, 70).

  • xˉ=3,  yˉ=60.4\bar x = 3, \; \bar y = 60.4.
  • (xxˉ)(yyˉ)=(2)(8.4)+(1)(5.4)+0+(1)(4.6)+(2)(9.6)=16.8+5.4+0+4.6+19.2=46\sum (x - \bar x)(y - \bar y) = (-2)(-8.4) + (-1)(-5.4) + 0 + (1)(4.6) + (2)(9.6) = 16.8 + 5.4 + 0 + 4.6 + 19.2 = 46.
  • (xxˉ)2=4+1+0+1+4=10\sum (x - \bar x)^2 = 4 + 1 + 0 + 1 + 4 = 10.
  • β1=46/10=4.6\beta_1 = 46 / 10 = 4.6.
  • β0=60.44.63=46.6\beta_0 = 60.4 - 4.6 \cdot 3 = 46.6.

y^=46.6+4.6x\hat y = 46.6 + 4.6 x. So each extra hour of study adds ~4.6 marks.

When SLR fails

  • Non-linear relationships (use polynomial / non-linear regression).
  • Multiple drivers (use multiple regression).
  • Categorical predictors (encode).
  • Heavy outliers (use robust regression).

Why it matters in data science

  • Simplest interpretable model — strong baseline.
  • Foundation for all linear methods (Logistic Regression, GLMs).
  • Closed-form solution makes it instantly trainable on huge datasets.
Mind Map
Visual structure of the concept
SIMPLE LINEAR REGRESSION
├── Model: ŷ = β₀ + β₁ x
├── Method: OLS (minimise SSE)
├── Closed-form
│   ├── β₁ = Σ(x−x̄)(y−ȳ) / Σ(x−x̄)²
│   └── β₀ = ȳ − β₁ x̄
├── R² = 1 − SSE/SST
├── Assumptions (LINE)
│   ├── Linearity
│   ├── Independence
│   ├── Normality of residuals
│   └── Equal variance (homoscedasticity)
└── Diagnostics
    ├── Residual plot
    │   ├── U-shape → non-linear
    │   ├── Funnel → heteroscedasticity
    │   └── Trend → biased
    └── Q-Q plot for normality
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Write the equation of simple linear regression. y^=β0+β1x\hat y = \beta_0 + \beta_1 x, where β0\beta_0 is the intercept and β1\beta_1 is the slope.

Q2. What is the least-squares criterion? Choose β0,β1\beta_0, \beta_1 to minimise the sum of squared residuals (yiy^i)2\sum (y_i - \hat y_i)^2.

Q3. What does R2R^2 represent? The fraction of variance in yy explained by the regression model: R2=1SSE/SSTR^2 = 1 - SSE/SST. Ranges 0 to 1; for SLR R2=r2R^2 = r^2, where rr is the Pearson correlation.


Part B (20 marks)

Q. Explain Simple Linear Regression. Derive the OLS estimates of β0\beta_0 and β1\beta_1. Apply to the data: hours studied x=(1,2,3,4,5)x = (1, 2, 3, 4, 5), marks y=(52,55,60,65,70)y = (52, 55, 60, 65, 70). Discuss assumptions and interpretation of R2R^2.

Model. y^=β0+β1x\hat y = \beta_0 + \beta_1 x, with residual ei=yiy^ie_i = y_i - \hat y_i.

Goal. Minimise SSE=i(yiβ0β1xi)2\text{SSE} = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2.

Derivation of OLS estimates.

Take partial derivatives and set to zero:

SSEβ0=2(yiβ0β1xi)=0\dfrac{\partial \text{SSE}}{\partial \beta_0} = -2 \sum (y_i - \beta_0 - \beta_1 x_i) = 0

SSEβ1=2xi(yiβ0β1xi)=0\dfrac{\partial \text{SSE}}{\partial \beta_1} = -2 \sum x_i (y_i - \beta_0 - \beta_1 x_i) = 0

From the first equation: β0=yˉβ1xˉ\beta_0 = \bar y - \beta_1 \bar x.

Substitute into the second: xi(yiyˉ+β1xˉβ1xi)=0\sum x_i (y_i - \bar y + \beta_1 \bar x - \beta_1 x_i) = 0 xi(yiyˉ)=β1xi(xixˉ)\sum x_i (y_i - \bar y) = \beta_1 \sum x_i (x_i - \bar x)

Both sides equal centred sums, so: β1=(xixˉ)(yiyˉ)(xixˉ)2,β0=yˉβ1xˉ.\boxed{\beta_1 = \dfrac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}}, \quad \boxed{\beta_0 = \bar y - \beta_1 \bar x.}

Application — given data.

xxyyxxˉx - \bar xyyˉy - \bar y(xxˉ)(yyˉ)(x-\bar x)(y - \bar y)(xxˉ)2(x - \bar x)^2
152−2−8.416.84
255−1−5.45.41
3600−0.400
46514.64.61
57029.619.24
Sum30246.010

xˉ=3,  yˉ=60.4\bar x = 3, \; \bar y = 60.4.

β1=46/10=4.6\beta_1 = 46 / 10 = \boxed{4.6}. β0=60.44.63=46.6\beta_0 = 60.4 - 4.6 \cdot 3 = \boxed{46.6}.

Fitted line: y^=46.6+4.6x\hat y = 46.6 + 4.6 x. Each extra hour of study adds ~4.6 marks.

Predictions and residuals.

xxyyy^\hat yresidual ee
15251.20.8
25555.8−0.8
36060.4−0.4
46565.00.0
57069.60.4

SSE = ei2=0.64+0.64+0.16+0+0.16=1.60\sum e_i^2 = 0.64 + 0.64 + 0.16 + 0 + 0.16 = 1.60.

SST = (yyˉ)2=70.56+29.16+0.16+21.16+92.16=213.2\sum (y - \bar y)^2 = 70.56 + 29.16 + 0.16 + 21.16 + 92.16 = 213.2.

R2=11.6/213.20.9925R^2 = 1 - 1.6 / 213.2 \approx 0.9925.

Interpretation. R299.25%R^2 \approx 99.25\% — the model explains nearly all of the variance in marks. Excellent linear fit. (Cross-validate before trusting on new students.)

Assumptions (LINE).

  1. Linearity — relationship truly linear.
  2. Independence — observations independent (students don't copy answers).
  3. Normality of residuals — needed for inference (CIs, p-values), not for point estimates.
  4. Equal variance (homoscedasticity) — constant spread of residuals.

Diagnostic plots.

  • Residual vs fitted plot should look like random noise. A U-shape → missed non-linearity (try x2x^2). A funnel → heteroscedasticity (try log-transform).
  • Q-Q plot of residuals — checks normality.