PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

Visualization: Residual Plot and Distribution Plot

Core Titles

Key headlines and terms for quick recall

Residual plot — residuals $e = y - \hat y$ vs fitted $\hat y$
Ideal: random scatter around 0 (white-noise)
Patterns indicate violations:
- U-shape → non-linearity
- Funnel → heteroscedasticity
- Trend → bias / missed feature
Distribution plot — overlay of predicted vs actual densities
Q-Q plot — checks residual normality
Scale-location plot — homoscedasticity check
Cook's distance — influential points

Basic Idea

What it is, why it matters, how it works

Residual plot

After fitting a regression model, compute residuals $e_i = y_i - \hat y_i$ and plot them against the fitted values $\hat y_i$ (or against each predictor).

Ideal residual plot = random horizontal noise around 0, no pattern.

Common patterns and what they mean.

Pattern	Cause	Fix
Random scatter	Model OK	–
U-shape / parabolic	Missing non-linearity	Add $x^2$ or non-linear model
Funnel (variance grows with $\hat y$ )	Heteroscedasticity	Log/Box-Cox transform $y$ ; weighted least squares
Trend (linear slope in residuals)	Missing variable or bias	Add omitted feature
Step pattern	Categorical effect not modelled	Encode and include
Clustering	Sub-populations	Mixed models or group-specific terms

Distribution plot

Overlay the distribution of predicted values $\hat y$ with the distribution of actual values $y$ . Use KDE plots from seaborn / matplotlib.

If they overlap closely → model captures the marginal distribution.
A shift means systematic over- or under-prediction.
A shape mismatch (e.g., model is unimodal but reality is bimodal) flags missed structure.

This is a fast visual check on regression and classification (predicted-probability vs class).

Q-Q plot

Plots the empirical quantiles of residuals against the theoretical quantiles of a normal distribution.

Straight diagonal → residuals are normal (OLS inference valid).
S-curve → heavy or light tails.
Curve at ends → skew.

Scale-location (spread-location) plot

Square root of absolute standardised residuals against fitted values.

Flat horizontal trend ⇒ constant variance.
Upward / downward trend ⇒ heteroscedasticity.

Cook's distance / leverage

Identifies influential points — observations whose removal significantly changes the fitted coefficients. Cook's distance $> 4/n$ is a common threshold for further inspection.

Why these plots matter

Numbers like $R^2$ can hide problems. Anscombe's quartet famously shows four different datasets with the same $R^2$ , mean, and variance — only the plots reveal them as totally different. Always plot residuals.

Mind Map

Visual structure of the concept

RESIDUAL & DISTRIBUTION PLOTS
├── Residual plot (e vs ŷ)
│   ├── Random  → model OK
│   ├── U-shape → non-linearity
│   ├── Funnel → heteroscedasticity
│   ├── Trend → missed feature
│   └── Clusters → sub-populations
├── Distribution plot
│   └── Compare densities of y and ŷ
├── Q-Q plot
│   └── Residual normality check
├── Scale-location plot
│   └── Homoscedasticity
└── Cook's distance
    └── Influential points

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is a residual plot? A scatter plot of residuals $e_i = y_i - \hat y_i$ against fitted values $\hat y_i$ (or against a predictor). It diagnoses model fit — ideal pattern is random scatter around zero.

Q2. A residual plot shows a U-shaped pattern. What does this imply? That the regression model is missing a non-linear component. The linear model is underfitting a curved relationship; add polynomial / interaction terms, transform features, or switch to a non-linear model (decision tree, kernel regression).

Q3. What is a Q-Q plot used for? To check whether residuals follow a normal distribution. A straight diagonal indicates approximate normality; deviations at the ends indicate skew or heavy tails.

Part B (20 marks)

Q. Explain the role of residual plots and distribution plots in model evaluation. Discuss the common residual-plot patterns and their interpretations.

Why plot residuals?

Summary statistics like $R^2$ can hide model defects. The famous Anscombe's quartet (1973) shows four datasets with identical mean, variance, correlation and $R^2$ , yet plots reveal them as wildly different. The lesson: numbers without plots lie.

A residual plot exposes whether the assumptions of regression hold:

Linearity — relationship is truly linear.
Independence of residuals.
Equal variance (homoscedasticity).
Normality of residuals.

Residual plot patterns.

Pattern	What it looks like	Likely cause	Remedy
Random scatter	Cloud around 0	Model OK	–
U-shape / parabolic	Smile or frown	Missing non-linearity	Add $x^2$ , log/sqrt features, switch to non-linear model
Funnel	Spread widens with $\hat y$	Heteroscedasticity	Log-transform $y$ , weighted least squares
Linear trend	Slope in residuals	Omitted variable bias	Add the missing predictor
Step pattern	Jumps at certain $\hat y$	Categorical effect ignored	Include the categorical feature
Clustering	Distinct groups	Mixed sub-populations	Mixed-effect model, group-specific intercepts

Distribution plot.

Overlay KDE of $y$ (actual) and $\hat y$ (predicted).

Coinciding curves → model captures the marginal distribution.
Shift → systematic over- or under-prediction.
Shape mismatch (e.g., predicted unimodal, actual bimodal) → model missing a categorical effect or non-linearity.

In classification, the same idea applies to predicted probabilities: plot $P(y = 1 | x)$ distributions for the two classes — should separate cleanly.

Q-Q plot.

Quantiles of residuals vs quantiles of Normal $(0, 1)$ .

Straight diagonal — residuals are normal; OLS inference (CIs, p-values) valid.
S-curve — heavy or light tails (kurtosis ≠ 3).
Curved at extremes — skewness.

Scale-location plot. $\sqrt{|\text{standardised residual}|}$ vs $\hat y$ . Flat horizontal trend = constant variance. Upward trend = heteroscedasticity.

Cook's distance and leverage. Identifies observations with large influence on coefficients. Inspect any with Cook's distance $> 4/n$ — they may be valid extreme observations or data-entry errors.

Worked example.

A linear model predicts house prices. Residual plot shows a clear funnel — residuals are tiny for cheap houses and huge for expensive ones. Diagnosis: heteroscedasticity. Fix: log-transform price, refit. The new residual plot is a random cloud → homoscedasticity restored; CIs now valid.

Take-away. Always plot residuals after fitting. They reveal what summary numbers cannot.

Multiple Linear Regression Polynomial Regression and Pipelines