PGD01C02
Module 4 · Model Evaluation

Visualization: Residual Plot and Distribution Plot

Core Titles
Key headlines and terms for quick recall
  • Residual plot — residuals e=yy^e = y - \hat y vs fitted y^\hat y
  • Ideal: random scatter around 0 (white-noise)
  • Patterns indicate violations:
    • U-shape → non-linearity
    • Funnel → heteroscedasticity
    • Trend → bias / missed feature
  • Distribution plot — overlay of predicted vs actual densities
  • Q-Q plot — checks residual normality
  • Scale-location plot — homoscedasticity check
  • Cook's distance — influential points
Basic Idea
What it is, why it matters, how it works

Residual plot

After fitting a regression model, compute residuals ei=yiy^ie_i = y_i - \hat y_i and plot them against the fitted values y^i\hat y_i (or against each predictor).

Ideal residual plot = random horizontal noise around 0, no pattern.

Common patterns and what they mean.

PatternCauseFix
Random scatterModel OK
U-shape / parabolicMissing non-linearityAdd x2x^2 or non-linear model
Funnel (variance grows with y^\hat y)HeteroscedasticityLog/Box-Cox transform yy; weighted least squares
Trend (linear slope in residuals)Missing variable or biasAdd omitted feature
Step patternCategorical effect not modelledEncode and include
ClusteringSub-populationsMixed models or group-specific terms

Distribution plot

Overlay the distribution of predicted values y^\hat y with the distribution of actual values yy. Use KDE plots from seaborn / matplotlib.

  • If they overlap closely → model captures the marginal distribution.
  • A shift means systematic over- or under-prediction.
  • A shape mismatch (e.g., model is unimodal but reality is bimodal) flags missed structure.

This is a fast visual check on regression and classification (predicted-probability vs class).

Q-Q plot

Plots the empirical quantiles of residuals against the theoretical quantiles of a normal distribution.

  • Straight diagonal → residuals are normal (OLS inference valid).
  • S-curve → heavy or light tails.
  • Curve at ends → skew.

Scale-location (spread-location) plot

Square root of absolute standardised residuals against fitted values.

  • Flat horizontal trend ⇒ constant variance.
  • Upward / downward trend ⇒ heteroscedasticity.

Cook's distance / leverage

Identifies influential points — observations whose removal significantly changes the fitted coefficients. Cook's distance >4/n> 4/n is a common threshold for further inspection.

Why these plots matter

Numbers like R2R^2 can hide problems. Anscombe's quartet famously shows four different datasets with the same R2R^2, mean, and variance — only the plots reveal them as totally different. Always plot residuals.

Mind Map
Visual structure of the concept
RESIDUAL & DISTRIBUTION PLOTS
├── Residual plot (e vs ŷ)
│   ├── Random  → model OK
│   ├── U-shape → non-linearity
│   ├── Funnel → heteroscedasticity
│   ├── Trend → missed feature
│   └── Clusters → sub-populations
├── Distribution plot
│   └── Compare densities of y and ŷ
├── Q-Q plot
│   └── Residual normality check
├── Scale-location plot
│   └── Homoscedasticity
└── Cook's distance
    └── Influential points
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is a residual plot? A scatter plot of residuals ei=yiy^ie_i = y_i - \hat y_i against fitted values y^i\hat y_i (or against a predictor). It diagnoses model fit — ideal pattern is random scatter around zero.

Q2. A residual plot shows a U-shaped pattern. What does this imply? That the regression model is missing a non-linear component. The linear model is underfitting a curved relationship; add polynomial / interaction terms, transform features, or switch to a non-linear model (decision tree, kernel regression).

Q3. What is a Q-Q plot used for? To check whether residuals follow a normal distribution. A straight diagonal indicates approximate normality; deviations at the ends indicate skew or heavy tails.


Part B (20 marks)

Q. Explain the role of residual plots and distribution plots in model evaluation. Discuss the common residual-plot patterns and their interpretations.

Why plot residuals?

Summary statistics like R2R^2 can hide model defects. The famous Anscombe's quartet (1973) shows four datasets with identical mean, variance, correlation and R2R^2, yet plots reveal them as wildly different. The lesson: numbers without plots lie.

A residual plot exposes whether the assumptions of regression hold:

  • Linearity — relationship is truly linear.
  • Independence of residuals.
  • Equal variance (homoscedasticity).
  • Normality of residuals.

Residual plot patterns.

PatternWhat it looks likeLikely causeRemedy
Random scatterCloud around 0Model OK
U-shape / parabolicSmile or frownMissing non-linearityAdd x2x^2, log/sqrt features, switch to non-linear model
FunnelSpread widens with y^\hat yHeteroscedasticityLog-transform yy, weighted least squares
Linear trendSlope in residualsOmitted variable biasAdd the missing predictor
Step patternJumps at certain y^\hat yCategorical effect ignoredInclude the categorical feature
ClusteringDistinct groupsMixed sub-populationsMixed-effect model, group-specific intercepts

Distribution plot.

Overlay KDE of yy (actual) and y^\hat y (predicted).

  • Coinciding curves → model captures the marginal distribution.
  • Shift → systematic over- or under-prediction.
  • Shape mismatch (e.g., predicted unimodal, actual bimodal) → model missing a categorical effect or non-linearity.

In classification, the same idea applies to predicted probabilities: plot P(y=1x)P(y = 1 | x) distributions for the two classes — should separate cleanly.

Q-Q plot.

Quantiles of residuals vs quantiles of Normal(0,1)(0, 1).

  • Straight diagonal — residuals are normal; OLS inference (CIs, p-values) valid.
  • S-curve — heavy or light tails (kurtosis ≠ 3).
  • Curved at extremes — skewness.

Scale-location plot. standardised residual\sqrt{|\text{standardised residual}|} vs y^\hat y. Flat horizontal trend = constant variance. Upward trend = heteroscedasticity.

Cook's distance and leverage. Identifies observations with large influence on coefficients. Inspect any with Cook's distance >4/n> 4/n — they may be valid extreme observations or data-entry errors.

Worked example.

A linear model predicts house prices. Residual plot shows a clear funnel — residuals are tiny for cheap houses and huge for expensive ones. Diagnosis: heteroscedasticity. Fix: log-transform price, refit. The new residual plot is a random cloud → homoscedasticity restored; CIs now valid.

Take-away. Always plot residuals after fitting. They reveal what summary numbers cannot.