Visualization: Residual Plot and Distribution Plot
Core Titles
Key headlines and terms for quick recall- Residual plot — residuals vs fitted
- Ideal: random scatter around 0 (white-noise)
- Patterns indicate violations:
- U-shape → non-linearity
- Funnel → heteroscedasticity
- Trend → bias / missed feature
- Distribution plot — overlay of predicted vs actual densities
- Q-Q plot — checks residual normality
- Scale-location plot — homoscedasticity check
- Cook's distance — influential points
Basic Idea
What it is, why it matters, how it worksResidual plot
After fitting a regression model, compute residuals and plot them against the fitted values (or against each predictor).
Ideal residual plot = random horizontal noise around 0, no pattern.
Common patterns and what they mean.
| Pattern | Cause | Fix |
|---|---|---|
| Random scatter | Model OK | – |
| U-shape / parabolic | Missing non-linearity | Add or non-linear model |
| Funnel (variance grows with ) | Heteroscedasticity | Log/Box-Cox transform ; weighted least squares |
| Trend (linear slope in residuals) | Missing variable or bias | Add omitted feature |
| Step pattern | Categorical effect not modelled | Encode and include |
| Clustering | Sub-populations | Mixed models or group-specific terms |
Distribution plot
Overlay the distribution of predicted values with the distribution of actual values . Use KDE plots from seaborn / matplotlib.
- If they overlap closely → model captures the marginal distribution.
- A shift means systematic over- or under-prediction.
- A shape mismatch (e.g., model is unimodal but reality is bimodal) flags missed structure.
This is a fast visual check on regression and classification (predicted-probability vs class).
Q-Q plot
Plots the empirical quantiles of residuals against the theoretical quantiles of a normal distribution.
- Straight diagonal → residuals are normal (OLS inference valid).
- S-curve → heavy or light tails.
- Curve at ends → skew.
Scale-location (spread-location) plot
Square root of absolute standardised residuals against fitted values.
- Flat horizontal trend ⇒ constant variance.
- Upward / downward trend ⇒ heteroscedasticity.
Cook's distance / leverage
Identifies influential points — observations whose removal significantly changes the fitted coefficients. Cook's distance is a common threshold for further inspection.
Why these plots matter
Numbers like can hide problems. Anscombe's quartet famously shows four different datasets with the same , mean, and variance — only the plots reveal them as totally different. Always plot residuals.
Mind Map
Visual structure of the conceptRESIDUAL & DISTRIBUTION PLOTS
├── Residual plot (e vs ŷ)
│ ├── Random → model OK
│ ├── U-shape → non-linearity
│ ├── Funnel → heteroscedasticity
│ ├── Trend → missed feature
│ └── Clusters → sub-populations
├── Distribution plot
│ └── Compare densities of y and ŷ
├── Q-Q plot
│ └── Residual normality check
├── Scale-location plot
│ └── Homoscedasticity
└── Cook's distance
└── Influential points
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. What is a residual plot? A scatter plot of residuals against fitted values (or against a predictor). It diagnoses model fit — ideal pattern is random scatter around zero.
Q2. A residual plot shows a U-shaped pattern. What does this imply? That the regression model is missing a non-linear component. The linear model is underfitting a curved relationship; add polynomial / interaction terms, transform features, or switch to a non-linear model (decision tree, kernel regression).
Q3. What is a Q-Q plot used for? To check whether residuals follow a normal distribution. A straight diagonal indicates approximate normality; deviations at the ends indicate skew or heavy tails.
Part B (20 marks)
Q. Explain the role of residual plots and distribution plots in model evaluation. Discuss the common residual-plot patterns and their interpretations.
Why plot residuals?
Summary statistics like can hide model defects. The famous Anscombe's quartet (1973) shows four datasets with identical mean, variance, correlation and , yet plots reveal them as wildly different. The lesson: numbers without plots lie.
A residual plot exposes whether the assumptions of regression hold:
- Linearity — relationship is truly linear.
- Independence of residuals.
- Equal variance (homoscedasticity).
- Normality of residuals.
Residual plot patterns.
| Pattern | What it looks like | Likely cause | Remedy |
|---|---|---|---|
| Random scatter | Cloud around 0 | Model OK | – |
| U-shape / parabolic | Smile or frown | Missing non-linearity | Add , log/sqrt features, switch to non-linear model |
| Funnel | Spread widens with | Heteroscedasticity | Log-transform , weighted least squares |
| Linear trend | Slope in residuals | Omitted variable bias | Add the missing predictor |
| Step pattern | Jumps at certain | Categorical effect ignored | Include the categorical feature |
| Clustering | Distinct groups | Mixed sub-populations | Mixed-effect model, group-specific intercepts |
Distribution plot.
Overlay KDE of (actual) and (predicted).
- Coinciding curves → model captures the marginal distribution.
- Shift → systematic over- or under-prediction.
- Shape mismatch (e.g., predicted unimodal, actual bimodal) → model missing a categorical effect or non-linearity.
In classification, the same idea applies to predicted probabilities: plot distributions for the two classes — should separate cleanly.
Q-Q plot.
Quantiles of residuals vs quantiles of Normal.
- Straight diagonal — residuals are normal; OLS inference (CIs, p-values) valid.
- S-curve — heavy or light tails (kurtosis ≠ 3).
- Curved at extremes — skewness.
Scale-location plot. vs . Flat horizontal trend = constant variance. Upward trend = heteroscedasticity.
Cook's distance and leverage. Identifies observations with large influence on coefficients. Inspect any with Cook's distance — they may be valid extreme observations or data-entry errors.
Worked example.
A linear model predicts house prices. Residual plot shows a clear funnel — residuals are tiny for cheap houses and huge for expensive ones. Diagnosis: heteroscedasticity. Fix: log-transform price, refit. The new residual plot is a random cloud → homoscedasticity restored; CIs now valid.
Take-away. Always plot residuals after fitting. They reveal what summary numbers cannot.