PGD01C02
Module 4 · Model Evaluation

Generalization Error and Out-of-Sample Metrics

Core Titles
Key headlines and terms for quick recall
  • Generalisation error — expected loss on unseen data
  • Training vs Test error — train error optimistic
  • Bias / Variance / Irreducible noise decomposition
  • Out-of-sample (OOS) metrics — held-out test, CV
  • Optimism gap = test error − train error
  • Learning curves — error vs training size
  • Why generalisation is the real goal in ML
Basic Idea
What it is, why it matters, how it works

What is generalisation error?

Generalisation error is the expected loss a model incurs on data drawn from the same distribution as the training data — but not used in training:

GenErr(f)=E(X,Y)D[L(Y,f(X))].\text{GenErr}(f) = E_{(X, Y) \sim D}[L(Y, f(X))].

Unlike training error (in-sample), generalisation error tells us how the model will perform in production.

Bias–Variance decomposition

For squared loss: E[(Yf^(X))2]=(E[f^]f)2Bias2+E[(f^E[f^])2]Variance+σε2Irreducible.E[(Y - \hat f(X))^2] = \underbrace{(E[\hat f] - f^*)^2}_{\text{Bias}^2} + \underbrace{E[(\hat f - E[\hat f])^2]}_{\text{Variance}} + \underbrace{\sigma^2_\varepsilon}_{\text{Irreducible}}.

  • Bias — error from wrong model assumptions (linear when reality is non-linear).
  • Variance — error from sensitivity to specific training sample.
  • Irreducible noise — intrinsic noise in YY; cannot be reduced.

Trade-off. Simpler models have high bias / low variance; complex models the reverse. The optimal complexity balances both.

Estimating generalisation error

1. Hold-out test set. Split data once (e.g., 80% train / 20% test). Train on the train; score on the test. Simple but the score is noisy on small data.

2. k-Fold cross-validation. Split into kk folds. Train on k1k-1, validate on the held-out fold. Repeat kk times; average. Better use of data.

3. Leave-One-Out CV (LOOCV). k=nk = n. Almost unbiased but expensive.

4. Repeated / Stratified CV. Multiple runs, preserved class proportions.

Optimism gap

Optimism=Test errorTrain error.\text{Optimism} = \text{Test error} - \text{Train error}.

  • Small gap → well-fit model.
  • Large gap → over-fitting.
  • Train and test both high → under-fitting.

Learning curves

Plot training and CV errors as functions of:

  • Training-set size — both errors should converge to the irreducible error.
  • Model complexity — train error always falls; CV error U-shapes.

These curves diagnose whether to add data, simplify the model, or call it done.

Regression vs classification metrics

TaskOOS metrics
RegressionRMSE, MAE, R2R^2, MAPE
ClassificationAccuracy, F1, ROC-AUC, PR-AUC, log-loss
RankingNDCG, MAP, MRR
DetectionmAP, IoU

Why generalisation is the real goal

Training error is easy to drive to zero — a lookup table memorises training perfectly. But generalisation is what the business cares about: how the model performs on tomorrow's new customers, transactions, images. Every modelling choice should ultimately serve generalisation.

Mind Map
Visual structure of the concept
GENERALISATION ERROR
├── Definition — expected loss on unseen data
├── Bias–Variance decomposition
│   ├── Bias² — model too simple
│   ├── Variance — too sensitive to data
│   └── Irreducible noise
├── Estimation
│   ├── Held-out test
│   ├── k-Fold CV
│   ├── LOOCV
│   └── Repeated stratified CV
├── Optimism gap = OOS − train
│   ├── Large → overfitting
│   └── Both high → underfitting
├── Learning curves
│   ├── Error vs N (size)
│   └── Error vs complexity
└── Why care?
    └── Business wants production performance
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define generalisation error. The expected loss of a model on data drawn from the same distribution as the training data but not used in training: E(X,Y)D[L(Y,f(X))]E_{(X, Y) \sim D}[L(Y, f(X))]. It measures real-world (production) performance.

Q2. What is the bias-variance trade-off? A trade-off in model complexity: simple models have high bias / low variance (underfit); complex models have low bias / high variance (overfit). The optimal model minimises total error = bias² + variance + irreducible noise.

Q3. Why is training error not a good measure of model quality? Because the model is fit to that data — even a memorising lookup table can achieve zero training error while failing on new data. Out-of-sample error (test, CV) is needed.


Part B (20 marks)

Q. Explain generalisation error and the bias-variance trade-off. Discuss techniques to estimate out-of-sample error and methods to diagnose under- and over-fitting using learning curves.

Generalisation error.

The expected loss on unseen data from the same distribution: GenErr(f)=E(X,Y)[L(Y,f(X))].\text{GenErr}(f) = E_{(X, Y)}[L(Y, f(X))].

It is the quantity we actually care about — production performance — not training error.

Bias–Variance decomposition (squared loss).

For an estimator f^\hat f trained on a random sample: E[(Yf^(X))2]=(E[f^(X)]f(X))2Bias2+Var(f^(X))Variance+σε2Noise.E[(Y - \hat f(X))^2] = \underbrace{(E[\hat f(X)] - f^*(X))^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat f(X))}_{\text{Variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{Noise}}.

SourceMeaningReduce by
BiasWrong model assumptions (e.g., linear when truth is curved)More complex model, better features
VarianceSensitivity to training sampleMore data, simpler model, regularisation, ensembling
NoiseIntrinsic randomness in YYIrreducible

Trade-off. As complexity grows: bias ↓, variance ↑. The total error is U-shaped — optimal complexity sits at the bottom.

Estimating generalisation error.

1. Hold-out test set. Simple 80/20 split. Cheap but noisy on small data.

2. k-Fold Cross-Validation. Split into kk folds; train on k1k-1, validate on held-out; repeat. Average is a good estimate of generalisation. Default k=5k = 5 or 1010.

3. Leave-One-Out CV (LOOCV). Maximum data use but nn-fold expensive.

4. Repeated stratified k-fold. Average over multiple random splits to reduce variance of the estimate.

5. Time-series CV / Group k-fold. For sequential or grouped data — never let future / same-entity data leak into training.

6. Nested CV for unbiased hyperparameter tuning + evaluation.

Optimism gap. Optimism=Test errorTrain error.\text{Optimism} = \text{Test error} - \text{Train error}.

  • Both errors low and close → well-fit, deploy.
  • Big gap (test ≫ train) → overfitting; simplify or regularise.
  • Both high → underfitting; add complexity or features.

Learning curves — diagnostic tool.

Plot training and validation error as functions of:

(a) Training-set size nn.

  • Underfitting: both curves plateau at a high error → adding data won't help; the model is too simple.
  • Overfitting: training error stays very low while validation error stays high — large gap → add data, regularise, or simplify.
  • Well-fit: both curves converge to near-irreducible error.

(b) Model complexity (e.g., polynomial degree, tree depth).

  • Training error monotonically decreases.
  • Validation error U-shapes — falls then rises.
  • Optimal complexity = bottom of validation curve.

Worked diagnostic.

Train errorTest/CV errorDiagnosisFix
0.050.06Well-fitDeploy
0.010.40High variance / overfitMore data, regularise, prune
0.300.32High bias / underfitMore features, complex model
0.300.50BothInvestigate data quality

Take-away. Always evaluate out-of-sample. Generalisation — not training fit — is the real goal of every ML system.