PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

Generalization Error and Out-of-Sample Metrics

Core Titles

Key headlines and terms for quick recall

Generalisation error — expected loss on unseen data
Training vs Test error — train error optimistic
Bias / Variance / Irreducible noise decomposition
Out-of-sample (OOS) metrics — held-out test, CV
Optimism gap = test error − train error
Learning curves — error vs training size
Why generalisation is the real goal in ML

Basic Idea

What it is, why it matters, how it works

What is generalisation error?

Generalisation error is the expected loss a model incurs on data drawn from the same distribution as the training data — but not used in training:

$\text{GenErr}(f) = E_{(X, Y) \sim D}[L(Y, f(X))].$

Unlike training error (in-sample), generalisation error tells us how the model will perform in production.

Bias–Variance decomposition

For squared loss: $E[(Y - \hat f(X))^2] = \underbrace{(E[\hat f] - f^*)^2}_{\text{Bias}^2} + \underbrace{E[(\hat f - E[\hat f])^2]}_{\text{Variance}} + \underbrace{\sigma^2_\varepsilon}_{\text{Irreducible}}.$

Bias — error from wrong model assumptions (linear when reality is non-linear).
Variance — error from sensitivity to specific training sample.
Irreducible noise — intrinsic noise in $Y$ ; cannot be reduced.

Trade-off. Simpler models have high bias / low variance; complex models the reverse. The optimal complexity balances both.

Estimating generalisation error

1. Hold-out test set. Split data once (e.g., 80% train / 20% test). Train on the train; score on the test. Simple but the score is noisy on small data.

2. k-Fold cross-validation. Split into $k$ folds. Train on $k-1$ , validate on the held-out fold. Repeat $k$ times; average. Better use of data.

3. Leave-One-Out CV (LOOCV). $k = n$ . Almost unbiased but expensive.

4. Repeated / Stratified CV. Multiple runs, preserved class proportions.

Optimism gap

$\text{Optimism} = \text{Test error} - \text{Train error}.$

Small gap → well-fit model.
Large gap → over-fitting.
Train and test both high → under-fitting.

Learning curves

Plot training and CV errors as functions of:

Training-set size — both errors should converge to the irreducible error.
Model complexity — train error always falls; CV error U-shapes.

These curves diagnose whether to add data, simplify the model, or call it done.

Regression vs classification metrics

Task	OOS metrics
Regression	RMSE, MAE, $R^2$ , MAPE
Classification	Accuracy, F1, ROC-AUC, PR-AUC, log-loss
Ranking	NDCG, MAP, MRR
Detection	mAP, IoU

Why generalisation is the real goal

Training error is easy to drive to zero — a lookup table memorises training perfectly. But generalisation is what the business cares about: how the model performs on tomorrow's new customers, transactions, images. Every modelling choice should ultimately serve generalisation.

Mind Map

Visual structure of the concept

GENERALISATION ERROR
├── Definition — expected loss on unseen data
├── Bias–Variance decomposition
│   ├── Bias² — model too simple
│   ├── Variance — too sensitive to data
│   └── Irreducible noise
├── Estimation
│   ├── Held-out test
│   ├── k-Fold CV
│   ├── LOOCV
│   └── Repeated stratified CV
├── Optimism gap = OOS − train
│   ├── Large → overfitting
│   └── Both high → underfitting
├── Learning curves
│   ├── Error vs N (size)
│   └── Error vs complexity
└── Why care?
    └── Business wants production performance

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define generalisation error. The expected loss of a model on data drawn from the same distribution as the training data but not used in training: $E_{(X, Y) \sim D}[L(Y, f(X))]$ . It measures real-world (production) performance.

Q2. What is the bias-variance trade-off? A trade-off in model complexity: simple models have high bias / low variance (underfit); complex models have low bias / high variance (overfit). The optimal model minimises total error = bias² + variance + irreducible noise.

Q3. Why is training error not a good measure of model quality? Because the model is fit to that data — even a memorising lookup table can achieve zero training error while failing on new data. Out-of-sample error (test, CV) is needed.

Part B (20 marks)

Q. Explain generalisation error and the bias-variance trade-off. Discuss techniques to estimate out-of-sample error and methods to diagnose under- and over-fitting using learning curves.

Generalisation error.

The expected loss on unseen data from the same distribution: $\text{GenErr}(f) = E_{(X, Y)}[L(Y, f(X))].$

It is the quantity we actually care about — production performance — not training error.

Bias–Variance decomposition (squared loss).

For an estimator $\hat f$ trained on a random sample: $E[(Y - \hat f(X))^2] = \underbrace{(E[\hat f(X)] - f^*(X))^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat f(X))}_{\text{Variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{Noise}}.$

Source	Meaning	Reduce by
Bias	Wrong model assumptions (e.g., linear when truth is curved)	More complex model, better features
Variance	Sensitivity to training sample	More data, simpler model, regularisation, ensembling
Noise	Intrinsic randomness in $Y$	Irreducible

Trade-off. As complexity grows: bias ↓, variance ↑. The total error is U-shaped — optimal complexity sits at the bottom.

Estimating generalisation error.

1. Hold-out test set. Simple 80/20 split. Cheap but noisy on small data.

2. k-Fold Cross-Validation. Split into $k$ folds; train on $k-1$ , validate on held-out; repeat. Average is a good estimate of generalisation. Default $k = 5$ or $10$ .

3. Leave-One-Out CV (LOOCV). Maximum data use but $n$ -fold expensive.

4. Repeated stratified k-fold. Average over multiple random splits to reduce variance of the estimate.

5. Time-series CV / Group k-fold. For sequential or grouped data — never let future / same-entity data leak into training.

6. Nested CV for unbiased hyperparameter tuning + evaluation.

Optimism gap. $\text{Optimism} = \text{Test error} - \text{Train error}.$

Both errors low and close → well-fit, deploy.
Big gap (test ≫ train) → overfitting; simplify or regularise.
Both high → underfitting; add complexity or features.

Learning curves — diagnostic tool.

Plot training and validation error as functions of:

(a) Training-set size $n$ .

Underfitting: both curves plateau at a high error → adding data won't help; the model is too simple.
Overfitting: training error stays very low while validation error stays high — large gap → add data, regularise, or simplify.
Well-fit: both curves converge to near-irreducible error.

(b) Model complexity (e.g., polynomial degree, tree depth).

Training error monotonically decreases.
Validation error U-shapes — falls then rises.
Optimal complexity = bottom of validation curve.

Worked diagnostic.

Train error	Test/CV error	Diagnosis	Fix
0.05	0.06	Well-fit	Deploy
0.01	0.40	High variance / overfit	More data, regularise, prune
0.30	0.32	High bias / underfit	More features, complex model
0.30	0.50	Both	Investigate data quality

Take-away. Always evaluate out-of-sample. Generalisation — not training fit — is the real goal of every ML system.

Prediction and Decision Making Cross-Validation