Generalization Error and Out-of-Sample Metrics
Core Titles
Key headlines and terms for quick recall- Generalisation error — expected loss on unseen data
- Training vs Test error — train error optimistic
- Bias / Variance / Irreducible noise decomposition
- Out-of-sample (OOS) metrics — held-out test, CV
- Optimism gap = test error − train error
- Learning curves — error vs training size
- Why generalisation is the real goal in ML
Basic Idea
What it is, why it matters, how it worksWhat is generalisation error?
Generalisation error is the expected loss a model incurs on data drawn from the same distribution as the training data — but not used in training:
Unlike training error (in-sample), generalisation error tells us how the model will perform in production.
Bias–Variance decomposition
For squared loss:
- Bias — error from wrong model assumptions (linear when reality is non-linear).
- Variance — error from sensitivity to specific training sample.
- Irreducible noise — intrinsic noise in ; cannot be reduced.
Trade-off. Simpler models have high bias / low variance; complex models the reverse. The optimal complexity balances both.
Estimating generalisation error
1. Hold-out test set. Split data once (e.g., 80% train / 20% test). Train on the train; score on the test. Simple but the score is noisy on small data.
2. k-Fold cross-validation. Split into folds. Train on , validate on the held-out fold. Repeat times; average. Better use of data.
3. Leave-One-Out CV (LOOCV). . Almost unbiased but expensive.
4. Repeated / Stratified CV. Multiple runs, preserved class proportions.
Optimism gap
- Small gap → well-fit model.
- Large gap → over-fitting.
- Train and test both high → under-fitting.
Learning curves
Plot training and CV errors as functions of:
- Training-set size — both errors should converge to the irreducible error.
- Model complexity — train error always falls; CV error U-shapes.
These curves diagnose whether to add data, simplify the model, or call it done.
Regression vs classification metrics
| Task | OOS metrics |
|---|---|
| Regression | RMSE, MAE, , MAPE |
| Classification | Accuracy, F1, ROC-AUC, PR-AUC, log-loss |
| Ranking | NDCG, MAP, MRR |
| Detection | mAP, IoU |
Why generalisation is the real goal
Training error is easy to drive to zero — a lookup table memorises training perfectly. But generalisation is what the business cares about: how the model performs on tomorrow's new customers, transactions, images. Every modelling choice should ultimately serve generalisation.
Mind Map
Visual structure of the conceptGENERALISATION ERROR
├── Definition — expected loss on unseen data
├── Bias–Variance decomposition
│ ├── Bias² — model too simple
│ ├── Variance — too sensitive to data
│ └── Irreducible noise
├── Estimation
│ ├── Held-out test
│ ├── k-Fold CV
│ ├── LOOCV
│ └── Repeated stratified CV
├── Optimism gap = OOS − train
│ ├── Large → overfitting
│ └── Both high → underfitting
├── Learning curves
│ ├── Error vs N (size)
│ └── Error vs complexity
└── Why care?
└── Business wants production performance
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Define generalisation error. The expected loss of a model on data drawn from the same distribution as the training data but not used in training: . It measures real-world (production) performance.
Q2. What is the bias-variance trade-off? A trade-off in model complexity: simple models have high bias / low variance (underfit); complex models have low bias / high variance (overfit). The optimal model minimises total error = bias² + variance + irreducible noise.
Q3. Why is training error not a good measure of model quality? Because the model is fit to that data — even a memorising lookup table can achieve zero training error while failing on new data. Out-of-sample error (test, CV) is needed.
Part B (20 marks)
Q. Explain generalisation error and the bias-variance trade-off. Discuss techniques to estimate out-of-sample error and methods to diagnose under- and over-fitting using learning curves.
Generalisation error.
The expected loss on unseen data from the same distribution:
It is the quantity we actually care about — production performance — not training error.
Bias–Variance decomposition (squared loss).
For an estimator trained on a random sample:
| Source | Meaning | Reduce by |
|---|---|---|
| Bias | Wrong model assumptions (e.g., linear when truth is curved) | More complex model, better features |
| Variance | Sensitivity to training sample | More data, simpler model, regularisation, ensembling |
| Noise | Intrinsic randomness in | Irreducible |
Trade-off. As complexity grows: bias ↓, variance ↑. The total error is U-shaped — optimal complexity sits at the bottom.
Estimating generalisation error.
1. Hold-out test set. Simple 80/20 split. Cheap but noisy on small data.
2. k-Fold Cross-Validation. Split into folds; train on , validate on held-out; repeat. Average is a good estimate of generalisation. Default or .
3. Leave-One-Out CV (LOOCV). Maximum data use but -fold expensive.
4. Repeated stratified k-fold. Average over multiple random splits to reduce variance of the estimate.
5. Time-series CV / Group k-fold. For sequential or grouped data — never let future / same-entity data leak into training.
6. Nested CV for unbiased hyperparameter tuning + evaluation.
Optimism gap.
- Both errors low and close → well-fit, deploy.
- Big gap (test ≫ train) → overfitting; simplify or regularise.
- Both high → underfitting; add complexity or features.
Learning curves — diagnostic tool.
Plot training and validation error as functions of:
(a) Training-set size .
- Underfitting: both curves plateau at a high error → adding data won't help; the model is too simple.
- Overfitting: training error stays very low while validation error stays high — large gap → add data, regularise, or simplify.
- Well-fit: both curves converge to near-irreducible error.
(b) Model complexity (e.g., polynomial degree, tree depth).
- Training error monotonically decreases.
- Validation error U-shapes — falls then rises.
- Optimal complexity = bottom of validation curve.
Worked diagnostic.
| Train error | Test/CV error | Diagnosis | Fix |
|---|---|---|---|
| 0.05 | 0.06 | Well-fit | Deploy |
| 0.01 | 0.40 | High variance / overfit | More data, regularise, prune |
| 0.30 | 0.32 | High bias / underfit | More features, complex model |
| 0.30 | 0.50 | Both | Investigate data quality |
Take-away. Always evaluate out-of-sample. Generalisation — not training fit — is the real goal of every ML system.