Overfitting, Underfitting and Model Selection
Core Titles
Key headlines and terms for quick recall- Underfitting — model too simple (high bias)
- Overfitting — model too complex (high variance)
- Symptoms — gaps between train and test errors
- Causes — capacity, data size, noise, leakage
- Remedies
- More data / augmentation
- Regularisation (L1, L2)
- Cross-validation
- Early stopping
- Dropout / pruning / ensembling
- Model selection — pick complexity that minimises CV error
Basic Idea
What it is, why it matters, how it worksThe two failure modes
Underfitting — the model is too simple to capture the underlying pattern. Both training and test errors are high.
Overfitting — the model is too complex: it memorises training data (including noise) and fails on new data. Training error is low, test error is high.
Good fit — model captures the signal but not the noise. Training and test errors are both low and close.
Symptoms
| Train error | Test error | Diagnosis |
|---|---|---|
| Low | Low (close) | Good fit ✓ |
| Low | High | Overfit (high variance) |
| High | High | Underfit (high bias) |
| High | Low | Suspect leakage / bug |
Causes of overfitting
- Model capacity too high for the data size (deep tree, large NN).
- Too few training samples.
- Noisy or mislabeled training data.
- Training for too long (NN).
- Data leakage from test into train.
- Too many features / high-cardinality categoricals.
Causes of underfitting
- Model too simple (linear when truth is non-linear).
- Insufficient features.
- Over-aggressive regularisation.
- Too little training (NN under-trained).
Remedies — overfitting
- More data — most reliable cure. Data augmentation effectively increases data.
- Simpler model — fewer parameters, shallower trees, smaller NN.
- Regularisation — L1 (Lasso) for feature selection; L2 (Ridge) for shrinkage; Elastic Net for both.
- Cross-validation — catches overfitting during training; pick model with best CV.
- Early stopping — for iterative models, stop when validation error stops improving.
- Dropout — randomly disable neurons during training.
- Pruning — for decision trees.
- Ensembling — bagging (Random Forest), boosting (XGBoost).
- Reduce features — selection, PCA.
Remedies — underfitting
- More complex model.
- Better features. Feature engineering, interactions, polynomial expansion.
- Reduce regularisation.
- Train longer / better.
- Combine via boosting.
Diagnosis tool — learning curve
Plot train and validation error vs training-set size or epochs.
- Underfitting: both curves plateau high → adding data won't help; model too simple.
- Overfitting: training error very low, validation high — gap widens → simplify, regularise, or add data.
- Good fit: both converge close to irreducible error.
Why this matters
In production, only generalisation matters. A model that overfits training data delivers zero business value when scores on real users are mediocre. Good ML practice = aggressive prevention of overfitting.
Mind Map
Visual structure of the conceptOVER- / UNDER-FITTING
├── Underfit (high bias)
│ ├── Train high + Test high
│ ├── Model too simple
│ └── Fix: more features, deeper model
├── Overfit (high variance)
│ ├── Train low + Test high
│ ├── Memorises noise
│ └── Fix: more data, regularise, prune, early-stop, ensemble
├── Good fit
│ └── Train low, Test low, gap small
└── Diagnosis
├── Learning curves
├── Train ↔ CV gap
└── Hold-out test
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. What is overfitting? A model that learns the training data — including noise — so closely that it fails on new data. Symptom: low training error but high test error.
Q2. What is underfitting? A model that is too simple to capture the underlying pattern; both training and test errors are high.
Q3. Name three techniques to reduce overfitting.
- Regularisation (L1, L2).
- Cross-validation to catch overfitting and choose simpler models.
- More data or data augmentation; early stopping; dropout; pruning; ensembling.
Part B (20 marks)
Q. What is overfitting and underfitting? Discuss their causes and remedies. How do learning curves help diagnose them?
Definitions.
- Underfitting. Model too simple to capture the underlying pattern. Both training and test errors are high. High bias.
- Overfitting. Model too complex; learns noise instead of signal. Training error very low, test error high. High variance.
- Good fit. Captures the signal without the noise; train and test errors are both low and close.
Symptoms table.
| Train error | Test error | Diagnosis |
|---|---|---|
| Low | Low, close | Good fit |
| Low | High | Overfit |
| High | High | Underfit |
| High | Low | Suspect leakage / bug |
Causes of overfitting.
- Model capacity too high relative to dataset size (deep tree, large NN).
- Too few training samples.
- Noisy or mislabeled data.
- Training for too long (NN).
- Data leakage from test into train.
- Too many or high-cardinality features.
Causes of underfitting.
- Model too simple (linear when truth is non-linear).
- Insufficient or uninformative features.
- Over-aggressive regularisation.
- Insufficient training (under-trained NN).
Remedies for overfitting.
- More data. Most reliable cure. Data augmentation (rotations/crops for images, back-translation for text) increases effective data without new labels.
- Simpler model. Fewer parameters, shallower trees, smaller neural net.
- Regularisation.
- L2 (Ridge): . Shrinks all weights.
- L1 (Lasso): . Drives some weights to zero.
- Elastic Net: combines L1 and L2.
- Cross-validation. Picks the simplest model whose validation error is competitive.
- Early stopping. Stop training when validation error stops improving.
- Dropout. Randomly disable neurons during training; common in deep nets.
- Pruning. Cut back branches of decision trees that don't improve validation.
- Ensembling. Bagging (Random Forest), boosting (XGBoost), stacking — average errors of individual learners.
- Feature reduction. Selection or dimensionality reduction (PCA).
- Proper validation. Stratified, group, time-series splits to prevent leakage.
Remedies for underfitting.
- More complex model — switch from linear to GBT / NN.
- Better features — polynomial expansion, interactions, domain-driven engineering.
- Reduce regularisation — relax .
- Train longer — more epochs for NN.
- Boost / ensemble.
Diagnostic — Learning curves.
Plot train and validation error against training-set size or epochs.
| Pattern | Diagnosis | Remedy |
|---|---|---|
| Both errors high and plateau | Underfit | Complexity, features |
| Train very low, validation high; gap widens | Overfit | Data, regularise, simplify |
| Both converge to small error | Good fit | Deploy |
Worked example. A Random Forest on a 2 000-patient readmission dataset reports train F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting. Remedies: limit max_depth, increase min_samples_leaf, apply stratified group-k-fold, regularise with max_features='sqrt'. After tuning, train F1 = 0.78, validation F1 = 0.73 — the gap closes, model generalises.
Take-away. Generalisation is the goal. Detect and prevent overfitting aggressively — through validation, regularisation, ensembling and learning-curve diagnostics. The best models are the simplest that still work.