PGD01C02
Module 4 · Model Evaluation

Overfitting, Underfitting and Model Selection

Core Titles
Key headlines and terms for quick recall
  • Underfitting — model too simple (high bias)
  • Overfitting — model too complex (high variance)
  • Symptoms — gaps between train and test errors
  • Causes — capacity, data size, noise, leakage
  • Remedies
    • More data / augmentation
    • Regularisation (L1, L2)
    • Cross-validation
    • Early stopping
    • Dropout / pruning / ensembling
  • Model selection — pick complexity that minimises CV error
Basic Idea
What it is, why it matters, how it works

The two failure modes

Underfitting — the model is too simple to capture the underlying pattern. Both training and test errors are high.

Overfitting — the model is too complex: it memorises training data (including noise) and fails on new data. Training error is low, test error is high.

Good fit — model captures the signal but not the noise. Training and test errors are both low and close.

Symptoms

Train errorTest errorDiagnosis
LowLow (close)Good fit ✓
LowHighOverfit (high variance)
HighHighUnderfit (high bias)
HighLowSuspect leakage / bug

Causes of overfitting

  • Model capacity too high for the data size (deep tree, large NN).
  • Too few training samples.
  • Noisy or mislabeled training data.
  • Training for too long (NN).
  • Data leakage from test into train.
  • Too many features / high-cardinality categoricals.

Causes of underfitting

  • Model too simple (linear when truth is non-linear).
  • Insufficient features.
  • Over-aggressive regularisation.
  • Too little training (NN under-trained).

Remedies — overfitting

  1. More data — most reliable cure. Data augmentation effectively increases data.
  2. Simpler model — fewer parameters, shallower trees, smaller NN.
  3. Regularisation — L1 (Lasso) for feature selection; L2 (Ridge) for shrinkage; Elastic Net for both.
  4. Cross-validation — catches overfitting during training; pick model with best CV.
  5. Early stopping — for iterative models, stop when validation error stops improving.
  6. Dropout — randomly disable neurons during training.
  7. Pruning — for decision trees.
  8. Ensembling — bagging (Random Forest), boosting (XGBoost).
  9. Reduce features — selection, PCA.

Remedies — underfitting

  1. More complex model.
  2. Better features. Feature engineering, interactions, polynomial expansion.
  3. Reduce regularisation.
  4. Train longer / better.
  5. Combine via boosting.

Diagnosis tool — learning curve

Plot train and validation error vs training-set size or epochs.

  • Underfitting: both curves plateau high → adding data won't help; model too simple.
  • Overfitting: training error very low, validation high — gap widens → simplify, regularise, or add data.
  • Good fit: both converge close to irreducible error.

Why this matters

In production, only generalisation matters. A model that overfits training data delivers zero business value when scores on real users are mediocre. Good ML practice = aggressive prevention of overfitting.

Mind Map
Visual structure of the concept
OVER- / UNDER-FITTING
├── Underfit (high bias)
│   ├── Train high + Test high
│   ├── Model too simple
│   └── Fix: more features, deeper model
├── Overfit (high variance)
│   ├── Train low + Test high
│   ├── Memorises noise
│   └── Fix: more data, regularise, prune, early-stop, ensemble
├── Good fit
│   └── Train low, Test low, gap small
└── Diagnosis
    ├── Learning curves
    ├── Train ↔ CV gap
    └── Hold-out test
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is overfitting? A model that learns the training data — including noise — so closely that it fails on new data. Symptom: low training error but high test error.

Q2. What is underfitting? A model that is too simple to capture the underlying pattern; both training and test errors are high.

Q3. Name three techniques to reduce overfitting.

  1. Regularisation (L1, L2).
  2. Cross-validation to catch overfitting and choose simpler models.
  3. More data or data augmentation; early stopping; dropout; pruning; ensembling.

Part B (20 marks)

Q. What is overfitting and underfitting? Discuss their causes and remedies. How do learning curves help diagnose them?

Definitions.

  • Underfitting. Model too simple to capture the underlying pattern. Both training and test errors are high. High bias.
  • Overfitting. Model too complex; learns noise instead of signal. Training error very low, test error high. High variance.
  • Good fit. Captures the signal without the noise; train and test errors are both low and close.

Symptoms table.

Train errorTest errorDiagnosis
LowLow, closeGood fit
LowHighOverfit
HighHighUnderfit
HighLowSuspect leakage / bug

Causes of overfitting.

  • Model capacity too high relative to dataset size (deep tree, large NN).
  • Too few training samples.
  • Noisy or mislabeled data.
  • Training for too long (NN).
  • Data leakage from test into train.
  • Too many or high-cardinality features.

Causes of underfitting.

  • Model too simple (linear when truth is non-linear).
  • Insufficient or uninformative features.
  • Over-aggressive regularisation.
  • Insufficient training (under-trained NN).

Remedies for overfitting.

  1. More data. Most reliable cure. Data augmentation (rotations/crops for images, back-translation for text) increases effective data without new labels.
  2. Simpler model. Fewer parameters, shallower trees, smaller neural net.
  3. Regularisation.
    • L2 (Ridge): λwj2\lambda \sum w_j^2. Shrinks all weights.
    • L1 (Lasso): λwj\lambda \sum |w_j|. Drives some weights to zero.
    • Elastic Net: combines L1 and L2.
  4. Cross-validation. Picks the simplest model whose validation error is competitive.
  5. Early stopping. Stop training when validation error stops improving.
  6. Dropout. Randomly disable neurons during training; common in deep nets.
  7. Pruning. Cut back branches of decision trees that don't improve validation.
  8. Ensembling. Bagging (Random Forest), boosting (XGBoost), stacking — average errors of individual learners.
  9. Feature reduction. Selection or dimensionality reduction (PCA).
  10. Proper validation. Stratified, group, time-series splits to prevent leakage.

Remedies for underfitting.

  1. More complex model — switch from linear to GBT / NN.
  2. Better features — polynomial expansion, interactions, domain-driven engineering.
  3. Reduce regularisation — relax λ\lambda.
  4. Train longer — more epochs for NN.
  5. Boost / ensemble.

Diagnostic — Learning curves.

Plot train and validation error against training-set size or epochs.

PatternDiagnosisRemedy
Both errors high and plateauUnderfitComplexity, features
Train very low, validation high; gap widensOverfitData, regularise, simplify
Both converge to small errorGood fitDeploy

Worked example. A Random Forest on a 2 000-patient readmission dataset reports train F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting. Remedies: limit max_depth, increase min_samples_leaf, apply stratified group-k-fold, regularise with max_features='sqrt'. After tuning, train F1 = 0.78, validation F1 = 0.73 — the gap closes, model generalises.

Take-away. Generalisation is the goal. Detect and prevent overfitting aggressively — through validation, regularisation, ensembling and learning-curve diagnostics. The best models are the simplest that still work.