PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

Overfitting, Underfitting and Model Selection

Core Titles

Key headlines and terms for quick recall

Underfitting — model too simple (high bias)
Overfitting — model too complex (high variance)
Symptoms — gaps between train and test errors
Causes — capacity, data size, noise, leakage
Remedies
- More data / augmentation
- Regularisation (L1, L2)
- Cross-validation
- Early stopping
- Dropout / pruning / ensembling
Model selection — pick complexity that minimises CV error

Basic Idea

What it is, why it matters, how it works

The two failure modes

Underfitting — the model is too simple to capture the underlying pattern. Both training and test errors are high.

Overfitting — the model is too complex: it memorises training data (including noise) and fails on new data. Training error is low, test error is high.

Good fit — model captures the signal but not the noise. Training and test errors are both low and close.

Symptoms

Train error	Test error	Diagnosis
Low	Low (close)	Good fit ✓
Low	High	Overfit (high variance)
High	High	Underfit (high bias)
High	Low	Suspect leakage / bug

Causes of overfitting

Model capacity too high for the data size (deep tree, large NN).
Too few training samples.
Noisy or mislabeled training data.
Training for too long (NN).
Data leakage from test into train.
Too many features / high-cardinality categoricals.

Causes of underfitting

Model too simple (linear when truth is non-linear).
Insufficient features.
Over-aggressive regularisation.
Too little training (NN under-trained).

Remedies — overfitting

More data — most reliable cure. Data augmentation effectively increases data.
Simpler model — fewer parameters, shallower trees, smaller NN.
Regularisation — L1 (Lasso) for feature selection; L2 (Ridge) for shrinkage; Elastic Net for both.
Cross-validation — catches overfitting during training; pick model with best CV.
Early stopping — for iterative models, stop when validation error stops improving.
Dropout — randomly disable neurons during training.
Pruning — for decision trees.
Ensembling — bagging (Random Forest), boosting (XGBoost).
Reduce features — selection, PCA.

Remedies — underfitting

More complex model.
Better features. Feature engineering, interactions, polynomial expansion.
Reduce regularisation.
Train longer / better.
Combine via boosting.

Diagnosis tool — learning curve

Plot train and validation error vs training-set size or epochs.

Underfitting: both curves plateau high → adding data won't help; model too simple.
Overfitting: training error very low, validation high — gap widens → simplify, regularise, or add data.
Good fit: both converge close to irreducible error.

Why this matters

In production, only generalisation matters. A model that overfits training data delivers zero business value when scores on real users are mediocre. Good ML practice = aggressive prevention of overfitting.

Mind Map

Visual structure of the concept

OVER- / UNDER-FITTING
├── Underfit (high bias)
│   ├── Train high + Test high
│   ├── Model too simple
│   └── Fix: more features, deeper model
├── Overfit (high variance)
│   ├── Train low + Test high
│   ├── Memorises noise
│   └── Fix: more data, regularise, prune, early-stop, ensemble
├── Good fit
│   └── Train low, Test low, gap small
└── Diagnosis
    ├── Learning curves
    ├── Train ↔ CV gap
    └── Hold-out test

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is overfitting? A model that learns the training data — including noise — so closely that it fails on new data. Symptom: low training error but high test error.

Q2. What is underfitting? A model that is too simple to capture the underlying pattern; both training and test errors are high.

Q3. Name three techniques to reduce overfitting.

Regularisation (L1, L2).
Cross-validation to catch overfitting and choose simpler models.
More data or data augmentation; early stopping; dropout; pruning; ensembling.

Part B (20 marks)

Q. What is overfitting and underfitting? Discuss their causes and remedies. How do learning curves help diagnose them?

Definitions.

Underfitting. Model too simple to capture the underlying pattern. Both training and test errors are high. High bias.
Overfitting. Model too complex; learns noise instead of signal. Training error very low, test error high. High variance.
Good fit. Captures the signal without the noise; train and test errors are both low and close.

Symptoms table.

Train error	Test error	Diagnosis
Low	Low, close	Good fit
Low	High	Overfit
High	High	Underfit
High	Low	Suspect leakage / bug

Causes of overfitting.

Model capacity too high relative to dataset size (deep tree, large NN).
Too few training samples.
Noisy or mislabeled data.
Training for too long (NN).
Data leakage from test into train.
Too many or high-cardinality features.

Causes of underfitting.

Model too simple (linear when truth is non-linear).
Insufficient or uninformative features.
Over-aggressive regularisation.
Insufficient training (under-trained NN).

Remedies for overfitting.

More data. Most reliable cure. Data augmentation (rotations/crops for images, back-translation for text) increases effective data without new labels.
Simpler model. Fewer parameters, shallower trees, smaller neural net.
Regularisation.
- L2 (Ridge): $\lambda \sum w_j^2$ . Shrinks all weights.
- L1 (Lasso): $\lambda \sum |w_j|$ . Drives some weights to zero.
- Elastic Net: combines L1 and L2.
Cross-validation. Picks the simplest model whose validation error is competitive.
Early stopping. Stop training when validation error stops improving.
Dropout. Randomly disable neurons during training; common in deep nets.
Pruning. Cut back branches of decision trees that don't improve validation.
Ensembling. Bagging (Random Forest), boosting (XGBoost), stacking — average errors of individual learners.
Feature reduction. Selection or dimensionality reduction (PCA).
Proper validation. Stratified, group, time-series splits to prevent leakage.

Remedies for underfitting.

More complex model — switch from linear to GBT / NN.
Better features — polynomial expansion, interactions, domain-driven engineering.
Reduce regularisation — relax $\lambda$ .
Train longer — more epochs for NN.
Boost / ensemble.

Diagnostic — Learning curves.

Plot train and validation error against training-set size or epochs.

Pattern	Diagnosis	Remedy
Both errors high and plateau	Underfit	Complexity, features
Train very low, validation high; gap widens	Overfit	Data, regularise, simplify
Both converge to small error	Good fit	Deploy

Worked example. A Random Forest on a 2 000-patient readmission dataset reports train F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting. Remedies: limit max_depth, increase min_samples_leaf, apply stratified group-k-fold, regularise with max_features='sqrt'. After tuning, train F1 = 0.78, validation F1 = 0.73 — the gap closes, model generalises.

Take-away. Generalisation is the goal. Detect and prevent overfitting aggressively — through validation, regularisation, ensembling and learning-curve diagnostics. The best models are the simplest that still work.

Cross-Validation Overview of Data Structures: Linear and Non-Linear