PGD01C02
Module 4 · Model Evaluation

Cross-Validation

Core Titles
Key headlines and terms for quick recall
  • Cross-validation (CV) — rotate train/test splits across the data
  • k-Fold CV — standard k=5k=5 or 1010
  • Stratified k-Fold — preserves class ratios
  • Leave-One-Out CVk=nk = n
  • Repeated k-Fold
  • Time-Series CV — respects order
  • Group k-Fold — keep entities intact
  • Nested CV — unbiased tuning + scoring
Basic Idea
What it is, why it matters, how it works

Why CV?

A single train/test split gives a noisy estimate of generalisation error. CV systematically rotates the split, using every example for both training and validation at some point — yielding a more stable estimate.

Specific benefits.

  • Use all data efficiently (especially valuable when small).
  • Pick hyperparameters reliably.
  • Detect overfitting (large train ↔ CV gap).
  • Compare models fairly.

k-Fold CV

  1. Split data randomly into kk equal folds.
  2. For each fold i=1,,ki = 1, \dots, k: train on the other k1k - 1 folds, validate on fold ii.
  3. Report mean and std of the fold scores.

Default k=5k = 5 or 1010 — empirical sweet spot of bias vs variance vs cost.

Stratified k-Fold

Preserves the class distribution in each fold. Essential for imbalanced classification so every fold contains both classes.

Leave-One-Out (LOOCV)

k=nk = n. Each fold has exactly one validation point.

  • Pros: maximises training data per fold; almost unbiased.
  • Cons: very expensive (nn fits); high variance of the estimate.

Leave-P-Out (LPOCV)

Hold pp samples out as validation; iterate over all (np)\binom{n}{p} subsets. Even more expensive.

Repeated k-Fold

Run k-fold multiple times with different random seeds; average. Reduces variance of the estimate.

Time-Series CV (rolling / expanding window)

Sequential data must respect time order. Train on [0,t][0, t], validate on [t+1,t+h][t+1, t+h], slide window forward. Never lets future data leak into training.

Group k-Fold / Leave-One-Group-Out

When data has groups (patients, users, sessions) that must not span train and test (to avoid leakage), use groups as the splitting unit.

Nested CV

Two-level CV:

  • Outer loop estimates performance.
  • Inner loop tunes hyperparameters.

Eliminates the optimistic bias of tuning and reporting on the same CV.

How to choose

Data typeCV strategy
Tabular, classificationStratified k-fold (k=5k=5 or 10)
Tabular, regressionk-fold
Very small dataLOOCV
Time-seriesTime-series CV
Grouped (patients, users)Group k-fold
Reporting + tuningNested CV

Implementation

from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
print(scores.mean(), scores.std())

Important caveat

Fit preprocessing inside the CV loop (use a Pipeline) — fitting a scaler/imputer on the whole data before splitting causes data leakage, inflating CV scores.

Mind Map
Visual structure of the concept
CROSS-VALIDATION
├── Why?
│   ├── Stable error estimate
│   ├── Hyperparameter tuning
│   └── Detect overfit
├── Types
│   ├── k-Fold (5/10 default)
│   ├── Stratified k-Fold (classes)
│   ├── LOOCV (k=n)
│   ├── Leave-P-Out
│   ├── Repeated k-Fold
│   ├── Time-Series CV
│   ├── Group k-Fold
│   └── Nested CV
└── Caveat
    └── Fit preprocessing INSIDE CV (Pipeline)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define k-fold cross-validation. Split the data into kk equal folds. For each ii, train on k1k-1 folds and validate on the held-out fold. Average the kk scores. Default k=5k = 5 or 1010.

Q2. Why use stratified k-fold for classification? To preserve the class distribution in every fold. Without stratification, an imbalanced dataset might place all positive examples in a single fold, producing unstable or biased scores.

Q3. What is time-series cross-validation? A CV variant that respects temporal order: train on data up to time tt, validate on t+1t+1 to t+ht+h, then slide the window forward. Prevents future data from leaking into training.


Part B (20 marks)

Q. Why is cross-validation used in machine learning? Discuss different types of cross-validation with examples and when each is appropriate.

Why CV?

A single train/test split gives a noisy score. CV rotates the role of train and validation across the data so every sample is used both ways, producing:

  1. More stable estimate of generalisation error.
  2. Reliable basis for hyperparameter tuning (grid / random / Bayesian search).
  3. Fair model-to-model comparison.
  4. Sensitive detection of overfitting (large train↔CV gap).

Types of CV.

1. k-Fold CV. Default approach. Split into kk folds; train on k1k-1; validate on remainder; repeat kk times. Average score.

  • Typical k=5k = 5 or 10 — empirical sweet spot.
  • Use for general regression / classification on tabular data.

2. Stratified k-Fold. Same as k-fold but preserves the class proportions in each fold.

  • Essential for imbalanced classification so every fold contains both classes.
  • Use for any classification task.

3. Leave-One-Out (LOOCV). k=nk = n. Each fold's validation is a single sample.

  • Almost unbiased estimate.
  • Expensive — nn fits.
  • Use when data is very small.

4. Leave-P-Out. Hold pp samples out; iterate over all (np)\binom{n}{p} subsets. Even more expensive; rarely used.

5. Repeated k-Fold. Run k-fold several times with different random seeds; average. Reduces variance of the estimate.

  • Use when single k-fold scores fluctuate too much.

6. Time-Series CV (rolling / expanding window). For sequential data, train on [0,t][0, t], validate on [t+1,t+h][t+1, t+h], slide window forward.

  • Never lets future data leak into training.
  • Use for forecasting, financial models, demand prediction.

7. Group k-Fold / Leave-One-Group-Out. When data has groups (patients, users, sessions) that must not span train and test.

  • Treats group as splitting unit.
  • Use for healthcare (one patient → one fold), recommender systems (one user), audio (one speaker).

8. Nested CV. Two-level CV:

  • Inner loop — tunes hyperparameters on training portion.
  • Outer loop — evaluates on untouched fold.
  • Removes the bias caused by tuning and scoring on the same CV.
  • Use when reporting and tuning.

Comparison.

StrategyBest forCost
k-FoldGeneral-purposeModerate
Stratified k-FoldImbalanced classificationModerate
LOOCVVery small dataHigh
Repeated k-FoldReducing score noiseHigh
Time-Series CVSequential / forecastModerate
Group k-FoldRepeated entitiesModerate
Nested CVTuning + reportingHigh

Worked example. A churn model on 5000 customers with 8% positive rate. We use Stratified 5-fold CV because of the imbalance. With nested CV: outer 5-fold reports F1 = 0.41 ± 0.03; inner 5-fold tunes XGBoost hyperparameters per outer fold. No leakage; trustworthy comparison.

Critical caveat — preprocessing. Fit all preprocessing (imputer, scaler, encoder) inside each CV fold using a Pipeline. Fitting on the entire data before splitting leaks information from validation into training and inflates scores. sklearn.pipeline.Pipeline makes this automatic.