PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

Cross-Validation

Core Titles

Key headlines and terms for quick recall

Cross-validation (CV) — rotate train/test splits across the data
k-Fold CV — standard $k=5$ or $10$
Stratified k-Fold — preserves class ratios
Leave-One-Out CV — $k = n$
Repeated k-Fold
Time-Series CV — respects order
Group k-Fold — keep entities intact
Nested CV — unbiased tuning + scoring

Basic Idea

What it is, why it matters, how it works

Why CV?

A single train/test split gives a noisy estimate of generalisation error. CV systematically rotates the split, using every example for both training and validation at some point — yielding a more stable estimate.

Specific benefits.

Use all data efficiently (especially valuable when small).
Pick hyperparameters reliably.
Detect overfitting (large train ↔ CV gap).
Compare models fairly.

k-Fold CV

Split data randomly into $k$ equal folds.
For each fold $i = 1, \dots, k$ : train on the other $k - 1$ folds, validate on fold $i$ .
Report mean and std of the fold scores.

Default $k = 5$ or $10$ — empirical sweet spot of bias vs variance vs cost.

Stratified k-Fold

Preserves the class distribution in each fold. Essential for imbalanced classification so every fold contains both classes.

Leave-One-Out (LOOCV)

$k = n$ . Each fold has exactly one validation point.

Pros: maximises training data per fold; almost unbiased.
Cons: very expensive ( $n$ fits); high variance of the estimate.

Leave-P-Out (LPOCV)

Hold $p$ samples out as validation; iterate over all $\binom{n}{p}$ subsets. Even more expensive.

Repeated k-Fold

Run k-fold multiple times with different random seeds; average. Reduces variance of the estimate.

Time-Series CV (rolling / expanding window)

Sequential data must respect time order. Train on $[0, t]$ , validate on $[t+1, t+h]$ , slide window forward. Never lets future data leak into training.

Group k-Fold / Leave-One-Group-Out

When data has groups (patients, users, sessions) that must not span train and test (to avoid leakage), use groups as the splitting unit.

Nested CV

Two-level CV:

Outer loop estimates performance.
Inner loop tunes hyperparameters.

Eliminates the optimistic bias of tuning and reporting on the same CV.

How to choose

Data type	CV strategy
Tabular, classification	Stratified k-fold ( $k=5$ or 10)
Tabular, regression	k-fold
Very small data	LOOCV
Time-series	Time-series CV
Grouped (patients, users)	Group k-fold
Reporting + tuning	Nested CV

Implementation

from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
print(scores.mean(), scores.std())

Important caveat

Fit preprocessing inside the CV loop (use a Pipeline) — fitting a scaler/imputer on the whole data before splitting causes data leakage, inflating CV scores.

Mind Map

Visual structure of the concept

CROSS-VALIDATION
├── Why?
│   ├── Stable error estimate
│   ├── Hyperparameter tuning
│   └── Detect overfit
├── Types
│   ├── k-Fold (5/10 default)
│   ├── Stratified k-Fold (classes)
│   ├── LOOCV (k=n)
│   ├── Leave-P-Out
│   ├── Repeated k-Fold
│   ├── Time-Series CV
│   ├── Group k-Fold
│   └── Nested CV
└── Caveat
    └── Fit preprocessing INSIDE CV (Pipeline)

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define k-fold cross-validation. Split the data into $k$ equal folds. For each $i$ , train on $k-1$ folds and validate on the held-out fold. Average the $k$ scores. Default $k = 5$ or $10$ .

Q2. Why use stratified k-fold for classification? To preserve the class distribution in every fold. Without stratification, an imbalanced dataset might place all positive examples in a single fold, producing unstable or biased scores.

Q3. What is time-series cross-validation? A CV variant that respects temporal order: train on data up to time $t$ , validate on $t+1$ to $t+h$ , then slide the window forward. Prevents future data from leaking into training.

Part B (20 marks)

Q. Why is cross-validation used in machine learning? Discuss different types of cross-validation with examples and when each is appropriate.

Why CV?

A single train/test split gives a noisy score. CV rotates the role of train and validation across the data so every sample is used both ways, producing:

More stable estimate of generalisation error.
Reliable basis for hyperparameter tuning (grid / random / Bayesian search).
Fair model-to-model comparison.
Sensitive detection of overfitting (large train↔CV gap).

Types of CV.

1. k-Fold CV. Default approach. Split into $k$ folds; train on $k-1$ ; validate on remainder; repeat $k$ times. Average score.

Typical $k = 5$ or 10 — empirical sweet spot.
Use for general regression / classification on tabular data.

2. Stratified k-Fold. Same as k-fold but preserves the class proportions in each fold.

Essential for imbalanced classification so every fold contains both classes.
Use for any classification task.

3. Leave-One-Out (LOOCV). $k = n$ . Each fold's validation is a single sample.

Almost unbiased estimate.
Expensive — $n$ fits.
Use when data is very small.

4. Leave-P-Out. Hold $p$ samples out; iterate over all $\binom{n}{p}$ subsets. Even more expensive; rarely used.

5. Repeated k-Fold. Run k-fold several times with different random seeds; average. Reduces variance of the estimate.

Use when single k-fold scores fluctuate too much.

6. Time-Series CV (rolling / expanding window). For sequential data, train on $[0, t]$ , validate on $[t+1, t+h]$ , slide window forward.

Never lets future data leak into training.
Use for forecasting, financial models, demand prediction.

7. Group k-Fold / Leave-One-Group-Out. When data has groups (patients, users, sessions) that must not span train and test.

Treats group as splitting unit.
Use for healthcare (one patient → one fold), recommender systems (one user), audio (one speaker).

8. Nested CV. Two-level CV:

Inner loop — tunes hyperparameters on training portion.
Outer loop — evaluates on untouched fold.
Removes the bias caused by tuning and scoring on the same CV.
Use when reporting and tuning.

Comparison.

Strategy	Best for	Cost
k-Fold	General-purpose	Moderate
Stratified k-Fold	Imbalanced classification	Moderate
LOOCV	Very small data	High
Repeated k-Fold	Reducing score noise	High
Time-Series CV	Sequential / forecast	Moderate
Group k-Fold	Repeated entities	Moderate
Nested CV	Tuning + reporting	High

Worked example. A churn model on 5000 customers with 8% positive rate. We use Stratified 5-fold CV because of the imbalance. With nested CV: outer 5-fold reports F1 = 0.41 ± 0.03; inner 5-fold tunes XGBoost hyperparameters per outer fold. No leakage; trustworthy comparison.

Critical caveat — preprocessing. Fit all preprocessing (imputer, scaler, encoder) inside each CV fold using a Pipeline. Fitting on the entire data before splitting leaks information from validation into training and inflates scores. sklearn.pipeline.Pipeline makes this automatic.

Generalization Error and Out-of-Sample Metrics Overfitting, Underfitting and Model Selection