Cross-Validation
Core Titles
Key headlines and terms for quick recall- Cross-validation (CV) — rotate train/test splits across the data
- k-Fold CV — standard or
- Stratified k-Fold — preserves class ratios
- Leave-One-Out CV —
- Repeated k-Fold
- Time-Series CV — respects order
- Group k-Fold — keep entities intact
- Nested CV — unbiased tuning + scoring
Basic Idea
What it is, why it matters, how it worksWhy CV?
A single train/test split gives a noisy estimate of generalisation error. CV systematically rotates the split, using every example for both training and validation at some point — yielding a more stable estimate.
Specific benefits.
- Use all data efficiently (especially valuable when small).
- Pick hyperparameters reliably.
- Detect overfitting (large train ↔ CV gap).
- Compare models fairly.
k-Fold CV
- Split data randomly into equal folds.
- For each fold : train on the other folds, validate on fold .
- Report mean and std of the fold scores.
Default or — empirical sweet spot of bias vs variance vs cost.
Stratified k-Fold
Preserves the class distribution in each fold. Essential for imbalanced classification so every fold contains both classes.
Leave-One-Out (LOOCV)
. Each fold has exactly one validation point.
- Pros: maximises training data per fold; almost unbiased.
- Cons: very expensive ( fits); high variance of the estimate.
Leave-P-Out (LPOCV)
Hold samples out as validation; iterate over all subsets. Even more expensive.
Repeated k-Fold
Run k-fold multiple times with different random seeds; average. Reduces variance of the estimate.
Time-Series CV (rolling / expanding window)
Sequential data must respect time order. Train on , validate on , slide window forward. Never lets future data leak into training.
Group k-Fold / Leave-One-Group-Out
When data has groups (patients, users, sessions) that must not span train and test (to avoid leakage), use groups as the splitting unit.
Nested CV
Two-level CV:
- Outer loop estimates performance.
- Inner loop tunes hyperparameters.
Eliminates the optimistic bias of tuning and reporting on the same CV.
How to choose
| Data type | CV strategy |
|---|---|
| Tabular, classification | Stratified k-fold ( or 10) |
| Tabular, regression | k-fold |
| Very small data | LOOCV |
| Time-series | Time-series CV |
| Grouped (patients, users) | Group k-fold |
| Reporting + tuning | Nested CV |
Implementation
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
print(scores.mean(), scores.std())
Important caveat
Fit preprocessing inside the CV loop (use a Pipeline) — fitting a scaler/imputer on the whole data before splitting causes data leakage, inflating CV scores.
Mind Map
Visual structure of the conceptCROSS-VALIDATION
├── Why?
│ ├── Stable error estimate
│ ├── Hyperparameter tuning
│ └── Detect overfit
├── Types
│ ├── k-Fold (5/10 default)
│ ├── Stratified k-Fold (classes)
│ ├── LOOCV (k=n)
│ ├── Leave-P-Out
│ ├── Repeated k-Fold
│ ├── Time-Series CV
│ ├── Group k-Fold
│ └── Nested CV
└── Caveat
└── Fit preprocessing INSIDE CV (Pipeline)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Define k-fold cross-validation. Split the data into equal folds. For each , train on folds and validate on the held-out fold. Average the scores. Default or .
Q2. Why use stratified k-fold for classification? To preserve the class distribution in every fold. Without stratification, an imbalanced dataset might place all positive examples in a single fold, producing unstable or biased scores.
Q3. What is time-series cross-validation? A CV variant that respects temporal order: train on data up to time , validate on to , then slide the window forward. Prevents future data from leaking into training.
Part B (20 marks)
Q. Why is cross-validation used in machine learning? Discuss different types of cross-validation with examples and when each is appropriate.
Why CV?
A single train/test split gives a noisy score. CV rotates the role of train and validation across the data so every sample is used both ways, producing:
- More stable estimate of generalisation error.
- Reliable basis for hyperparameter tuning (grid / random / Bayesian search).
- Fair model-to-model comparison.
- Sensitive detection of overfitting (large train↔CV gap).
Types of CV.
1. k-Fold CV. Default approach. Split into folds; train on ; validate on remainder; repeat times. Average score.
- Typical or 10 — empirical sweet spot.
- Use for general regression / classification on tabular data.
2. Stratified k-Fold. Same as k-fold but preserves the class proportions in each fold.
- Essential for imbalanced classification so every fold contains both classes.
- Use for any classification task.
3. Leave-One-Out (LOOCV). . Each fold's validation is a single sample.
- Almost unbiased estimate.
- Expensive — fits.
- Use when data is very small.
4. Leave-P-Out. Hold samples out; iterate over all subsets. Even more expensive; rarely used.
5. Repeated k-Fold. Run k-fold several times with different random seeds; average. Reduces variance of the estimate.
- Use when single k-fold scores fluctuate too much.
6. Time-Series CV (rolling / expanding window). For sequential data, train on , validate on , slide window forward.
- Never lets future data leak into training.
- Use for forecasting, financial models, demand prediction.
7. Group k-Fold / Leave-One-Group-Out. When data has groups (patients, users, sessions) that must not span train and test.
- Treats group as splitting unit.
- Use for healthcare (one patient → one fold), recommender systems (one user), audio (one speaker).
8. Nested CV. Two-level CV:
- Inner loop — tunes hyperparameters on training portion.
- Outer loop — evaluates on untouched fold.
- Removes the bias caused by tuning and scoring on the same CV.
- Use when reporting and tuning.
Comparison.
| Strategy | Best for | Cost |
|---|---|---|
| k-Fold | General-purpose | Moderate |
| Stratified k-Fold | Imbalanced classification | Moderate |
| LOOCV | Very small data | High |
| Repeated k-Fold | Reducing score noise | High |
| Time-Series CV | Sequential / forecast | Moderate |
| Group k-Fold | Repeated entities | Moderate |
| Nested CV | Tuning + reporting | High |
Worked example. A churn model on 5000 customers with 8% positive rate. We use Stratified 5-fold CV because of the imbalance. With nested CV: outer 5-fold reports F1 = 0.41 ± 0.03; inner 5-fold tunes XGBoost hyperparameters per outer fold. No leakage; trustworthy comparison.
Critical caveat — preprocessing. Fit all preprocessing (imputer, scaler, encoder) inside each CV fold using a Pipeline. Fitting on the entire data before splitting leaks information from validation into training and inflates scores. sklearn.pipeline.Pipeline makes this automatic.