PGD01C02
Module 4 · Model Evaluation

In-sample Evaluation Measures

Core Titles
Key headlines and terms for quick recall
  • In-sample evaluation — measure model fit on the training data
  • Mean Squared Error (MSE) =1n(yiy^i)2= \dfrac{1}{n}\sum (y_i - \hat y_i)^2
  • Root Mean Squared Error (RMSE) — same units as yy
  • Mean Absolute Error (MAE)
  • Coefficient of determination R2R^2 = 1 − SSE / SST
  • Adjusted R2R^2 — penalises extra predictors
  • Information criteria: AIC, BIC
  • Caveat: in-sample metrics over-estimate generalisation
Basic Idea
What it is, why it matters, how it works

What is in-sample evaluation?

Measuring how well a model fits the training data it was estimated from. It tells you whether the model captured the signal in the data — but not how it will perform on new data.

Common in-sample metrics

1. Mean Squared Error (MSE). MSE=1ni=1n(yiy^i)2.\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat y_i)^2. Penalises large errors heavily; differentiable (used as a training loss).

2. Root Mean Squared Error (RMSE). RMSE=MSE.\text{RMSE} = \sqrt{\text{MSE}}. Same units as yy — more interpretable.

3. Mean Absolute Error (MAE). MAE=1ni=1nyiy^i.\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat y_i|. Robust to outliers; linear in error magnitude.

4. Coefficient of Determination (R2R^2). R2=1SSESST,SST=(yiyˉ)2.R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2. Fraction of variance explained; ranges 0 to 1.

5. Adjusted R2R^2. Radj2=1(1R2)(n1)np1.R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}. Penalises adding non-informative predictors.

6. Information criteria.

  • AIC = 2k2ln(L)2k - 2 \ln(L) where LL is the likelihood, kk is number of parameters.
  • BIC = klnn2lnLk \ln n - 2 \ln L.
  • Lower is better. BIC penalises complexity more strongly.

The big caveat — in-sample optimism

A model's in-sample error is biased downward — it has seen those data points. For instance, a deep tree can achieve R2=1R^2 = 1 on training and 0.5 on test.

To estimate true generalisation, we need out-of-sample metrics (held-out test, cross-validation). In-sample metrics alone can give a misleadingly rosy picture and trigger silent over-fitting.

Rule of thumb. Use in-sample metrics to confirm the model fits at all (e.g., R20R^2 \gg 0). Use out-of-sample metrics to decide whether to deploy.

Classification analogues

For classification, in-sample equivalents include:

  • Training accuracy, precision, recall, F1.
  • Training log-loss / cross-entropy.

Same caveat applies — training accuracy is over-optimistic.

Mind Map
Visual structure of the concept
IN-SAMPLE EVALUATION
├── On training data
├── Regression metrics
│   ├── MSE — Σ(y − ŷ)² / n
│   ├── RMSE — √MSE (same units)
│   ├── MAE — Σ|y − ŷ| / n
│   ├── R² — variance explained
│   └── Adjusted R² — penalty for p
├── Information criteria
│   ├── AIC — lower better
│   └── BIC — stronger penalty
├── Classification metrics (training)
│   ├── Accuracy, Precision, Recall, F1
│   └── Log-loss
└── Caveat
    └── Over-optimistic; use OOS too
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define MSE and RMSE.

  • MSE=1n(yiy^i)2\text{MSE} = \dfrac{1}{n}\sum (y_i - \hat y_i)^2.
  • RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}, in the same units as yy.

Q2. What is the difference between R2R^2 and Adjusted R2R^2? R2R^2 measures variance explained but never decreases when adding a predictor — including useless ones. Adjusted R2R^2 penalises the number of predictors, decreasing if a new feature doesn't improve fit enough.

Q3. Why are in-sample metrics overoptimistic? Because the model was estimated to fit the training data — those very points. Even a degenerate model can memorise the training set and score perfectly while failing on unseen data. Out-of-sample evaluation is needed.


Part B (20 marks)

Q. Explain the in-sample evaluation measures for regression models. Discuss MSE, RMSE, MAE, R2R^2, Adjusted R2R^2, AIC and BIC with their formulas and use cases.

Why in-sample evaluation? To answer the basic question "Did the model fit at all?". Sound in-sample fit is a necessary but not sufficient condition for deployment — out-of-sample metrics tell the rest.

1. Mean Squared Error. MSE=1ni=1n(yiy^i)2.\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2.

  • Penalises large errors quadratically.
  • Differentiable — preferred as a training loss.
  • Units: yy squared (so a hard to interpret directly).

2. Root Mean Squared Error. RMSE=MSE.\text{RMSE} = \sqrt{\text{MSE}}.

  • Most reported metric — same units as yy.
  • Comparable to standard deviation.

3. Mean Absolute Error. MAE=1ni=1nyiy^i.\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|.

  • Robust to outliers (no squaring).
  • Less differentiable (kink at 0).
  • Used as training loss in robust regression (Huber, quantile).

4. Coefficient of Determination R2R^2. R2=1SSESST,SST=(yiyˉ)2.R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2.

  • 0 = model no better than predicting the mean.
  • 1 = perfect fit.
  • For OLS, R20R^2 \ge 0. For other models can be negative on test data.

5. Adjusted R2R^2. Radj2=1(1R2)(n1)np1.R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}.

  • Penalises adding non-informative predictors.
  • Preferred when comparing models with different numbers of features.

6. Information Criteria.

AIC = 2k2lnL2k - 2 \ln L, where LL is the maximum likelihood and kk the number of parameters.

  • Lower is better.
  • Balances fit vs complexity.

BIC = klnn2lnLk \ln n - 2 \ln L.

  • Larger penalty on complexity (lnn>2\ln n > 2 for n>7n > 7).
  • Tends to favour simpler models than AIC.

Use cases for each.

MetricUse when
RMSEDefault; comparable scale to yy
MAEOutliers present; robust comparison
R2R^2Quick communication of fit quality
Adj. R2R^2Comparing models with different pp
AICModel selection, especially nested models
BICModel selection with stronger simplicity preference

Caveat — over-optimism. In-sample metrics measure fit on training data and are biased downward (errors look smaller than they truly will be on new data). A common trap is reporting only training R2R^2 — a deep decision tree can reach R2=1R^2 = 1 on training and 0.3 on test.

Best practice.

  1. Compute in-sample metrics to verify the model fits.
  2. Then compute out-of-sample metrics (held-out test, k-fold CV) — these drive deployment.
  3. Always report both side-by-side; the gap reveals over- or under-fitting.

Example. A model on house prices reports training RMSE = ₹25 k, R2=0.95R^2 = 0.95. The 5-fold CV RMSE = ₹95 k. The huge gap (in-sample optimism) signals overfitting — simplify the model, regularise, or get more data before deploying.