PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

In-sample Evaluation Measures

Core Titles

Key headlines and terms for quick recall

In-sample evaluation — measure model fit on the training data
Mean Squared Error (MSE) $= \dfrac{1}{n}\sum (y_i - \hat y_i)^2$
Root Mean Squared Error (RMSE) — same units as $y$
Mean Absolute Error (MAE)
Coefficient of determination $R^2$ = 1 − SSE / SST
Adjusted $R^2$ — penalises extra predictors
Information criteria: AIC, BIC
Caveat: in-sample metrics over-estimate generalisation

Basic Idea

What it is, why it matters, how it works

What is in-sample evaluation?

Measuring how well a model fits the training data it was estimated from. It tells you whether the model captured the signal in the data — but not how it will perform on new data.

Common in-sample metrics

1. Mean Squared Error (MSE). $\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat y_i)^2.$ Penalises large errors heavily; differentiable (used as a training loss).

2. Root Mean Squared Error (RMSE). $\text{RMSE} = \sqrt{\text{MSE}}.$ Same units as $y$ — more interpretable.

3. Mean Absolute Error (MAE). $\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat y_i|.$ Robust to outliers; linear in error magnitude.

4. Coefficient of Determination ( $R^2$ ). $R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2.$ Fraction of variance explained; ranges 0 to 1.

5. Adjusted $R^2$ . $R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}.$ Penalises adding non-informative predictors.

6. Information criteria.

AIC = $2k - 2 \ln(L)$ where $L$ is the likelihood, $k$ is number of parameters.
BIC = $k \ln n - 2 \ln L$ .
Lower is better. BIC penalises complexity more strongly.

The big caveat — in-sample optimism

A model's in-sample error is biased downward — it has seen those data points. For instance, a deep tree can achieve $R^2 = 1$ on training and 0.5 on test.

To estimate true generalisation, we need out-of-sample metrics (held-out test, cross-validation). In-sample metrics alone can give a misleadingly rosy picture and trigger silent over-fitting.

Rule of thumb. Use in-sample metrics to confirm the model fits at all (e.g., $R^2 \gg 0$ ). Use out-of-sample metrics to decide whether to deploy.

Classification analogues

For classification, in-sample equivalents include:

Training accuracy, precision, recall, F1.
Training log-loss / cross-entropy.

Same caveat applies — training accuracy is over-optimistic.

Mind Map

Visual structure of the concept

IN-SAMPLE EVALUATION
├── On training data
├── Regression metrics
│   ├── MSE — Σ(y − ŷ)² / n
│   ├── RMSE — √MSE (same units)
│   ├── MAE — Σ|y − ŷ| / n
│   ├── R² — variance explained
│   └── Adjusted R² — penalty for p
├── Information criteria
│   ├── AIC — lower better
│   └── BIC — stronger penalty
├── Classification metrics (training)
│   ├── Accuracy, Precision, Recall, F1
│   └── Log-loss
└── Caveat
    └── Over-optimistic; use OOS too

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define MSE and RMSE.

$\text{MSE} = \dfrac{1}{n}\sum (y_i - \hat y_i)^2$ .
$\text{RMSE} = \sqrt{\text{MSE}}$ , in the same units as $y$ .

Q2. What is the difference between $R^2$ and Adjusted $R^2$ ? $R^2$ measures variance explained but never decreases when adding a predictor — including useless ones. Adjusted $R^2$ penalises the number of predictors, decreasing if a new feature doesn't improve fit enough.

Q3. Why are in-sample metrics overoptimistic? Because the model was estimated to fit the training data — those very points. Even a degenerate model can memorise the training set and score perfectly while failing on unseen data. Out-of-sample evaluation is needed.

Part B (20 marks)

Q. Explain the in-sample evaluation measures for regression models. Discuss MSE, RMSE, MAE, $R^2$ , Adjusted $R^2$ , AIC and BIC with their formulas and use cases.

Why in-sample evaluation? To answer the basic question "Did the model fit at all?". Sound in-sample fit is a necessary but not sufficient condition for deployment — out-of-sample metrics tell the rest.

1. Mean Squared Error. $\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2.$

Penalises large errors quadratically.
Differentiable — preferred as a training loss.
Units: $y$ squared (so a hard to interpret directly).

2. Root Mean Squared Error. $\text{RMSE} = \sqrt{\text{MSE}}.$

Most reported metric — same units as $y$ .
Comparable to standard deviation.

3. Mean Absolute Error. $\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|.$

Robust to outliers (no squaring).
Less differentiable (kink at 0).
Used as training loss in robust regression (Huber, quantile).

4. Coefficient of Determination $R^2$ . $R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum (y_i - \bar y)^2.$

0 = model no better than predicting the mean.
1 = perfect fit.
For OLS, $R^2 \ge 0$ . For other models can be negative on test data.

5. Adjusted $R^2$ . $R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}.$

Penalises adding non-informative predictors.
Preferred when comparing models with different numbers of features.

6. Information Criteria.

AIC = $2k - 2 \ln L$ , where $L$ is the maximum likelihood and $k$ the number of parameters.

Lower is better.
Balances fit vs complexity.

BIC = $k \ln n - 2 \ln L$ .

Larger penalty on complexity ( $\ln n > 2$ for $n > 7$ ).
Tends to favour simpler models than AIC.

Use cases for each.

Metric	Use when
RMSE	Default; comparable scale to $y$
MAE	Outliers present; robust comparison
$R^2$	Quick communication of fit quality
Adj. $R^2$	Comparing models with different $p$
AIC	Model selection, especially nested models
BIC	Model selection with stronger simplicity preference

Caveat — over-optimism. In-sample metrics measure fit on training data and are biased downward (errors look smaller than they truly will be on new data). A common trap is reporting only training $R^2$ — a deep decision tree can reach $R^2 = 1$ on training and 0.3 on test.

Best practice.

Compute in-sample metrics to verify the model fits.
Then compute out-of-sample metrics (held-out test, k-fold CV) — these drive deployment.
Always report both side-by-side; the gap reveals over- or under-fitting.

Example. A model on house prices reports training RMSE = ₹25 k, $R^2 = 0.95$ . The 5-fold CV RMSE = ₹95 k. The huge gap (in-sample optimism) signals overfitting — simplify the model, regularise, or get more data before deploying.

Polynomial Regression and Pipelines Prediction and Decision Making