In-sample Evaluation Measures
Core Titles
Key headlines and terms for quick recall- In-sample evaluation — measure model fit on the training data
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE) — same units as
- Mean Absolute Error (MAE)
- Coefficient of determination = 1 − SSE / SST
- Adjusted — penalises extra predictors
- Information criteria: AIC, BIC
- Caveat: in-sample metrics over-estimate generalisation
Basic Idea
What it is, why it matters, how it worksWhat is in-sample evaluation?
Measuring how well a model fits the training data it was estimated from. It tells you whether the model captured the signal in the data — but not how it will perform on new data.
Common in-sample metrics
1. Mean Squared Error (MSE). Penalises large errors heavily; differentiable (used as a training loss).
2. Root Mean Squared Error (RMSE). Same units as — more interpretable.
3. Mean Absolute Error (MAE). Robust to outliers; linear in error magnitude.
4. Coefficient of Determination (). Fraction of variance explained; ranges 0 to 1.
5. Adjusted . Penalises adding non-informative predictors.
6. Information criteria.
- AIC = where is the likelihood, is number of parameters.
- BIC = .
- Lower is better. BIC penalises complexity more strongly.
The big caveat — in-sample optimism
A model's in-sample error is biased downward — it has seen those data points. For instance, a deep tree can achieve on training and 0.5 on test.
To estimate true generalisation, we need out-of-sample metrics (held-out test, cross-validation). In-sample metrics alone can give a misleadingly rosy picture and trigger silent over-fitting.
Rule of thumb. Use in-sample metrics to confirm the model fits at all (e.g., ). Use out-of-sample metrics to decide whether to deploy.
Classification analogues
For classification, in-sample equivalents include:
- Training accuracy, precision, recall, F1.
- Training log-loss / cross-entropy.
Same caveat applies — training accuracy is over-optimistic.
Mind Map
Visual structure of the conceptIN-SAMPLE EVALUATION
├── On training data
├── Regression metrics
│ ├── MSE — Σ(y − ŷ)² / n
│ ├── RMSE — √MSE (same units)
│ ├── MAE — Σ|y − ŷ| / n
│ ├── R² — variance explained
│ └── Adjusted R² — penalty for p
├── Information criteria
│ ├── AIC — lower better
│ └── BIC — stronger penalty
├── Classification metrics (training)
│ ├── Accuracy, Precision, Recall, F1
│ └── Log-loss
└── Caveat
└── Over-optimistic; use OOS too
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Define MSE and RMSE.
- .
- , in the same units as .
Q2. What is the difference between and Adjusted ? measures variance explained but never decreases when adding a predictor — including useless ones. Adjusted penalises the number of predictors, decreasing if a new feature doesn't improve fit enough.
Q3. Why are in-sample metrics overoptimistic? Because the model was estimated to fit the training data — those very points. Even a degenerate model can memorise the training set and score perfectly while failing on unseen data. Out-of-sample evaluation is needed.
Part B (20 marks)
Q. Explain the in-sample evaluation measures for regression models. Discuss MSE, RMSE, MAE, , Adjusted , AIC and BIC with their formulas and use cases.
Why in-sample evaluation? To answer the basic question "Did the model fit at all?". Sound in-sample fit is a necessary but not sufficient condition for deployment — out-of-sample metrics tell the rest.
1. Mean Squared Error.
- Penalises large errors quadratically.
- Differentiable — preferred as a training loss.
- Units: squared (so a hard to interpret directly).
2. Root Mean Squared Error.
- Most reported metric — same units as .
- Comparable to standard deviation.
3. Mean Absolute Error.
- Robust to outliers (no squaring).
- Less differentiable (kink at 0).
- Used as training loss in robust regression (Huber, quantile).
4. Coefficient of Determination .
- 0 = model no better than predicting the mean.
- 1 = perfect fit.
- For OLS, . For other models can be negative on test data.
5. Adjusted .
- Penalises adding non-informative predictors.
- Preferred when comparing models with different numbers of features.
6. Information Criteria.
AIC = , where is the maximum likelihood and the number of parameters.
- Lower is better.
- Balances fit vs complexity.
BIC = .
- Larger penalty on complexity ( for ).
- Tends to favour simpler models than AIC.
Use cases for each.
| Metric | Use when |
|---|---|
| RMSE | Default; comparable scale to |
| MAE | Outliers present; robust comparison |
| Quick communication of fit quality | |
| Adj. | Comparing models with different |
| AIC | Model selection, especially nested models |
| BIC | Model selection with stronger simplicity preference |
Caveat — over-optimism. In-sample metrics measure fit on training data and are biased downward (errors look smaller than they truly will be on new data). A common trap is reporting only training — a deep decision tree can reach on training and 0.3 on test.
Best practice.
- Compute in-sample metrics to verify the model fits.
- Then compute out-of-sample metrics (held-out test, k-fold CV) — these drive deployment.
- Always report both side-by-side; the gap reveals over- or under-fitting.
Example. A model on house prices reports training RMSE = ₹25 k, . The 5-fold CV RMSE = ₹95 k. The huge gap (in-sample optimism) signals overfitting — simplify the model, regularise, or get more data before deploying.