PGDDSA Study · Semester 1

PGD01C02

Module 4 · Model Evaluation

Polynomial Regression and Pipelines

Core Titles

Key headlines and terms for quick recall

Polynomial Regression — linear in coefficients, non-linear in inputs
Model: $\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d$
Degree $d$ — bias/variance trade-off
Pipeline — chain preprocessing + model into a reproducible object
Why pipelines: prevent leakage, ensure train/infer parity
sklearn Pipeline, ColumnTransformer
Cross-validation integrates seamlessly with pipelines

Basic Idea

What it is, why it matters, how it works

Polynomial Regression

When the relationship between $x$ and $y$ is curved, simple linear regression underfits. Polynomial regression extends the model with powers of $x$ :

$\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots + \beta_d x^d.$

This is still linear regression — linear in the coefficients $\beta_j$ — and can be fit with OLS by treating $x, x^2, x^3, \dots$ as separate features.

Degree $d$ — bias / variance trade-off.

Low $d$ (1, 2) → may underfit (high bias).
High $d$ (10+) → fits training data perfectly but oscillates wildly (high variance, overfitting).
Choose $d$ via cross-validation.

Multi-feature polynomial expansion. With features $x_1, x_2$ and degree 2, expand to $(x_1, x_2, x_1^2, x_2^2, x_1 x_2)$ . This includes interaction terms $x_1 x_2$ , which capture how the effect of $x_1$ depends on $x_2$ .

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

Pipelines

Why pipelines? A modern ML workflow has multiple steps: impute → scale → polynomial expand → model. Without pipelines:

Easy to leak test data (e.g., scaler fitted on the whole dataset including test).
Easy to forget transformations at inference time, causing train/serve skew.
Harder to version, test and deploy.

A pipeline chains transformers + estimator into a single object with the same .fit() and .predict() interface.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("poly",  PolynomialFeatures(degree=3)),
    ("model", LinearRegression())
])

pipe.fit(X_train, y_train)         # fits all steps on train only
pipe.predict(X_test)               # applies same transforms then predicts

Benefits.

No leakage — transformers fit only on training data inside each CV fold.
Train/inference parity — production scoring uses the same pipeline object.
Hyperparameter search can span all steps simultaneously.

from sklearn.model_selection import GridSearchCV
grid = {"poly__degree": [1, 2, 3, 4, 5]}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X, y)

ColumnTransformer — apply different transformers to different columns (e.g., scale numerics, one-hot encode categoricals) within one pipeline.

Why polynomial + pipelines together

Polynomial features expand dimensionality fast — must be paired with scaling and regularisation (Ridge, Lasso) to prevent overfitting.
Pipelines make this cleanly composable and CV-friendly.

Diminishing returns

For genuinely complex non-linearities, polynomials lose to:

Splines — piecewise polynomials, smoother extrapolation.
GAMs — generalised additive models.
Tree ensembles — random forests, XGBoost.
Kernel methods — SVR with RBF kernel.
Neural networks.

But the polynomial pipeline remains a clean, interpretable starting point.

Mind Map

Visual structure of the concept

POLYNOMIAL REGRESSION + PIPELINES
├── Polynomial regression
│   ├── ŷ = β₀ + β₁x + β₂x² + … + β_d xᵈ
│   ├── Linear in β — fit with OLS
│   ├── Degree d via CV (bias ↔ variance)
│   └── Multi-feature: includes interactions
├── Pipelines (sklearn)
│   ├── Chain: scaler → polyExpand → model
│   ├── Same .fit / .predict interface
│   ├── Prevent leakage in CV
│   ├── Train/serve parity
│   └── Grid search across all steps
└── ColumnTransformer for mixed features

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is polynomial regression? A regression model that extends linear regression with powers of the input: $\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d$ . It is linear in coefficients and can be fit by OLS treating $x^k$ as separate features.

Q2. Why are pipelines important in machine learning? They chain preprocessing and modelling into one reproducible object, ensuring the same transformations apply at training and inference. This prevents data leakage during cross-validation and guarantees train/serve parity in production.

Q3. What is the bias-variance trade-off in polynomial regression? Low-degree polynomials underfit (high bias); high-degree polynomials overfit (high variance). The optimal degree balances both — typically chosen via cross-validation.

Part B (20 marks)

Q. Explain polynomial regression and how it differs from linear regression. Discuss the importance of pipelines in machine learning and demonstrate with an example.

Polynomial regression.

When the true relationship between $x$ and $y$ is curved, a straight line underfits. Polynomial regression captures curvature by adding powers of $x$ to the predictor set:

$\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d.$

Important. This is still linear regression — linear in the coefficients $\beta_j$ — fit with OLS by treating $x, x^2, x^3, \dots$ as separate features.

Multi-feature. With $x_1, x_2$ and degree 2, the expanded features are $(x_1, x_2, x_1^2, x_2^2, x_1 x_2)$ , including the interaction $x_1 x_2$ .

Choosing degree $d$ .

$d$	Effect
1	Linear; may underfit (high bias)
2–3	Captures gentle curvature
5–10+	Risks oscillation (high variance, overfit)

Use cross-validation to choose $d$ that minimises validation MSE.

Difference from linear regression.

Aspect	Linear	Polynomial
Relationship	Straight line	Curve of degree $d$
Features	$x$	$x, x^2, \dots, x^d$
Fit method	OLS	OLS on expanded features
Bias / variance	High bias	Tunable via $d$
Interpretability	Easy	Harder for $d > 2$
Risk	Underfit	Overfit

Pipelines.

Modern ML workflows chain many steps — impute, scale, expand, model. Doing them ad-hoc invites:

Data leakage — fitting scaler on whole dataset including test set inflates score.
Train/serve skew — production inference forgets a transformation.
Difficulty in tuning — must repeat each step under each fold.

Pipelines chain transformers + estimator into one object with the same .fit() and .predict() interface.

Code example.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("poly",  PolynomialFeatures(include_bias=False)),
    ("model", Ridge())
])

grid = {
    "poly__degree":  [1, 2, 3, 4, 5],
    "model__alpha":  [0.01, 0.1, 1, 10]
}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X_train, y_train)

print(search.best_params_)
print(search.score(X_test, y_test))

Benefits illustrated.

The StandardScaler is fitted inside each CV fold on its own training subset — no leakage.
Grid search explores degree × alpha jointly.
The single search.best_estimator_ can be saved and deployed; production code calls .predict() — exact transformations applied.

ColumnTransformer lets you apply different transforms per column (scale numerics, one-hot categoricals) within the same pipeline.

Take-away. Polynomial regression is a powerful, interpretable extension of linear regression — but its tendency to overfit makes pipelines + regularisation + cross-validation essential.

Visualization: Residual Plot and Distribution Plot In-sample Evaluation Measures