PGD01C02
Module 4 · Model Evaluation

Polynomial Regression and Pipelines

Core Titles
Key headlines and terms for quick recall
  • Polynomial Regression — linear in coefficients, non-linear in inputs
  • Model: y^=β0+β1x+β2x2++βdxd\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d
  • Degree dd — bias/variance trade-off
  • Pipeline — chain preprocessing + model into a reproducible object
  • Why pipelines: prevent leakage, ensure train/infer parity
  • sklearn Pipeline, ColumnTransformer
  • Cross-validation integrates seamlessly with pipelines
Basic Idea
What it is, why it matters, how it works

Polynomial Regression

When the relationship between xx and yy is curved, simple linear regression underfits. Polynomial regression extends the model with powers of xx:

y^=β0+β1x+β2x2+β3x3++βdxd.\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots + \beta_d x^d.

This is still linear regression — linear in the coefficients βj\beta_j — and can be fit with OLS by treating x,x2,x3,x, x^2, x^3, \dots as separate features.

Degree dd — bias / variance trade-off.

  • Low dd (1, 2) → may underfit (high bias).
  • High dd (10+) → fits training data perfectly but oscillates wildly (high variance, overfitting).
  • Choose dd via cross-validation.

Multi-feature polynomial expansion. With features x1,x2x_1, x_2 and degree 2, expand to (x1,x2,x12,x22,x1x2)(x_1, x_2, x_1^2, x_2^2, x_1 x_2). This includes interaction terms x1x2x_1 x_2, which capture how the effect of x1x_1 depends on x2x_2.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

Pipelines

Why pipelines? A modern ML workflow has multiple steps: impute → scale → polynomial expand → model. Without pipelines:

  • Easy to leak test data (e.g., scaler fitted on the whole dataset including test).
  • Easy to forget transformations at inference time, causing train/serve skew.
  • Harder to version, test and deploy.

A pipeline chains transformers + estimator into a single object with the same .fit() and .predict() interface.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("poly",  PolynomialFeatures(degree=3)),
    ("model", LinearRegression())
])

pipe.fit(X_train, y_train)         # fits all steps on train only
pipe.predict(X_test)               # applies same transforms then predicts

Benefits.

  • No leakage — transformers fit only on training data inside each CV fold.
  • Train/inference parity — production scoring uses the same pipeline object.
  • Hyperparameter search can span all steps simultaneously.
from sklearn.model_selection import GridSearchCV
grid = {"poly__degree": [1, 2, 3, 4, 5]}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X, y)

ColumnTransformer — apply different transformers to different columns (e.g., scale numerics, one-hot encode categoricals) within one pipeline.

Why polynomial + pipelines together

  • Polynomial features expand dimensionality fast — must be paired with scaling and regularisation (Ridge, Lasso) to prevent overfitting.
  • Pipelines make this cleanly composable and CV-friendly.

Diminishing returns

For genuinely complex non-linearities, polynomials lose to:

  • Splines — piecewise polynomials, smoother extrapolation.
  • GAMs — generalised additive models.
  • Tree ensembles — random forests, XGBoost.
  • Kernel methods — SVR with RBF kernel.
  • Neural networks.

But the polynomial pipeline remains a clean, interpretable starting point.

Mind Map
Visual structure of the concept
POLYNOMIAL REGRESSION + PIPELINES
├── Polynomial regression
│   ├── ŷ = β₀ + β₁x + β₂x² + … + β_d xᵈ
│   ├── Linear in β — fit with OLS
│   ├── Degree d via CV (bias ↔ variance)
│   └── Multi-feature: includes interactions
├── Pipelines (sklearn)
│   ├── Chain: scaler → polyExpand → model
│   ├── Same .fit / .predict interface
│   ├── Prevent leakage in CV
│   ├── Train/serve parity
│   └── Grid search across all steps
└── ColumnTransformer for mixed features
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is polynomial regression? A regression model that extends linear regression with powers of the input: y^=β0+β1x+β2x2++βdxd\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d. It is linear in coefficients and can be fit by OLS treating xkx^k as separate features.

Q2. Why are pipelines important in machine learning? They chain preprocessing and modelling into one reproducible object, ensuring the same transformations apply at training and inference. This prevents data leakage during cross-validation and guarantees train/serve parity in production.

Q3. What is the bias-variance trade-off in polynomial regression? Low-degree polynomials underfit (high bias); high-degree polynomials overfit (high variance). The optimal degree balances both — typically chosen via cross-validation.


Part B (20 marks)

Q. Explain polynomial regression and how it differs from linear regression. Discuss the importance of pipelines in machine learning and demonstrate with an example.

Polynomial regression.

When the true relationship between xx and yy is curved, a straight line underfits. Polynomial regression captures curvature by adding powers of xx to the predictor set:

y^=β0+β1x+β2x2++βdxd.\hat y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d.

Important. This is still linear regression — linear in the coefficients βj\beta_j — fit with OLS by treating x,x2,x3,x, x^2, x^3, \dots as separate features.

Multi-feature. With x1,x2x_1, x_2 and degree 2, the expanded features are (x1,x2,x12,x22,x1x2)(x_1, x_2, x_1^2, x_2^2, x_1 x_2), including the interaction x1x2x_1 x_2.

Choosing degree dd.

ddEffect
1Linear; may underfit (high bias)
2–3Captures gentle curvature
5–10+Risks oscillation (high variance, overfit)

Use cross-validation to choose dd that minimises validation MSE.

Difference from linear regression.

AspectLinearPolynomial
RelationshipStraight lineCurve of degree dd
Featuresxxx,x2,,xdx, x^2, \dots, x^d
Fit methodOLSOLS on expanded features
Bias / varianceHigh biasTunable via dd
InterpretabilityEasyHarder for d>2d > 2
RiskUnderfitOverfit

Pipelines.

Modern ML workflows chain many steps — impute, scale, expand, model. Doing them ad-hoc invites:

  • Data leakage — fitting scaler on whole dataset including test set inflates score.
  • Train/serve skew — production inference forgets a transformation.
  • Difficulty in tuning — must repeat each step under each fold.

Pipelines chain transformers + estimator into one object with the same .fit() and .predict() interface.

Code example.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("poly",  PolynomialFeatures(include_bias=False)),
    ("model", Ridge())
])

grid = {
    "poly__degree":  [1, 2, 3, 4, 5],
    "model__alpha":  [0.01, 0.1, 1, 10]
}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X_train, y_train)

print(search.best_params_)
print(search.score(X_test, y_test))

Benefits illustrated.

  • The StandardScaler is fitted inside each CV fold on its own training subset — no leakage.
  • Grid search explores degree × alpha jointly.
  • The single search.best_estimator_ can be saved and deployed; production code calls .predict() — exact transformations applied.

ColumnTransformer lets you apply different transforms per column (scale numerics, one-hot categoricals) within the same pipeline.

Take-away. Polynomial regression is a powerful, interpretable extension of linear regression — but its tendency to overfit makes pipelines + regularisation + cross-validation essential.