Polynomial Regression and Pipelines
Core Titles
Key headlines and terms for quick recall- Polynomial Regression — linear in coefficients, non-linear in inputs
- Model:
- Degree — bias/variance trade-off
- Pipeline — chain preprocessing + model into a reproducible object
- Why pipelines: prevent leakage, ensure train/infer parity
- sklearn Pipeline, ColumnTransformer
- Cross-validation integrates seamlessly with pipelines
Basic Idea
What it is, why it matters, how it worksPolynomial Regression
When the relationship between and is curved, simple linear regression underfits. Polynomial regression extends the model with powers of :
This is still linear regression — linear in the coefficients — and can be fit with OLS by treating as separate features.
Degree — bias / variance trade-off.
- Low (1, 2) → may underfit (high bias).
- High (10+) → fits training data perfectly but oscillates wildly (high variance, overfitting).
- Choose via cross-validation.
Multi-feature polynomial expansion. With features and degree 2, expand to . This includes interaction terms , which capture how the effect of depends on .
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
Pipelines
Why pipelines? A modern ML workflow has multiple steps: impute → scale → polynomial expand → model. Without pipelines:
- Easy to leak test data (e.g., scaler fitted on the whole dataset including test).
- Easy to forget transformations at inference time, causing train/serve skew.
- Harder to version, test and deploy.
A pipeline chains transformers + estimator into a single object with the same .fit() and .predict() interface.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
pipe = Pipeline([
("scale", StandardScaler()),
("poly", PolynomialFeatures(degree=3)),
("model", LinearRegression())
])
pipe.fit(X_train, y_train) # fits all steps on train only
pipe.predict(X_test) # applies same transforms then predicts
Benefits.
- No leakage — transformers fit only on training data inside each CV fold.
- Train/inference parity — production scoring uses the same pipeline object.
- Hyperparameter search can span all steps simultaneously.
from sklearn.model_selection import GridSearchCV
grid = {"poly__degree": [1, 2, 3, 4, 5]}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X, y)
ColumnTransformer — apply different transformers to different columns (e.g., scale numerics, one-hot encode categoricals) within one pipeline.
Why polynomial + pipelines together
- Polynomial features expand dimensionality fast — must be paired with scaling and regularisation (Ridge, Lasso) to prevent overfitting.
- Pipelines make this cleanly composable and CV-friendly.
Diminishing returns
For genuinely complex non-linearities, polynomials lose to:
- Splines — piecewise polynomials, smoother extrapolation.
- GAMs — generalised additive models.
- Tree ensembles — random forests, XGBoost.
- Kernel methods — SVR with RBF kernel.
- Neural networks.
But the polynomial pipeline remains a clean, interpretable starting point.
Mind Map
Visual structure of the conceptPOLYNOMIAL REGRESSION + PIPELINES
├── Polynomial regression
│ ├── ŷ = β₀ + β₁x + β₂x² + … + β_d xᵈ
│ ├── Linear in β — fit with OLS
│ ├── Degree d via CV (bias ↔ variance)
│ └── Multi-feature: includes interactions
├── Pipelines (sklearn)
│ ├── Chain: scaler → polyExpand → model
│ ├── Same .fit / .predict interface
│ ├── Prevent leakage in CV
│ ├── Train/serve parity
│ └── Grid search across all steps
└── ColumnTransformer for mixed features
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. What is polynomial regression? A regression model that extends linear regression with powers of the input: . It is linear in coefficients and can be fit by OLS treating as separate features.
Q2. Why are pipelines important in machine learning? They chain preprocessing and modelling into one reproducible object, ensuring the same transformations apply at training and inference. This prevents data leakage during cross-validation and guarantees train/serve parity in production.
Q3. What is the bias-variance trade-off in polynomial regression? Low-degree polynomials underfit (high bias); high-degree polynomials overfit (high variance). The optimal degree balances both — typically chosen via cross-validation.
Part B (20 marks)
Q. Explain polynomial regression and how it differs from linear regression. Discuss the importance of pipelines in machine learning and demonstrate with an example.
Polynomial regression.
When the true relationship between and is curved, a straight line underfits. Polynomial regression captures curvature by adding powers of to the predictor set:
Important. This is still linear regression — linear in the coefficients — fit with OLS by treating as separate features.
Multi-feature. With and degree 2, the expanded features are , including the interaction .
Choosing degree .
| Effect | |
|---|---|
| 1 | Linear; may underfit (high bias) |
| 2–3 | Captures gentle curvature |
| 5–10+ | Risks oscillation (high variance, overfit) |
Use cross-validation to choose that minimises validation MSE.
Difference from linear regression.
| Aspect | Linear | Polynomial |
|---|---|---|
| Relationship | Straight line | Curve of degree |
| Features | ||
| Fit method | OLS | OLS on expanded features |
| Bias / variance | High bias | Tunable via |
| Interpretability | Easy | Harder for |
| Risk | Underfit | Overfit |
Pipelines.
Modern ML workflows chain many steps — impute, scale, expand, model. Doing them ad-hoc invites:
- Data leakage — fitting scaler on whole dataset including test set inflates score.
- Train/serve skew — production inference forgets a transformation.
- Difficulty in tuning — must repeat each step under each fold.
Pipelines chain transformers + estimator into one object with the same .fit() and .predict() interface.
Code example.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scale", StandardScaler()),
("poly", PolynomialFeatures(include_bias=False)),
("model", Ridge())
])
grid = {
"poly__degree": [1, 2, 3, 4, 5],
"model__alpha": [0.01, 0.1, 1, 10]
}
search = GridSearchCV(pipe, grid, cv=5, scoring="r2")
search.fit(X_train, y_train)
print(search.best_params_)
print(search.score(X_test, y_test))
Benefits illustrated.
- The StandardScaler is fitted inside each CV fold on its own training subset — no leakage.
- Grid search explores degree × alpha jointly.
- The single
search.best_estimator_can be saved and deployed; production code calls.predict()— exact transformations applied.
ColumnTransformer lets you apply different transforms per column (scale numerics, one-hot categoricals) within the same pipeline.
Take-away. Polynomial regression is a powerful, interpretable extension of linear regression — but its tendency to overfit makes pipelines + regularisation + cross-validation essential.