Data Reduction
Core Titles
Key headlines and terms for quick recall- Goal: reduce data volume / dimensions while preserving information
- Why: curse of dimensionality, faster training, less overfitting, easier viz
- Feature selection — drop unhelpful features
- Filter: variance threshold, , mutual information
- Wrapper: RFE
- Embedded: L1 (Lasso), tree importance
- Feature extraction: PCA, LDA, t-SNE, UMAP, autoencoders
- Sampling / aggregation — work with subset
- Numerosity reduction — store summary statistics
- Data compression — lossless (gzip) and lossy (JPEG, low-rank SVD)
Basic Idea
What it is, why it matters, how it worksWhy reduce?
High-dimensional, voluminous data is expensive to store and train on, and suffers from the curse of dimensionality — distances become meaningless, models overfit easily. Reduction:
- Speeds up training and inference.
- Reduces overfitting (fewer parameters to fit).
- Enables visualisation (2-D / 3-D plots).
- Removes irrelevant / redundant features.
Two big categories
- Feature selection — keep a subset of original features.
- Feature extraction — construct new features from combinations of originals.
Feature selection methods
1. Filter methods (use statistics, model-independent).
- Variance threshold — drop features with near-zero variance.
- test — for categorical features vs categorical target.
- ANOVA F-test — numeric features vs categorical target.
- Mutual information — captures non-linear relationships.
- Correlation filter — drop one of two highly-correlated features.
2. Wrapper methods (use the model itself).
- Forward selection — start empty, add the feature that most improves CV score.
- Backward elimination — start with all, remove the least useful.
- Recursive Feature Elimination (RFE) — repeatedly remove the lowest-importance feature.
Expensive but more accurate than filters.
3. Embedded methods (feature selection inside the model).
- L1 (Lasso) regression — shrinks unimportant weights to zero.
- Tree-based importance — Random Forest / XGBoost feature importance.
Feature extraction (dimensionality reduction)
1. Linear methods.
Principal Component Analysis (PCA).
- Project data onto orthogonal directions of maximum variance.
- Components = eigenvectors of covariance matrix.
- Keep top that retain, say, 95% variance.
- Unsupervised.
Linear Discriminant Analysis (LDA).
- Like PCA but supervised — maximises between-class variance / within-class variance.
- Reduces to at most dimensions ( = number of classes).
Singular Value Decomposition (SVD).
- Foundation of PCA; works on any matrix.
- Truncated SVD = best rank- approximation.
2. Non-linear methods.
- t-SNE — preserves local neighbourhoods; great for visualisation, bad for distances.
- UMAP — faster than t-SNE, preserves more global structure.
- Autoencoders — neural nets that compress to a latent code.
- Kernel PCA — PCA in a high-dim kernel space.
Numerosity reduction (less data, same info)
- Sampling — random or stratified subset.
- Aggregation — daily → monthly summaries.
- Clustering — represent groups by centroids.
Data compression
- Lossless — gzip, Parquet snappy compression.
- Lossy — JPEG, low-rank SVD truncation, quantisation.
Discretisation (covered separately) also reduces effective cardinality.
When to use what
| Need | Use |
|---|---|
| Drop noise features | Filter (variance, MI) |
| Find smallest accurate subset | Wrapper (RFE) |
| Train as you select | Embedded (Lasso, tree importance) |
| Visualise high-dim | PCA, t-SNE, UMAP |
| Compress images / matrices | SVD truncation |
Mind Map
Visual structure of the conceptDATA REDUCTION
├── Why?
│ ├── Curse of dimensionality
│ ├── Speed
│ ├── Avoid overfitting
│ └── Visualisation
├── Feature Selection (keep subset)
│ ├── Filter
│ │ ├── Variance threshold
│ │ ├── χ², mutual info
│ │ └── ANOVA F
│ ├── Wrapper
│ │ ├── Forward / Backward
│ │ └── RFE
│ └── Embedded
│ ├── L1 (Lasso)
│ └── Tree importance
├── Feature Extraction (new features)
│ ├── Linear
│ │ ├── PCA — variance
│ │ ├── LDA — between-class
│ │ └── SVD
│ └── Non-linear
│ ├── t-SNE / UMAP
│ └── Autoencoders
├── Numerosity reduction
│ ├── Sampling
│ ├── Aggregation
│ └── Clustering
└── Compression (gzip, JPEG)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Why is dimensionality reduction useful? It combats the curse of dimensionality, speeds up training and inference, removes noisy/redundant features, reduces overfitting, and enables 2-D / 3-D visualisation of high-dimensional data.
Q2. Differentiate filter and wrapper feature selection.
- Filter — uses statistical scores (variance, , mutual info) independent of any model. Fast.
- Wrapper — searches for the subset that maximises a specific model's CV score (e.g., RFE). Accurate but expensive.
Q3. What does PCA do? PCA projects data onto a low-dimensional space whose basis vectors (principal components) are the eigenvectors of the covariance matrix — the orthogonal directions of maximum variance.
Part B (20 marks)
Q. Explain data reduction techniques used in data science. Discuss feature selection and dimensionality reduction with examples.
Why reduce. High-dimensional data is expensive to store, slow to train on, easy to overfit, and hard to visualise. Reduction strategies cut features (selection) or compress them into fewer informative ones (extraction) while preserving signal.
Feature Selection — keep a subset of original features.
1. Filter methods. Use statistics independent of any model.
- Variance threshold — drop near-constant features.
- test — categorical features vs categorical target.
- ANOVA F-test — numeric features vs categorical target.
- Mutual information — captures non-linear dependencies.
- Correlation filter — drop one of two highly correlated features.
Example. In a 200-feature gene-expression dataset, drop features with variance .
2. Wrapper methods. Score subsets by running the model.
- Forward selection — start empty; add the best feature each step.
- Backward elimination — start with all; drop the worst each step.
- Recursive Feature Elimination (RFE) — fit, drop lowest-importance, refit.
Example. RFE with logistic regression picks 10 of 100 features that give the best CV F1.
3. Embedded methods. Selection happens inside the model.
- L1 (Lasso) shrinks unimportant coefficients to zero.
- Tree-based importance (Random Forest, XGBoost) ranks features by how much they reduce loss.
Feature Extraction — construct new features.
1. Principal Component Analysis (PCA).
- Centre data, compute covariance .
- Eigendecompose — eigenvectors = principal components, eigenvalues = variance along them.
- Keep top to retain (say) 95% cumulative variance.
- Unsupervised, linear, orthogonal.
Example. Eigenfaces — represent face images using top 50 of 4096 pixel-PCA features.
2. Linear Discriminant Analysis (LDA).
- Supervised — maximises between-class variance over within-class variance.
- Reduces to at most dimensions for classes.
- Example. Reducing 7000 gene features to 1 LDA axis perfectly separates ALL vs AML leukaemia.
3. Truncated SVD. Best rank- approximation; used for sparse text matrices (TF-IDF) where PCA is infeasible.
4. t-SNE / UMAP. Non-linear; preserve local neighbourhoods. Used for visualising clusters in 2-D / 3-D.
5. Autoencoders. Neural nets with a bottleneck layer — encode data into a low-dim latent code, decode back. Captures non-linear structure.
Numerosity reduction.
- Sampling — random or stratified subset.
- Aggregation — daily summaries instead of per-second readings.
- Clustering — represent groups by centroids.
When to use what.
| Need | Use |
|---|---|
| Drop irrelevant features | Filter (variance, MI) |
| Find smallest accurate subset | Wrapper (RFE) |
| Selection inside training | Embedded (Lasso, RF) |
| Linear dimensionality reduction | PCA, LDA, SVD |
| Visualise high-dim data | t-SNE, UMAP |
| Non-linear compression | Autoencoders |
Caveat. Feature selection keeps interpretability; feature extraction may obscure original meaning. Choose accordingly.