PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Reduction

Core Titles

Key headlines and terms for quick recall

Goal: reduce data volume / dimensions while preserving information
Why: curse of dimensionality, faster training, less overfitting, easier viz
Feature selection — drop unhelpful features
- Filter: variance threshold, $\chi^2$ , mutual information
- Wrapper: RFE
- Embedded: L1 (Lasso), tree importance
Feature extraction: PCA, LDA, t-SNE, UMAP, autoencoders
Sampling / aggregation — work with subset
Numerosity reduction — store summary statistics
Data compression — lossless (gzip) and lossy (JPEG, low-rank SVD)

Basic Idea

What it is, why it matters, how it works

Why reduce?

High-dimensional, voluminous data is expensive to store and train on, and suffers from the curse of dimensionality — distances become meaningless, models overfit easily. Reduction:

Speeds up training and inference.
Reduces overfitting (fewer parameters to fit).
Enables visualisation (2-D / 3-D plots).
Removes irrelevant / redundant features.

Two big categories

Feature selection — keep a subset of original features.
Feature extraction — construct new features from combinations of originals.

Feature selection methods

1. Filter methods (use statistics, model-independent).

Variance threshold — drop features with near-zero variance.
$\chi^2$ test — for categorical features vs categorical target.
ANOVA F-test — numeric features vs categorical target.
Mutual information — captures non-linear relationships.
Correlation filter — drop one of two highly-correlated features.

2. Wrapper methods (use the model itself).

Forward selection — start empty, add the feature that most improves CV score.
Backward elimination — start with all, remove the least useful.
Recursive Feature Elimination (RFE) — repeatedly remove the lowest-importance feature.

Expensive but more accurate than filters.

3. Embedded methods (feature selection inside the model).

L1 (Lasso) regression — shrinks unimportant weights to zero.
Tree-based importance — Random Forest / XGBoost feature importance.

Feature extraction (dimensionality reduction)

1. Linear methods.

Principal Component Analysis (PCA).

Project data onto orthogonal directions of maximum variance.
Components = eigenvectors of covariance matrix.
Keep top $k$ that retain, say, 95% variance.
Unsupervised.

Linear Discriminant Analysis (LDA).

Like PCA but supervised — maximises between-class variance / within-class variance.
Reduces to at most $K - 1$ dimensions ( $K$ = number of classes).

Singular Value Decomposition (SVD).

Foundation of PCA; works on any matrix.
Truncated SVD = best rank- $k$ approximation.

2. Non-linear methods.

t-SNE — preserves local neighbourhoods; great for visualisation, bad for distances.
UMAP — faster than t-SNE, preserves more global structure.
Autoencoders — neural nets that compress to a latent code.
Kernel PCA — PCA in a high-dim kernel space.

Numerosity reduction (less data, same info)

Sampling — random or stratified subset.
Aggregation — daily → monthly summaries.
Clustering — represent groups by centroids.

Data compression

Lossless — gzip, Parquet snappy compression.
Lossy — JPEG, low-rank SVD truncation, quantisation.

Discretisation (covered separately) also reduces effective cardinality.

When to use what

Need	Use
Drop noise features	Filter (variance, MI)
Find smallest accurate subset	Wrapper (RFE)
Train as you select	Embedded (Lasso, tree importance)
Visualise high-dim	PCA, t-SNE, UMAP
Compress images / matrices	SVD truncation

Mind Map

Visual structure of the concept

DATA REDUCTION
├── Why?
│   ├── Curse of dimensionality
│   ├── Speed
│   ├── Avoid overfitting
│   └── Visualisation
├── Feature Selection (keep subset)
│   ├── Filter
│   │   ├── Variance threshold
│   │   ├── χ², mutual info
│   │   └── ANOVA F
│   ├── Wrapper
│   │   ├── Forward / Backward
│   │   └── RFE
│   └── Embedded
│       ├── L1 (Lasso)
│       └── Tree importance
├── Feature Extraction (new features)
│   ├── Linear
│   │   ├── PCA — variance
│   │   ├── LDA — between-class
│   │   └── SVD
│   └── Non-linear
│       ├── t-SNE / UMAP
│       └── Autoencoders
├── Numerosity reduction
│   ├── Sampling
│   ├── Aggregation
│   └── Clustering
└── Compression (gzip, JPEG)

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Why is dimensionality reduction useful? It combats the curse of dimensionality, speeds up training and inference, removes noisy/redundant features, reduces overfitting, and enables 2-D / 3-D visualisation of high-dimensional data.

Q2. Differentiate filter and wrapper feature selection.

Filter — uses statistical scores (variance, $\chi^2$ , mutual info) independent of any model. Fast.
Wrapper — searches for the subset that maximises a specific model's CV score (e.g., RFE). Accurate but expensive.

Q3. What does PCA do? PCA projects data onto a low-dimensional space whose basis vectors (principal components) are the eigenvectors of the covariance matrix — the orthogonal directions of maximum variance.

Part B (20 marks)

Q. Explain data reduction techniques used in data science. Discuss feature selection and dimensionality reduction with examples.

Why reduce. High-dimensional data is expensive to store, slow to train on, easy to overfit, and hard to visualise. Reduction strategies cut features (selection) or compress them into fewer informative ones (extraction) while preserving signal.

Feature Selection — keep a subset of original features.

1. Filter methods. Use statistics independent of any model.

Variance threshold — drop near-constant features.
$\chi^2$ test — categorical features vs categorical target.
ANOVA F-test — numeric features vs categorical target.
Mutual information — captures non-linear dependencies.
Correlation filter — drop one of two highly correlated features.

Example. In a 200-feature gene-expression dataset, drop features with variance $< 0.01$ .

2. Wrapper methods. Score subsets by running the model.

Forward selection — start empty; add the best feature each step.
Backward elimination — start with all; drop the worst each step.
Recursive Feature Elimination (RFE) — fit, drop lowest-importance, refit.

Example. RFE with logistic regression picks 10 of 100 features that give the best CV F1.

3. Embedded methods. Selection happens inside the model.

L1 (Lasso) shrinks unimportant coefficients to zero.
Tree-based importance (Random Forest, XGBoost) ranks features by how much they reduce loss.

Feature Extraction — construct new features.

1. Principal Component Analysis (PCA).

Centre data, compute covariance $\Sigma$ .
Eigendecompose $\Sigma$ — eigenvectors = principal components, eigenvalues = variance along them.
Keep top $k$ to retain (say) 95% cumulative variance.
Unsupervised, linear, orthogonal.

Example. Eigenfaces — represent face images using top 50 of 4096 pixel-PCA features.

2. Linear Discriminant Analysis (LDA).

Supervised — maximises between-class variance over within-class variance.
Reduces to at most $K - 1$ dimensions for $K$ classes.
Example. Reducing 7000 gene features to 1 LDA axis perfectly separates ALL vs AML leukaemia.

3. Truncated SVD. Best rank- $k$ approximation; used for sparse text matrices (TF-IDF) where PCA is infeasible.

4. t-SNE / UMAP. Non-linear; preserve local neighbourhoods. Used for visualising clusters in 2-D / 3-D.

5. Autoencoders. Neural nets with a bottleneck layer — encode data into a low-dim latent code, decode back. Captures non-linear structure.

Numerosity reduction.

Sampling — random or stratified subset.
Aggregation — daily summaries instead of per-second readings.
Clustering — represent groups by centroids.

When to use what.

Need	Use
Drop irrelevant features	Filter (variance, MI)
Find smallest accurate subset	Wrapper (RFE)
Selection inside training	Embedded (Lasso, RF)
Linear dimensionality reduction	PCA, LDA, SVD
Visualise high-dim data	t-SNE, UMAP
Non-linear compression	Autoencoders

Caveat. Feature selection keeps interpretability; feature extraction may obscure original meaning. Choose accordingly.

Data Integration and Transformation Data Discretization