PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Pre-Processing Overview

Core Titles

Key headlines and terms for quick recall

Data preprocessing — preparing raw data for analysis
Five major steps: Cleaning → Integration → Transformation → Reduction → Discretization
Goal: clean, consistent, model-ready data
Time cost: 60–80% of project
Garbage in, garbage out
Tools: pandas, scikit-learn, dbt, Airflow, Great Expectations

Basic Idea

What it is, why it matters, how it works

Why preprocess?

Raw data is messy, inconsistent and noisy. Preprocessing transforms it into the clean, consistent format a model expects. Without it:

Missing values crash algorithms.
Outliers distort gradient updates.
Different scales unbalance distance metrics.
Skewed distributions break linear assumptions.

The five steps

1. Data Cleaning. Remove noise and fix errors:

Missing values — drop, impute (mean/median/mode), model.
Outliers — detect (z-score, IQR), then cap/remove/transform.
Duplicates — exact and fuzzy.
Inconsistencies — unit conversions, format normalisation.

2. Data Integration. Merge multiple data sources into a unified view:

Resolve schema conflicts (different column names for the same thing).
Reconcile entities (the same customer in two systems).
Handle redundant or contradictory records.

3. Data Transformation. Reshape data for the model:

Scaling — standardisation $(x-\mu)/\sigma$ or min-max.
Encoding categoricals — one-hot, ordinal, target encoding.
Skew correction — log, square-root, Box–Cox.
Feature engineering — derive new features.

4. Data Reduction. Reduce volume while preserving information:

Feature selection — drop irrelevant features.
Dimensionality reduction — PCA, t-SNE.
Sampling — work with a subset of huge data.

5. Data Discretization. Bin continuous variables into discrete intervals:

Equal-width, equal-frequency, model-based.
Useful for trees, fairness audits, business reporting.

Pipeline pattern

Modern practice wraps all steps + model in a Pipeline (sklearn, TFX) so:

Same transforms apply at training and inference.
No data leakage from test to train.
Easy to version, monitor and reproduce.

Output

A clean, well-shaped dataset — often saved to a feature store so multiple models reuse the same features.

Tools

Need	Tool
Tabular cleaning	pandas, NumPy, sklearn
Interactive cleaning	OpenRefine
Pipeline orchestration	Airflow, Prefect, Dagster
In-warehouse transforms	dbt
Quality testing	Great Expectations, Soda
Feature stores	Feast, Tecton

Mind Map

Visual structure of the concept

DATA PREPROCESSING — overview
├── Why? Garbage in, garbage out
├── Five steps
│   ├── 1 Cleaning (missing, outliers, dup, format)
│   ├── 2 Integration (merge sources)
│   ├── 3 Transformation (scale, encode, FE)
│   ├── 4 Reduction (select, PCA)
│   └── 5 Discretization (bin continuous)
├── Pipeline pattern
│   ├── Same transforms train ↔ infer
│   └── No leakage; reproducible
└── Tools
    ├── pandas, sklearn
    ├── Airflow, dbt
    └── Great Expectations

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What are the five main steps of data preprocessing? Cleaning, Integration, Transformation, Reduction, Discretization.

Q2. What does "garbage in, garbage out" mean in data science? It means that the quality of a model's output is bounded by the quality of its input data. Noisy, biased or incomplete data produce unreliable models — no algorithm can rescue dirty data.

Q3. What is a preprocessing pipeline and why is it useful? A pipeline chains all preprocessing steps and the model into one reproducible artifact. It guarantees that the same transformations apply at training and inference, prevents data leakage, and makes the workflow versionable, testable and deployable.

Part B (20 marks)

Q. Explain the data-preprocessing process in detail. Discuss each step with examples and tools.

Why preprocessing matters. Raw data is rarely model-ready. Preprocessing fixes quality issues and reshapes features for the chosen algorithm. It consumes 60–80% of project time but determines model success more than algorithm choice. Garbage in → garbage out.

Step 1 — Data Cleaning.

Tasks.

Handle missing values: drop rows/columns; impute (mean, median, mode, KNN, MICE); flag with a missingness indicator.
Detect outliers: z-score, IQR rule, isolation forest; cap (winsorise), remove, or transform.
Remove duplicates — exact and fuzzy (Levenshtein on names).
Fix inconsistencies — standardise dates, units, capitalisation.

Example. In a customer table, dates appear as "2024-11-15", "11/15/24", "Nov 15 2024" — normalise to ISO 8601.

Tools. pandas (dropna, fillna), sklearn (SimpleImputer, KNNImputer), OpenRefine.

Step 2 — Data Integration.

Tasks.

Merge multiple sources (CRM + billing + support logs).
Resolve schema conflicts ("cust_id" vs "CustomerID").
Reconcile duplicate entities across sources.

Example. Joining MySQL transactions with MongoDB user profiles into one Snowflake table.

Tools. SQL JOINs, pandas merge, dbt, Spark.

Step 3 — Data Transformation.

Tasks.

Scaling: standardisation $z = (x - \mu)/\sigma$ , min-max, robust. Required for k-NN, SVM, neural nets, gradient methods.
Categorical encoding: one-hot, ordinal, target/mean, hashing.
Skew correction: log, square-root, Box–Cox.
Feature engineering: derive new informative features (e.g., days_since_last_login).

Example. Encoding "city" as one-hot for a churn model; log-transforming "income" because it's right-skewed.

Tools. sklearn (StandardScaler, OneHotEncoder), pandas get_dummies, Featuretools.

Step 4 — Data Reduction.

Tasks.

Feature selection: variance threshold, chi-square, mutual information, L1 (Lasso), Recursive Feature Elimination.
Dimensionality reduction: PCA, LDA, t-SNE, UMAP, autoencoders.
Sampling when data is huge.

Example. On a 7000-gene microarray, PCA reduces to 50 components that capture 95% variance — speeds modelling and reduces overfitting.

Tools. sklearn (PCA, SelectKBest, Lasso), Boruta.

Step 5 — Data Discretization.

Tasks.

Bin continuous values into intervals: equal-width, equal-frequency, model-based (decision-tree bins).

Example. Age binned into <25, 25–40, 40–60, >60 for a fairness audit and for a decision-tree split.

Tools. pandas cut, qcut; sklearn KBinsDiscretizer.

Pipeline pattern (production best practice).

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
    ("model", LogisticRegression())
])
pipe.fit(X_train, y_train)
pipe.predict(X_new)  # same transforms at inference

A single pipeline:

Guarantees train/inference parity (no leakage).
Is versionable in MLflow / DVC.
Is testable end-to-end.

Result. A clean, well-shaped dataset that feeds a robust, generalisable model.

Data Collection Strategies Data Cleaning