PGD01C02
Module 2 · Data Collection and Pre-Processing

Data Pre-Processing Overview

Core Titles
Key headlines and terms for quick recall
  • Data preprocessing — preparing raw data for analysis
  • Five major steps: Cleaning → Integration → Transformation → Reduction → Discretization
  • Goal: clean, consistent, model-ready data
  • Time cost: 60–80% of project
  • Garbage in, garbage out
  • Tools: pandas, scikit-learn, dbt, Airflow, Great Expectations
Basic Idea
What it is, why it matters, how it works

Why preprocess?

Raw data is messy, inconsistent and noisy. Preprocessing transforms it into the clean, consistent format a model expects. Without it:

  • Missing values crash algorithms.
  • Outliers distort gradient updates.
  • Different scales unbalance distance metrics.
  • Skewed distributions break linear assumptions.

The five steps

1. Data Cleaning. Remove noise and fix errors:

  • Missing values — drop, impute (mean/median/mode), model.
  • Outliers — detect (z-score, IQR), then cap/remove/transform.
  • Duplicates — exact and fuzzy.
  • Inconsistencies — unit conversions, format normalisation.

2. Data Integration. Merge multiple data sources into a unified view:

  • Resolve schema conflicts (different column names for the same thing).
  • Reconcile entities (the same customer in two systems).
  • Handle redundant or contradictory records.

3. Data Transformation. Reshape data for the model:

  • Scaling — standardisation (xμ)/σ(x-\mu)/\sigma or min-max.
  • Encoding categoricals — one-hot, ordinal, target encoding.
  • Skew correction — log, square-root, Box–Cox.
  • Feature engineering — derive new features.

4. Data Reduction. Reduce volume while preserving information:

  • Feature selection — drop irrelevant features.
  • Dimensionality reduction — PCA, t-SNE.
  • Sampling — work with a subset of huge data.

5. Data Discretization. Bin continuous variables into discrete intervals:

  • Equal-width, equal-frequency, model-based.
  • Useful for trees, fairness audits, business reporting.

Pipeline pattern

Modern practice wraps all steps + model in a Pipeline (sklearn, TFX) so:

  • Same transforms apply at training and inference.
  • No data leakage from test to train.
  • Easy to version, monitor and reproduce.

Output

A clean, well-shaped dataset — often saved to a feature store so multiple models reuse the same features.

Tools

NeedTool
Tabular cleaningpandas, NumPy, sklearn
Interactive cleaningOpenRefine
Pipeline orchestrationAirflow, Prefect, Dagster
In-warehouse transformsdbt
Quality testingGreat Expectations, Soda
Feature storesFeast, Tecton
Mind Map
Visual structure of the concept
DATA PREPROCESSING — overview
├── Why? Garbage in, garbage out
├── Five steps
│   ├── 1 Cleaning (missing, outliers, dup, format)
│   ├── 2 Integration (merge sources)
│   ├── 3 Transformation (scale, encode, FE)
│   ├── 4 Reduction (select, PCA)
│   └── 5 Discretization (bin continuous)
├── Pipeline pattern
│   ├── Same transforms train ↔ infer
│   └── No leakage; reproducible
└── Tools
    ├── pandas, sklearn
    ├── Airflow, dbt
    └── Great Expectations
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What are the five main steps of data preprocessing? Cleaning, Integration, Transformation, Reduction, Discretization.

Q2. What does "garbage in, garbage out" mean in data science? It means that the quality of a model's output is bounded by the quality of its input data. Noisy, biased or incomplete data produce unreliable models — no algorithm can rescue dirty data.

Q3. What is a preprocessing pipeline and why is it useful? A pipeline chains all preprocessing steps and the model into one reproducible artifact. It guarantees that the same transformations apply at training and inference, prevents data leakage, and makes the workflow versionable, testable and deployable.


Part B (20 marks)

Q. Explain the data-preprocessing process in detail. Discuss each step with examples and tools.

Why preprocessing matters. Raw data is rarely model-ready. Preprocessing fixes quality issues and reshapes features for the chosen algorithm. It consumes 60–80% of project time but determines model success more than algorithm choice. Garbage in → garbage out.

Step 1 — Data Cleaning.

Tasks.

  • Handle missing values: drop rows/columns; impute (mean, median, mode, KNN, MICE); flag with a missingness indicator.
  • Detect outliers: z-score, IQR rule, isolation forest; cap (winsorise), remove, or transform.
  • Remove duplicates — exact and fuzzy (Levenshtein on names).
  • Fix inconsistencies — standardise dates, units, capitalisation.

Example. In a customer table, dates appear as "2024-11-15", "11/15/24", "Nov 15 2024" — normalise to ISO 8601.

Tools. pandas (dropna, fillna), sklearn (SimpleImputer, KNNImputer), OpenRefine.

Step 2 — Data Integration.

Tasks.

  • Merge multiple sources (CRM + billing + support logs).
  • Resolve schema conflicts ("cust_id" vs "CustomerID").
  • Reconcile duplicate entities across sources.

Example. Joining MySQL transactions with MongoDB user profiles into one Snowflake table.

Tools. SQL JOINs, pandas merge, dbt, Spark.

Step 3 — Data Transformation.

Tasks.

  • Scaling: standardisation z=(xμ)/σz = (x - \mu)/\sigma, min-max, robust. Required for k-NN, SVM, neural nets, gradient methods.
  • Categorical encoding: one-hot, ordinal, target/mean, hashing.
  • Skew correction: log, square-root, Box–Cox.
  • Feature engineering: derive new informative features (e.g., days_since_last_login).

Example. Encoding "city" as one-hot for a churn model; log-transforming "income" because it's right-skewed.

Tools. sklearn (StandardScaler, OneHotEncoder), pandas get_dummies, Featuretools.

Step 4 — Data Reduction.

Tasks.

  • Feature selection: variance threshold, chi-square, mutual information, L1 (Lasso), Recursive Feature Elimination.
  • Dimensionality reduction: PCA, LDA, t-SNE, UMAP, autoencoders.
  • Sampling when data is huge.

Example. On a 7000-gene microarray, PCA reduces to 50 components that capture 95% variance — speeds modelling and reduces overfitting.

Tools. sklearn (PCA, SelectKBest, Lasso), Boruta.

Step 5 — Data Discretization.

Tasks.

  • Bin continuous values into intervals: equal-width, equal-frequency, model-based (decision-tree bins).

Example. Age binned into <25, 25–40, 40–60, >60 for a fairness audit and for a decision-tree split.

Tools. pandas cut, qcut; sklearn KBinsDiscretizer.

Pipeline pattern (production best practice).

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
    ("model", LogisticRegression())
])
pipe.fit(X_train, y_train)
pipe.predict(X_new)  # same transforms at inference

A single pipeline:

  • Guarantees train/inference parity (no leakage).
  • Is versionable in MLflow / DVC.
  • Is testable end-to-end.

Result. A clean, well-shaped dataset that feeds a robust, generalisable model.