PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Cleaning

Core Titles

Key headlines and terms for quick recall

Goal: remove noise, fix errors, prepare data for modelling
Missing values: deletion vs imputation
Imputation: mean / median / mode / KNN / MICE / model-based
Outliers: z-score, IQR rule, isolation forest, DBSCAN
Outlier treatment: cap / remove / transform
Duplicates: exact and fuzzy (Levenshtein)
Noise: smoothing, binning
Format consistency: date / unit / capitalisation

Basic Idea

What it is, why it matters, how it works

Why clean?

Raw data has missing entries, outliers, duplicates and inconsistent formats. Each of these corrupts model training:

Missing values cause crashes or biased estimates.
Outliers pull regression lines and inflate variance.
Duplicates double-count and break independence assumptions.
Inconsistent formats cause joins to fail and features to misalign.

Handling missing values

Why missing?

MCAR — Missing Completely At Random (rare; treat as noise).
MAR — Missing At Random (missingness depends on observed features; impute).
MNAR — Missing Not At Random (missingness depends on the missing value itself; informative — add an indicator).

Strategies.

Deletion — drop rows (listwise) or columns. Quick but lossy and can bias.
Mean / median / mode imputation — simple, but understates variance.
KNN imputation — use nearest neighbours' values; preserves relationships.
MICE (Multiple Imputation by Chained Equations) — model each missing column from the others; produces multiple imputations.
Model-based — regression / random-forest imputation.
Missingness indicator — add a binary flag — captures MNAR signal.

Handling outliers

Detection.

Z-score — $|z_i| > 3$ flag.
IQR rule — outside $[Q1 - 1.5 \cdot IQR, \; Q3 + 1.5 \cdot IQR]$ .
Isolation Forest — randomly partitions data; outliers isolate quickly.
DBSCAN — points not in any cluster are outliers.
Mahalanobis distance for multivariate outliers.

Treatment.

Cap (winsorise) — clip to 1st/99th percentile.
Remove — when clearly an error.
Transform — log / square-root reduces leverage.
Robust algorithms — use median-based or tree-based methods (which are less sensitive).
Leave alone if outliers are real and important (fraud, rare-disease).

Duplicates

Exact — pandas drop_duplicates on full row or key columns.
Fuzzy — names with typos, e.g., "Mohammed Ali" vs "Mohamed Ali". Tools: recordlinkage, dedupe library, Levenshtein distance, Jaccard similarity.

Noise

Smoothing — moving average, exponential smoothing on time series.
Binning — group nearby values into buckets.

Format consistency

Standardise dates to ISO 8601.
Normalise units (kg vs lbs).
Lowercase + trim strings.
Encoding: UTF-8 everywhere; handle BOM/locale.

Validate after cleaning

Use Great Expectations or Soda to assert:

No missing in critical columns.
Values in expected ranges.
Uniqueness on primary keys.
Foreign-key integrity across tables.

Treat data quality as a production guarantee, not a one-off fix.

Mind Map

Visual structure of the concept

DATA CLEANING
├── Missing values
│   ├── Why? MCAR / MAR / MNAR
│   └── Fix
│       ├── Delete (listwise / pairwise)
│       ├── Impute (mean, median, mode)
│       ├── KNN / MICE
│       └── Add missingness indicator
├── Outliers
│   ├── Detect
│   │   ├── Z-score |z| > 3
│   │   ├── IQR rule
│   │   ├── Isolation Forest
│   │   └── DBSCAN
│   └── Treat
│       ├── Cap (winsorise)
│       ├── Remove
│       └── Transform (log / √)
├── Duplicates
│   ├── Exact
│   └── Fuzzy (Levenshtein, Jaccard)
├── Noise — smoothing, binning
└── Format consistency (date, unit, case)
    └── Validate via Great Expectations

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Mention two methods of handling missing data.

Deletion — drop rows or columns with missing entries. Quick but lossy and may bias.
Imputation — fill with mean/median/mode or model-based (KNN, MICE, regression imputation).

Q2. What is the IQR rule for outlier detection? A data point is an outlier if it lies outside $[Q1 - 1.5 \cdot IQR, \; Q3 + 1.5 \cdot IQR]$ , where $IQR = Q3 - Q1$ is the interquartile range.

Q3. Differentiate MCAR, MAR and MNAR.

MCAR (Missing Completely At Random) — missingness is independent of everything.
MAR (Missing At Random) — missingness depends on observed variables.
MNAR (Missing Not At Random) — missingness depends on the missing value itself — informative.

Part B (20 marks)

Q. Describe the data-cleaning process in detail. Discuss how to handle missing values, outliers and duplicates with appropriate techniques.

Goal of data cleaning. Detect and fix errors, missing data, outliers and inconsistencies so the model trains on accurate, trustworthy data.

1. Handling missing values.

First, understand the mechanism.

MCAR — independent of all variables; rare; can drop safely.
MAR — depends on observed features; impute conditional on them.
MNAR — depends on the missing value itself; informative — encode a flag.

Strategies.

Method	Pros	Cons
Listwise deletion (drop rows)	Simple	Lossy; biased if not MCAR
Pairwise deletion	Uses available data per analysis	Inconsistent sample size
Mean / median / mode imputation	Fast	Understates variance
KNN imputation	Preserves local structure	Slow on big data
MICE	Multiple imputations capture uncertainty	Computationally heavier
Model-based (regression / RF)	Accurate	Complex
Indicator variable	Captures informative missingness	Adds features

Example. In a patient dataset, impute missing systolic BP with the median by age band and add a bp_missing flag.

2. Handling outliers.

Detection.

Z-score — flag $|z_i| = |x_i - \mu|/\sigma > 3$ .
IQR rule — outside $[Q1 - 1.5 \cdot IQR, \; Q3 + 1.5 \cdot IQR]$ .
Isolation Forest — anomalies are easier to isolate by random splits.
DBSCAN — points not assigned to any cluster.
Mahalanobis distance — multivariate; accounts for correlation.

Treatment.

Cap (winsorise) — clip to 1st / 99th percentile.
Remove when clearly an error.
Transform — log / Box–Cox reduces leverage.
Use robust algorithms — median-based or tree-based methods.
Leave alone if the outlier is meaningful (fraud, rare disease, system anomaly).

Example. A 200-year age entry is clearly a data-entry error → remove. A ₹10 million transaction in a fraud dataset should be kept — it's the signal we care about.

3. Handling duplicates.

Exact duplicates — df.drop_duplicates() after deciding on the key columns.

Fuzzy duplicates — "John Smith" vs "Jon Smith" or "Mohammed Ali" vs "Mohamed Ali". Detected with:

Levenshtein / edit distance for short strings.
Jaccard / Sørensen–Dice for token sets.
Soundex / Metaphone for phonetic similarity.
Specialised tools — Python recordlinkage, dedupe.

Example. In a CRM, fuzzy-merge customer records by name + birthdate similarity.

4. Format consistency.

Standardise date formats to ISO 8601.
Normalise units (kg vs lbs vs grams).
Trim and case-normalise strings ("New York " vs "new york" vs "NEW YORK").
Fix encoding issues (UTF-8, BOM).

5. Noise smoothing. For sensor / time-series data, apply moving average, exponential smoothing, or wavelet denoising.

6. Validation step. After cleaning, assert quality with Great Expectations or Soda:

No missing in critical columns.
Values in expected ranges.
Primary-key uniqueness.

Treat data quality as a continuous guarantee — not a one-off fix.

Data Pre-Processing Overview Data Integration and Transformation