Data Cleaning
Core Titles
Key headlines and terms for quick recall- Goal: remove noise, fix errors, prepare data for modelling
- Missing values: deletion vs imputation
- Imputation: mean / median / mode / KNN / MICE / model-based
- Outliers: z-score, IQR rule, isolation forest, DBSCAN
- Outlier treatment: cap / remove / transform
- Duplicates: exact and fuzzy (Levenshtein)
- Noise: smoothing, binning
- Format consistency: date / unit / capitalisation
Basic Idea
What it is, why it matters, how it worksWhy clean?
Raw data has missing entries, outliers, duplicates and inconsistent formats. Each of these corrupts model training:
- Missing values cause crashes or biased estimates.
- Outliers pull regression lines and inflate variance.
- Duplicates double-count and break independence assumptions.
- Inconsistent formats cause joins to fail and features to misalign.
Handling missing values
Why missing?
- MCAR — Missing Completely At Random (rare; treat as noise).
- MAR — Missing At Random (missingness depends on observed features; impute).
- MNAR — Missing Not At Random (missingness depends on the missing value itself; informative — add an indicator).
Strategies.
- Deletion — drop rows (listwise) or columns. Quick but lossy and can bias.
- Mean / median / mode imputation — simple, but understates variance.
- KNN imputation — use nearest neighbours' values; preserves relationships.
- MICE (Multiple Imputation by Chained Equations) — model each missing column from the others; produces multiple imputations.
- Model-based — regression / random-forest imputation.
- Missingness indicator — add a binary flag — captures MNAR signal.
Handling outliers
Detection.
- Z-score — flag.
- IQR rule — outside .
- Isolation Forest — randomly partitions data; outliers isolate quickly.
- DBSCAN — points not in any cluster are outliers.
- Mahalanobis distance for multivariate outliers.
Treatment.
- Cap (winsorise) — clip to 1st/99th percentile.
- Remove — when clearly an error.
- Transform — log / square-root reduces leverage.
- Robust algorithms — use median-based or tree-based methods (which are less sensitive).
- Leave alone if outliers are real and important (fraud, rare-disease).
Duplicates
- Exact — pandas
drop_duplicateson full row or key columns. - Fuzzy — names with typos, e.g., "Mohammed Ali" vs "Mohamed Ali". Tools:
recordlinkage,dedupelibrary, Levenshtein distance, Jaccard similarity.
Noise
- Smoothing — moving average, exponential smoothing on time series.
- Binning — group nearby values into buckets.
Format consistency
- Standardise dates to ISO 8601.
- Normalise units (kg vs lbs).
- Lowercase + trim strings.
- Encoding: UTF-8 everywhere; handle BOM/locale.
Validate after cleaning
Use Great Expectations or Soda to assert:
- No missing in critical columns.
- Values in expected ranges.
- Uniqueness on primary keys.
- Foreign-key integrity across tables.
Treat data quality as a production guarantee, not a one-off fix.
Mind Map
Visual structure of the conceptDATA CLEANING
├── Missing values
│ ├── Why? MCAR / MAR / MNAR
│ └── Fix
│ ├── Delete (listwise / pairwise)
│ ├── Impute (mean, median, mode)
│ ├── KNN / MICE
│ └── Add missingness indicator
├── Outliers
│ ├── Detect
│ │ ├── Z-score |z| > 3
│ │ ├── IQR rule
│ │ ├── Isolation Forest
│ │ └── DBSCAN
│ └── Treat
│ ├── Cap (winsorise)
│ ├── Remove
│ └── Transform (log / √)
├── Duplicates
│ ├── Exact
│ └── Fuzzy (Levenshtein, Jaccard)
├── Noise — smoothing, binning
└── Format consistency (date, unit, case)
└── Validate via Great Expectations
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Mention two methods of handling missing data.
- Deletion — drop rows or columns with missing entries. Quick but lossy and may bias.
- Imputation — fill with mean/median/mode or model-based (KNN, MICE, regression imputation).
Q2. What is the IQR rule for outlier detection? A data point is an outlier if it lies outside , where is the interquartile range.
Q3. Differentiate MCAR, MAR and MNAR.
- MCAR (Missing Completely At Random) — missingness is independent of everything.
- MAR (Missing At Random) — missingness depends on observed variables.
- MNAR (Missing Not At Random) — missingness depends on the missing value itself — informative.
Part B (20 marks)
Q. Describe the data-cleaning process in detail. Discuss how to handle missing values, outliers and duplicates with appropriate techniques.
Goal of data cleaning. Detect and fix errors, missing data, outliers and inconsistencies so the model trains on accurate, trustworthy data.
1. Handling missing values.
First, understand the mechanism.
- MCAR — independent of all variables; rare; can drop safely.
- MAR — depends on observed features; impute conditional on them.
- MNAR — depends on the missing value itself; informative — encode a flag.
Strategies.
| Method | Pros | Cons |
|---|---|---|
| Listwise deletion (drop rows) | Simple | Lossy; biased if not MCAR |
| Pairwise deletion | Uses available data per analysis | Inconsistent sample size |
| Mean / median / mode imputation | Fast | Understates variance |
| KNN imputation | Preserves local structure | Slow on big data |
| MICE | Multiple imputations capture uncertainty | Computationally heavier |
| Model-based (regression / RF) | Accurate | Complex |
| Indicator variable | Captures informative missingness | Adds features |
Example. In a patient dataset, impute missing systolic BP with the median by age band and add a bp_missing flag.
2. Handling outliers.
Detection.
- Z-score — flag .
- IQR rule — outside .
- Isolation Forest — anomalies are easier to isolate by random splits.
- DBSCAN — points not assigned to any cluster.
- Mahalanobis distance — multivariate; accounts for correlation.
Treatment.
- Cap (winsorise) — clip to 1st / 99th percentile.
- Remove when clearly an error.
- Transform — log / Box–Cox reduces leverage.
- Use robust algorithms — median-based or tree-based methods.
- Leave alone if the outlier is meaningful (fraud, rare disease, system anomaly).
Example. A 200-year age entry is clearly a data-entry error → remove. A ₹10 million transaction in a fraud dataset should be kept — it's the signal we care about.
3. Handling duplicates.
Exact duplicates — df.drop_duplicates() after deciding on the key columns.
Fuzzy duplicates — "John Smith" vs "Jon Smith" or "Mohammed Ali" vs "Mohamed Ali". Detected with:
- Levenshtein / edit distance for short strings.
- Jaccard / Sørensen–Dice for token sets.
- Soundex / Metaphone for phonetic similarity.
- Specialised tools — Python
recordlinkage,dedupe.
Example. In a CRM, fuzzy-merge customer records by name + birthdate similarity.
4. Format consistency.
- Standardise date formats to ISO 8601.
- Normalise units (kg vs lbs vs grams).
- Trim and case-normalise strings ("New York " vs "new york" vs "NEW YORK").
- Fix encoding issues (UTF-8, BOM).
5. Noise smoothing. For sensor / time-series data, apply moving average, exponential smoothing, or wavelet denoising.
6. Validation step. After cleaning, assert quality with Great Expectations or Soda:
- No missing in critical columns.
- Values in expected ranges.
- Primary-key uniqueness.
Treat data quality as a continuous guarantee — not a one-off fix.