PGD01C02
Module 2 · Data Collection and Pre-Processing

Data Integration and Transformation

Core Titles
Key headlines and terms for quick recall
  • Data integration — combine multiple sources into one view
  • Schema integration — resolve naming, type, unit conflicts
  • Entity resolution — identify the same entity across sources
  • Redundancy detection — correlation, χ2\chi^2 test
  • Data transformation — convert into model-ready form
  • Normalisation / Standardisation — scale to common range
  • Encoding categoricals — one-hot, ordinal, target
  • Aggregation — daily → monthly summaries
  • Feature engineering — derive new informative features
Basic Idea
What it is, why it matters, how it works

Data Integration

Real organisations have data scattered across many systems — CRM, billing, web logs, IoT, third-party APIs. Data integration unifies them into a consistent view that downstream models can use.

Challenges.

  1. Schema integration / matching. Different sources use different names / types for the same concept.

    • "cust_id" vs "CustomerID".
    • Phone stored as int vs string.
    • "DOB" in one as date, in another as string "1990-01-15".
  2. Entity resolution. The same real-world entity appears differently across sources — "Mohammed Ali" / "Mohamed Ali" / "M. Ali". Use fuzzy matching, deterministic rules, or probabilistic record linkage.

  3. Value conflicts. Two sources disagree on the same fact — which one wins? Strategies: most recent, majority vote, weighted by source trust.

  4. Redundancy detection. When merged, features may be redundant (correlated). Drop duplicates via correlation matrix, χ2\chi^2 test for categoricals.

  5. Granularity mismatch. One source has hourly data, another daily — aggregate or interpolate.

Approaches.

  • ETL — Extract, Transform, Load. Transform in an external engine before loading.
  • ELT — Extract, Load, Transform. Load raw into the warehouse, transform with SQL/dbt.
  • Data virtualisation — unified view without physically moving data (Denodo, Trino).
  • Data lake / lakehouse — store raw + curated in one place (Delta Lake, Iceberg).

Data Transformation

After integration, transform data to suit the chosen model.

1. Scaling.

  • Standardisation (z-score): x=(xμ)/σx' = (x - \mu) / \sigma — assumes / produces mean 0, std 1.
  • Min-max: x=(xmin)/(maxmin)x' = (x - \min)/(\max - \min) — maps to [0,1][0, 1].
  • Robust scaling: (xmedian)/IQR(x - \text{median})/IQR — resistant to outliers.
  • Log transformation: x=ln(x+1)x' = \ln(x + 1) — corrects right-skew.

Required for k-NN, SVM, k-means, gradient methods, neural nets — distance and gradient calculations are unfair when features have wildly different scales.

2. Categorical encoding.

  • One-hotKK categories → KK binary columns. Simple but causes feature blow-up at high cardinality.
  • Ordinal — assign integers when there's a natural order (small / medium / large).
  • Target / mean encoding — replace category with target mean for that group. Powerful but risks leakage; use cross-fold encoding.
  • Hashing — bucket categories into fixed-size hash; for very high cardinality.

3. Date / time feature extraction.

  • Day of week, hour of day, month, is_weekend, is_holiday.
  • Time since last event.
  • Cyclical encoding sin(2πh/24),cos(2πh/24)\sin(2\pi h / 24), \cos(2\pi h / 24) for hours.

4. Text feature extraction.

  • Bag-of-Words, TF-IDF, word embeddings (Word2Vec, BERT).

5. Aggregation.

  • Customer-level: total spend per month, avg session length, count of distinct categories.

6. Feature engineering. Derive new informative features:

  • Ratio of features (debt/income\text{debt}/\text{income} in credit modelling).
  • Polynomial / interaction terms.
  • Domain-specific (RFM in retail: Recency, Frequency, Monetary).

Good feature engineering often beats a fancier algorithm.

Mind Map
Visual structure of the concept
INTEGRATION & TRANSFORMATION
├── INTEGRATION (combine sources)
│   ├── Schema matching (column names, types)
│   ├── Entity resolution (fuzzy match)
│   ├── Value conflicts (which wins?)
│   ├── Redundancy detection (corr, χ²)
│   └── Approaches
│       ├── ETL  (transform before load)
│       ├── ELT  (load raw, transform in warehouse)
│       ├── Data virtualisation
│       └── Data lake / lakehouse
└── TRANSFORMATION (reshape for model)
    ├── Scaling
    │   ├── Standardisation z = (x − μ)/σ
    │   ├── Min-max
    │   ├── Robust
    │   └── Log
    ├── Categorical encoding
    │   ├── One-hot, Ordinal
    │   ├── Target encoding (careful: leakage)
    │   └── Hashing
    ├── Date features (DOW, hour, is_weekend)
    ├── Text features (TF-IDF, embeddings)
    ├── Aggregation (summarise by entity)
    └── Feature engineering (ratios, FE)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is one reason for using data integration? To produce a unified, consistent view of data from multiple heterogeneous sources — removing redundancy, resolving schema/naming conflicts, and enabling holistic analytics that no single source could provide.

Q2. Differentiate standardisation and min-max scaling.

  • Standardisation: x=(xμ)/σx' = (x - \mu)/\sigma — zero mean, unit variance; suitable when data is approximately Gaussian.
  • Min-max: x=(xmin)/(maxmin)x' = (x - \min)/(\max - \min) — maps to [0,1][0, 1]; sensitive to outliers.

Q3. Why use one-hot encoding for categorical variables? Because most ML models expect numeric inputs and treating arbitrary integer codes as ordered would imply false ordinality. One-hot creates an independent binary column per category, removing that bias — at the cost of feature dimensionality.


Part B (20 marks)

Q. Discuss data integration and transformation in detail. Explain the challenges in integration and the common transformation techniques used in machine learning.

Data Integration.

Goal. Merge multiple data sources into a single consistent view.

Challenges.

ChallengeDescriptionSolution
Schema mismatch"cust_id" vs "CustomerID"Mapping table, dbt models
Type / unit conflictkg vs lbs; int vs stringType coercion, unit normalisation
Entity resolution"Mohammed Ali" vs "Mohamed Ali"Fuzzy matching, record linkage
Value conflictTwo sources disagreeLast-write-wins, source trust
RedundancySame info from two sourcesCorrelation, χ2\chi^2, drop
Granularity mismatchhourly vs dailyAggregate or interpolate

Approaches.

  • ETL — transform before load. Best when target storage is limited / regulated data must be cleansed first.
  • ELT — load raw into a cloud warehouse, transform with SQL/dbt. Modern default with Snowflake/BigQuery.
  • Data virtualisation — query across sources without moving data.
  • Data lake / lakehouse — store raw + curated together (Delta Lake, Iceberg).

Data Transformation.

After integration, reshape the data so the model can consume it.

1. Scaling.

  • Standardisation z=(xμ)/σz = (x - \mu)/\sigma — required for k-NN, SVM, neural nets.
  • Min-max (xmin)/(maxmin)[0,1](x - \min)/(\max - \min) \in [0,1] — useful for sigmoid-activated networks.
  • Robust scaling (xmedian)/IQR(x - \text{median})/IQR — robust to outliers.
  • Log ln(x+1)\ln(x + 1) — corrects right-skew (incomes, counts).

Example. For a churn model with annual income (₹) and tenure (months), standardise both so k-means clusters aren't dominated by income.

2. Encoding categorical variables.

  • One-hotKK categories become KK binary columns. Used for nominal categoricals.
  • Ordinal — integer codes when categories have natural order (low, medium, high).
  • Target / mean encoding — replace category with target mean. Powerful but risks leakage; use cross-fold encoding.
  • Hashing — for high cardinality (millions of zip codes).

Example. "city" → 100 one-hot columns; "education" → ordinal codes 1–4.

3. Date / time feature extraction.

  • Day-of-week, hour, month, is_weekend, holiday flag.
  • Time since last event.
  • Cyclical encoding sin(2πh/24),cos(2πh/24)\sin(2\pi h/24), \cos(2\pi h/24) to preserve cyclical proximity (23 ≈ 1).

4. Text feature extraction.

  • Bag-of-Words, TF-IDF, n-grams, embeddings (Word2Vec, BERT).

5. Aggregation.

  • Customer-level RFM (Recency, Frequency, Monetary) in retail.
  • Patient-level summary statistics from time-series vitals.

6. Feature engineering. Domain-driven derived features:

  • debt/income\text{debt}/\text{income} ratio for credit scoring.
  • Polynomial / interaction terms.
  • Lag features for time series.

Often a single well-engineered feature beats switching from logistic regression to XGBoost. Integration + transformation deserve careful attention because they shape the upper bound of model performance.