PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Integration and Transformation

Core Titles

Key headlines and terms for quick recall

Data integration — combine multiple sources into one view
Schema integration — resolve naming, type, unit conflicts
Entity resolution — identify the same entity across sources
Redundancy detection — correlation, $\chi^2$ test
Data transformation — convert into model-ready form
Normalisation / Standardisation — scale to common range
Encoding categoricals — one-hot, ordinal, target
Aggregation — daily → monthly summaries
Feature engineering — derive new informative features

Basic Idea

What it is, why it matters, how it works

Data Integration

Real organisations have data scattered across many systems — CRM, billing, web logs, IoT, third-party APIs. Data integration unifies them into a consistent view that downstream models can use.

Challenges.

Schema integration / matching. Different sources use different names / types for the same concept.
- "cust_id" vs "CustomerID".
- Phone stored as int vs string.
- "DOB" in one as date, in another as string "1990-01-15".
Entity resolution. The same real-world entity appears differently across sources — "Mohammed Ali" / "Mohamed Ali" / "M. Ali". Use fuzzy matching, deterministic rules, or probabilistic record linkage.
Value conflicts. Two sources disagree on the same fact — which one wins? Strategies: most recent, majority vote, weighted by source trust.
Redundancy detection. When merged, features may be redundant (correlated). Drop duplicates via correlation matrix, $\chi^2$ test for categoricals.
Granularity mismatch. One source has hourly data, another daily — aggregate or interpolate.

Approaches.

ETL — Extract, Transform, Load. Transform in an external engine before loading.
ELT — Extract, Load, Transform. Load raw into the warehouse, transform with SQL/dbt.
Data virtualisation — unified view without physically moving data (Denodo, Trino).
Data lake / lakehouse — store raw + curated in one place (Delta Lake, Iceberg).

Data Transformation

After integration, transform data to suit the chosen model.

1. Scaling.

Standardisation (z-score): $x' = (x - \mu) / \sigma$ — assumes / produces mean 0, std 1.
Min-max: $x' = (x - \min)/(\max - \min)$ — maps to $[0, 1]$ .
Robust scaling: $(x - \text{median})/IQR$ — resistant to outliers.
Log transformation: $x' = \ln(x + 1)$ — corrects right-skew.

Required for k-NN, SVM, k-means, gradient methods, neural nets — distance and gradient calculations are unfair when features have wildly different scales.

2. Categorical encoding.

One-hot — $K$ categories → $K$ binary columns. Simple but causes feature blow-up at high cardinality.
Ordinal — assign integers when there's a natural order (small / medium / large).
Target / mean encoding — replace category with target mean for that group. Powerful but risks leakage; use cross-fold encoding.
Hashing — bucket categories into fixed-size hash; for very high cardinality.

3. Date / time feature extraction.

Day of week, hour of day, month, is_weekend, is_holiday.
Time since last event.
Cyclical encoding $\sin(2\pi h / 24), \cos(2\pi h / 24)$ for hours.

4. Text feature extraction.

Bag-of-Words, TF-IDF, word embeddings (Word2Vec, BERT).

5. Aggregation.

Customer-level: total spend per month, avg session length, count of distinct categories.

6. Feature engineering. Derive new informative features:

Ratio of features ( $\text{debt}/\text{income}$ in credit modelling).
Polynomial / interaction terms.
Domain-specific (RFM in retail: Recency, Frequency, Monetary).

Good feature engineering often beats a fancier algorithm.

Mind Map

Visual structure of the concept

INTEGRATION & TRANSFORMATION
├── INTEGRATION (combine sources)
│   ├── Schema matching (column names, types)
│   ├── Entity resolution (fuzzy match)
│   ├── Value conflicts (which wins?)
│   ├── Redundancy detection (corr, χ²)
│   └── Approaches
│       ├── ETL  (transform before load)
│       ├── ELT  (load raw, transform in warehouse)
│       ├── Data virtualisation
│       └── Data lake / lakehouse
└── TRANSFORMATION (reshape for model)
    ├── Scaling
    │   ├── Standardisation z = (x − μ)/σ
    │   ├── Min-max
    │   ├── Robust
    │   └── Log
    ├── Categorical encoding
    │   ├── One-hot, Ordinal
    │   ├── Target encoding (careful: leakage)
    │   └── Hashing
    ├── Date features (DOW, hour, is_weekend)
    ├── Text features (TF-IDF, embeddings)
    ├── Aggregation (summarise by entity)
    └── Feature engineering (ratios, FE)

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is one reason for using data integration? To produce a unified, consistent view of data from multiple heterogeneous sources — removing redundancy, resolving schema/naming conflicts, and enabling holistic analytics that no single source could provide.

Q2. Differentiate standardisation and min-max scaling.

Standardisation: $x' = (x - \mu)/\sigma$ — zero mean, unit variance; suitable when data is approximately Gaussian.
Min-max: $x' = (x - \min)/(\max - \min)$ — maps to $[0, 1]$ ; sensitive to outliers.

Q3. Why use one-hot encoding for categorical variables? Because most ML models expect numeric inputs and treating arbitrary integer codes as ordered would imply false ordinality. One-hot creates an independent binary column per category, removing that bias — at the cost of feature dimensionality.

Part B (20 marks)

Q. Discuss data integration and transformation in detail. Explain the challenges in integration and the common transformation techniques used in machine learning.

Data Integration.

Goal. Merge multiple data sources into a single consistent view.

Challenges.

Challenge	Description	Solution
Schema mismatch	"cust_id" vs "CustomerID"	Mapping table, dbt models
Type / unit conflict	kg vs lbs; int vs string	Type coercion, unit normalisation
Entity resolution	"Mohammed Ali" vs "Mohamed Ali"	Fuzzy matching, record linkage
Value conflict	Two sources disagree	Last-write-wins, source trust
Redundancy	Same info from two sources	Correlation, $\chi^2$ , drop
Granularity mismatch	hourly vs daily	Aggregate or interpolate

Approaches.

ETL — transform before load. Best when target storage is limited / regulated data must be cleansed first.
ELT — load raw into a cloud warehouse, transform with SQL/dbt. Modern default with Snowflake/BigQuery.
Data virtualisation — query across sources without moving data.
Data lake / lakehouse — store raw + curated together (Delta Lake, Iceberg).

Data Transformation.

After integration, reshape the data so the model can consume it.

1. Scaling.

Standardisation $z = (x - \mu)/\sigma$ — required for k-NN, SVM, neural nets.
Min-max $(x - \min)/(\max - \min) \in [0,1]$ — useful for sigmoid-activated networks.
Robust scaling $(x - \text{median})/IQR$ — robust to outliers.
Log $\ln(x + 1)$ — corrects right-skew (incomes, counts).

Example. For a churn model with annual income (₹) and tenure (months), standardise both so k-means clusters aren't dominated by income.

2. Encoding categorical variables.

One-hot — $K$ categories become $K$ binary columns. Used for nominal categoricals.
Ordinal — integer codes when categories have natural order (low, medium, high).
Target / mean encoding — replace category with target mean. Powerful but risks leakage; use cross-fold encoding.
Hashing — for high cardinality (millions of zip codes).

Example. "city" → 100 one-hot columns; "education" → ordinal codes 1–4.

3. Date / time feature extraction.

Day-of-week, hour, month, is_weekend, holiday flag.
Time since last event.
Cyclical encoding $\sin(2\pi h/24), \cos(2\pi h/24)$ to preserve cyclical proximity (23 ≈ 1).

4. Text feature extraction.

Bag-of-Words, TF-IDF, n-grams, embeddings (Word2Vec, BERT).

5. Aggregation.

Customer-level RFM (Recency, Frequency, Monetary) in retail.
Patient-level summary statistics from time-series vitals.

6. Feature engineering. Domain-driven derived features:

$\text{debt}/\text{income}$ ratio for credit scoring.
Polynomial / interaction terms.
Lag features for time series.

Often a single well-engineered feature beats switching from logistic regression to XGBoost. Integration + transformation deserve careful attention because they shape the upper bound of model performance.

Data Cleaning Data Reduction