Data Integration and Transformation
Core Titles
Key headlines and terms for quick recall- Data integration — combine multiple sources into one view
- Schema integration — resolve naming, type, unit conflicts
- Entity resolution — identify the same entity across sources
- Redundancy detection — correlation, test
- Data transformation — convert into model-ready form
- Normalisation / Standardisation — scale to common range
- Encoding categoricals — one-hot, ordinal, target
- Aggregation — daily → monthly summaries
- Feature engineering — derive new informative features
Basic Idea
What it is, why it matters, how it worksData Integration
Real organisations have data scattered across many systems — CRM, billing, web logs, IoT, third-party APIs. Data integration unifies them into a consistent view that downstream models can use.
Challenges.
-
Schema integration / matching. Different sources use different names / types for the same concept.
- "cust_id" vs "CustomerID".
- Phone stored as int vs string.
- "DOB" in one as date, in another as string "1990-01-15".
-
Entity resolution. The same real-world entity appears differently across sources — "Mohammed Ali" / "Mohamed Ali" / "M. Ali". Use fuzzy matching, deterministic rules, or probabilistic record linkage.
-
Value conflicts. Two sources disagree on the same fact — which one wins? Strategies: most recent, majority vote, weighted by source trust.
-
Redundancy detection. When merged, features may be redundant (correlated). Drop duplicates via correlation matrix, test for categoricals.
-
Granularity mismatch. One source has hourly data, another daily — aggregate or interpolate.
Approaches.
- ETL — Extract, Transform, Load. Transform in an external engine before loading.
- ELT — Extract, Load, Transform. Load raw into the warehouse, transform with SQL/dbt.
- Data virtualisation — unified view without physically moving data (Denodo, Trino).
- Data lake / lakehouse — store raw + curated in one place (Delta Lake, Iceberg).
Data Transformation
After integration, transform data to suit the chosen model.
1. Scaling.
- Standardisation (z-score): — assumes / produces mean 0, std 1.
- Min-max: — maps to .
- Robust scaling: — resistant to outliers.
- Log transformation: — corrects right-skew.
Required for k-NN, SVM, k-means, gradient methods, neural nets — distance and gradient calculations are unfair when features have wildly different scales.
2. Categorical encoding.
- One-hot — categories → binary columns. Simple but causes feature blow-up at high cardinality.
- Ordinal — assign integers when there's a natural order (small / medium / large).
- Target / mean encoding — replace category with target mean for that group. Powerful but risks leakage; use cross-fold encoding.
- Hashing — bucket categories into fixed-size hash; for very high cardinality.
3. Date / time feature extraction.
- Day of week, hour of day, month, is_weekend, is_holiday.
- Time since last event.
- Cyclical encoding for hours.
4. Text feature extraction.
- Bag-of-Words, TF-IDF, word embeddings (Word2Vec, BERT).
5. Aggregation.
- Customer-level: total spend per month, avg session length, count of distinct categories.
6. Feature engineering. Derive new informative features:
- Ratio of features ( in credit modelling).
- Polynomial / interaction terms.
- Domain-specific (RFM in retail: Recency, Frequency, Monetary).
Good feature engineering often beats a fancier algorithm.
Mind Map
Visual structure of the conceptINTEGRATION & TRANSFORMATION
├── INTEGRATION (combine sources)
│ ├── Schema matching (column names, types)
│ ├── Entity resolution (fuzzy match)
│ ├── Value conflicts (which wins?)
│ ├── Redundancy detection (corr, χ²)
│ └── Approaches
│ ├── ETL (transform before load)
│ ├── ELT (load raw, transform in warehouse)
│ ├── Data virtualisation
│ └── Data lake / lakehouse
└── TRANSFORMATION (reshape for model)
├── Scaling
│ ├── Standardisation z = (x − μ)/σ
│ ├── Min-max
│ ├── Robust
│ └── Log
├── Categorical encoding
│ ├── One-hot, Ordinal
│ ├── Target encoding (careful: leakage)
│ └── Hashing
├── Date features (DOW, hour, is_weekend)
├── Text features (TF-IDF, embeddings)
├── Aggregation (summarise by entity)
└── Feature engineering (ratios, FE)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. What is one reason for using data integration? To produce a unified, consistent view of data from multiple heterogeneous sources — removing redundancy, resolving schema/naming conflicts, and enabling holistic analytics that no single source could provide.
Q2. Differentiate standardisation and min-max scaling.
- Standardisation: — zero mean, unit variance; suitable when data is approximately Gaussian.
- Min-max: — maps to ; sensitive to outliers.
Q3. Why use one-hot encoding for categorical variables? Because most ML models expect numeric inputs and treating arbitrary integer codes as ordered would imply false ordinality. One-hot creates an independent binary column per category, removing that bias — at the cost of feature dimensionality.
Part B (20 marks)
Q. Discuss data integration and transformation in detail. Explain the challenges in integration and the common transformation techniques used in machine learning.
Data Integration.
Goal. Merge multiple data sources into a single consistent view.
Challenges.
| Challenge | Description | Solution |
|---|---|---|
| Schema mismatch | "cust_id" vs "CustomerID" | Mapping table, dbt models |
| Type / unit conflict | kg vs lbs; int vs string | Type coercion, unit normalisation |
| Entity resolution | "Mohammed Ali" vs "Mohamed Ali" | Fuzzy matching, record linkage |
| Value conflict | Two sources disagree | Last-write-wins, source trust |
| Redundancy | Same info from two sources | Correlation, , drop |
| Granularity mismatch | hourly vs daily | Aggregate or interpolate |
Approaches.
- ETL — transform before load. Best when target storage is limited / regulated data must be cleansed first.
- ELT — load raw into a cloud warehouse, transform with SQL/dbt. Modern default with Snowflake/BigQuery.
- Data virtualisation — query across sources without moving data.
- Data lake / lakehouse — store raw + curated together (Delta Lake, Iceberg).
Data Transformation.
After integration, reshape the data so the model can consume it.
1. Scaling.
- Standardisation — required for k-NN, SVM, neural nets.
- Min-max — useful for sigmoid-activated networks.
- Robust scaling — robust to outliers.
- Log — corrects right-skew (incomes, counts).
Example. For a churn model with annual income (₹) and tenure (months), standardise both so k-means clusters aren't dominated by income.
2. Encoding categorical variables.
- One-hot — categories become binary columns. Used for nominal categoricals.
- Ordinal — integer codes when categories have natural order (low, medium, high).
- Target / mean encoding — replace category with target mean. Powerful but risks leakage; use cross-fold encoding.
- Hashing — for high cardinality (millions of zip codes).
Example. "city" → 100 one-hot columns; "education" → ordinal codes 1–4.
3. Date / time feature extraction.
- Day-of-week, hour, month, is_weekend, holiday flag.
- Time since last event.
- Cyclical encoding to preserve cyclical proximity (23 ≈ 1).
4. Text feature extraction.
- Bag-of-Words, TF-IDF, n-grams, embeddings (Word2Vec, BERT).
5. Aggregation.
- Customer-level RFM (Recency, Frequency, Monetary) in retail.
- Patient-level summary statistics from time-series vitals.
6. Feature engineering. Domain-driven derived features:
- ratio for credit scoring.
- Polynomial / interaction terms.
- Lag features for time series.
Often a single well-engineered feature beats switching from logistic regression to XGBoost. Integration + transformation deserve careful attention because they shape the upper bound of model performance.