Overview of Data Collection and Pre-Processing
Core Titles
Key headlines and terms for quick recall- Data collection — first stage of any DS project
- Data sources: primary vs secondary, internal vs external
- Data formats: structured, semi-structured, unstructured
- Common formats: CSV, JSON, XML, Parquet, images, audio
- Data preprocessing — preparing raw data for analysis
- Steps: cleaning → integration → transformation → reduction → discretization
- Why preprocessing matters — 60–80% of project time
- Garbage in, garbage out
Basic Idea
What it is, why it matters, how it worksData collection
Data collection is the systematic gathering of data needed to answer a question or build a model. It is the first stage of every data-science project — and its quality bounds everything downstream.
Sources.
| Source type | Examples |
|---|---|
| Primary (newly gathered) | Surveys, sensors, experiments, web scraping, A/B tests |
| Secondary (existing) | Public datasets (Kaggle, UCI, World Bank), company DBs, APIs |
| Internal | CRM, billing, web/mobile logs |
| External | Social media, market reports, government open data |
Data formats.
- Structured — neatly tabular (RDBMS tables, CSV). Predefined schema.
- Semi-structured — JSON, XML, YAML. Schema flexible.
- Unstructured — text, images, audio, video. Needs feature extraction.
Why preprocessing?
Raw data is rarely model-ready:
- Missing values.
- Outliers and noise.
- Inconsistent formats and units.
- Multiple sources with conflicting keys.
- Skewed distributions, unscaled features.
Garbage in, garbage out — a brilliant algorithm cannot rescue dirty data. Empirically, data preprocessing consumes 60–80% of project time.
High-level preprocessing pipeline
1. Data cleaning.
- Handle missing values (drop, impute, model).
- Detect and treat outliers.
- Remove duplicates.
- Fix inconsistent formats (date formats, units, capitalisation).
2. Data integration.
- Merge multiple data sources.
- Resolve schema conflicts ("cust_id" vs "customer_ID").
- Reconcile duplicates across sources.
3. Data transformation.
- Scaling — standardisation, min-max.
- Encoding categoricals — one-hot, ordinal, target.
- Skew correction — log, square-root.
- Feature engineering — derive new informative features.
4. Data reduction.
- Feature selection — drop irrelevant / redundant features.
- Dimensionality reduction — PCA, t-SNE.
- Sampling — work with a subset when full data is too big.
5. Data discretization.
- Bin continuous values into intervals — useful for tree models, naive Bayes, fairness analysis.
Tools
- pandas, NumPy — Python tabular processing.
- OpenRefine — interactive cleaning.
- dbt, Airflow — production data transformation.
- Great Expectations, Soda — automated data-quality testing.
Output of preprocessing
A clean, consistent, model-ready dataset — often saved as a feature store so the same features feed both training and production inference, avoiding skew.
Mind Map
Visual structure of the conceptDATA COLLECTION & PREPROCESSING — OVERVIEW
├── Collection
│ ├── Sources: primary vs secondary, internal vs external
│ ├── Formats
│ │ ├── Structured (CSV, RDBMS)
│ │ ├── Semi-structured (JSON, XML)
│ │ └── Unstructured (text, image, audio)
│ └── Tools: web scraping, APIs, sensors, surveys
└── Preprocessing (60–80% of project time)
├── 1 Cleaning
│ ├── Missing values
│ ├── Outliers
│ ├── Duplicates
│ └── Format consistency
├── 2 Integration (merge sources)
├── 3 Transformation
│ ├── Scaling
│ ├── Encoding
│ └── Feature engineering
├── 4 Reduction
│ ├── Feature selection
│ └── Dimensionality reduction
└── 5 Discretization (binning)
Garbage in, garbage out
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Differentiate between structured, semi-structured and unstructured data.
- Structured — fixed schema, tabular (CSV, RDBMS).
- Semi-structured — flexible schema with tags / keys (JSON, XML).
- Unstructured — no schema (free text, images, audio, video) — needs feature extraction before modelling.
Q2. Why is data preprocessing important? Because raw data is rarely model-ready. It contains missing values, outliers, noise, inconsistent formats and irrelevant features. Without preprocessing, even the best algorithm produces poor models — the garbage in, garbage out principle. Empirically it consumes 60–80% of project time but determines model quality.
Q3. Identify two sources of data collection.
- Primary — surveys, sensors, experiments, web scraping.
- Secondary — public datasets, government databases, third-party APIs.
Part B (20 marks)
Q. Explain the role of data collection and preprocessing in a data science project. Discuss the key steps of preprocessing with examples.
Role of data collection.
The first stage of every DS project. Decides what data we have — and therefore what questions we can answer. Bad collection (biased sample, missing variables, late-arriving data) caps the project regardless of how sophisticated the modelling is.
Sources:
- Primary: designed surveys, instrumentation, sensors, A/B logs.
- Secondary: internal CRM/ERP, public datasets, third-party APIs.
Formats:
- Structured (CSV, RDBMS rows), semi-structured (JSON), unstructured (text, image, audio).
Considerations: representativeness (does the sample mirror reality?), freshness (when collected), consent / privacy (GDPR, HIPAA), provenance (lineage and trust).
Role of preprocessing.
Raw data is rarely model-ready. Preprocessing bridges raw inputs and the model, fixing quality issues, harmonising sources, and shaping features for the algorithm of choice. It typically consumes 60–80% of project time — but has out-sized impact on model quality.
Key steps with examples.
1. Data cleaning.
- Missing values — drop rows, impute mean/median/mode, model-based imputation. Example: in a churn dataset, impute missing tenure with median by region.
- Outliers — z-score, IQR, isolation-forest detection; cap, remove or transform. Example: a transaction of ₹10⁹ inflates regression — clip or log-transform.
- Duplicates — exact and fuzzy deduplication. Example: "John Smith" vs "john smith" (extra space).
- Format consistency — standardise date formats, currencies, capitalisation.
2. Data integration.
- Merge multiple sources, resolve key conflicts ("cust_id" vs "customer_ID"), reconcile entity duplicates. Example: join CRM, billing and support tickets into one unified customer table.
3. Data transformation.
- Scaling — standardise or min-max. Required for distance-based and gradient-based models. Example: without scaling, "annual income (₹)" dominates k-means clusters.
- Encoding categoricals — one-hot, ordinal, target.
- Skew correction — log, square-root, Box–Cox. Example: log-transform incomes for linear regression.
- Feature engineering — derive new features. Example: extract day-of-week from timestamps.
4. Data reduction.
- Feature selection — variance threshold, chi-square, mutual information, L1. Example: drop 200 noisy gene features that don't correlate with target.
- Dimensionality reduction — PCA, t-SNE, autoencoders.
- Sampling / aggregation when data is huge.
5. Data discretisation.
- Bin continuous variables into intervals (uniform, quantile, model-based). Example: bin age into
<25, 25–40, 40–60, >60for fairness audits and decision-tree splits.
Pipelining.
Wrap all steps + the model in a single sklearn / TFX / MLflow pipeline so the same transformations apply at training and at inference — preventing leakage and train-serve skew.
Impact. A clean, well-prepared dataset enables simpler models that generalise better, are easier to explain, and survive longer in production.