PGD01C02
Module 1 · Introduction to Data Science

Overview of Data Collection and Pre-Processing

Core Titles
Key headlines and terms for quick recall
  • Data collection — first stage of any DS project
  • Data sources: primary vs secondary, internal vs external
  • Data formats: structured, semi-structured, unstructured
  • Common formats: CSV, JSON, XML, Parquet, images, audio
  • Data preprocessing — preparing raw data for analysis
  • Steps: cleaning → integration → transformation → reduction → discretization
  • Why preprocessing matters — 60–80% of project time
  • Garbage in, garbage out
Basic Idea
What it is, why it matters, how it works

Data collection

Data collection is the systematic gathering of data needed to answer a question or build a model. It is the first stage of every data-science project — and its quality bounds everything downstream.

Sources.

Source typeExamples
Primary (newly gathered)Surveys, sensors, experiments, web scraping, A/B tests
Secondary (existing)Public datasets (Kaggle, UCI, World Bank), company DBs, APIs
InternalCRM, billing, web/mobile logs
ExternalSocial media, market reports, government open data

Data formats.

  • Structured — neatly tabular (RDBMS tables, CSV). Predefined schema.
  • Semi-structured — JSON, XML, YAML. Schema flexible.
  • Unstructured — text, images, audio, video. Needs feature extraction.

Why preprocessing?

Raw data is rarely model-ready:

  • Missing values.
  • Outliers and noise.
  • Inconsistent formats and units.
  • Multiple sources with conflicting keys.
  • Skewed distributions, unscaled features.

Garbage in, garbage out — a brilliant algorithm cannot rescue dirty data. Empirically, data preprocessing consumes 60–80% of project time.

High-level preprocessing pipeline

1. Data cleaning.

  • Handle missing values (drop, impute, model).
  • Detect and treat outliers.
  • Remove duplicates.
  • Fix inconsistent formats (date formats, units, capitalisation).

2. Data integration.

  • Merge multiple data sources.
  • Resolve schema conflicts ("cust_id" vs "customer_ID").
  • Reconcile duplicates across sources.

3. Data transformation.

  • Scaling — standardisation, min-max.
  • Encoding categoricals — one-hot, ordinal, target.
  • Skew correction — log, square-root.
  • Feature engineering — derive new informative features.

4. Data reduction.

  • Feature selection — drop irrelevant / redundant features.
  • Dimensionality reduction — PCA, t-SNE.
  • Sampling — work with a subset when full data is too big.

5. Data discretization.

  • Bin continuous values into intervals — useful for tree models, naive Bayes, fairness analysis.

Tools

  • pandas, NumPy — Python tabular processing.
  • OpenRefine — interactive cleaning.
  • dbt, Airflow — production data transformation.
  • Great Expectations, Soda — automated data-quality testing.

Output of preprocessing

A clean, consistent, model-ready dataset — often saved as a feature store so the same features feed both training and production inference, avoiding skew.

Mind Map
Visual structure of the concept
DATA COLLECTION & PREPROCESSING — OVERVIEW
├── Collection
│   ├── Sources: primary vs secondary, internal vs external
│   ├── Formats
│   │   ├── Structured (CSV, RDBMS)
│   │   ├── Semi-structured (JSON, XML)
│   │   └── Unstructured (text, image, audio)
│   └── Tools: web scraping, APIs, sensors, surveys
└── Preprocessing (60–80% of project time)
    ├── 1 Cleaning
    │   ├── Missing values
    │   ├── Outliers
    │   ├── Duplicates
    │   └── Format consistency
    ├── 2 Integration (merge sources)
    ├── 3 Transformation
    │   ├── Scaling
    │   ├── Encoding
    │   └── Feature engineering
    ├── 4 Reduction
    │   ├── Feature selection
    │   └── Dimensionality reduction
    └── 5 Discretization (binning)

Garbage in, garbage out
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Differentiate between structured, semi-structured and unstructured data.

  • Structured — fixed schema, tabular (CSV, RDBMS).
  • Semi-structured — flexible schema with tags / keys (JSON, XML).
  • Unstructured — no schema (free text, images, audio, video) — needs feature extraction before modelling.

Q2. Why is data preprocessing important? Because raw data is rarely model-ready. It contains missing values, outliers, noise, inconsistent formats and irrelevant features. Without preprocessing, even the best algorithm produces poor models — the garbage in, garbage out principle. Empirically it consumes 60–80% of project time but determines model quality.

Q3. Identify two sources of data collection.

  1. Primary — surveys, sensors, experiments, web scraping.
  2. Secondary — public datasets, government databases, third-party APIs.

Part B (20 marks)

Q. Explain the role of data collection and preprocessing in a data science project. Discuss the key steps of preprocessing with examples.

Role of data collection.

The first stage of every DS project. Decides what data we have — and therefore what questions we can answer. Bad collection (biased sample, missing variables, late-arriving data) caps the project regardless of how sophisticated the modelling is.

Sources:

  • Primary: designed surveys, instrumentation, sensors, A/B logs.
  • Secondary: internal CRM/ERP, public datasets, third-party APIs.

Formats:

  • Structured (CSV, RDBMS rows), semi-structured (JSON), unstructured (text, image, audio).

Considerations: representativeness (does the sample mirror reality?), freshness (when collected), consent / privacy (GDPR, HIPAA), provenance (lineage and trust).

Role of preprocessing.

Raw data is rarely model-ready. Preprocessing bridges raw inputs and the model, fixing quality issues, harmonising sources, and shaping features for the algorithm of choice. It typically consumes 60–80% of project time — but has out-sized impact on model quality.

Key steps with examples.

1. Data cleaning.

  • Missing values — drop rows, impute mean/median/mode, model-based imputation. Example: in a churn dataset, impute missing tenure with median by region.
  • Outliers — z-score, IQR, isolation-forest detection; cap, remove or transform. Example: a transaction of ₹10⁹ inflates regression — clip or log-transform.
  • Duplicates — exact and fuzzy deduplication. Example: "John Smith" vs "john smith" (extra space).
  • Format consistency — standardise date formats, currencies, capitalisation.

2. Data integration.

  • Merge multiple sources, resolve key conflicts ("cust_id" vs "customer_ID"), reconcile entity duplicates. Example: join CRM, billing and support tickets into one unified customer table.

3. Data transformation.

  • Scaling — standardise (xμ)/σ(x - \mu)/\sigma or min-max. Required for distance-based and gradient-based models. Example: without scaling, "annual income (₹)" dominates k-means clusters.
  • Encoding categoricals — one-hot, ordinal, target.
  • Skew correction — log, square-root, Box–Cox. Example: log-transform incomes for linear regression.
  • Feature engineering — derive new features. Example: extract day-of-week from timestamps.

4. Data reduction.

  • Feature selection — variance threshold, chi-square, mutual information, L1. Example: drop 200 noisy gene features that don't correlate with target.
  • Dimensionality reduction — PCA, t-SNE, autoencoders.
  • Sampling / aggregation when data is huge.

5. Data discretisation.

  • Bin continuous variables into intervals (uniform, quantile, model-based). Example: bin age into <25, 25–40, 40–60, >60 for fairness audits and decision-tree splits.

Pipelining.

Wrap all steps + the model in a single sklearn / TFX / MLflow pipeline so the same transformations apply at training and at inference — preventing leakage and train-serve skew.

Impact. A clean, well-prepared dataset enables simpler models that generalise better, are easier to explain, and survive longer in production.