PGDDSA Study · Semester 1

PGD01C02

Module 1 · Introduction to Data Science

Overview of Data Collection and Pre-Processing

Core Titles

Key headlines and terms for quick recall

Data collection — first stage of any DS project
Data sources: primary vs secondary, internal vs external
Data formats: structured, semi-structured, unstructured
Common formats: CSV, JSON, XML, Parquet, images, audio
Data preprocessing — preparing raw data for analysis
Steps: cleaning → integration → transformation → reduction → discretization
Why preprocessing matters — 60–80% of project time
Garbage in, garbage out

Basic Idea

What it is, why it matters, how it works

Data collection

Data collection is the systematic gathering of data needed to answer a question or build a model. It is the first stage of every data-science project — and its quality bounds everything downstream.

Sources.

Source type	Examples
Primary (newly gathered)	Surveys, sensors, experiments, web scraping, A/B tests
Secondary (existing)	Public datasets (Kaggle, UCI, World Bank), company DBs, APIs
Internal	CRM, billing, web/mobile logs
External	Social media, market reports, government open data

Data formats.

Structured — neatly tabular (RDBMS tables, CSV). Predefined schema.
Semi-structured — JSON, XML, YAML. Schema flexible.
Unstructured — text, images, audio, video. Needs feature extraction.

Why preprocessing?

Raw data is rarely model-ready:

Missing values.
Outliers and noise.
Inconsistent formats and units.
Multiple sources with conflicting keys.
Skewed distributions, unscaled features.

Garbage in, garbage out — a brilliant algorithm cannot rescue dirty data. Empirically, data preprocessing consumes 60–80% of project time.

High-level preprocessing pipeline

1. Data cleaning.

Handle missing values (drop, impute, model).
Detect and treat outliers.
Remove duplicates.
Fix inconsistent formats (date formats, units, capitalisation).

2. Data integration.

Merge multiple data sources.
Resolve schema conflicts ("cust_id" vs "customer_ID").
Reconcile duplicates across sources.

3. Data transformation.

Scaling — standardisation, min-max.
Encoding categoricals — one-hot, ordinal, target.
Skew correction — log, square-root.
Feature engineering — derive new informative features.

4. Data reduction.

Feature selection — drop irrelevant / redundant features.
Dimensionality reduction — PCA, t-SNE.
Sampling — work with a subset when full data is too big.

5. Data discretization.

Bin continuous values into intervals — useful for tree models, naive Bayes, fairness analysis.

Tools

pandas, NumPy — Python tabular processing.
OpenRefine — interactive cleaning.
dbt, Airflow — production data transformation.
Great Expectations, Soda — automated data-quality testing.

Output of preprocessing

A clean, consistent, model-ready dataset — often saved as a feature store so the same features feed both training and production inference, avoiding skew.

Mind Map

Visual structure of the concept

DATA COLLECTION & PREPROCESSING — OVERVIEW
├── Collection
│   ├── Sources: primary vs secondary, internal vs external
│   ├── Formats
│   │   ├── Structured (CSV, RDBMS)
│   │   ├── Semi-structured (JSON, XML)
│   │   └── Unstructured (text, image, audio)
│   └── Tools: web scraping, APIs, sensors, surveys
└── Preprocessing (60–80% of project time)
    ├── 1 Cleaning
    │   ├── Missing values
    │   ├── Outliers
    │   ├── Duplicates
    │   └── Format consistency
    ├── 2 Integration (merge sources)
    ├── 3 Transformation
    │   ├── Scaling
    │   ├── Encoding
    │   └── Feature engineering
    ├── 4 Reduction
    │   ├── Feature selection
    │   └── Dimensionality reduction
    └── 5 Discretization (binning)

Garbage in, garbage out

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Differentiate between structured, semi-structured and unstructured data.

Structured — fixed schema, tabular (CSV, RDBMS).
Semi-structured — flexible schema with tags / keys (JSON, XML).
Unstructured — no schema (free text, images, audio, video) — needs feature extraction before modelling.

Q2. Why is data preprocessing important? Because raw data is rarely model-ready. It contains missing values, outliers, noise, inconsistent formats and irrelevant features. Without preprocessing, even the best algorithm produces poor models — the garbage in, garbage out principle. Empirically it consumes 60–80% of project time but determines model quality.

Q3. Identify two sources of data collection.

Primary — surveys, sensors, experiments, web scraping.
Secondary — public datasets, government databases, third-party APIs.

Part B (20 marks)

Q. Explain the role of data collection and preprocessing in a data science project. Discuss the key steps of preprocessing with examples.

Role of data collection.

The first stage of every DS project. Decides what data we have — and therefore what questions we can answer. Bad collection (biased sample, missing variables, late-arriving data) caps the project regardless of how sophisticated the modelling is.

Sources:

Primary: designed surveys, instrumentation, sensors, A/B logs.
Secondary: internal CRM/ERP, public datasets, third-party APIs.

Formats:

Structured (CSV, RDBMS rows), semi-structured (JSON), unstructured (text, image, audio).

Considerations: representativeness (does the sample mirror reality?), freshness (when collected), consent / privacy (GDPR, HIPAA), provenance (lineage and trust).

Role of preprocessing.

Raw data is rarely model-ready. Preprocessing bridges raw inputs and the model, fixing quality issues, harmonising sources, and shaping features for the algorithm of choice. It typically consumes 60–80% of project time — but has out-sized impact on model quality.

Key steps with examples.

1. Data cleaning.

Missing values — drop rows, impute mean/median/mode, model-based imputation. Example: in a churn dataset, impute missing tenure with median by region.
Outliers — z-score, IQR, isolation-forest detection; cap, remove or transform. Example: a transaction of ₹10⁹ inflates regression — clip or log-transform.
Duplicates — exact and fuzzy deduplication. Example: "John Smith" vs "john smith" (extra space).
Format consistency — standardise date formats, currencies, capitalisation.

2. Data integration.

Merge multiple sources, resolve key conflicts ("cust_id" vs "customer_ID"), reconcile entity duplicates. Example: join CRM, billing and support tickets into one unified customer table.

3. Data transformation.

Scaling — standardise $(x - \mu)/\sigma$ or min-max. Required for distance-based and gradient-based models. Example: without scaling, "annual income (₹)" dominates k-means clusters.
Encoding categoricals — one-hot, ordinal, target.
Skew correction — log, square-root, Box–Cox. Example: log-transform incomes for linear regression.
Feature engineering — derive new features. Example: extract day-of-week from timestamps.

4. Data reduction.

Feature selection — variance threshold, chi-square, mutual information, L1. Example: drop 200 noisy gene features that don't correlate with target.
Dimensionality reduction — PCA, t-SNE, autoencoders.
Sampling / aggregation when data is huge.

5. Data discretisation.

Bin continuous variables into intervals (uniform, quantile, model-based). Example: bin age into <25, 25–40, 40–60, >60 for fairness audits and decision-tree splits.

Pipelining.

Wrap all steps + the model in a single sklearn / TFX / MLflow pipeline so the same transformations apply at training and at inference — preventing leakage and train-serve skew.

Impact. A clean, well-prepared dataset enables simpler models that generalise better, are easier to explain, and survive longer in production.

Applications of Data Science Data Collection Strategies