PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Collection Strategies

Core Titles

Key headlines and terms for quick recall

Primary collection — gather new data for the project
Secondary collection — reuse existing data
Web scraping — automated extraction from websites
APIs — structured data feeds (REST, GraphQL)
Sensors / IoT — instrumented data
Surveys & forms
Public datasets — Kaggle, UCI, World Bank, government open data
Sampling strategies — simple, stratified, cluster, systematic

Basic Idea

What it is, why it matters, how it works

What is a data-collection strategy?

A data-collection strategy is the plan that decides what data to gather, how, from where and how much — to answer a specific question with adequate quality, representativeness and cost.

Primary vs Secondary

	Primary	Secondary
Source	Gathered fresh for the project	Pre-existing
Cost	Higher	Lower
Control over quality	High	Limited
Examples	Surveys, sensors, A/B tests, scraping	Kaggle, UCI, gov datasets, company DB

Strategies by source

1. Internal company databases. Transactional (OLTP), CRM, ERP, web/mobile event logs. Usually the richest, most reliable starting point.

2. Web scraping. Use BeautifulSoup, Scrapy, Selenium to extract data from web pages. Respect robots.txt and terms of service. Useful for prices, reviews, public listings.

3. APIs. Programmatic feeds — REST, GraphQL, gRPC. Examples: Twitter API, OpenWeather, Stripe, Stock-market APIs. Reliable and structured but rate-limited.

4. IoT / Sensors. Streaming data from devices — temperature, GPS, accelerometers. Need event-streaming infrastructure (Kafka, MQTT) and time-series storage.

5. Surveys and forms. Direct user input — Google Forms, SurveyMonkey, in-app questionnaires. Risk of selection bias and self-report errors.

6. Public datasets. Kaggle, UCI ML Repository, World Bank, Indian government open-data portal, Hugging Face Datasets. Good for benchmarking and prototyping.

7. Crowdsourcing. Amazon Mechanical Turk, Prolific — humans label data, common for supervised learning.

8. Synthetic data. When real data is scarce / private — generate via simulation or generative models (GANs, diffusion). Useful for rare-event modelling.

Sampling strategies

When the full population is too large or expensive, sample wisely:

Simple random sample — every item equal probability.
Stratified — sample within strata (age groups, regions) to preserve composition.
Cluster — sample whole groups (cities, schools), then everyone within.
Systematic — every $k$ -th item.
Convenience — easy access (risk: biased).

Quality and ethics

Representativeness — does the sample reflect the target population?
Bias — is the data skewed toward a subset (e.g., only urban users)?
Consent and privacy — GDPR, HIPAA, IT Act 2000.
Data provenance — source, time, transformations recorded.
Freshness — old data may not reflect current behaviour.

Common pitfalls

Survivorship bias (studying only the survivors → wrong conclusions).
Sampling bias (over-/under-sampling subgroups).
Measurement error (faulty sensors, ambiguous survey wording).
Late-arriving data (target leakage).

A solid collection strategy upfront prevents painful re-collection downstream.

Mind Map

Visual structure of the concept

DATA-COLLECTION STRATEGIES
├── Primary vs Secondary
│   ├── Primary — gathered fresh
│   └── Secondary — reuse existing
├── Sources
│   ├── Internal DBs (CRM, ERP, logs)
│   ├── Web scraping (Scrapy, BS4)
│   ├── APIs (REST, GraphQL)
│   ├── IoT / sensors
│   ├── Surveys & forms
│   ├── Public datasets (Kaggle, UCI)
│   ├── Crowdsourcing (MTurk)
│   └── Synthetic (GAN, simulation)
├── Sampling
│   ├── Simple random
│   ├── Stratified
│   ├── Cluster
│   ├── Systematic
│   └── Convenience (biased)
└── Quality & ethics
    ├── Representativeness
    ├── Privacy / consent
    ├── Provenance
    └── Freshness

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Identify two sources of data collection.

Primary — surveys, sensors, web scraping, instrumentation.
Secondary — public datasets (Kaggle, UCI), government open data, internal databases.

Q2. What is stratified sampling? A sampling strategy where the population is divided into homogeneous strata (e.g., age groups, regions) and samples are drawn proportionally from each. It preserves the composition of subgroups so the sample mirrors the population.

Q3. Name two ethical concerns in data collection.

Consent and privacy — collecting personal data requires informed consent and regulatory compliance (GDPR, HIPAA).
Bias / fairness — sampling that excludes certain demographics produces biased models.

Part B (20 marks)

Q. Describe different data-collection strategies used in data science. Compare primary and secondary sources. Discuss sampling techniques and ethical considerations.

Primary vs Secondary.

Aspect	Primary	Secondary
Definition	Gathered fresh for the project	Pre-existing, reused
Cost	High (effort, infrastructure)	Low (often free)
Quality control	Direct	Limited
Tailoring	Exactly fits questions	May not
Time	Slow	Fast
Examples	Surveys, sensors, A/B logs	Public datasets, company DBs

Strategies.

Internal databases — CRM, billing, ERP, web/mobile logs. Richest, most reliable.
Web scraping — Scrapy / BeautifulSoup / Selenium extract data from web pages (prices, reviews, listings). Must respect terms of service.
APIs — Twitter, OpenWeather, Stripe, government APIs — structured and reliable; rate-limited.
IoT / sensors — temperature, GPS, accelerometers; needs streaming infrastructure (Kafka, MQTT, InfluxDB).
Surveys and forms — Google Forms, Typeform, in-app questionnaires. Risk: response bias, self-report errors.
Public datasets — Kaggle, UCI ML Repository, World Bank, India open-data portal, Hugging Face Datasets.
Crowdsourcing — MTurk, Prolific. Used for labelling.
Synthetic data — generated via simulation or GANs when real data is scarce / sensitive.

Sampling techniques.

Method	Idea	When to use
Simple random	Equal probability for each item	When population is homogeneous
Stratified	Sample within homogeneous strata	When subgroups differ; preserves composition
Cluster	Sample groups, then everyone within	Cost-efficient field surveys
Systematic	Every $k$ -th item	Convenient for streams
Convenience	Easy-access subset	Quick proto; high bias risk

Ethical considerations.

Informed consent — subjects must know how their data is used.
Privacy and anonymisation — strip personally identifiable information (PII), k-anonymity, differential privacy.
Regulatory compliance — GDPR (EU), HIPAA (US health), CCPA (California), India's DPDP Act 2023.
Bias and representation — ensure subgroups are fairly represented; survivorship and selection bias avoided.
Provenance and audit — record source, time, transformations.
Security — protect data at rest and in transit.
Purpose limitation — use data only for declared purposes.

Common pitfalls.

Survivorship bias — studying only the survivors leads to wrong conclusions (WWII bombers).
Sampling bias — Internet surveys exclude offline populations.
Measurement bias — faulty sensors or ambiguous survey wording.
Late-arriving data — using post-event features at training causes leakage.

A well-thought-out strategy upfront prevents painful re-collection downstream and produces models whose performance survives in production.

Overview of Data Collection and Pre-Processing Data Pre-Processing Overview