PGD01C02
Module 2 · Data Collection and Pre-Processing

Data Collection Strategies

Core Titles
Key headlines and terms for quick recall
  • Primary collection — gather new data for the project
  • Secondary collection — reuse existing data
  • Web scraping — automated extraction from websites
  • APIs — structured data feeds (REST, GraphQL)
  • Sensors / IoT — instrumented data
  • Surveys & forms
  • Public datasets — Kaggle, UCI, World Bank, government open data
  • Sampling strategies — simple, stratified, cluster, systematic
Basic Idea
What it is, why it matters, how it works

What is a data-collection strategy?

A data-collection strategy is the plan that decides what data to gather, how, from where and how much — to answer a specific question with adequate quality, representativeness and cost.

Primary vs Secondary

PrimarySecondary
SourceGathered fresh for the projectPre-existing
CostHigherLower
Control over qualityHighLimited
ExamplesSurveys, sensors, A/B tests, scrapingKaggle, UCI, gov datasets, company DB

Strategies by source

1. Internal company databases. Transactional (OLTP), CRM, ERP, web/mobile event logs. Usually the richest, most reliable starting point.

2. Web scraping. Use BeautifulSoup, Scrapy, Selenium to extract data from web pages. Respect robots.txt and terms of service. Useful for prices, reviews, public listings.

3. APIs. Programmatic feeds — REST, GraphQL, gRPC. Examples: Twitter API, OpenWeather, Stripe, Stock-market APIs. Reliable and structured but rate-limited.

4. IoT / Sensors. Streaming data from devices — temperature, GPS, accelerometers. Need event-streaming infrastructure (Kafka, MQTT) and time-series storage.

5. Surveys and forms. Direct user input — Google Forms, SurveyMonkey, in-app questionnaires. Risk of selection bias and self-report errors.

6. Public datasets. Kaggle, UCI ML Repository, World Bank, Indian government open-data portal, Hugging Face Datasets. Good for benchmarking and prototyping.

7. Crowdsourcing. Amazon Mechanical Turk, Prolific — humans label data, common for supervised learning.

8. Synthetic data. When real data is scarce / private — generate via simulation or generative models (GANs, diffusion). Useful for rare-event modelling.

Sampling strategies

When the full population is too large or expensive, sample wisely:

  • Simple random sample — every item equal probability.
  • Stratified — sample within strata (age groups, regions) to preserve composition.
  • Cluster — sample whole groups (cities, schools), then everyone within.
  • Systematic — every kk-th item.
  • Convenience — easy access (risk: biased).

Quality and ethics

  • Representativeness — does the sample reflect the target population?
  • Bias — is the data skewed toward a subset (e.g., only urban users)?
  • Consent and privacy — GDPR, HIPAA, IT Act 2000.
  • Data provenance — source, time, transformations recorded.
  • Freshness — old data may not reflect current behaviour.

Common pitfalls

  • Survivorship bias (studying only the survivors → wrong conclusions).
  • Sampling bias (over-/under-sampling subgroups).
  • Measurement error (faulty sensors, ambiguous survey wording).
  • Late-arriving data (target leakage).

A solid collection strategy upfront prevents painful re-collection downstream.

Mind Map
Visual structure of the concept
DATA-COLLECTION STRATEGIES
├── Primary vs Secondary
│   ├── Primary — gathered fresh
│   └── Secondary — reuse existing
├── Sources
│   ├── Internal DBs (CRM, ERP, logs)
│   ├── Web scraping (Scrapy, BS4)
│   ├── APIs (REST, GraphQL)
│   ├── IoT / sensors
│   ├── Surveys & forms
│   ├── Public datasets (Kaggle, UCI)
│   ├── Crowdsourcing (MTurk)
│   └── Synthetic (GAN, simulation)
├── Sampling
│   ├── Simple random
│   ├── Stratified
│   ├── Cluster
│   ├── Systematic
│   └── Convenience (biased)
└── Quality & ethics
    ├── Representativeness
    ├── Privacy / consent
    ├── Provenance
    └── Freshness
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Identify two sources of data collection.

  1. Primary — surveys, sensors, web scraping, instrumentation.
  2. Secondary — public datasets (Kaggle, UCI), government open data, internal databases.

Q2. What is stratified sampling? A sampling strategy where the population is divided into homogeneous strata (e.g., age groups, regions) and samples are drawn proportionally from each. It preserves the composition of subgroups so the sample mirrors the population.

Q3. Name two ethical concerns in data collection.

  • Consent and privacy — collecting personal data requires informed consent and regulatory compliance (GDPR, HIPAA).
  • Bias / fairness — sampling that excludes certain demographics produces biased models.

Part B (20 marks)

Q. Describe different data-collection strategies used in data science. Compare primary and secondary sources. Discuss sampling techniques and ethical considerations.

Primary vs Secondary.

AspectPrimarySecondary
DefinitionGathered fresh for the projectPre-existing, reused
CostHigh (effort, infrastructure)Low (often free)
Quality controlDirectLimited
TailoringExactly fits questionsMay not
TimeSlowFast
ExamplesSurveys, sensors, A/B logsPublic datasets, company DBs

Strategies.

  1. Internal databases — CRM, billing, ERP, web/mobile logs. Richest, most reliable.
  2. Web scraping — Scrapy / BeautifulSoup / Selenium extract data from web pages (prices, reviews, listings). Must respect terms of service.
  3. APIs — Twitter, OpenWeather, Stripe, government APIs — structured and reliable; rate-limited.
  4. IoT / sensors — temperature, GPS, accelerometers; needs streaming infrastructure (Kafka, MQTT, InfluxDB).
  5. Surveys and forms — Google Forms, Typeform, in-app questionnaires. Risk: response bias, self-report errors.
  6. Public datasets — Kaggle, UCI ML Repository, World Bank, India open-data portal, Hugging Face Datasets.
  7. Crowdsourcing — MTurk, Prolific. Used for labelling.
  8. Synthetic data — generated via simulation or GANs when real data is scarce / sensitive.

Sampling techniques.

MethodIdeaWhen to use
Simple randomEqual probability for each itemWhen population is homogeneous
StratifiedSample within homogeneous strataWhen subgroups differ; preserves composition
ClusterSample groups, then everyone withinCost-efficient field surveys
SystematicEvery kk-th itemConvenient for streams
ConvenienceEasy-access subsetQuick proto; high bias risk

Ethical considerations.

  1. Informed consent — subjects must know how their data is used.
  2. Privacy and anonymisation — strip personally identifiable information (PII), k-anonymity, differential privacy.
  3. Regulatory compliance — GDPR (EU), HIPAA (US health), CCPA (California), India's DPDP Act 2023.
  4. Bias and representation — ensure subgroups are fairly represented; survivorship and selection bias avoided.
  5. Provenance and audit — record source, time, transformations.
  6. Security — protect data at rest and in transit.
  7. Purpose limitation — use data only for declared purposes.

Common pitfalls.

  • Survivorship bias — studying only the survivors leads to wrong conclusions (WWII bombers).
  • Sampling bias — Internet surveys exclude offline populations.
  • Measurement bias — faulty sensors or ambiguous survey wording.
  • Late-arriving data — using post-event features at training causes leakage.

A well-thought-out strategy upfront prevents painful re-collection downstream and produces models whose performance survives in production.