Data Collection Strategies
Core Titles
Key headlines and terms for quick recall- Primary collection — gather new data for the project
- Secondary collection — reuse existing data
- Web scraping — automated extraction from websites
- APIs — structured data feeds (REST, GraphQL)
- Sensors / IoT — instrumented data
- Surveys & forms
- Public datasets — Kaggle, UCI, World Bank, government open data
- Sampling strategies — simple, stratified, cluster, systematic
Basic Idea
What it is, why it matters, how it worksWhat is a data-collection strategy?
A data-collection strategy is the plan that decides what data to gather, how, from where and how much — to answer a specific question with adequate quality, representativeness and cost.
Primary vs Secondary
| Primary | Secondary | |
|---|---|---|
| Source | Gathered fresh for the project | Pre-existing |
| Cost | Higher | Lower |
| Control over quality | High | Limited |
| Examples | Surveys, sensors, A/B tests, scraping | Kaggle, UCI, gov datasets, company DB |
Strategies by source
1. Internal company databases. Transactional (OLTP), CRM, ERP, web/mobile event logs. Usually the richest, most reliable starting point.
2. Web scraping. Use BeautifulSoup, Scrapy, Selenium to extract data from web pages. Respect robots.txt and terms of service. Useful for prices, reviews, public listings.
3. APIs. Programmatic feeds — REST, GraphQL, gRPC. Examples: Twitter API, OpenWeather, Stripe, Stock-market APIs. Reliable and structured but rate-limited.
4. IoT / Sensors. Streaming data from devices — temperature, GPS, accelerometers. Need event-streaming infrastructure (Kafka, MQTT) and time-series storage.
5. Surveys and forms. Direct user input — Google Forms, SurveyMonkey, in-app questionnaires. Risk of selection bias and self-report errors.
6. Public datasets. Kaggle, UCI ML Repository, World Bank, Indian government open-data portal, Hugging Face Datasets. Good for benchmarking and prototyping.
7. Crowdsourcing. Amazon Mechanical Turk, Prolific — humans label data, common for supervised learning.
8. Synthetic data. When real data is scarce / private — generate via simulation or generative models (GANs, diffusion). Useful for rare-event modelling.
Sampling strategies
When the full population is too large or expensive, sample wisely:
- Simple random sample — every item equal probability.
- Stratified — sample within strata (age groups, regions) to preserve composition.
- Cluster — sample whole groups (cities, schools), then everyone within.
- Systematic — every -th item.
- Convenience — easy access (risk: biased).
Quality and ethics
- Representativeness — does the sample reflect the target population?
- Bias — is the data skewed toward a subset (e.g., only urban users)?
- Consent and privacy — GDPR, HIPAA, IT Act 2000.
- Data provenance — source, time, transformations recorded.
- Freshness — old data may not reflect current behaviour.
Common pitfalls
- Survivorship bias (studying only the survivors → wrong conclusions).
- Sampling bias (over-/under-sampling subgroups).
- Measurement error (faulty sensors, ambiguous survey wording).
- Late-arriving data (target leakage).
A solid collection strategy upfront prevents painful re-collection downstream.
Mind Map
Visual structure of the conceptDATA-COLLECTION STRATEGIES
├── Primary vs Secondary
│ ├── Primary — gathered fresh
│ └── Secondary — reuse existing
├── Sources
│ ├── Internal DBs (CRM, ERP, logs)
│ ├── Web scraping (Scrapy, BS4)
│ ├── APIs (REST, GraphQL)
│ ├── IoT / sensors
│ ├── Surveys & forms
│ ├── Public datasets (Kaggle, UCI)
│ ├── Crowdsourcing (MTurk)
│ └── Synthetic (GAN, simulation)
├── Sampling
│ ├── Simple random
│ ├── Stratified
│ ├── Cluster
│ ├── Systematic
│ └── Convenience (biased)
└── Quality & ethics
├── Representativeness
├── Privacy / consent
├── Provenance
└── Freshness
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Identify two sources of data collection.
- Primary — surveys, sensors, web scraping, instrumentation.
- Secondary — public datasets (Kaggle, UCI), government open data, internal databases.
Q2. What is stratified sampling? A sampling strategy where the population is divided into homogeneous strata (e.g., age groups, regions) and samples are drawn proportionally from each. It preserves the composition of subgroups so the sample mirrors the population.
Q3. Name two ethical concerns in data collection.
- Consent and privacy — collecting personal data requires informed consent and regulatory compliance (GDPR, HIPAA).
- Bias / fairness — sampling that excludes certain demographics produces biased models.
Part B (20 marks)
Q. Describe different data-collection strategies used in data science. Compare primary and secondary sources. Discuss sampling techniques and ethical considerations.
Primary vs Secondary.
| Aspect | Primary | Secondary |
|---|---|---|
| Definition | Gathered fresh for the project | Pre-existing, reused |
| Cost | High (effort, infrastructure) | Low (often free) |
| Quality control | Direct | Limited |
| Tailoring | Exactly fits questions | May not |
| Time | Slow | Fast |
| Examples | Surveys, sensors, A/B logs | Public datasets, company DBs |
Strategies.
- Internal databases — CRM, billing, ERP, web/mobile logs. Richest, most reliable.
- Web scraping — Scrapy / BeautifulSoup / Selenium extract data from web pages (prices, reviews, listings). Must respect terms of service.
- APIs — Twitter, OpenWeather, Stripe, government APIs — structured and reliable; rate-limited.
- IoT / sensors — temperature, GPS, accelerometers; needs streaming infrastructure (Kafka, MQTT, InfluxDB).
- Surveys and forms — Google Forms, Typeform, in-app questionnaires. Risk: response bias, self-report errors.
- Public datasets — Kaggle, UCI ML Repository, World Bank, India open-data portal, Hugging Face Datasets.
- Crowdsourcing — MTurk, Prolific. Used for labelling.
- Synthetic data — generated via simulation or GANs when real data is scarce / sensitive.
Sampling techniques.
| Method | Idea | When to use |
|---|---|---|
| Simple random | Equal probability for each item | When population is homogeneous |
| Stratified | Sample within homogeneous strata | When subgroups differ; preserves composition |
| Cluster | Sample groups, then everyone within | Cost-efficient field surveys |
| Systematic | Every -th item | Convenient for streams |
| Convenience | Easy-access subset | Quick proto; high bias risk |
Ethical considerations.
- Informed consent — subjects must know how their data is used.
- Privacy and anonymisation — strip personally identifiable information (PII), k-anonymity, differential privacy.
- Regulatory compliance — GDPR (EU), HIPAA (US health), CCPA (California), India's DPDP Act 2023.
- Bias and representation — ensure subgroups are fairly represented; survivorship and selection bias avoided.
- Provenance and audit — record source, time, transformations.
- Security — protect data at rest and in transit.
- Purpose limitation — use data only for declared purposes.
Common pitfalls.
- Survivorship bias — studying only the survivors leads to wrong conclusions (WWII bombers).
- Sampling bias — Internet surveys exclude offline populations.
- Measurement bias — faulty sensors or ambiguous survey wording.
- Late-arriving data — using post-event features at training causes leakage.
A well-thought-out strategy upfront prevents painful re-collection downstream and produces models whose performance survives in production.