PGD01C02
Module 1 · Introduction to Data Science

Stages in a Data Science Project

Core Titles
Key headlines and terms for quick recall
  • Data Science Life Cycle — end-to-end process
  • CRISP-DM: Business → Data understanding → Prep → Modelling → Evaluation → Deployment
  • OSEMN: Obtain, Scrub, Explore, Model, iNterpret
  • Business understanding — define the problem & success metric
  • Data understanding & EDA
  • Data preparation / feature engineering
  • Modelling & training
  • Evaluation & validation
  • Deployment & monitoring (MLOps)
  • Iteration — repeat as data drifts
Basic Idea
What it is, why it matters, how it works

Why we need a process

A data-science project is not just "fit a model" — without structure, projects drown in unclear goals, dirty data, leaky validation, and models that never reach production. Two industry-standard frameworks structure the work: CRISP-DM and OSEMN.

CRISP-DM — Cross-Industry Standard Process for Data Mining

The most widely adopted framework. Six iterative phases:

1. Business Understanding.

  • Define the business problem in plain language: "Reduce customer churn by 5%."
  • Translate into a data problem: "Predict which customers will churn in the next 30 days."
  • Define success metric (e.g., AUC ≥ 0.80, ROI ≥ 3×).
  • Identify constraints (privacy, latency, budget).

2. Data Understanding.

  • Collect initial data — internal databases, third-party feeds, public sources.
  • Describe and explore (EDA): distributions, correlations, missing values.
  • Verify data quality — is it complete, recent, accurate?

3. Data Preparation. (60–80% of the project)

  • Clean — handle missing, outliers, duplicates.
  • Integrate — merge multiple sources.
  • Transform — scaling, encoding, feature engineering.
  • Reduce — feature selection, dimensionality reduction.
  • Split — train / validation / test.

4. Modelling.

  • Select candidate algorithms (logistic regression, random forest, XGBoost, neural net).
  • Train each on training set.
  • Tune hyperparameters via cross-validation.
  • Compare performance.

5. Evaluation.

  • Evaluate on held-out test set.
  • Check against business success criteria — not just statistical metrics.
  • Review unintended consequences (bias, fairness).
  • Decide: deploy / iterate / abandon.

6. Deployment.

  • Productionise — wrap as REST API, batch job or embedded model.
  • Monitor — track data drift, model drift, business impact.
  • Schedule retraining.
  • Document and hand off.

Iteration. Arrows return from any phase to earlier ones — projects rarely flow strictly forward.

OSEMN — alternative mnemonic

LetterPhase
ObtainGather the data
ScrubClean and preprocess
ExploreEDA
ModelBuild and tune models
iN(terpret)Communicate results

OSEMN focuses more on the analyst's day-to-day; CRISP-DM is more enterprise/process-focused.

Why iteration matters

  • Initial business question may need to be refined after EDA.
  • Modelling may reveal that more data is needed.
  • Deployment may expose distribution drift, requiring retraining.

Key success factors

  • Clear, measurable business problem.
  • Tight feedback loop with stakeholders.
  • Reproducible pipeline (version control, MLOps).
  • Validation that mimics production distribution.
  • Ethical / legal review where appropriate.
Mind Map
Visual structure of the concept
DATA SCIENCE PROJECT — CRISP-DM
├── 1. Business Understanding
│   ├── Define problem & success metric
│   └── Constraints (privacy, latency)
├── 2. Data Understanding
│   ├── Collect data
│   ├── EDA: distributions, correlations
│   └── Quality check
├── 3. Data Preparation (60–80% time)
│   ├── Clean (missing, outliers)
│   ├── Integrate (merge sources)
│   ├── Transform (scale, encode, FE)
│   ├── Reduce (feature selection)
│   └── Split (train/val/test)
├── 4. Modelling
│   ├── Candidate algorithms
│   ├── Train + tune (CV)
│   └── Compare
├── 5. Evaluation
│   ├── Held-out test
│   ├── Business metric, fairness
│   └── Deploy / iterate / abandon
├── 6. Deployment
│   ├── Productionise (API / batch)
│   ├── Monitor (drift)
│   └── Retrain schedule
└── ITERATE (arrows back to any stage)

OSEMN: Obtain → Scrub → Explore → Model → iNterpret
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Name the six phases of CRISP-DM. Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.

Q2. What does OSEMN stand for? Obtain, Scrub, Explore, Model, iNterpret.

Q3. Why is the Business Understanding stage important? Because without a clear business problem and success metric, the rest of the project drifts. It translates a business question (e.g., "reduce churn") into a measurable data-science problem (e.g., "predict 30-day churn with AUC > 0.80") and sets evaluation criteria up-front.


Part B (20 marks)

Q. Explain the stages of a typical Data Science project using the CRISP-DM framework. Why is iteration important? Discuss with an example.

CRISP-DM stages.

Stage 1 — Business Understanding. Translate the business goal into a data problem. Define:

  • Problem: "Reduce 30-day customer churn from 12% to 8%."
  • Data-science framing: binary classification — predict churn probability per customer.
  • Success metric: AUC ≥ 0.80, top-decile precision ≥ 30%.
  • Constraints: must score nightly within 4 hours; under GDPR.

Stage 2 — Data Understanding. Pull data from CRM, billing, support tickets, mobile-app logs. Conduct EDA — distributions of tenure, plan type, last-login age. Spot quality issues — 8% of records have missing income.

Stage 3 — Data Preparation. ~60–80% of project time:

  • Impute missing income with median by region.
  • Engineer features: usage in last 7 / 30 / 90 days, support-ticket count, days since plan change.
  • One-hot encode categoricals; standardise numerics.
  • Split temporally: train on Jan–Sep, validate on Oct, test on Nov.

Stage 4 — Modelling. Try logistic regression (baseline), Random Forest, XGBoost. Cross-validate within training data. Tune hyperparameters via grid search. XGBoost wins with CV AUC 0.84.

Stage 5 — Evaluation. Test on November holdout:

  • AUC 0.82 — meets target ✓
  • Top-decile precision 35% — exceeds target ✓
  • Audit for bias: model not unfair across age/gender ✓
  • Estimate business impact: retaining top 5% predicted churners saves ~$2 M/year.
  • Decision: deploy.

Stage 6 — Deployment.

  • Wrap as REST API, scored nightly into the CRM.
  • Monitor: PSI score on input features, AUC on rolling labels, alerts when drift.
  • Schedule retraining monthly.
  • Document model card, lineage and ownership.

Why iteration is essential.

  • Discovery loops back. EDA in Stage 2 might reveal that we lack a key feature (e.g., billing complaints) — we return to data collection.
  • Modelling exposes data issues. Feature importance might highlight a feature that's actually a leakage of the target — back to preparation.
  • Evaluation may invalidate the framing. If even the best model achieves only AUC 0.62, the business question may be unanswerable from current data — back to Stage 1.
  • Production drift. Deployed model degrades as customer behaviour changes — periodic retraining returns the loop to Stages 3–4.
  • Stakeholder feedback. Business may rescope after seeing initial results — entire process re-iterates.

CRISP-DM explicitly draws arrows backward between stages — projects are spiral, not linear. A team that forgets to iterate ships a model that fits Q2 data and fails by Q4.