Stages in a Data Science Project
Core Titles
Key headlines and terms for quick recall- Data Science Life Cycle — end-to-end process
- CRISP-DM: Business → Data understanding → Prep → Modelling → Evaluation → Deployment
- OSEMN: Obtain, Scrub, Explore, Model, iNterpret
- Business understanding — define the problem & success metric
- Data understanding & EDA
- Data preparation / feature engineering
- Modelling & training
- Evaluation & validation
- Deployment & monitoring (MLOps)
- Iteration — repeat as data drifts
Basic Idea
What it is, why it matters, how it worksWhy we need a process
A data-science project is not just "fit a model" — without structure, projects drown in unclear goals, dirty data, leaky validation, and models that never reach production. Two industry-standard frameworks structure the work: CRISP-DM and OSEMN.
CRISP-DM — Cross-Industry Standard Process for Data Mining
The most widely adopted framework. Six iterative phases:
1. Business Understanding.
- Define the business problem in plain language: "Reduce customer churn by 5%."
- Translate into a data problem: "Predict which customers will churn in the next 30 days."
- Define success metric (e.g., AUC ≥ 0.80, ROI ≥ 3×).
- Identify constraints (privacy, latency, budget).
2. Data Understanding.
- Collect initial data — internal databases, third-party feeds, public sources.
- Describe and explore (EDA): distributions, correlations, missing values.
- Verify data quality — is it complete, recent, accurate?
3. Data Preparation. (60–80% of the project)
- Clean — handle missing, outliers, duplicates.
- Integrate — merge multiple sources.
- Transform — scaling, encoding, feature engineering.
- Reduce — feature selection, dimensionality reduction.
- Split — train / validation / test.
4. Modelling.
- Select candidate algorithms (logistic regression, random forest, XGBoost, neural net).
- Train each on training set.
- Tune hyperparameters via cross-validation.
- Compare performance.
5. Evaluation.
- Evaluate on held-out test set.
- Check against business success criteria — not just statistical metrics.
- Review unintended consequences (bias, fairness).
- Decide: deploy / iterate / abandon.
6. Deployment.
- Productionise — wrap as REST API, batch job or embedded model.
- Monitor — track data drift, model drift, business impact.
- Schedule retraining.
- Document and hand off.
Iteration. Arrows return from any phase to earlier ones — projects rarely flow strictly forward.
OSEMN — alternative mnemonic
| Letter | Phase |
|---|---|
| Obtain | Gather the data |
| Scrub | Clean and preprocess |
| Explore | EDA |
| Model | Build and tune models |
| iN(terpret) | Communicate results |
OSEMN focuses more on the analyst's day-to-day; CRISP-DM is more enterprise/process-focused.
Why iteration matters
- Initial business question may need to be refined after EDA.
- Modelling may reveal that more data is needed.
- Deployment may expose distribution drift, requiring retraining.
Key success factors
- Clear, measurable business problem.
- Tight feedback loop with stakeholders.
- Reproducible pipeline (version control, MLOps).
- Validation that mimics production distribution.
- Ethical / legal review where appropriate.
Mind Map
Visual structure of the conceptDATA SCIENCE PROJECT — CRISP-DM
├── 1. Business Understanding
│ ├── Define problem & success metric
│ └── Constraints (privacy, latency)
├── 2. Data Understanding
│ ├── Collect data
│ ├── EDA: distributions, correlations
│ └── Quality check
├── 3. Data Preparation (60–80% time)
│ ├── Clean (missing, outliers)
│ ├── Integrate (merge sources)
│ ├── Transform (scale, encode, FE)
│ ├── Reduce (feature selection)
│ └── Split (train/val/test)
├── 4. Modelling
│ ├── Candidate algorithms
│ ├── Train + tune (CV)
│ └── Compare
├── 5. Evaluation
│ ├── Held-out test
│ ├── Business metric, fairness
│ └── Deploy / iterate / abandon
├── 6. Deployment
│ ├── Productionise (API / batch)
│ ├── Monitor (drift)
│ └── Retrain schedule
└── ITERATE (arrows back to any stage)
OSEMN: Obtain → Scrub → Explore → Model → iNterpret
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Name the six phases of CRISP-DM. Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.
Q2. What does OSEMN stand for? Obtain, Scrub, Explore, Model, iNterpret.
Q3. Why is the Business Understanding stage important? Because without a clear business problem and success metric, the rest of the project drifts. It translates a business question (e.g., "reduce churn") into a measurable data-science problem (e.g., "predict 30-day churn with AUC > 0.80") and sets evaluation criteria up-front.
Part B (20 marks)
Q. Explain the stages of a typical Data Science project using the CRISP-DM framework. Why is iteration important? Discuss with an example.
CRISP-DM stages.
Stage 1 — Business Understanding. Translate the business goal into a data problem. Define:
- Problem: "Reduce 30-day customer churn from 12% to 8%."
- Data-science framing: binary classification — predict churn probability per customer.
- Success metric: AUC ≥ 0.80, top-decile precision ≥ 30%.
- Constraints: must score nightly within 4 hours; under GDPR.
Stage 2 — Data Understanding. Pull data from CRM, billing, support tickets, mobile-app logs. Conduct EDA — distributions of tenure, plan type, last-login age. Spot quality issues — 8% of records have missing income.
Stage 3 — Data Preparation. ~60–80% of project time:
- Impute missing income with median by region.
- Engineer features: usage in last 7 / 30 / 90 days, support-ticket count, days since plan change.
- One-hot encode categoricals; standardise numerics.
- Split temporally: train on Jan–Sep, validate on Oct, test on Nov.
Stage 4 — Modelling. Try logistic regression (baseline), Random Forest, XGBoost. Cross-validate within training data. Tune hyperparameters via grid search. XGBoost wins with CV AUC 0.84.
Stage 5 — Evaluation. Test on November holdout:
- AUC 0.82 — meets target ✓
- Top-decile precision 35% — exceeds target ✓
- Audit for bias: model not unfair across age/gender ✓
- Estimate business impact: retaining top 5% predicted churners saves ~$2 M/year.
- Decision: deploy.
Stage 6 — Deployment.
- Wrap as REST API, scored nightly into the CRM.
- Monitor: PSI score on input features, AUC on rolling labels, alerts when drift.
- Schedule retraining monthly.
- Document model card, lineage and ownership.
Why iteration is essential.
- Discovery loops back. EDA in Stage 2 might reveal that we lack a key feature (e.g., billing complaints) — we return to data collection.
- Modelling exposes data issues. Feature importance might highlight a feature that's actually a leakage of the target — back to preparation.
- Evaluation may invalidate the framing. If even the best model achieves only AUC 0.62, the business question may be unanswerable from current data — back to Stage 1.
- Production drift. Deployed model degrades as customer behaviour changes — periodic retraining returns the loop to Stages 3–4.
- Stakeholder feedback. Business may rescope after seeing initial results — entire process re-iterates.
CRISP-DM explicitly draws arrows backward between stages — projects are spiral, not linear. A team that forgets to iterate ships a model that fits Q2 data and fails by Q4.