PGDDSA Study · Semester 1

PGD01C02

Module 1 · Introduction to Data Science

Stages in a Data Science Project

Core Titles

Key headlines and terms for quick recall

Data Science Life Cycle — end-to-end process
CRISP-DM: Business → Data understanding → Prep → Modelling → Evaluation → Deployment
OSEMN: Obtain, Scrub, Explore, Model, iNterpret
Business understanding — define the problem & success metric
Data understanding & EDA
Data preparation / feature engineering
Modelling & training
Evaluation & validation
Deployment & monitoring (MLOps)
Iteration — repeat as data drifts

Basic Idea

What it is, why it matters, how it works

Why we need a process

A data-science project is not just "fit a model" — without structure, projects drown in unclear goals, dirty data, leaky validation, and models that never reach production. Two industry-standard frameworks structure the work: CRISP-DM and OSEMN.

CRISP-DM — Cross-Industry Standard Process for Data Mining

The most widely adopted framework. Six iterative phases:

1. Business Understanding.

Define the business problem in plain language: "Reduce customer churn by 5%."
Translate into a data problem: "Predict which customers will churn in the next 30 days."
Define success metric (e.g., AUC ≥ 0.80, ROI ≥ 3×).
Identify constraints (privacy, latency, budget).

2. Data Understanding.

Collect initial data — internal databases, third-party feeds, public sources.
Describe and explore (EDA): distributions, correlations, missing values.
Verify data quality — is it complete, recent, accurate?

3. Data Preparation. (60–80% of the project)

Clean — handle missing, outliers, duplicates.
Integrate — merge multiple sources.
Transform — scaling, encoding, feature engineering.
Reduce — feature selection, dimensionality reduction.
Split — train / validation / test.

4. Modelling.

Select candidate algorithms (logistic regression, random forest, XGBoost, neural net).
Train each on training set.
Tune hyperparameters via cross-validation.
Compare performance.

5. Evaluation.

Evaluate on held-out test set.
Check against business success criteria — not just statistical metrics.
Review unintended consequences (bias, fairness).
Decide: deploy / iterate / abandon.

6. Deployment.

Productionise — wrap as REST API, batch job or embedded model.
Monitor — track data drift, model drift, business impact.
Schedule retraining.
Document and hand off.

Iteration. Arrows return from any phase to earlier ones — projects rarely flow strictly forward.

OSEMN — alternative mnemonic

Letter	Phase
Obtain	Gather the data
Scrub	Clean and preprocess
Explore	EDA
Model	Build and tune models
iN(terpret)	Communicate results

OSEMN focuses more on the analyst's day-to-day; CRISP-DM is more enterprise/process-focused.

Why iteration matters

Initial business question may need to be refined after EDA.
Modelling may reveal that more data is needed.
Deployment may expose distribution drift, requiring retraining.

Key success factors

Clear, measurable business problem.
Tight feedback loop with stakeholders.
Reproducible pipeline (version control, MLOps).
Validation that mimics production distribution.
Ethical / legal review where appropriate.

Mind Map

Visual structure of the concept

DATA SCIENCE PROJECT — CRISP-DM
├── 1. Business Understanding
│   ├── Define problem & success metric
│   └── Constraints (privacy, latency)
├── 2. Data Understanding
│   ├── Collect data
│   ├── EDA: distributions, correlations
│   └── Quality check
├── 3. Data Preparation (60–80% time)
│   ├── Clean (missing, outliers)
│   ├── Integrate (merge sources)
│   ├── Transform (scale, encode, FE)
│   ├── Reduce (feature selection)
│   └── Split (train/val/test)
├── 4. Modelling
│   ├── Candidate algorithms
│   ├── Train + tune (CV)
│   └── Compare
├── 5. Evaluation
│   ├── Held-out test
│   ├── Business metric, fairness
│   └── Deploy / iterate / abandon
├── 6. Deployment
│   ├── Productionise (API / batch)
│   ├── Monitor (drift)
│   └── Retrain schedule
└── ITERATE (arrows back to any stage)

OSEMN: Obtain → Scrub → Explore → Model → iNterpret

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Name the six phases of CRISP-DM. Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.

Q2. What does OSEMN stand for? Obtain, Scrub, Explore, Model, iNterpret.

Q3. Why is the Business Understanding stage important? Because without a clear business problem and success metric, the rest of the project drifts. It translates a business question (e.g., "reduce churn") into a measurable data-science problem (e.g., "predict 30-day churn with AUC > 0.80") and sets evaluation criteria up-front.

Part B (20 marks)

Q. Explain the stages of a typical Data Science project using the CRISP-DM framework. Why is iteration important? Discuss with an example.

CRISP-DM stages.

Stage 1 — Business Understanding. Translate the business goal into a data problem. Define:

Problem: "Reduce 30-day customer churn from 12% to 8%."
Data-science framing: binary classification — predict churn probability per customer.
Success metric: AUC ≥ 0.80, top-decile precision ≥ 30%.
Constraints: must score nightly within 4 hours; under GDPR.

Stage 2 — Data Understanding. Pull data from CRM, billing, support tickets, mobile-app logs. Conduct EDA — distributions of tenure, plan type, last-login age. Spot quality issues — 8% of records have missing income.

Stage 3 — Data Preparation. ~60–80% of project time:

Impute missing income with median by region.
Engineer features: usage in last 7 / 30 / 90 days, support-ticket count, days since plan change.
One-hot encode categoricals; standardise numerics.
Split temporally: train on Jan–Sep, validate on Oct, test on Nov.

Stage 4 — Modelling. Try logistic regression (baseline), Random Forest, XGBoost. Cross-validate within training data. Tune hyperparameters via grid search. XGBoost wins with CV AUC 0.84.

Stage 5 — Evaluation. Test on November holdout:

AUC 0.82 — meets target ✓
Top-decile precision 35% — exceeds target ✓
Audit for bias: model not unfair across age/gender ✓
Estimate business impact: retaining top 5% predicted churners saves ~$2 M/year.
Decision: deploy.

Stage 6 — Deployment.

Wrap as REST API, scored nightly into the CRM.
Monitor: PSI score on input features, AUC on rolling labels, alerts when drift.
Schedule retraining monthly.
Document model card, lineage and ownership.

Why iteration is essential.

Discovery loops back. EDA in Stage 2 might reveal that we lack a key feature (e.g., billing complaints) — we return to data collection.
Modelling exposes data issues. Feature importance might highlight a feature that's actually a leakage of the target — back to preparation.
Evaluation may invalidate the framing. If even the best model achieves only AUC 0.62, the business question may be unanswerable from current data — back to Stage 1.
Production drift. Deployed model degrades as customer behaviour changes — periodic retraining returns the loop to Stages 3–4.
Stakeholder feedback. Business may rescope after seeing initial results — entire process re-iterates.

CRISP-DM explicitly draws arrows backward between stages — projects are spiral, not linear. A team that forgets to iterate ships a model that fits Q2 data and fails by Q4.

Data Science Roles Applications of Data Science