PGD01C02
Module 2 · Data Collection and Pre-Processing

Data Discretization

Core Titles
Key headlines and terms for quick recall
  • Discretization — convert continuous variables into discrete intervals (bins)
  • Equal-width binning — fixed-width intervals
  • Equal-frequency binning — quantile-based, same count per bin
  • Cluster-based binning — k-means on feature
  • Decision-tree binning — supervised splits
  • ChiMerge — merges intervals using χ2\chi^2
  • Use cases: decision trees, naive Bayes, business reporting, fairness audits
  • Concept-hierarchy generation — abstract → general (e.g., year → decade)
Basic Idea
What it is, why it matters, how it works

What is discretization?

Discretization transforms a continuous numeric variable into a set of discrete intervals (bins). The continuous value x=27.4x = 27.4 becomes a categorical bin like "25–30" or "young".

Why discretize?

  1. Some algorithms need discrete inputs — Naive Bayes, ID3 decision tree, association-rule mining (Apriori).
  2. Interpretability — bins (Low / Medium / High) are easier to explain than raw numbers.
  3. Robustness to outliers — bins squash extreme values.
  4. Capture non-linearity without explicit polynomial features.
  5. Business reporting — age groups, income brackets.
  6. Fairness audits — group analysis across demographic bins.

Binning techniques

1. Equal-width (uniform) binning. Divide the range [min,max][\min, \max] into kk intervals of equal width w=(maxmin)/kw = (\max - \min)/k.

  • Simple, intuitive.
  • Sensitive to outliers (very wide range distorts bins).
  • Bins may have very unequal populations.

Example. Age range 0–100 → 5 bins of width 20: 0–20, 20–40, 40–60, 60–80, 80–100.

2. Equal-frequency (quantile) binning. Choose bin edges so each bin has the same number of items.

  • Robust to outliers.
  • Bins have different widths.
  • Used for percentiles, deciles, quartiles.

Example. Income binned into deciles — each contains 10% of customers.

3. Cluster-based binning. Run k-means on the single feature; bins = cluster intervals.

  • Finds natural groupings.
  • Computationally costlier.

4. Decision-tree (supervised) binning. Train a single-feature decision tree against the target; use splits as bin edges.

  • Bins are tailored to maximise predictive power.
  • Risks overfitting; use cross-validation.

5. ChiMerge. Start with each unique value as a bin; iteratively merge adjacent bins whose χ2\chi^2 on the target is below a threshold.

  • Supervised, statistically grounded.

Concept-hierarchy generation

A form of discretisation across multiple levels of abstraction:

  • Numeric: age 27 → 20s → adult.
  • Categorical: Bengaluru → Karnataka → India → Asia.
  • Date: 2024-11-15 → November 2024 → Q4 2024 → 2024.

Used in OLAP cubes (roll-up / drill-down) and data-warehousing.

Caveats

  • Information loss. Discretisation discards precision.
  • Bin count kk matters: too few = lose signal, too many = noise.
  • Boundary effects — items just inside / outside a boundary land far apart.

Use cross-validation to choose binning strategy and kk.

Mind Map
Visual structure of the concept
DATA DISCRETIZATION
├── Continuous → Discrete (bins)
├── Why?
│   ├── Needed for NB, Apriori, ID3
│   ├── Interpretability
│   ├── Outlier robustness
│   └── Business reporting
├── Methods
│   ├── Equal-width (uniform)
│   ├── Equal-frequency (quantile)
│   ├── Cluster-based (k-means)
│   ├── Decision-tree (supervised)
│   └── ChiMerge (χ²-based merge)
├── Concept hierarchy
│   ├── Numeric: 27 → 20s → adult
│   ├── Geo: city → state → country
│   └── Date: day → month → year
└── Caveats
    ├── Loses precision
    └── Choose k via CV
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is data discretization? The process of converting a continuous variable into a set of discrete intervals (bins) — e.g., age [0,100]\in [0, 100] → groups <25, 25–40, 40–60, >60.

Q2. Differentiate equal-width and equal-frequency binning.

  • Equal-width — divide range into intervals of equal width; bins may have unequal populations; sensitive to outliers.
  • Equal-frequency (quantile) — choose bin edges so each bin holds the same count of items; bins have different widths; robust to outliers.

Q3. Why discretize a continuous variable? To suit algorithms that require discrete input (Naive Bayes, Apriori), improve interpretability, reduce outlier impact, capture non-linearity, and enable business-friendly reporting (e.g., income brackets).


Part B (20 marks)

Q. Explain data discretization techniques with examples. Discuss concept-hierarchy generation and its uses.

What and why.

Discretisation maps a continuous variable into discrete intervals. Useful for:

  1. Algorithms that need discrete input — Naive Bayes, ID3, Apriori, association-rule mining.
  2. Interpretability — "Low / Medium / High" reads better than 24.7.
  3. Robustness to outliers — bins absorb extreme values.
  4. Capturing non-linear effects without polynomial expansion.
  5. Business reporting and fairness audits.

Techniques.

1. Equal-width (uniform) binning. Split [min,max][\min, \max] into kk intervals of width w=(maxmin)/kw = (\max - \min)/k. Simple but sensitive to outliers and may produce empty / overflowing bins.

Example. Ages 0–100 → 5 bins of width 20: 0–20, 20–40, 40–60, 60–80, 80–100.

2. Equal-frequency (quantile) binning. Choose boundaries so each bin contains the same number of items. Robust to outliers. Used for deciles, quartiles, percentile-based risk scoring.

Example. Customer income binned into deciles — each contains 10%.

3. Cluster-based binning. Run k-means on the single feature; bin = cluster interval. Finds natural groupings but computationally costlier.

4. Decision-tree (supervised) binning. Train a single-feature tree against the target; use the splits as bin edges. Predictive but risks overfitting; use cross-validation.

5. ChiMerge. Start with each distinct value as its own bin; merge adjacent bins whose χ2\chi^2 statistic against the target is below a threshold. Supervised and statistically grounded.

Example workflow. For a churn model, discretise tenure into deciles, then pass to a Naive Bayes — improves calibration on long-tailed tenure distributions.

Concept-Hierarchy Generation.

A multi-level discretisation that captures abstraction:

LevelNumericGeographicDate
Most specificAge 27Bengaluru2024-11-15
Mid-level25–30KarnatakaNov 2024
General"Adult"IndiaQ4 2024
Most generalAsia2024

Uses.

  • OLAP cubes — roll-up (aggregate) and drill-down across levels.
  • Data warehousing — pre-aggregated marts for fast reporting.
  • Decision rules at the right level — "promote in Asia" vs "promote in Bengaluru."
  • Privacy — generalisation enables k-anonymity (e.g., release "30-40" instead of exact age).

Caveats.

  • Information loss — every discretisation discards precision.
  • Choice of kk — too few bins lose signal; too many create noise. Use cross-validation.
  • Boundary effects — items just on either side of a boundary land in different buckets despite being similar.

A thoughtful discretisation strategy can improve interpretability and downstream model performance — particularly for tree-based and Bayesian methods.