PGDDSA Study · Semester 1

PGD01C02

Module 2 · Data Collection and Pre-Processing

Data Discretization

Core Titles

Key headlines and terms for quick recall

Discretization — convert continuous variables into discrete intervals (bins)
Equal-width binning — fixed-width intervals
Equal-frequency binning — quantile-based, same count per bin
Cluster-based binning — k-means on feature
Decision-tree binning — supervised splits
ChiMerge — merges intervals using $\chi^2$
Use cases: decision trees, naive Bayes, business reporting, fairness audits
Concept-hierarchy generation — abstract → general (e.g., year → decade)

Basic Idea

What it is, why it matters, how it works

What is discretization?

Discretization transforms a continuous numeric variable into a set of discrete intervals (bins). The continuous value $x = 27.4$ becomes a categorical bin like "25–30" or "young".

Why discretize?

Some algorithms need discrete inputs — Naive Bayes, ID3 decision tree, association-rule mining (Apriori).
Interpretability — bins (Low / Medium / High) are easier to explain than raw numbers.
Robustness to outliers — bins squash extreme values.
Capture non-linearity without explicit polynomial features.
Business reporting — age groups, income brackets.
Fairness audits — group analysis across demographic bins.

Binning techniques

1. Equal-width (uniform) binning. Divide the range $[\min, \max]$ into $k$ intervals of equal width $w = (\max - \min)/k$ .

Simple, intuitive.
Sensitive to outliers (very wide range distorts bins).
Bins may have very unequal populations.

Example. Age range 0–100 → 5 bins of width 20: 0–20, 20–40, 40–60, 60–80, 80–100.

2. Equal-frequency (quantile) binning. Choose bin edges so each bin has the same number of items.

Robust to outliers.
Bins have different widths.
Used for percentiles, deciles, quartiles.

Example. Income binned into deciles — each contains 10% of customers.

3. Cluster-based binning. Run k-means on the single feature; bins = cluster intervals.

Finds natural groupings.
Computationally costlier.

4. Decision-tree (supervised) binning. Train a single-feature decision tree against the target; use splits as bin edges.

Bins are tailored to maximise predictive power.
Risks overfitting; use cross-validation.

5. ChiMerge. Start with each unique value as a bin; iteratively merge adjacent bins whose $\chi^2$ on the target is below a threshold.

Supervised, statistically grounded.

Concept-hierarchy generation

A form of discretisation across multiple levels of abstraction:

Numeric: age 27 → 20s → adult.
Categorical: Bengaluru → Karnataka → India → Asia.
Date: 2024-11-15 → November 2024 → Q4 2024 → 2024.

Used in OLAP cubes (roll-up / drill-down) and data-warehousing.

Caveats

Information loss. Discretisation discards precision.
Bin count $k$ matters: too few = lose signal, too many = noise.
Boundary effects — items just inside / outside a boundary land far apart.

Use cross-validation to choose binning strategy and $k$ .

Mind Map

Visual structure of the concept

DATA DISCRETIZATION
├── Continuous → Discrete (bins)
├── Why?
│   ├── Needed for NB, Apriori, ID3
│   ├── Interpretability
│   ├── Outlier robustness
│   └── Business reporting
├── Methods
│   ├── Equal-width (uniform)
│   ├── Equal-frequency (quantile)
│   ├── Cluster-based (k-means)
│   ├── Decision-tree (supervised)
│   └── ChiMerge (χ²-based merge)
├── Concept hierarchy
│   ├── Numeric: 27 → 20s → adult
│   ├── Geo: city → state → country
│   └── Date: day → month → year
└── Caveats
    ├── Loses precision
    └── Choose k via CV

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What is data discretization? The process of converting a continuous variable into a set of discrete intervals (bins) — e.g., age $\in [0, 100]$ → groups <25, 25–40, 40–60, >60.

Q2. Differentiate equal-width and equal-frequency binning.

Equal-width — divide range into intervals of equal width; bins may have unequal populations; sensitive to outliers.
Equal-frequency (quantile) — choose bin edges so each bin holds the same count of items; bins have different widths; robust to outliers.

Q3. Why discretize a continuous variable? To suit algorithms that require discrete input (Naive Bayes, Apriori), improve interpretability, reduce outlier impact, capture non-linearity, and enable business-friendly reporting (e.g., income brackets).

Part B (20 marks)

Q. Explain data discretization techniques with examples. Discuss concept-hierarchy generation and its uses.

What and why.

Discretisation maps a continuous variable into discrete intervals. Useful for:

Algorithms that need discrete input — Naive Bayes, ID3, Apriori, association-rule mining.
Interpretability — "Low / Medium / High" reads better than 24.7.
Robustness to outliers — bins absorb extreme values.
Capturing non-linear effects without polynomial expansion.
Business reporting and fairness audits.

Techniques.

1. Equal-width (uniform) binning. Split $[\min, \max]$ into $k$ intervals of width $w = (\max - \min)/k$ . Simple but sensitive to outliers and may produce empty / overflowing bins.

Example. Ages 0–100 → 5 bins of width 20: 0–20, 20–40, 40–60, 60–80, 80–100.

2. Equal-frequency (quantile) binning. Choose boundaries so each bin contains the same number of items. Robust to outliers. Used for deciles, quartiles, percentile-based risk scoring.

Example. Customer income binned into deciles — each contains 10%.

3. Cluster-based binning. Run k-means on the single feature; bin = cluster interval. Finds natural groupings but computationally costlier.

4. Decision-tree (supervised) binning. Train a single-feature tree against the target; use the splits as bin edges. Predictive but risks overfitting; use cross-validation.

5. ChiMerge. Start with each distinct value as its own bin; merge adjacent bins whose $\chi^2$ statistic against the target is below a threshold. Supervised and statistically grounded.

Example workflow. For a churn model, discretise tenure into deciles, then pass to a Naive Bayes — improves calibration on long-tailed tenure distributions.

Concept-Hierarchy Generation.

A multi-level discretisation that captures abstraction:

Level	Numeric	Geographic	Date
Most specific	Age 27	Bengaluru	2024-11-15
Mid-level	25–30	Karnataka	Nov 2024
General	"Adult"	India	Q4 2024
Most general	–	Asia	2024

Uses.

OLAP cubes — roll-up (aggregate) and drill-down across levels.
Data warehousing — pre-aggregated marts for fast reporting.
Decision rules at the right level — "promote in Asia" vs "promote in Bengaluru."
Privacy — generalisation enables k-anonymity (e.g., release "30-40" instead of exact age).

Caveats.

Information loss — every discretisation discards precision.
Choice of $k$ — too few bins lose signal; too many create noise. Use cross-validation.
Boundary effects — items just on either side of a boundary land in different buckets despite being similar.

A thoughtful discretisation strategy can improve interpretability and downstream model performance — particularly for tree-based and Bayesian methods.

Data Reduction Descriptive Statistics: Mean, SD, Skewness, Kurtosis