PGDDSA Study · Semester 1

PGD01C02

Module 3 · Exploratory Data Analytics and Model Development

Descriptive Statistics: Mean, SD, Skewness, Kurtosis

Core Titles

Key headlines and terms for quick recall

Descriptive statistics — summarise data, no inference
Measures of central tendency: Mean, Median, Mode
Measures of spread: Range, IQR, Variance, Standard Deviation
Skewness — asymmetry of distribution
Kurtosis — tailedness / peakedness
Percentiles, Quartiles
Five-number summary (min, Q1, median, Q3, max)
Standard error, coefficient of variation $CV = \sigma/\mu$

Basic Idea

What it is, why it matters, how it works

What is descriptive statistics?

Descriptive statistics summarise the main features of a dataset — without making any inference about a larger population. They are the first lens applied during EDA.

Measures of Central Tendency

Mean — arithmetic average $\bar{x} = \dfrac{1}{n} \sum_{i=1}^n x_i$ .

Sensitive to outliers.

Median — middle value when sorted.

Robust to outliers; better for skewed data (income, house prices).

Mode — most frequent value.

Only sensible measure for nominal categorical data.

Measures of Spread

Range = $\max - \min$ . Very sensitive to outliers.

Interquartile Range (IQR) = $Q3 - Q1$ . Robust; spans the middle 50%.

Variance $\sigma^2 = \dfrac{1}{n} \sum (x_i - \bar{x})^2$ (population) or $s^2 = \dfrac{1}{n-1} \sum (x_i - \bar{x})^2$ (sample). Units are squared.

Standard deviation $\sigma = \sqrt{\sigma^2}$ . In the same units as the data — most reported.

Coefficient of Variation $CV = \sigma / \bar{x}$ . Unit-less; compares variability across different scales.

Shape

Skewness measures asymmetry: $\text{Skew} = \frac{E[(X - \mu)^3]}{\sigma^3}.$

0 = symmetric (normal).
$> 0$ = right-skewed (tail on right; mean > median); income, lifespan.
$< 0$ = left-skewed (tail on left; mean < median); exam scores capped at 100.

Kurtosis measures tailedness: $\text{Kurt} = \frac{E[(X - \mu)^4]}{\sigma^4}.$ Excess kurtosis = Kurt − 3 (so a normal has 0).

$> 0$ leptokurtic — fat tails, more outliers (stock returns).
$< 0$ platykurtic — thin tails, flat top.

Quartiles and percentiles

Quartiles $Q1, Q2 (=$ median $), Q3$ split data into four parts.
Percentiles $P_k$ — value below which $k\%$ falls.
Five-number summary $(\min, Q1, \text{median}, Q3, \max)$ — basis of the box plot.

Why these matter

Quickly profile a dataset.
Catch data-quality issues (impossible min/max, weird mode).
Decide on transformations (log if heavily skewed).
Drive choice of model (parametric vs non-parametric).
Justify outlier treatment.

pandas one-liner

df.describe()  # count, mean, std, min, 25%, 50%, 75%, max
df.skew(), df.kurt()

Mind Map

Visual structure of the concept

DESCRIPTIVE STATISTICS
├── Central tendency
│   ├── Mean        (sensitive to outliers)
│   ├── Median      (robust)
│   └── Mode        (nominal)
├── Spread
│   ├── Range
│   ├── IQR  (Q3 − Q1)
│   ├── Variance σ²
│   ├── Std dev σ
│   └── Coef. of variation σ/μ
├── Shape
│   ├── Skewness
│   │   ├── > 0 right-skew (income)
│   │   └── < 0 left-skew (capped scores)
│   └── Kurtosis
│       ├── > 3 leptokurtic — fat tails
│       └── < 3 platykurtic — flat
├── Quartiles / Percentiles
└── 5-number summary (basis of box plot)

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define mean, median and mode.

Mean — arithmetic average $\bar x = \dfrac{1}{n}\sum x_i$ .
Median — middle value when sorted; robust to outliers.
Mode — most frequent value; the only measure suitable for nominal data.

Q2. What is skewness? A measure of asymmetry of a distribution: $\text{Skew} = E[(X - \mu)^3]/\sigma^3$ . Zero = symmetric, positive = right-skewed (long right tail), negative = left-skewed.

Q3. What is the difference between variance and standard deviation? Variance is the average squared deviation from the mean ( $\sigma^2$ ); standard deviation is its square root ( $\sigma$ ). Std-dev is in the original units of the data, making it the more interpretable measure.

Part B (20 marks)

Q. Discuss descriptive statistics with formulas and examples. Explain how mean, median, mode, variance, standard deviation, skewness and kurtosis describe a dataset.

Role of descriptive statistics.

They summarise a dataset's main features — centre, spread, shape — without inferring beyond it. The first lens applied during EDA.

1. Central tendency.

Measure	Formula	Property	Best for
Mean	$\bar x = \frac{1}{n} \sum x_i$	Uses every value	Symmetric data
Median	Middle value	Robust to outliers	Skewed data (income, prices)
Mode	Most frequent	For categorical	Nominal data

Example. Salaries: 30k, 32k, 33k, 35k, 200k → mean = 66k, median = 33k. Median better reflects "typical" salary.

2. Spread.

Measure	Formula	Robust?
Range	$\max - \min$	No
IQR	$Q3 - Q1$	Yes
Variance $\sigma^2$	$\frac{1}{n}\sum (x_i - \bar x)^2$	No
Std dev $\sigma$	$\sqrt{\sigma^2}$	No
CV	$\sigma/\bar x$	scale-free

Why use IQR? It ignores the extreme tails — useful for skewed distributions.

Example. In a dataset $(2, 4, 4, 4, 5, 5, 7, 9)$ : mean = 5, variance = 4, std = 2.

3. Shape — Skewness and Kurtosis.

Skewness measures asymmetry: $\text{Skew} = \frac{E[(X - \mu)^3]}{\sigma^3}$

Zero → symmetric (Normal).
Positive → right-skewed; long right tail; mean > median (income, time-to-failure).
Negative → left-skewed; long left tail; mean < median (exam scores capped at 100, time-to-success).

Kurtosis measures tailedness: $\text{Kurt} = \frac{E[(X - \mu)^4]}{\sigma^4}$

Excess kurtosis = Kurt − 3 (Normal has 0).
$> 0$ leptokurtic — fat tails, more outliers (stock returns).
$< 0$ platykurtic — thinner tails, flat top.

Why care? Many models (linear regression, GLMs) assume Gaussian noise. Heavy skew / kurtosis violates this — apply transformations or switch to robust models.

4. Quartiles, Percentiles and Five-number summary.

The five-number summary $(\min, Q1, \text{median}, Q3, \max)$ underpins the box plot and reveals outliers visually.

Putting it together — a worked example.

Suppose a study finds incomes (in thousands ₹): 20, 24, 26, 30, 35, 40, 45, 60, 200.

Mean = 53.3 — pulled by the outlier.
Median = 35 — robust, more representative.
Std dev ≈ 56 — large, reflecting spread.
Skewness > 0 — long right tail.
Box plot — shows ₹200 k as an outlier far beyond $Q3 + 1.5 \cdot IQR$ .

A data scientist seeing this:

Logs the income to reduce skew.
Reports median, not mean.
Considers winsorising the outlier or using a robust model.

Descriptive statistics — though "basic" — drive every downstream modelling decision.

Data Discretization Box Plots, Pivot Tables and Heat Maps