PGD01C02
Module 3 · Exploratory Data Analytics and Model Development

Descriptive Statistics: Mean, SD, Skewness, Kurtosis

Core Titles
Key headlines and terms for quick recall
  • Descriptive statistics — summarise data, no inference
  • Measures of central tendency: Mean, Median, Mode
  • Measures of spread: Range, IQR, Variance, Standard Deviation
  • Skewness — asymmetry of distribution
  • Kurtosis — tailedness / peakedness
  • Percentiles, Quartiles
  • Five-number summary (min, Q1, median, Q3, max)
  • Standard error, coefficient of variation CV=σ/μCV = \sigma/\mu
Basic Idea
What it is, why it matters, how it works

What is descriptive statistics?

Descriptive statistics summarise the main features of a dataset — without making any inference about a larger population. They are the first lens applied during EDA.

Measures of Central Tendency

Mean — arithmetic average xˉ=1ni=1nxi\bar{x} = \dfrac{1}{n} \sum_{i=1}^n x_i.

  • Sensitive to outliers.

Median — middle value when sorted.

  • Robust to outliers; better for skewed data (income, house prices).

Mode — most frequent value.

  • Only sensible measure for nominal categorical data.

Measures of Spread

Range = maxmin\max - \min. Very sensitive to outliers.

Interquartile Range (IQR) = Q3Q1Q3 - Q1. Robust; spans the middle 50%.

Variance σ2=1n(xixˉ)2\sigma^2 = \dfrac{1}{n} \sum (x_i - \bar{x})^2 (population) or s2=1n1(xixˉ)2s^2 = \dfrac{1}{n-1} \sum (x_i - \bar{x})^2 (sample). Units are squared.

Standard deviation σ=σ2\sigma = \sqrt{\sigma^2}. In the same units as the data — most reported.

Coefficient of Variation CV=σ/xˉCV = \sigma / \bar{x}. Unit-less; compares variability across different scales.

Shape

Skewness measures asymmetry: Skew=E[(Xμ)3]σ3.\text{Skew} = \frac{E[(X - \mu)^3]}{\sigma^3}.

  • 0 = symmetric (normal).
  • >0> 0 = right-skewed (tail on right; mean > median); income, lifespan.
  • <0< 0 = left-skewed (tail on left; mean < median); exam scores capped at 100.

Kurtosis measures tailedness: Kurt=E[(Xμ)4]σ4.\text{Kurt} = \frac{E[(X - \mu)^4]}{\sigma^4}. Excess kurtosis = Kurt − 3 (so a normal has 0).

  • >0> 0 leptokurtic — fat tails, more outliers (stock returns).
  • <0< 0 platykurtic — thin tails, flat top.

Quartiles and percentiles

  • Quartiles Q1,Q2(=Q1, Q2 (= median ),Q3), Q3 split data into four parts.
  • Percentiles PkP_k — value below which k%k\% falls.
  • Five-number summary (min,Q1,median,Q3,max)(\min, Q1, \text{median}, Q3, \max) — basis of the box plot.

Why these matter

  • Quickly profile a dataset.
  • Catch data-quality issues (impossible min/max, weird mode).
  • Decide on transformations (log if heavily skewed).
  • Drive choice of model (parametric vs non-parametric).
  • Justify outlier treatment.

pandas one-liner

df.describe()  # count, mean, std, min, 25%, 50%, 75%, max
df.skew(), df.kurt()
Mind Map
Visual structure of the concept
DESCRIPTIVE STATISTICS
├── Central tendency
│   ├── Mean        (sensitive to outliers)
│   ├── Median      (robust)
│   └── Mode        (nominal)
├── Spread
│   ├── Range
│   ├── IQR  (Q3 − Q1)
│   ├── Variance σ²
│   ├── Std dev σ
│   └── Coef. of variation σ/μ
├── Shape
│   ├── Skewness
│   │   ├── > 0 right-skew (income)
│   │   └── < 0 left-skew (capped scores)
│   └── Kurtosis
│       ├── > 3 leptokurtic — fat tails
│       └── < 3 platykurtic — flat
├── Quartiles / Percentiles
└── 5-number summary (basis of box plot)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define mean, median and mode.

  • Mean — arithmetic average xˉ=1nxi\bar x = \dfrac{1}{n}\sum x_i.
  • Median — middle value when sorted; robust to outliers.
  • Mode — most frequent value; the only measure suitable for nominal data.

Q2. What is skewness? A measure of asymmetry of a distribution: Skew=E[(Xμ)3]/σ3\text{Skew} = E[(X - \mu)^3]/\sigma^3. Zero = symmetric, positive = right-skewed (long right tail), negative = left-skewed.

Q3. What is the difference between variance and standard deviation? Variance is the average squared deviation from the mean (σ2\sigma^2); standard deviation is its square root (σ\sigma). Std-dev is in the original units of the data, making it the more interpretable measure.


Part B (20 marks)

Q. Discuss descriptive statistics with formulas and examples. Explain how mean, median, mode, variance, standard deviation, skewness and kurtosis describe a dataset.

Role of descriptive statistics.

They summarise a dataset's main features — centre, spread, shape — without inferring beyond it. The first lens applied during EDA.

1. Central tendency.

MeasureFormulaPropertyBest for
Meanxˉ=1nxi\bar x = \frac{1}{n} \sum x_iUses every valueSymmetric data
MedianMiddle valueRobust to outliersSkewed data (income, prices)
ModeMost frequentFor categoricalNominal data

Example. Salaries: 30k, 32k, 33k, 35k, 200k → mean = 66k, median = 33k. Median better reflects "typical" salary.

2. Spread.

MeasureFormulaRobust?
Rangemaxmin\max - \minNo
IQRQ3Q1Q3 - Q1Yes
Variance σ2\sigma^21n(xixˉ)2\frac{1}{n}\sum (x_i - \bar x)^2No
Std dev σ\sigmaσ2\sqrt{\sigma^2}No
CVσ/xˉ\sigma/\bar xscale-free

Why use IQR? It ignores the extreme tails — useful for skewed distributions.

Example. In a dataset (2,4,4,4,5,5,7,9)(2, 4, 4, 4, 5, 5, 7, 9): mean = 5, variance = 4, std = 2.

3. Shape — Skewness and Kurtosis.

Skewness measures asymmetry: Skew=E[(Xμ)3]σ3\text{Skew} = \frac{E[(X - \mu)^3]}{\sigma^3}

  • Zero → symmetric (Normal).
  • Positive → right-skewed; long right tail; mean > median (income, time-to-failure).
  • Negative → left-skewed; long left tail; mean < median (exam scores capped at 100, time-to-success).

Kurtosis measures tailedness: Kurt=E[(Xμ)4]σ4\text{Kurt} = \frac{E[(X - \mu)^4]}{\sigma^4}

  • Excess kurtosis = Kurt − 3 (Normal has 0).
  • >0> 0 leptokurtic — fat tails, more outliers (stock returns).
  • <0< 0 platykurtic — thinner tails, flat top.

Why care? Many models (linear regression, GLMs) assume Gaussian noise. Heavy skew / kurtosis violates this — apply transformations or switch to robust models.

4. Quartiles, Percentiles and Five-number summary.

The five-number summary (min,Q1,median,Q3,max)(\min, Q1, \text{median}, Q3, \max) underpins the box plot and reveals outliers visually.

Putting it together — a worked example.

Suppose a study finds incomes (in thousands ₹): 20, 24, 26, 30, 35, 40, 45, 60, 200.

  • Mean = 53.3 — pulled by the outlier.
  • Median = 35 — robust, more representative.
  • Std dev ≈ 56 — large, reflecting spread.
  • Skewness > 0 — long right tail.
  • Box plot — shows ₹200 k as an outlier far beyond Q3+1.5IQRQ3 + 1.5 \cdot IQR.

A data scientist seeing this:

  1. Logs the income to reduce skew.
  2. Reports median, not mean.
  3. Considers winsorising the outlier or using a robust model.

Descriptive statistics — though "basic" — drive every downstream modelling decision.