Descriptive Statistics: Mean, SD, Skewness, Kurtosis
Core Titles
Key headlines and terms for quick recall- Descriptive statistics — summarise data, no inference
- Measures of central tendency: Mean, Median, Mode
- Measures of spread: Range, IQR, Variance, Standard Deviation
- Skewness — asymmetry of distribution
- Kurtosis — tailedness / peakedness
- Percentiles, Quartiles
- Five-number summary (min, Q1, median, Q3, max)
- Standard error, coefficient of variation
Basic Idea
What it is, why it matters, how it worksWhat is descriptive statistics?
Descriptive statistics summarise the main features of a dataset — without making any inference about a larger population. They are the first lens applied during EDA.
Measures of Central Tendency
Mean — arithmetic average .
- Sensitive to outliers.
Median — middle value when sorted.
- Robust to outliers; better for skewed data (income, house prices).
Mode — most frequent value.
- Only sensible measure for nominal categorical data.
Measures of Spread
Range = . Very sensitive to outliers.
Interquartile Range (IQR) = . Robust; spans the middle 50%.
Variance (population) or (sample). Units are squared.
Standard deviation . In the same units as the data — most reported.
Coefficient of Variation . Unit-less; compares variability across different scales.
Shape
Skewness measures asymmetry:
- 0 = symmetric (normal).
- = right-skewed (tail on right; mean > median); income, lifespan.
- = left-skewed (tail on left; mean < median); exam scores capped at 100.
Kurtosis measures tailedness: Excess kurtosis = Kurt − 3 (so a normal has 0).
- leptokurtic — fat tails, more outliers (stock returns).
- platykurtic — thin tails, flat top.
Quartiles and percentiles
- Quartiles median split data into four parts.
- Percentiles — value below which falls.
- Five-number summary — basis of the box plot.
Why these matter
- Quickly profile a dataset.
- Catch data-quality issues (impossible min/max, weird mode).
- Decide on transformations (log if heavily skewed).
- Drive choice of model (parametric vs non-parametric).
- Justify outlier treatment.
pandas one-liner
df.describe() # count, mean, std, min, 25%, 50%, 75%, max
df.skew(), df.kurt()
Mind Map
Visual structure of the conceptDESCRIPTIVE STATISTICS
├── Central tendency
│ ├── Mean (sensitive to outliers)
│ ├── Median (robust)
│ └── Mode (nominal)
├── Spread
│ ├── Range
│ ├── IQR (Q3 − Q1)
│ ├── Variance σ²
│ ├── Std dev σ
│ └── Coef. of variation σ/μ
├── Shape
│ ├── Skewness
│ │ ├── > 0 right-skew (income)
│ │ └── < 0 left-skew (capped scores)
│ └── Kurtosis
│ ├── > 3 leptokurtic — fat tails
│ └── < 3 platykurtic — flat
├── Quartiles / Percentiles
└── 5-number summary (basis of box plot)
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questionsPart A (2 marks each)
Q1. Define mean, median and mode.
- Mean — arithmetic average .
- Median — middle value when sorted; robust to outliers.
- Mode — most frequent value; the only measure suitable for nominal data.
Q2. What is skewness? A measure of asymmetry of a distribution: . Zero = symmetric, positive = right-skewed (long right tail), negative = left-skewed.
Q3. What is the difference between variance and standard deviation? Variance is the average squared deviation from the mean (); standard deviation is its square root (). Std-dev is in the original units of the data, making it the more interpretable measure.
Part B (20 marks)
Q. Discuss descriptive statistics with formulas and examples. Explain how mean, median, mode, variance, standard deviation, skewness and kurtosis describe a dataset.
Role of descriptive statistics.
They summarise a dataset's main features — centre, spread, shape — without inferring beyond it. The first lens applied during EDA.
1. Central tendency.
| Measure | Formula | Property | Best for |
|---|---|---|---|
| Mean | Uses every value | Symmetric data | |
| Median | Middle value | Robust to outliers | Skewed data (income, prices) |
| Mode | Most frequent | For categorical | Nominal data |
Example. Salaries: 30k, 32k, 33k, 35k, 200k → mean = 66k, median = 33k. Median better reflects "typical" salary.
2. Spread.
| Measure | Formula | Robust? |
|---|---|---|
| Range | No | |
| IQR | Yes | |
| Variance | No | |
| Std dev | No | |
| CV | scale-free |
Why use IQR? It ignores the extreme tails — useful for skewed distributions.
Example. In a dataset : mean = 5, variance = 4, std = 2.
3. Shape — Skewness and Kurtosis.
Skewness measures asymmetry:
- Zero → symmetric (Normal).
- Positive → right-skewed; long right tail; mean > median (income, time-to-failure).
- Negative → left-skewed; long left tail; mean < median (exam scores capped at 100, time-to-success).
Kurtosis measures tailedness:
- Excess kurtosis = Kurt − 3 (Normal has 0).
- leptokurtic — fat tails, more outliers (stock returns).
- platykurtic — thinner tails, flat top.
Why care? Many models (linear regression, GLMs) assume Gaussian noise. Heavy skew / kurtosis violates this — apply transformations or switch to robust models.
4. Quartiles, Percentiles and Five-number summary.
The five-number summary underpins the box plot and reveals outliers visually.
Putting it together — a worked example.
Suppose a study finds incomes (in thousands ₹): 20, 24, 26, 30, 35, 40, 45, 60, 200.
- Mean = 53.3 — pulled by the outlier.
- Median = 35 — robust, more representative.
- Std dev ≈ 56 — large, reflecting spread.
- Skewness > 0 — long right tail.
- Box plot — shows ₹200 k as an outlier far beyond .
A data scientist seeing this:
- Logs the income to reduce skew.
- Reports median, not mean.
- Considers winsorising the outlier or using a robust model.
Descriptive statistics — though "basic" — drive every downstream modelling decision.