PGD01C02
Module 3 · Exploratory Data Analytics and Model Development

Box Plots, Pivot Tables and Heat Maps

Core Titles
Key headlines and terms for quick recall
  • Box plot — five-number summary + outliers
  • Histogram — distribution of one numeric variable
  • Scatter plot — relationship between two numerics
  • Bar chart, Pie chart — categorical
  • Pivot table — multi-dim aggregate (Excel / pandas pivot_table)
  • Heatmap — colour-coded matrix
  • Correlation heatmap — pairwise correlations
  • Pair plot — scatter matrix of all pairs
Basic Idea
What it is, why it matters, how it works

Box plot

A graphical summary of the five-number summary: min, Q1Q1, median, Q3Q3, max — plus outliers flagged outside [Q11.5IQR,  Q3+1.5IQR][Q1 - 1.5 \cdot IQR, \; Q3 + 1.5 \cdot IQR].

       |---|         outlier (•)
   |--[ box ]--|     whiskers + box
       median

Reveals.

  • Median and spread.
  • Skewness (box position within whiskers).
  • Outliers as individual points.
  • Comparison across groups (side-by-side box plots).

Histogram

Bars of counts (or density) per bin of a numeric variable. Shape reveals modality, skewness, gaps.

Bin choice matters — too few hides structure; too many adds noise.

Pivot table

A spreadsheet-style aggregate showing one variable's summary across two grouping variables.

Region \ QuarterQ1Q2Q3Q4
North₹2.1M₹2.4M₹2.8M₹3.1M
South₹1.8M₹2.0M₹2.5M₹2.7M
East₹1.5M₹1.6M₹1.9M₹2.0M
West₹2.2M₹2.3M₹2.5M₹2.6M

Use. Quickly slice business metrics by multiple dimensions. Available in Excel, Google Sheets, pandas (df.pivot_table).

Heatmap

Visualise a matrix with colour: darker = higher value.

Common uses.

  • Correlation heatmap — pairwise correlations between numeric features. Spot multicollinearity instantly.
  • Confusion matrix — classification performance.
  • Geographic — population, sales density on a map.
  • Time-of-day vs day-of-week — when do users log in?
  • Student vs subject performance — find weak subjects / students at a glance.

Pair plot (scatter matrix)

A grid of scatter plots for every pair of features, with histograms on the diagonal. seaborn pairplot is the standard tool.

Use. Spot pairwise linear / non-linear relationships, clusters, outliers in a multi-feature dataset.

Other essential plots

  • Bar / column chart — categorical counts.
  • Pie chart — share of a whole (use sparingly).
  • Line plot — time series trends.
  • Violin plot — box + KDE; richer distribution view.
  • Q-Q plot — checks normality.
  • Geographic / choropleth — spatial data.

Tools

  • Python: matplotlib, seaborn, plotly.
  • R: ggplot2.
  • Commercial: Tableau, Power BI, Looker.

Why visualisation matters

  • Humans grok pictures faster than tables.
  • EDA depends heavily on plots.
  • Stakeholders absorb dashboards better than statistics.
  • Anomalies and trends often jump out visually before any statistic catches them.
Mind Map
Visual structure of the concept
EDA VISUALISATIONS
├── One variable
│   ├── Histogram (distribution shape)
│   ├── Box plot (5-number + outliers)
│   ├── Violin plot (box + density)
│   └── Bar (categorical counts)
├── Two variables
│   ├── Scatter (numeric–numeric)
│   ├── Grouped box (numeric–categorical)
│   └── Contingency / mosaic (cat–cat)
├── Aggregates
│   ├── Pivot table (cross-tabulation)
│   └── Heatmap (matrix colour map)
├── Many variables
│   ├── Pair plot (scatter matrix)
│   └── Correlation heatmap
└── Time / space
    ├── Line plot (trend)
    └── Choropleth map
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. What does a box plot show? A graphical summary of a dataset's five-number summary — minimum, Q1Q1, median, Q3Q3, maximum — with outliers flagged as individual points outside [Q11.5IQR,  Q3+1.5IQR][Q1 - 1.5 \cdot IQR, \; Q3 + 1.5 \cdot IQR].

Q2. What is a pivot table? A spreadsheet-style aggregation that summarises one variable across two or more grouping variables — e.g., revenue by region × quarter — available in Excel, Google Sheets, and pandas df.pivot_table.

Q3. How can a heatmap help in evaluating student performance across subjects? A heatmap displays a students × subjects score matrix as a coloured grid. Cold columns reveal weak subjects across the cohort; cold rows reveal struggling students; clusters reveal patterns such as students strong in maths also doing well in physics.


Part B (20 marks)

Q. Describe the role of box plots, pivot tables and heatmaps in Exploratory Data Analysis with examples.

Box plot.

What it shows. The five-number summary visually:

  • Box spans Q1Q1 to Q3Q3.
  • Median marked inside.
  • Whiskers extend to the most extreme non-outlier points.
  • Outliers plotted individually outside Q11.5IQRQ1 - 1.5 \cdot IQR or Q3+1.5IQRQ3 + 1.5 \cdot IQR.

Information revealed.

  • Centre and spread.
  • Skewness (box offset within whiskers).
  • Outliers.
  • Multiple groups side-by-side compare distributions across categories.

Example use. Plot salary grouped by department — instantly see whether sales pays more than engineering and whether anyone is an outlier.

Pivot table.

What it is. A multi-dimensional aggregate — for each combination of two grouping variables, compute a summary (sum, mean, count) of a third.

Region \ QuarterQ1Q2Q3Q4
North₹2.1 M₹2.4 M₹2.8 M₹3.1 M
South₹1.8 M₹2.0 M₹2.5 M₹2.7 M
East₹1.5 M₹1.6 M₹1.9 M₹2.0 M
West₹2.2 M₹2.3 M₹2.5 M₹2.6 M

Insights. Quickly spot best-performing region (North), seasonal trends (Q4 highest), weakest region (East).

pandas one-liner.

df.pivot_table(values="revenue", index="region", columns="quarter", aggfunc="sum")

Heatmap.

What it is. A 2-D colour-coded matrix where each cell's colour intensity encodes a numeric value.

Common uses.

  1. Correlation heatmap — pairwise correlations between numeric features. Identifies multicollinearity at a glance — important before linear regression.

  2. Confusion matrix — actual vs predicted classes. Shows where the classifier confuses categories.

  3. Geographic / spatial — population density, sales by city.

  4. Time × day — login activity by hour of day × day of week.

  5. Student × subject performance — instantly reveals weak subjects (cold columns), struggling students (cold rows), and learning patterns (clusters of strong subjects).

Example — student performance heatmap.

Student \ SubjectMathsPhysicsChemBiologyEnglish
Alice9288857870
Bob4550528088
Carol7882757065
...

Plot as a heatmap with red ↔ blue gradient. Cold cluster in top-right (Alice's English) is visible; Bob's cold cluster in STEM is striking; teacher decides Alice needs language coaching and Bob needs maths support.

Why these tools in EDA.

  • Box plot — pinpoints outliers and compares distributions.
  • Pivot table — slices business metrics across dimensions.
  • Heatmap — turns dense matrices into instant patterns.

Together they form the core of fast, visual EDA — the foundation on which good models rest.