PGD01C01
Module 5 · Probability Theory

Expectation: Mean, Variance, Covariance, Conditional Expectation

Core Titles
Key headlines and terms for quick recall
  • Expectation E[X]=xp(x)E[X] = \sum x p(x) or xf(x)dx\int x f(x) \, dx
  • Linearity E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = a E[X] + b E[Y]
  • Variance Var(X)=E[(Xμ)2]=E[X2]E[X]2\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - E[X]^2
  • Covariance Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[XY] - E[X] E[Y]
  • Conditional Expectation E[XY]E[X | Y]
  • Law of Total Expectation E[X]=E[E[XY]]E[X] = E[E[X | Y]]
  • Law of Total Variance Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y])
Basic Idea
What it is, why it matters, how it works

Expectation (mean)

Centre of mass of a distribution: E[X]={xxp(x)discretexf(x)dxcontinuousE[X] = \begin{cases} \sum_x x \, p(x) & \text{discrete} \\ \int x f(x) \, dx & \text{continuous} \end{cases}

Law of the Unconscious Statistician (LOTUS). For Y=g(X)Y = g(X): E[g(X)]=g(x)p(x)E[g(X)] = \sum g(x) p(x) or g(x)f(x)dx\int g(x) f(x) \, dx.

Linearity (always holds)

E[aX+bY+c]=aE[X]+bE[Y]+c.E[aX + bY + c] = aE[X] + bE[Y] + c. No independence needed.

Variance

Var(X)=E[(Xμ)2]=E[X2](E[X])2.\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2. Spread around the mean. Standard deviation σ=Var(X)\sigma = \sqrt{\text{Var}(X)}.

Properties.

  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)
  • Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y)
  • If XYX \perp Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Covariance

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y].\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]. Symmetric, bilinear. Correlation ρ=Cov(X,Y)/(σXσY)[1,1]\rho = \text{Cov}(X,Y) / (\sigma_X \sigma_Y) \in [-1, 1].

Conditional expectation

E[XY=y]=xfXY(xy)dx.E[X | Y = y] = \int x \, f_{X|Y}(x | y) \, dx. A function of yy. Viewed as a random variable E[XY]E[X | Y].

Law of Total Expectation (Adam's law)

E[X]=E[E[XY]].E[X] = E\big[ E[X | Y] \big]. Average the conditional means over YY.

Law of Total Variance (Eve's law)

Var(X)=E[Var(XY)]+Var(E[XY]).\text{Var}(X) = E\big[ \text{Var}(X | Y) \big] + \text{Var}\big( E[X | Y] \big). Within-group variance + between-group variance.

Why this matters in Data Science

Mean and variance summarise distributions. Bias / variance trade-off lives here. Conditional expectation = optimal predictor under squared loss — the foundation of regression.

Mind Map
Visual structure of the concept
EXPECTATION & MOMENTS
├── E[X] center of mass
├── LOTUS: E[g(X)] without finding dist of g(X)
├── Linearity (always): E[aX + bY] = aE[X] + bE[Y]
├── Var(X) = E[X²] − (E[X])²
│   ├── Var(aX + b) = a² Var(X)
│   └── Var(X + Y) = Var(X) + Var(Y) + 2Cov
├── Cov(X, Y) = E[XY] − E[X]E[Y]
├── Conditional Expectation E[X|Y]
├── Total Expectation: E[X] = E[E[X|Y]]
└── Total Variance: Var(X) = E[Var(X|Y)] + Var(E[X|Y])
Exam Q&A
Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. State linearity of expectation. E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y], regardless of independence.

Q2. Define variance. Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2.

Q3. Define covariance. Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[XY] - E[X] E[Y].

Q4. State the law of total expectation. E[X]=E[E[XY]]E[X] = E[E[X | Y]].


Part B (20 marks)

Q. Derive the formula Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2. State and prove the linearity of expectation. State the laws of total expectation and total variance. Compute mean and variance of XBinomial(n,p)X \sim \text{Binomial}(n, p) using linearity.

Variance identity. Var(X)=E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2.\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2. Since E[X]=μE[X] = \mu: Var(X)=E[X2]2μ2+μ2=E[X2]μ2=E[X2](E[X])2.\text{Var}(X) = E[X^2] - 2\mu^2 + \mu^2 = E[X^2] - \mu^2 = E[X^2] - (E[X])^2. \quad \blacksquare

Linearity of expectation. Theorem. For RVs X,YX, Y and scalars a,ba, b: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y].

Proof (continuous). E[aX+bY]=(ax+by)f(x,y)dxdy=axfdxdy+byfdxdy=aE[X]+bE[Y]E[aX + bY] = \iint (ax + by) f(x, y) \, dx \, dy = a \iint x f \, dx \, dy + b \iint y f \, dx \, dy = a E[X] + b E[Y].

(Discrete case identical with sums.) No independence required. ∎

Total expectation. E[X]=E[E[XY]]E[X] = E[E[X | Y]].

Total variance. Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = E[\text{Var}(X | Y)] + \text{Var}(E[X | Y]). (Within-group variance + between-group variance.)

Binomial mean and variance via linearity.

Write X=X1+X2++XnX = X_1 + X_2 + \dots + X_n where XiBernoulli(p)X_i \sim \text{Bernoulli}(p) are independent.

Mean. E[Xi]=pE[X_i] = p. By linearity: E[X]=E[Xi]=npE[X] = \sum E[X_i] = np.

Variance. Var(Xi)=p(1p)\text{Var}(X_i) = p(1-p). By independence: Var(X)=Var(Xi)=np(1p)\text{Var}(X) = \sum \text{Var}(X_i) = np(1-p).

So Binomial(n,p)\text{Binomial}(n, p) has mean npnp and variance np(1p)np(1-p). ✓

Sanity check. If p=0p = 0 or p=1p = 1, variance is 0 (deterministic). Variance is maximised at p=1/2p = 1/2. Intuitively: the most "random" Bernoulli is the fair coin.