PGDDSA Study · Semester 1

PGD24110101C

November 2024 — Solved

Mathematical and Statistical Foundation of Data Science

PGD01C01

2 hours 30 minutes

50 marks

Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)

Q1. What does the Rank and Nullity theorem define for a matrix?

For an $m \times n$ matrix $A$ representing a linear map $A : \mathbb{R}^n \to \mathbb{R}^m$ , the rank-nullity theorem states:

$\text{rank}(A) + \text{nullity}(A) = n$

where $\text{rank}(A)$ is the dimension of the column space and $\text{nullity}(A) = \dim N(A)$ is the dimension of the null space $\{x : Ax = 0\}$ . It says the input dimension is partitioned into "dimensions preserved" + "dimensions collapsed to zero."

Q2. What are denumerable sets?

A set $A$ is denumerable (countably infinite) if there exists a bijection $f : \mathbb{N} \to A$ , i.e., its elements can be listed as $a_1, a_2, a_3, \dots$ The sets $\mathbb{N}, \mathbb{Z}$ and $\mathbb{Q}$ are all denumerable, while $\mathbb{R}$ is uncountable (Cantor's diagonal argument).

Q3. Find the symmetric equation of the line through $A(1, 2, 3)$ and $B(4, 0, -1)$ .

Direction vector: $\vec{d} = B - A = (3, -2, -4)$ .

Using point $A$ : $\boxed{\dfrac{x - 1}{3} = \dfrac{y - 2}{-2} = \dfrac{z - 3}{-4}}.$

Q4. What defines an Eulerian circuit in graph theory?

An Eulerian circuit is a closed walk in a connected graph that traverses every edge exactly once and returns to the starting vertex. By Euler's theorem, a connected graph has an Eulerian circuit if and only if every vertex has even degree.

Q5. How many ways can a committee of 3 men and 2 women be selected from 8 men and 6 women?

Choose 3 of 8 men and 2 of 6 women independently:

$\binom{8}{3} \cdot \binom{6}{2} = 56 \cdot 15 = \boxed{840} \text{ ways.}$

Q6. What is meant by Independent Events? How can you determine if two events $A$ and $B$ are independent?

Two events $A$ and $B$ are independent if the occurrence of one does not affect the probability of the other, i.e.,

$P(A \cap B) = P(A) \cdot P(B).$

To test independence: compute $P(A)$ , $P(B)$ and $P(A \cap B)$ from the data and check whether the product equals the joint probability. Equivalently, check $P(A \mid B) = P(A)$ .

Q7. Define a Tree.

A tree is a connected, acyclic, undirected graph. Equivalent characterisations on $n$ vertices: it has exactly $n - 1$ edges, there is a unique path between any two vertices, removing any edge disconnects it, and adding any edge creates a cycle.

Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)

Q8.

(a) What are the different types of random variables that can be generated? Explain with their PDF or PMF. (10 marks)

A random variable (RV) $X$ is a function $X : \Omega \to \mathbb{R}$ that assigns a real number to each outcome of a random experiment. RVs fall into two main types — discrete and continuous — each with several standard families.

1. Discrete Random Variables. Take countably many values; described by a Probability Mass Function (PMF) $p(x) = P(X = x)$ with $\sum_x p(x) = 1$ .

Distribution	PMF	Mean	Variance	Typical use
Bernoulli $(p)$	$P(X=1) = p, \; P(X=0) = 1 - p$	$p$	$p(1-p)$	single trial / yes-no
Binomial $(n, p)$	$\binom{n}{k} p^k (1-p)^{n-k}$	$np$	$np(1-p)$	# successes in $n$ trials
Poisson $(\lambda)$	$e^{-\lambda}\lambda^k / k!$	$\lambda$	$\lambda$	rare event counts
Geometric $(p)$	$(1-p)^{k-1} p$	$1/p$	$(1-p)/p^2$	trials until first success

2. Continuous Random Variables. Take values in an interval of $\mathbb{R}$ ; described by a Probability Density Function (PDF) $f(x) \ge 0$ with $\int_{-\infty}^\infty f(x)\,dx = 1$ . Note $P(X = c) = 0$ for any single point; probabilities come from intervals: $P(a \le X \le b) = \int_a^b f(x)\,dx$ .

Distribution	PDF	Mean	Variance	Typical use
Uniform $(a, b)$	$\dfrac{1}{b - a}$ on $[a, b]$	$\dfrac{a+b}{2}$	$\dfrac{(b-a)^2}{12}$	equal-likelihood model
Exponential $(\lambda)$	$\lambda e^{-\lambda x}, \; x \ge 0$	$1/\lambda$	$1/\lambda^2$	waiting times, lifetimes
Normal $(\mu, \sigma^2)$	$\dfrac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2 / (2\sigma^2)}$	$\mu$	$\sigma^2$	natural variation, errors
Gamma $(\alpha, \beta)$	$\dfrac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}$	$\alpha / \beta$	$\alpha / \beta^2$	sum of exponentials

Generation. Discrete RVs are typically generated by sampling from a uniform $U \sim \text{Uniform}(0, 1)$ and bucketing $U$ against the cumulative PMF. Continuous RVs are generated by inverse-transform sampling $X = F^{-1}(U)$ when the CDF is invertible (e.g., Exponential $(\lambda)$ via $X = -\ln(1 - U)/\lambda$ ), the Box–Muller transform for normals, or rejection sampling for complex shapes.

(b) How does rejection sampling facilitate sampling from a complex probability distribution? Explain the algorithm. (10 marks)

Motivation. When the target density $f(x)$ is hard to invert (so inverse-transform sampling cannot be used) but can be evaluated point-wise, rejection sampling lets us sample from $f$ using an easier proposal distribution.

Setup.

Target density $f(x)$ — what we want to sample from.
Proposal density $g(x)$ — easy to sample (e.g., uniform, normal).
Envelope constant $M$ such that $M \cdot g(x) \ge f(x)$ for all $x$ .

Algorithm.

loop forever:
    1. Draw a candidate X from g
    2. Draw U from Uniform(0, 1)
    3. If U ≤ f(X) / (M · g(X)):  return X     (accept)
       else:                       continue    (reject and try again)

Why it works.

$P(\text{accept and } X \le x) = \int_{-\infty}^{x} g(y) \cdot \dfrac{f(y)}{M\, g(y)} \, dy = \dfrac{1}{M} \int_{-\infty}^{x} f(y)\,dy = \dfrac{F(x)}{M}$ .

Hence $P(\text{accept}) = 1/M$ and the conditional CDF of the accepted samples is $F(x)$ — so accepted samples follow $f$ . ∎

Efficiency and trade-off.

Acceptance rate is $1/M$ . A tight envelope ( $g$ hugging $f$ ) gives small $M$ and high acceptance.
A loose envelope wastes many candidate draws.
$g$ should be chosen in the same support as $f$ and ideally with a similar shape.

Example — sampling from $f \propto e^{-x^2/2}$ (standard normal half) on $[0, \infty)$ using exponential proposal $g(x) = e^{-x}$ : the envelope constant is $M = \sqrt{2e/\pi} \approx 1.32$ , giving an acceptance rate of ~76%.

Applications: Bayesian inference (sampling posteriors), Monte Carlo integration, generative simulation, and as a building block within MCMC methods like Metropolis–Hastings.

Q9.

(a) State the Principle of Inclusion-Exclusion for three sets. Class survey problem. (10 marks)

Statement. For any three finite sets $A, B, C$ :

$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$

Class survey problem.

Let $M, P, C$ be the sets of students who like Mathematics, Physics, Chemistry respectively.

Given: $|M| = 60$ , $|P| = 45$ , $|C| = 50$ , $|M \cap P| = 25$ , $|M \cap C| = 20$ , $|P \cap C| = 15$ , $|M \cap P \cap C| = 10$ . Total students = 120.

Students who like at least one subject: $|M \cup P \cup C| = 60 + 45 + 50 - 25 - 20 - 15 + 10 = \boxed{105}.$

So 105 students like at least one subject (and $120 - 105 = 15$ like none of the three).

Students who like only Mathematics = $|M| - |M \cap P| - |M \cap C| + |M \cap P \cap C|$ $= 60 - 25 - 20 + 10 = \boxed{25}.$

Venn diagram (regions and counts).

Region	Count
Only Math	25
Only Physics	45 − 25 − 15 + 10 = 15
Only Chemistry	50 − 20 − 15 + 10 = 25
Math ∩ Physics only (not Chem)	25 − 10 = 15
Math ∩ Chemistry only	20 − 10 = 10
Physics ∩ Chemistry only	15 − 10 = 5
All three	10
None	15
Total	120 ✓

Sketch: three overlapping circles labeled M, P, C; centre region (all three) = 10; pairwise outer rings 15, 10, 5; outer rings 25, 15, 25; outside the three circles = 15.

(b) Illustrate different graph terminologies and traversals. (10 marks)

Graph terminologies.

Vertex (node) and edge — fundamental units. $G = (V, E)$ .
Degree $\deg(v)$ — number of edges incident to $v$ . Handshaking lemma: $\sum \deg(v) = 2|E|$ .
Simple graph — no self-loops, no parallel edges.
Multigraph / Pseudograph — allows parallel edges (and loops).
Directed graph (digraph) — edges with direction; vertices have in-degree and out-degree.
Weighted graph — edges carry weights (costs, distances).
Complete graph $K_n$ — every pair joined, $\binom{n}{2}$ edges.
Bipartite graph — vertex set splits as $V_1 \cup V_2$ with edges only between $V_1$ and $V_2$ .
Walk — sequence of vertices/edges (repetition allowed).
Trail — walk with no repeated edge.
Path — walk with no repeated vertex.
Cycle/Circuit — closed path / closed trail.
Connected — every pair of vertices joined by a path.
Tree — connected, acyclic graph with $n - 1$ edges.
Subgraph, Spanning subgraph, Spanning tree.
Planar graph — drawable without edge crossings.

Graph traversals.

1. Breadth-First Search (BFS). Explore neighbours layer by layer using a queue.

BFS(G, s):
    mark s as visited, enqueue s
    while queue not empty:
        u = dequeue
        for each neighbour v of u:
            if v not visited:
                mark v, enqueue v

Complexity $O(V + E)$ . Used for shortest path in unweighted graphs, level-order processing, web crawling.

2. Depth-First Search (DFS). Explore as deep as possible before backtracking, using recursion or a stack.

DFS(G, u):
    mark u as visited
    for each neighbour v of u:
        if v not visited:
            DFS(G, v)

Complexity $O(V + E)$ . Used for cycle detection, topological sort, connected components, articulation points.

Example on a small graph with vertices $\{1,2,3,4,5\}$ and edges $\{(1,2),(1,3),(2,4),(3,4),(4,5)\}$ :

BFS from 1: $1 \to 2 \to 3 \to 4 \to 5$ .
DFS from 1: $1 \to 2 \to 4 \to 3 \to 5$ (one possible order).

Other traversals.

Eulerian traversal — visits every edge once (Hierholzer's algorithm).
Hamiltonian traversal — visits every vertex once (NP-hard in general).
Dijkstra's — shortest paths in weighted graphs with non-negative weights.

Q10.

(a) What do eigenvalues and eigenvectors contribute? Find eigenvalues and eigenvectors of $A = \begin{pmatrix} 5 & 4 \\ 2 & 3 \end{pmatrix}$ . (10 marks)

Contribution of eigenvalues / eigenvectors.

For a square matrix $A$ , a scalar $\lambda$ and non-zero vector $v$ satisfying $Av = \lambda v$ form an eigenpair. Eigenvectors are directions that $A$ only scales — direction unchanged. Their importance:

Reveal intrinsic structure of a linear transformation: principal directions of stretch (largest $\lambda$ ) and compression (smallest).
Diagonalisation $A = P D P^{-1}$ — turns matrix powers and ODE solutions into simple scalar exponentials: $A^k = P D^k P^{-1}$ .
PCA — eigenvectors of the covariance matrix are the principal components; eigenvalues are the variances along them.
Spectral clustering, PageRank — leading eigenvectors of graph Laplacians or stochastic matrices reveal community structure and importance.
Stability analysis of Markov chains, dynamical systems, neural-network optimisation.

Computation for $A = \begin{pmatrix} 5 & 4 \\ 2 & 3 \end{pmatrix}$ .

Step 1 — Characteristic polynomial. $\det(A - \lambda I) = (5 - \lambda)(3 - \lambda) - 8 = \lambda^2 - 8\lambda + 7$ .

Factor: $(\lambda - 7)(\lambda - 1) = 0$ .

$\boxed{\lambda_1 = 7, \quad \lambda_2 = 1}.$

(Check: sum = 8 = trace ✓, product = 7 = det ✓.)

Step 2 — Eigenvector for $\lambda_1 = 7$ . Solve $(A - 7I) v = 0$ : $\begin{pmatrix} -2 & 4 \\ 2 & -4 \end{pmatrix} v = 0 \Rightarrow -2 x + 4 y = 0 \Rightarrow x = 2 y$ . Take $y = 1$ : $\boxed{v_1 = (2, 1)^T}$ .

Step 3 — Eigenvector for $\lambda_2 = 1$ . Solve $(A - I) v = 0$ : $\begin{pmatrix} 4 & 4 \\ 2 & 2 \end{pmatrix} v = 0 \Rightarrow 4 x + 4 y = 0 \Rightarrow x = -y$ . Take $y = 1$ : $\boxed{v_2 = (-1, 1)^T}$ .

Diagonalisation. $P = \begin{pmatrix} 2 & -1 \\ 1 & 1 \end{pmatrix}, \quad D = \begin{pmatrix} 7 & 0 \\ 0 & 1 \end{pmatrix}, \quad A = P D P^{-1}.$

(b) How are inner products and similarities computed between vectors? (5 marks)

Inner product (dot product). For $x, y \in \mathbb{R}^n$ : $\langle x, y \rangle = x^T y = \sum_{i=1}^n x_i y_i.$

It induces the norm $\|x\| = \sqrt{\langle x, x \rangle}$ and measures alignment via the angle: $\cos \theta = \frac{\langle x, y \rangle}{\|x\| \, \|y\|}.$

Cosine similarity $\dfrac{\langle x, y \rangle}{\|x\|\|y\|} \in [-1, 1]$ — the standard similarity measure for text vectors, embeddings, and recommender systems. Value 1 ⇒ identical direction; 0 ⇒ orthogonal; −1 ⇒ opposite.

Two vectors are orthogonal iff $\langle x, y \rangle = 0$ . By Cauchy–Schwarz, $|\langle x, y \rangle| \le \|x\| \|y\|$ .

(c) Discuss different metrics employed to measure the distance between matrices. (5 marks)

For matrices $A, B \in \mathbb{R}^{m \times n}$ , common distances are:

Metric	Formula	Interpretation
Frobenius	$\\|A - B\\|_F = \sqrt{\sum_{i,j} (a_{ij} - b_{ij})^2}$	Entry-wise $L_2$ ; equals $\sqrt{\text{tr}((A-B)^T (A-B))}$ . Most common.
Spectral (operator $L_2$ )	$\\|A - B\\|_2 = \sigma_{\max}(A - B)$	Largest singular value of the difference. Worst-case stretching.
Nuclear / Trace norm	$\\|A - B\\|_* = \sum_i \sigma_i(A - B)$	Sum of singular values. Used in low-rank recovery.
Manhattan (element $L_1$ )	$\sum_{i,j}	a_{ij} - b_{ij}
Max norm / $L_\infty$	$\max_{i,j}	a_{ij} - b_{ij}

When to use which. Frobenius is the default for least-squares matrix problems and PCA reconstruction error. Spectral norm appears in stability analysis. Nuclear norm is a convex surrogate for matrix rank (used in matrix completion / Netflix prize). Manhattan / max are useful for outlier-sensitive comparisons in robust ML.

Q11.

(a) Construct equations and analyse lines and plane. (10 marks)

Line A. Passes through $A(1, 0, 2)$ with direction $\vec{d}_A = (2, -1, 3)$ .

Vector form: $\vec{r}(t) = (1, 0, 2) + t(2, -1, 3)$ .

Parametric: $x = 1 + 2t, \; y = -t, \; z = 2 + 3t$ .

Symmetric: $\dfrac{x - 1}{2} = \dfrac{y}{-1} = \dfrac{z - 2}{3}$ .

Line B. Through $B(3, 1, 4)$ and $C(5, 0, 7)$ . Direction $\vec{d}_B = C - B = (2, -1, 3)$ .

Symmetric: $\dfrac{x - 3}{2} = \dfrac{y - 1}{-1} = \dfrac{z - 4}{3}$ .

Comparing Line A and Line B.

Direction vectors $\vec{d}_A = \vec{d}_B = (2, -1, 3)$ ⇒ the lines are parallel (or coincident).

Check whether Line A's point $A(1, 0, 2)$ lies on Line B by plugging into Line B's symmetric form: $\dfrac{1 - 3}{2} = -1, \quad \dfrac{0 - 1}{-1} = 1, \quad \dfrac{2 - 4}{3} = -\dfrac{2}{3}$ .

The three ratios are not equal, so $A$ does not lie on Line B.

Conclusion: Lines A and B are parallel but distinct (not intersecting, not the same line).

Plane through $P(3, -1, 1), Q(4, 1, 3), R(2, 0, 2)$ .

Edge vectors: $\vec{PQ} = (1, 2, 2), \; \vec{PR} = (-1, 1, 1)$ .

Normal $\vec{n} = \vec{PQ} \times \vec{PR}$ : $\vec{n} = \begin{vmatrix} \hat{i} & \hat{j} & \hat{k} \\ 1 & 2 & 2 \\ -1 & 1 & 1 \end{vmatrix} = \hat{i}(2 - 2) - \hat{j}(1 + 2) + \hat{k}(1 + 2) = (0, -3, 3)$ .

Simplify: $\vec{n} = (0, -1, 1)$ .

Plane equation (using $P(3, -1, 1)$ ): $0(x - 3) - 1(y + 1) + 1(z - 1) = 0$ $\boxed{y - z + 2 = 0 \quad \text{or equivalently} \quad -y + z = 2}.$

(b) Explain Bayes theorem in detail. Solve the defect-test problem. (10 marks)

Bayes' Theorem — statement. For events $A$ and $B$ with $P(B) > 0$ :

$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}.$

If $\{B_1, \dots, B_n\}$ partitions $\Omega$ : $P(B_k \mid A) = \frac{P(A \mid B_k) P(B_k)}{\sum_{i=1}^n P(A \mid B_i) P(B_i)}.$

Why it matters. Bayes' theorem inverts conditioning — it turns a likelihood $P(B \mid A)$ (often easy to measure: "given the disease, the test reads positive 98% of the time") into a posterior $P(A \mid B)$ (often what we actually want: "given a positive test, what's the chance the patient has the disease"). The prior $P(A)$ injects base-rate information.

Proof. From the definition of conditional probability:

$P(A \cap B) = P(A \mid B) P(B)$
$P(A \cap B) = P(B \mid A) P(A)$

Equating: $P(A \mid B) P(B) = P(B \mid A) P(A)$ , hence $P(A \mid B) = P(B \mid A) P(A) / P(B)$ . ∎

Applications. Naive Bayes classifier (text, spam), Bayesian networks, medical diagnosis, A/B testing, Bayesian inference, Kalman filters.

Defect-test problem.

Let:

$D$ = "item is defective", $D^c$ = "item is non-defective".
$T$ = "test reports defect (positive)".

Given.

Prior: $P(D) = 1/200 = 0.005, \quad P(D^c) = 0.995$ .
Sensitivity: $P(T \mid D) = 0.98$ .
False-positive rate: $P(T \mid D^c) = 0.02$ .

Step 1 — Total probability of a positive test. $P(T) = P(T \mid D) P(D) + P(T \mid D^c) P(D^c)$ $= 0.98 \times 0.005 + 0.02 \times 0.995$ $= 0.0049 + 0.0199 = 0.0248$ .

Step 2 — Apply Bayes. $P(D \mid T) = \frac{P(T \mid D) P(D)}{P(T)} = \frac{0.0049}{0.0248} \approx 0.1976.$

$\boxed{P(D \mid T) \approx 19.76\%.}$

Interpretation. Even though the test is 98% accurate, a positive result implies the item is actually defective only about 1 in 5 times. This is the classic base-rate fallacy: when the underlying condition is rare, false positives dominate true positives. This is also why screening tests for rare diseases are often followed by confirmatory tests.