PGDDSA Study · Semester 1

PGD24110103A

November 2024 — Solved

Machine Learning for Data Science

PGD01C03

2 hours 30 minutes

50 marks

Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)

Q1. Define machine learning. What are the different types of machine learning?

Machine Learning is a branch of AI that builds systems which learn patterns from data to make predictions or decisions, without being explicitly programmed for the task (Tom Mitchell, 1997).

Types:

Supervised — labelled data; learn $f : X \to Y$ (classification, regression).
Unsupervised — unlabelled data; find structure (clustering, dimensionality reduction).
Reinforcement — agent interacts with an environment, learns from rewards (Q-learning, policy gradients).
(Semi-supervised, Self-supervised — hybrids that mix labelled and unlabelled data.)

Q2. What is dimensionality reduction?

Dimensionality reduction is the process of representing high-dimensional data $\mathbb{R}^n$ in a lower-dimensional space $\mathbb{R}^k$ ( $k \ll n$ ) while preserving as much information as possible. It combats the curse of dimensionality, removes redundancy, speeds up training and aids visualisation.

Examples: PCA, LDA, t-SNE, UMAP, autoencoders.

Q3. Define F1-score.

The F1-score is the harmonic mean of precision and recall:

$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}.$

It ranges 0 to 1 and gives a single number balancing false positives and false negatives — useful for imbalanced classification where accuracy is misleading.

Q4. What is the difference between precision and recall?

Precision = $\dfrac{TP}{TP + FP}$ — of items the model predicted positive, how many are truly positive. "How careful?"
Recall = $\dfrac{TP}{TP + FN}$ — of items truly positive, how many did the model find. "How complete?"

Precision matters when false positives are costly (spam filter). Recall matters when false negatives are costly (cancer screening).

Q5. What contributes to support and confidence?

In association rule mining for a rule $A \Rightarrow B$ :

Support = $\dfrac{\text{count}(A \cup B)}{N}$ — fraction of transactions containing both $A$ and $B$ . Measures the rule's frequency / popularity.
Confidence = $\dfrac{\text{count}(A \cup B)}{\text{count}(A)} = P(B \mid A)$ — fraction of $A$ -transactions that also contain $B$ . Measures the rule's reliability.

Q6. What makes RNNs suitable for processing sequential data?

Recurrent Neural Networks maintain a hidden state $h_t$ updated at each time step from the previous state and current input: $h_t = f(W h_{t-1} + U x_t)$ . This recurrence gives them memory of past inputs, allowing them to model dependencies that depend on order — exactly what sequential data (text, speech, time-series) requires. Variants like LSTM and GRU add gating to handle long-range dependencies.

Q7. Why is the activation function important in a perceptron?

The activation function introduces non-linearity into the perceptron's output. Without it, a stack of layers would collapse into a single linear map, no matter how deep — incapable of learning non-linear patterns like XOR. Non-linear activations (sigmoid, tanh, ReLU) allow neural networks to approximate any continuous function (universal approximation theorem).

Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)

Q8.

(a) Analyse the problem of email spam classification using ML. Explain the stages of data preprocessing. (10 marks)

Problem framing. Given an email $e$ , classify it as spam (1) or ham (0). This is a binary supervised classification problem. Performance is judged by precision (avoid throwing away genuine email), recall (catch spam), and the F1-score.

Typical pipeline. Email text + metadata → preprocessing → feature extraction → train a classifier (Naive Bayes, Logistic Regression, SVM or Transformer) → evaluate on held-out emails → deploy in mail-server filter.

Stages of data preprocessing for spam classification.

1. Data collection. Gather a labelled corpus — e.g., the Enron dataset, SpamAssassin, or in-house user reports. Include both spam and ham, ideally balanced or class-weighted.

2. Lowercasing. Convert all text to lowercase so "FREE" and "free" map to one token.

3. Removing HTML, URLs, email headers. Spam often contains HTML tags and tracking links. Use regex or a parser (BeautifulSoup) to strip them — but optionally keep a count of links as a feature, because spam tends to have many.

4. Tokenisation. Split text into tokens (words, sub-words). Use NLTK, SpaCy or transformer tokenisers.

5. Removing punctuation, digits, special chars. Or replace them with placeholder tokens like <NUM> if numeric patterns matter.

6. Stop-word removal. Drop high-frequency, low-information words ("the", "and"). Optional — sometimes stop-words help spam detection.

7. Stemming / Lemmatisation. Reduce words to roots ("running" → "run") so morphological variants share weight.

8. Handle imbalance. Often more ham than spam. Use class weights, oversample spam (SMOTE), or undersample ham.

9. Feature engineering.

Bag-of-Words / TF-IDF vectorisation.
n-grams ("free money" is more telling than the unigrams).
Meta-features: number of links, exclamation marks, ALL-CAPS ratio, length, sender domain, presence of attachments.

10. Train / test split. Stratified split to preserve spam ratio. Optionally use temporal split (older → train, newer → test) because spam content drifts.

11. Scaling / normalisation. Apply standard scaling to numeric meta-features if mixing with TF-IDF.

Modelling and evaluation. Train Multinomial Naive Bayes (a classical, fast baseline), Logistic Regression with L1, SVM with linear kernel, or fine-tune a small Transformer. Report precision, recall, F1, ROC-AUC. Iterate by examining false positives and false negatives.

(b) Various error calculation techniques in ML models (10 marks)

Regression error metrics.

Metric	Formula	Notes
MAE (Mean Absolute Error)	$\dfrac{1}{n}\sum	y_i - \hat y_i
MSE (Mean Squared Error)	$\dfrac{1}{n}\sum (y_i - \hat y_i)^2$	Penalises large errors; differentiable.
RMSE	$\sqrt{MSE}$	Same units as $y$ . Most reported.
MAPE	$\dfrac{100}{n}\sum \left	\dfrac{y_i - \hat y_i}{y_i}\right
$R^2$	$1 - \dfrac{SS_{res}}{SS_{tot}}$	Fraction of variance explained; 1 = perfect, 0 = mean baseline.
Adjusted $R^2$	Penalises adding non-informative predictors.

Classification error metrics.

Built on the confusion matrix of TP, TN, FP, FN.

Metric	Formula	Meaning
Accuracy	$\dfrac{TP + TN}{N}$	Overall correct rate. Misleading under imbalance.
Precision	$\dfrac{TP}{TP + FP}$	Of predicted positives, how many real.
Recall (Sensitivity)	$\dfrac{TP}{TP + FN}$	Of real positives, how many found.
Specificity	$\dfrac{TN}{TN + FP}$	True-negative rate.
F1	$\dfrac{2 \cdot P \cdot R}{P + R}$	Balanced harmonic mean.
F $\beta$	weights recall vs precision	Use F $_2$ when recall matters more.
ROC-AUC	Area under TPR vs FPR curve	Threshold-independent ranking quality.
PR-AUC	Area under Precision-Recall curve	Better than ROC under heavy imbalance.
Log loss	$-\dfrac{1}{n}\sum [y \log \hat p + (1-y) \log(1-\hat p)]$	Penalises confident-wrong predictions.

Probabilistic / Information-theoretic.

Cross-entropy — multiclass log loss.
KL divergence — distance between predicted and true distributions.
Brier score — mean squared error on predicted probabilities.

Generalisation diagnostics.

Train vs validation gap — wide gap = overfitting.
Learning curves — error vs training size to diagnose bias vs variance.
Cross-validation mean and standard deviation.

Domain-specific.

Object detection: IoU, mAP.
Translation: BLEU, METEOR.
Recommender: NDCG, MAP, MRR.

Choosing the right metric. Tie the metric to the business cost: in fraud detection optimise recall + PR-AUC because FN is expensive; in spam optimise precision; in forecasting often RMSE. Wrong metric = wrong model wins.

Q9.

(a) Heart-disease prediction model using supervised learning — algorithm choice, feature selection, training, evaluation. (10 marks)

Problem. Given a patient's demographic, clinical and lab data, predict probability of heart disease (binary classification). Standard dataset: UCI Cleveland.

Step 1 — Data understanding. Features include age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting ECG, max heart rate, exercise-induced angina, ST depression, slope, # major vessels, thalassemia.

Step 2 — Preprocessing.

Impute missing values (median for numeric, mode for categorical, KNN imputation for nuanced cases).
One-hot encode categorical variables.
Standardise numeric features (mean 0, std 1) — crucial for distance / gradient models.
Stratified train/test split (e.g., 80/20).

Step 3 — Feature selection. Methods used together:

Domain knowledge — cardiologists' guidance on which markers matter (cholesterol, ST depression, age).
Univariate filter — Chi-square / mutual information against target.
Embedded methods — L1 (Lasso) logistic regression zeros out unimportant features.
Wrapper / RFE — Recursive Feature Elimination using a model's coefficients.
Model importance — Random Forest / XGBoost feature importance.

Keep ~8–10 most informative features to keep model interpretable.

Step 4 — Algorithm choice. Compare candidates:

Algorithm	Pros	Cons
Logistic Regression	Interpretable, gives probabilities, fast	Linear only
Decision Tree	Interpretable, handles non-linear	Overfits
Random Forest	Robust, captures interactions, good baseline	Less interpretable
Gradient Boosting (XGBoost)	Often best accuracy	Tuning effort
SVM (RBF)	Strong with small data	Slow on big data, no probs by default
KNN	Simple	Sensitive to scaling, slow at predict time

For clinical deployment, Logistic Regression or Random Forest is often chosen for the balance of accuracy and interpretability (doctors want to know why).

Step 5 — Training.

Define loss (binary cross-entropy / Gini for trees).
Fit on training set; tune hyperparameters via grid/random search inside stratified k-fold CV (k = 5).
Address class imbalance with class_weight='balanced' or SMOTE oversampling.

Step 6 — Evaluation. Heart disease is a high-cost FN problem — missing a sick patient is dangerous. Use:

Recall (Sensitivity) as the headline metric.
PR-AUC, ROC-AUC for threshold-free quality.
F1 as a balanced summary.
Confusion matrix at chosen threshold.
Calibration plot so predicted probabilities are honest.

Tune the decision threshold to achieve, say, 90% recall while keeping precision acceptable.

Step 7 — Interpret & deploy. SHAP values reveal each feature's contribution to a specific patient's risk score, which clinicians can audit. Deploy as a clinician decision-support tool — never autonomous diagnosis.

(b) Concept of hyperplanes in SVM (10 marks)

Hyperplane. In $\mathbb{R}^n$ , a hyperplane is the $(n-1)$ -dimensional set $H = \{x : w^T x + b = 0\}$ with $w \in \mathbb{R}^n \setminus \{0\}$ . The vector $w$ is normal (perpendicular) to $H$ .

$\mathbb{R}^2$ : a line.
$\mathbb{R}^3$ : a plane.
$\mathbb{R}^n$ : an $(n-1)$ -flat.

It divides space into two half-spaces $H^+ = \{x : w^T x + b \ge 0\}$ and $H^- = \{x : w^T x + b \le 0\}$ .

SVM idea. For binary classification with $y_i \in \{-1, +1\}$ , find the hyperplane that separates the classes with the largest margin. Among infinitely many separating hyperplanes, the one with the biggest distance to the nearest points generalises best.

Geometric margin. The distance from a point $x_i$ to hyperplane $w^T x + b = 0$ is $\dfrac{|w^T x_i + b|}{\|w\|}$ . If we rescale $w$ and $b$ so that the closest points satisfy $|w^T x_i + b| = 1$ , then the margin (distance between the two parallel boundaries $w^T x + b = \pm 1$ ) is

$\text{margin} = \frac{2}{\|w\|}.$

Hard-margin SVM (linearly separable).

Maximise margin = minimise $\|w\|$ , with correct classification: $\min_{w, b} \; \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \ge 1 \;\; \forall i.$

Soft-margin SVM (non-separable data). Introduce slack $\xi_i \ge 0$ allowing some misclassification, controlled by penalty $C$ : $\min_{w, b, \xi} \; \frac{1}{2}\|w\|^2 + C \sum_i \xi_i, \quad y_i(w^T x_i + b) \ge 1 - \xi_i.$ $C$ large ⇒ low tolerance, narrow margin. $C$ small ⇒ tolerant, wider margin.

Support vectors. The data points satisfying $y_i (w^T x_i + b) = 1$ (or violating with $\xi_i > 0$ ) are support vectors. They lie on the margin boundaries and entirely determine the optimal hyperplane — removing any other point doesn't change the solution. This makes SVM memory-efficient at prediction time.

Kernel trick — non-linear hyperplanes. When classes are not linearly separable in the input space, map to a higher-dim feature space via $\phi(x)$ and find a linear hyperplane there — which corresponds to a non-linear boundary in the original space. Computing $\phi(x)$ explicitly is expensive, so we use kernels $K(x, x') = \langle \phi(x), \phi(x') \rangle$ :

Linear: $K = x^T x'$
Polynomial: $K = (x^T x' + 1)^d$
RBF / Gaussian: $K = \exp(-\gamma \|x - x'\|^2)$
Sigmoid: $K = \tanh(\kappa x^T x' + c)$

Decision function. New point classified by $\hat y = \text{sign}(w^T x + b)$ , with confidence proportional to distance from the hyperplane.

Q10.

(a) ECLAT algorithm for frequent itemset mining — compare with Apriori, illustrate with example. (10 marks)

Frequent Itemset Mining problem. Given a transaction database $T$ and minimum support $\sigma$ , find all itemsets $I$ whose support $\text{sup}(I) \ge \sigma$ . Used in market-basket analysis, web-click pattern mining, recommendation.

ECLAT — Equivalence CLAss Transformation (Zaki, 1997). Mines frequent itemsets using a depth-first search on a vertical (tid-list) representation of data.

Key idea — vertical layout. Instead of storing each transaction as a row of items (horizontal), store each item as the set of transaction IDs (TIDs) containing it:

Item	TID-set
Bread	{1, 2, 4, 5}
Milk	{1, 2, 3, 5}
Eggs	{2, 4, 5}

The support of an itemset $\{X, Y\}$ is then the size of the intersection of the TID-sets — pure set operations, no DB scan.

Algorithm.

ECLAT(prefix, items_with_tidlists, σ):
    for i in items_with_tidlists:
        if |tidlist(i)| ≥ σ:
            output (prefix ∪ {i}, |tidlist(i)|)
            new_items = []
            for j after i in items_with_tidlists:
                t = tidlist(i) ∩ tidlist(j)
                if |t| ≥ σ:
                    new_items.append((j, t))
            ECLAT(prefix ∪ {i}, new_items, σ)

Comparison with Apriori.

Aspect	Apriori	ECLAT
Data layout	Horizontal	Vertical (tid-lists)
Search strategy	Breadth-first, level-wise	Depth-first
Support computation	Repeated DB scans	Set intersections
Candidate generation	Combine k-itemsets to form (k+1) candidates, prune	Intersect tid-lists of items sharing a prefix
Memory	Smaller per pass (candidates only)	Larger — must hold tid-lists
Speed on dense data	Slow — many scans	Fast — fewer scans
Speed on sparse, very large data	Acceptable	Can blow up memory
Disk-friendly	Yes	Less so

Example. Suppose 5 transactions:

TID	Items
1	Bread, Milk
2	Bread, Milk, Eggs
3	Milk
4	Bread, Eggs
5	Bread, Milk, Eggs

Minimum support $\sigma = 2$ .

Vertical layout: B = {1,2,4,5}, M = {1,2,3,5}, E = {2,4,5}.

Singletons (all ≥ 2): sup(B)=4, sup(M)=4, sup(E)=3.

Pairs by intersection:

B ∩ M = {1,2,5}, sup = 3 ✓
B ∩ E = {2,4,5}, sup = 3 ✓
M ∩ E = {2,5}, sup = 2 ✓

Triple: B ∩ M ∩ E = {2,5}, sup = 2 ✓.

All frequent itemsets: {B}, {M}, {E}, {B,M}, {B,E}, {M,E}, {B,M,E} — found with no extra DB scans, just three set intersections per branch.

(b) Upper Confidence Bound (UCB) for the exploration–exploitation dilemma (10 marks)

Exploration–exploitation dilemma. In multi-armed bandits and reinforcement learning, at each step we must choose between:

Exploit — play the action with the highest current estimated reward.
Explore — try other actions in case a less-tested one is actually better.

Always exploiting locks in to early estimates that may be wrong. Always exploring never collects reward. We need a principled balance.

UCB idea. Be optimistic in the face of uncertainty — for each arm, compute an upper confidence bound on its true reward and pick the arm with the highest bound. This naturally explores arms we haven't tried much (their bound is high because uncertainty is high) and exploits arms we trust (their estimate is high and uncertainty is low).

UCB1 algorithm (Auer et al., 2002). For each round $t$ pick:

$a_t = \arg\max_a \left[ \bar{r}_a + \sqrt{\frac{2 \ln t}{N_a}} \right]$

where $\bar{r}_a$ is the average reward observed from arm $a$ so far and $N_a$ is the number of times $a$ has been played.

First term $\bar{r}_a$ — exploitation.
Second term $\sqrt{2 \ln t / N_a}$ — exploration bonus; large when $N_a$ is small (arm under-tested) and slowly grows with $t$ .

Why it works. Hoeffding's inequality shows the true mean of arm $a$ lies below $\bar{r}_a + \sqrt{2 \ln t / N_a}$ with high probability. So the chosen arm is either:

the truly best arm (we exploit correctly), or
an under-explored arm whose uncertainty pulled it up (we explore — and will tighten its bound over time).

Regret bound. UCB1 achieves $O(\log t)$ regret — provably optimal up to constants.

Practical worked example. Three arms, after some plays:

Arm	Plays $N_a$	Avg reward $\bar r_a$	Bonus $\sqrt{2\ln 100 / N_a}$	UCB
1	60	0.70	0.39	1.09
2	30	0.60	0.55	1.15
3	10	0.50	0.96	1.46

Even though Arm 3 currently has the lowest average reward, its high uncertainty gives it the highest UCB → UCB explores it next.

Variants and uses.

UCB2, UCB-V (variance-aware), KL-UCB (tighter bounds).
Used in A/B testing, online recommender systems, ad selection, hyperparameter search, and as the value back-up in Monte Carlo Tree Search (e.g., AlphaGo).
Reinforcement-learning extensions include UCRL for MDPs.

Q11.

(a) How do Artificial Neural Networks (ANNs) learn from data? Explain the role of backpropagation. (10 marks)

Architecture. An ANN is a layered graph of artificial neurons. Each neuron in layer $\ell$ computes: $z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \quad a^{(\ell)} = \sigma(z^{(\ell)}),$ where $\sigma$ is a non-linear activation (ReLU, sigmoid, tanh). The full network is a composition of these layers.

Learning is optimisation. Define a loss function $L(\hat y, y)$ that measures how wrong the network's prediction is — e.g., MSE for regression, cross-entropy for classification. Training means finding weights $\theta = \{W^{(\ell)}, b^{(\ell)}\}$ that minimise the average loss over the training set: $\theta^* = \arg\min_\theta \; \frac{1}{N} \sum_{i=1}^N L\big(\hat y(x_i; \theta), y_i\big).$

This is done by gradient descent — repeatedly nudge $\theta$ in the direction of steepest descent of $L$ .

The training loop.

Forward pass — feed input $x$ through layers to compute $\hat y$ and loss $L$ .
Backward pass (backpropagation) — compute $\partial L / \partial \theta$ for every weight.
Update — $\theta \leftarrow \theta - \eta \, \partial L / \partial \theta$ ( $\eta$ = learning rate).
Repeat over mini-batches for many epochs.

Backpropagation.

Goal. Compute $\dfrac{\partial L}{\partial W^{(\ell)}}$ and $\dfrac{\partial L}{\partial b^{(\ell)}}$ for every layer efficiently.

Idea. Apply the chain rule layer by layer, starting at the output and moving backward. Reuse intermediate results saved from the forward pass.

Equations (for a feed-forward net with loss $L$ at the output layer $L$ ):

Output error: $\delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)})$ .
Propagate: $\delta^{(\ell)} = \big( W^{(\ell+1)} \big)^T \delta^{(\ell+1)} \odot \sigma'(z^{(\ell)})$ .
Gradients: $\frac{\partial L}{\partial W^{(\ell)}} = \delta^{(\ell)} \big( a^{(\ell-1)} \big)^T, \quad \frac{\partial L}{\partial b^{(\ell)}} = \delta^{(\ell)}.$

Why it's brilliant. Without backprop, computing gradients for an $L$ -layer network with $n$ weights would cost $O(n^2)$ per training step. Backprop reduces it to $O(n)$ by sharing computation through the chain rule — making deep networks feasible.

Why it's important.

Backpropagation is the engine of modern deep learning. Every CNN, RNN, Transformer, GPT-style model trains by gradient descent + backprop.
It allows networks with millions–billions of parameters to be trained.
It is generic — any differentiable computation graph can be trained by backprop (automatic differentiation in PyTorch/TensorFlow makes this nearly free).

Practical concerns.

Vanishing / exploding gradients with deep nets — addressed by ReLU, batch normalisation, residual connections, gradient clipping.
Mini-batch / Stochastic Gradient Descent for scalability.
Optimisers — Adam, RMSProp, AdaGrad — accelerate and stabilise the update step.

(b) Basic structure and working principle of a perceptron in a neural network. (10 marks)

History. The Perceptron is the simplest artificial neuron, introduced by Frank Rosenblatt (1958) — the building block of modern neural networks.

Structure.

A perceptron takes $n$ real inputs $x_1, x_2, \dots, x_n$ , each multiplied by a learned weight $w_i$ , summed, plus a bias $b$ , and passed through an activation function:

  x1 ─w1─┐
  x2 ─w2─┤   z = Σ wᵢxᵢ + b      a = σ(z) → output
  ...    ┼─→ (──────────────)──→ ( σ )──→
  xn ─wn─┘                     bias b

Components.

Inputs $x_1, \dots, x_n$ — features of one data sample.
Weights $w_1, \dots, w_n$ — learned during training; encode importance and sign of each feature.
Bias $b$ — shifts the decision boundary, allowing it to not pass through origin.
Weighted sum (pre-activation): $z = \sum_{i=1}^n w_i x_i + b = w^T x + b$ .
Activation function $\sigma$ — originally a step function $\sigma(z) = 1 \text{ if } z \ge 0 \text{ else } 0$ . Modern versions use sigmoid, tanh, or ReLU.
Output $\hat y = \sigma(z)$ .

Geometric interpretation. $w^T x + b = 0$ defines a hyperplane that splits feature space into two regions. The perceptron classifies a point by which side of the hyperplane it lies on.

Working principle — learning.

The perceptron learning rule updates weights based on errors. For each training example $(x_i, y_i)$ with $y_i \in \{0, 1\}$ :

Compute prediction $\hat y_i = \sigma(w^T x_i + b)$ .
Update: $w \leftarrow w + \eta (y_i - \hat y_i) x_i$ $b \leftarrow b + \eta (y_i - \hat y_i)$
Repeat over the dataset until convergence (or max epochs).

If the data is linearly separable, the Perceptron Convergence Theorem (Rosenblatt, Novikoff) guarantees this rule will find a separating hyperplane in finitely many steps.

Limitation — XOR problem. A single perceptron can only learn linearly separable patterns. It famously fails on the XOR function — there is no straight line that separates $\{(0,1),(1,0)\}$ from $\{(0,0),(1,1)\}$ . This limitation, pointed out by Minsky and Papert (1969), caused the "AI winter" until multi-layer perceptrons (MLPs) with non-linear activations and backpropagation were developed to overcome it.

Importance today.

The perceptron is the conceptual atom of every neural network — a deep network is just many perceptrons stacked and connected.
Logistic regression is a perceptron with sigmoid activation and cross-entropy loss.
Understanding the perceptron clarifies linear classification, decision boundaries, weight updates, and the geometry of learning.