November 2024 — Solved
Machine Learning for Data Science
End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).
Part A — Short Essay
Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)
Q1. Define machine learning. What are the different types of machine learning?
Machine Learning is a branch of AI that builds systems which learn patterns from data to make predictions or decisions, without being explicitly programmed for the task (Tom Mitchell, 1997).
Types:
- Supervised — labelled data; learn (classification, regression).
- Unsupervised — unlabelled data; find structure (clustering, dimensionality reduction).
- Reinforcement — agent interacts with an environment, learns from rewards (Q-learning, policy gradients).
- (Semi-supervised, Self-supervised — hybrids that mix labelled and unlabelled data.)
Q2. What is dimensionality reduction?
Dimensionality reduction is the process of representing high-dimensional data in a lower-dimensional space () while preserving as much information as possible. It combats the curse of dimensionality, removes redundancy, speeds up training and aids visualisation.
Examples: PCA, LDA, t-SNE, UMAP, autoencoders.
Q3. Define F1-score.
The F1-score is the harmonic mean of precision and recall:
It ranges 0 to 1 and gives a single number balancing false positives and false negatives — useful for imbalanced classification where accuracy is misleading.
Q4. What is the difference between precision and recall?
- Precision = — of items the model predicted positive, how many are truly positive. "How careful?"
- Recall = — of items truly positive, how many did the model find. "How complete?"
Precision matters when false positives are costly (spam filter). Recall matters when false negatives are costly (cancer screening).
Q5. What contributes to support and confidence?
In association rule mining for a rule :
- Support = — fraction of transactions containing both and . Measures the rule's frequency / popularity.
- Confidence = — fraction of -transactions that also contain . Measures the rule's reliability.
Q6. What makes RNNs suitable for processing sequential data?
Recurrent Neural Networks maintain a hidden state updated at each time step from the previous state and current input: . This recurrence gives them memory of past inputs, allowing them to model dependencies that depend on order — exactly what sequential data (text, speech, time-series) requires. Variants like LSTM and GRU add gating to handle long-range dependencies.
Q7. Why is the activation function important in a perceptron?
The activation function introduces non-linearity into the perceptron's output. Without it, a stack of layers would collapse into a single linear map, no matter how deep — incapable of learning non-linear patterns like XOR. Non-linear activations (sigmoid, tanh, ReLU) allow neural networks to approximate any continuous function (universal approximation theorem).
Part B — Long Essay
Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)
Q8.
(a) Analyse the problem of email spam classification using ML. Explain the stages of data preprocessing. (10 marks)
Problem framing. Given an email , classify it as spam (1) or ham (0). This is a binary supervised classification problem. Performance is judged by precision (avoid throwing away genuine email), recall (catch spam), and the F1-score.
Typical pipeline. Email text + metadata → preprocessing → feature extraction → train a classifier (Naive Bayes, Logistic Regression, SVM or Transformer) → evaluate on held-out emails → deploy in mail-server filter.
Stages of data preprocessing for spam classification.
1. Data collection. Gather a labelled corpus — e.g., the Enron dataset, SpamAssassin, or in-house user reports. Include both spam and ham, ideally balanced or class-weighted.
2. Lowercasing. Convert all text to lowercase so "FREE" and "free" map to one token.
3. Removing HTML, URLs, email headers. Spam often contains HTML tags and tracking links. Use regex or a parser (BeautifulSoup) to strip them — but optionally keep a count of links as a feature, because spam tends to have many.
4. Tokenisation. Split text into tokens (words, sub-words). Use NLTK, SpaCy or transformer tokenisers.
5. Removing punctuation, digits, special chars. Or replace them with placeholder tokens like <NUM> if numeric patterns matter.
6. Stop-word removal. Drop high-frequency, low-information words ("the", "and"). Optional — sometimes stop-words help spam detection.
7. Stemming / Lemmatisation. Reduce words to roots ("running" → "run") so morphological variants share weight.
8. Handle imbalance. Often more ham than spam. Use class weights, oversample spam (SMOTE), or undersample ham.
9. Feature engineering.
- Bag-of-Words / TF-IDF vectorisation.
- n-grams ("free money" is more telling than the unigrams).
- Meta-features: number of links, exclamation marks, ALL-CAPS ratio, length, sender domain, presence of attachments.
10. Train / test split. Stratified split to preserve spam ratio. Optionally use temporal split (older → train, newer → test) because spam content drifts.
11. Scaling / normalisation. Apply standard scaling to numeric meta-features if mixing with TF-IDF.
Modelling and evaluation. Train Multinomial Naive Bayes (a classical, fast baseline), Logistic Regression with L1, SVM with linear kernel, or fine-tune a small Transformer. Report precision, recall, F1, ROC-AUC. Iterate by examining false positives and false negatives.
(b) Various error calculation techniques in ML models (10 marks)
Regression error metrics.
| Metric | Formula | Notes |
|---|---|---|
| MAE (Mean Absolute Error) | $\dfrac{1}{n}\sum | y_i - \hat y_i |
| MSE (Mean Squared Error) | Penalises large errors; differentiable. | |
| RMSE | Same units as . Most reported. | |
| MAPE | $\dfrac{100}{n}\sum \left | \dfrac{y_i - \hat y_i}{y_i}\right |
| Fraction of variance explained; 1 = perfect, 0 = mean baseline. | ||
| Adjusted | Penalises adding non-informative predictors. |
Classification error metrics.
Built on the confusion matrix of TP, TN, FP, FN.
| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | Overall correct rate. Misleading under imbalance. | |
| Precision | Of predicted positives, how many real. | |
| Recall (Sensitivity) | Of real positives, how many found. | |
| Specificity | True-negative rate. | |
| F1 | Balanced harmonic mean. | |
| F | weights recall vs precision | Use F when recall matters more. |
| ROC-AUC | Area under TPR vs FPR curve | Threshold-independent ranking quality. |
| PR-AUC | Area under Precision-Recall curve | Better than ROC under heavy imbalance. |
| Log loss | Penalises confident-wrong predictions. |
Probabilistic / Information-theoretic.
- Cross-entropy — multiclass log loss.
- KL divergence — distance between predicted and true distributions.
- Brier score — mean squared error on predicted probabilities.
Generalisation diagnostics.
- Train vs validation gap — wide gap = overfitting.
- Learning curves — error vs training size to diagnose bias vs variance.
- Cross-validation mean and standard deviation.
Domain-specific.
- Object detection: IoU, mAP.
- Translation: BLEU, METEOR.
- Recommender: NDCG, MAP, MRR.
Choosing the right metric. Tie the metric to the business cost: in fraud detection optimise recall + PR-AUC because FN is expensive; in spam optimise precision; in forecasting often RMSE. Wrong metric = wrong model wins.
Q9.
(a) Heart-disease prediction model using supervised learning — algorithm choice, feature selection, training, evaluation. (10 marks)
Problem. Given a patient's demographic, clinical and lab data, predict probability of heart disease (binary classification). Standard dataset: UCI Cleveland.
Step 1 — Data understanding. Features include age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting ECG, max heart rate, exercise-induced angina, ST depression, slope, # major vessels, thalassemia.
Step 2 — Preprocessing.
- Impute missing values (median for numeric, mode for categorical, KNN imputation for nuanced cases).
- One-hot encode categorical variables.
- Standardise numeric features (mean 0, std 1) — crucial for distance / gradient models.
- Stratified train/test split (e.g., 80/20).
Step 3 — Feature selection. Methods used together:
- Domain knowledge — cardiologists' guidance on which markers matter (cholesterol, ST depression, age).
- Univariate filter — Chi-square / mutual information against target.
- Embedded methods — L1 (Lasso) logistic regression zeros out unimportant features.
- Wrapper / RFE — Recursive Feature Elimination using a model's coefficients.
- Model importance — Random Forest / XGBoost feature importance.
Keep ~8–10 most informative features to keep model interpretable.
Step 4 — Algorithm choice. Compare candidates:
| Algorithm | Pros | Cons |
|---|---|---|
| Logistic Regression | Interpretable, gives probabilities, fast | Linear only |
| Decision Tree | Interpretable, handles non-linear | Overfits |
| Random Forest | Robust, captures interactions, good baseline | Less interpretable |
| Gradient Boosting (XGBoost) | Often best accuracy | Tuning effort |
| SVM (RBF) | Strong with small data | Slow on big data, no probs by default |
| KNN | Simple | Sensitive to scaling, slow at predict time |
For clinical deployment, Logistic Regression or Random Forest is often chosen for the balance of accuracy and interpretability (doctors want to know why).
Step 5 — Training.
- Define loss (binary cross-entropy / Gini for trees).
- Fit on training set; tune hyperparameters via grid/random search inside stratified k-fold CV (k = 5).
- Address class imbalance with
class_weight='balanced'or SMOTE oversampling.
Step 6 — Evaluation. Heart disease is a high-cost FN problem — missing a sick patient is dangerous. Use:
- Recall (Sensitivity) as the headline metric.
- PR-AUC, ROC-AUC for threshold-free quality.
- F1 as a balanced summary.
- Confusion matrix at chosen threshold.
- Calibration plot so predicted probabilities are honest.
Tune the decision threshold to achieve, say, 90% recall while keeping precision acceptable.
Step 7 — Interpret & deploy. SHAP values reveal each feature's contribution to a specific patient's risk score, which clinicians can audit. Deploy as a clinician decision-support tool — never autonomous diagnosis.
(b) Concept of hyperplanes in SVM (10 marks)
Hyperplane. In , a hyperplane is the -dimensional set with . The vector is normal (perpendicular) to .
- : a line.
- : a plane.
- : an -flat.
It divides space into two half-spaces and .
SVM idea. For binary classification with , find the hyperplane that separates the classes with the largest margin. Among infinitely many separating hyperplanes, the one with the biggest distance to the nearest points generalises best.
Geometric margin. The distance from a point to hyperplane is . If we rescale and so that the closest points satisfy , then the margin (distance between the two parallel boundaries ) is
Hard-margin SVM (linearly separable).
Maximise margin = minimise , with correct classification:
Soft-margin SVM (non-separable data). Introduce slack allowing some misclassification, controlled by penalty : large ⇒ low tolerance, narrow margin. small ⇒ tolerant, wider margin.
Support vectors. The data points satisfying (or violating with ) are support vectors. They lie on the margin boundaries and entirely determine the optimal hyperplane — removing any other point doesn't change the solution. This makes SVM memory-efficient at prediction time.
Kernel trick — non-linear hyperplanes. When classes are not linearly separable in the input space, map to a higher-dim feature space via and find a linear hyperplane there — which corresponds to a non-linear boundary in the original space. Computing explicitly is expensive, so we use kernels :
- Linear:
- Polynomial:
- RBF / Gaussian:
- Sigmoid:
Decision function. New point classified by , with confidence proportional to distance from the hyperplane.
Q10.
(a) ECLAT algorithm for frequent itemset mining — compare with Apriori, illustrate with example. (10 marks)
Frequent Itemset Mining problem. Given a transaction database and minimum support , find all itemsets whose support . Used in market-basket analysis, web-click pattern mining, recommendation.
ECLAT — Equivalence CLAss Transformation (Zaki, 1997). Mines frequent itemsets using a depth-first search on a vertical (tid-list) representation of data.
Key idea — vertical layout. Instead of storing each transaction as a row of items (horizontal), store each item as the set of transaction IDs (TIDs) containing it:
| Item | TID-set |
|---|---|
| Bread | {1, 2, 4, 5} |
| Milk | {1, 2, 3, 5} |
| Eggs | {2, 4, 5} |
The support of an itemset is then the size of the intersection of the TID-sets — pure set operations, no DB scan.
Algorithm.
ECLAT(prefix, items_with_tidlists, σ):
for i in items_with_tidlists:
if |tidlist(i)| ≥ σ:
output (prefix ∪ {i}, |tidlist(i)|)
new_items = []
for j after i in items_with_tidlists:
t = tidlist(i) ∩ tidlist(j)
if |t| ≥ σ:
new_items.append((j, t))
ECLAT(prefix ∪ {i}, new_items, σ)
Comparison with Apriori.
| Aspect | Apriori | ECLAT |
|---|---|---|
| Data layout | Horizontal | Vertical (tid-lists) |
| Search strategy | Breadth-first, level-wise | Depth-first |
| Support computation | Repeated DB scans | Set intersections |
| Candidate generation | Combine k-itemsets to form (k+1) candidates, prune | Intersect tid-lists of items sharing a prefix |
| Memory | Smaller per pass (candidates only) | Larger — must hold tid-lists |
| Speed on dense data | Slow — many scans | Fast — fewer scans |
| Speed on sparse, very large data | Acceptable | Can blow up memory |
| Disk-friendly | Yes | Less so |
Example. Suppose 5 transactions:
| TID | Items |
|---|---|
| 1 | Bread, Milk |
| 2 | Bread, Milk, Eggs |
| 3 | Milk |
| 4 | Bread, Eggs |
| 5 | Bread, Milk, Eggs |
Minimum support .
Vertical layout: B = {1,2,4,5}, M = {1,2,3,5}, E = {2,4,5}.
Singletons (all ≥ 2): sup(B)=4, sup(M)=4, sup(E)=3.
Pairs by intersection:
- B ∩ M = {1,2,5}, sup = 3 ✓
- B ∩ E = {2,4,5}, sup = 3 ✓
- M ∩ E = {2,5}, sup = 2 ✓
Triple: B ∩ M ∩ E = {2,5}, sup = 2 ✓.
All frequent itemsets: {B}, {M}, {E}, {B,M}, {B,E}, {M,E}, {B,M,E} — found with no extra DB scans, just three set intersections per branch.
(b) Upper Confidence Bound (UCB) for the exploration–exploitation dilemma (10 marks)
Exploration–exploitation dilemma. In multi-armed bandits and reinforcement learning, at each step we must choose between:
- Exploit — play the action with the highest current estimated reward.
- Explore — try other actions in case a less-tested one is actually better.
Always exploiting locks in to early estimates that may be wrong. Always exploring never collects reward. We need a principled balance.
UCB idea. Be optimistic in the face of uncertainty — for each arm, compute an upper confidence bound on its true reward and pick the arm with the highest bound. This naturally explores arms we haven't tried much (their bound is high because uncertainty is high) and exploits arms we trust (their estimate is high and uncertainty is low).
UCB1 algorithm (Auer et al., 2002). For each round pick:
where is the average reward observed from arm so far and is the number of times has been played.
- First term — exploitation.
- Second term — exploration bonus; large when is small (arm under-tested) and slowly grows with .
Why it works. Hoeffding's inequality shows the true mean of arm lies below with high probability. So the chosen arm is either:
- the truly best arm (we exploit correctly), or
- an under-explored arm whose uncertainty pulled it up (we explore — and will tighten its bound over time).
Regret bound. UCB1 achieves regret — provably optimal up to constants.
Practical worked example. Three arms, after some plays:
| Arm | Plays | Avg reward | Bonus | UCB |
|---|---|---|---|---|
| 1 | 60 | 0.70 | 0.39 | 1.09 |
| 2 | 30 | 0.60 | 0.55 | 1.15 |
| 3 | 10 | 0.50 | 0.96 | 1.46 |
Even though Arm 3 currently has the lowest average reward, its high uncertainty gives it the highest UCB → UCB explores it next.
Variants and uses.
- UCB2, UCB-V (variance-aware), KL-UCB (tighter bounds).
- Used in A/B testing, online recommender systems, ad selection, hyperparameter search, and as the value back-up in Monte Carlo Tree Search (e.g., AlphaGo).
- Reinforcement-learning extensions include UCRL for MDPs.
Q11.
(a) How do Artificial Neural Networks (ANNs) learn from data? Explain the role of backpropagation. (10 marks)
Architecture. An ANN is a layered graph of artificial neurons. Each neuron in layer computes: where is a non-linear activation (ReLU, sigmoid, tanh). The full network is a composition of these layers.
Learning is optimisation. Define a loss function that measures how wrong the network's prediction is — e.g., MSE for regression, cross-entropy for classification. Training means finding weights that minimise the average loss over the training set:
This is done by gradient descent — repeatedly nudge in the direction of steepest descent of .
The training loop.
- Forward pass — feed input through layers to compute and loss .
- Backward pass (backpropagation) — compute for every weight.
- Update — ( = learning rate).
- Repeat over mini-batches for many epochs.
Backpropagation.
Goal. Compute and for every layer efficiently.
Idea. Apply the chain rule layer by layer, starting at the output and moving backward. Reuse intermediate results saved from the forward pass.
Equations (for a feed-forward net with loss at the output layer ):
-
Output error: .
-
Propagate: .
-
Gradients:
Why it's brilliant. Without backprop, computing gradients for an -layer network with weights would cost per training step. Backprop reduces it to by sharing computation through the chain rule — making deep networks feasible.
Why it's important.
- Backpropagation is the engine of modern deep learning. Every CNN, RNN, Transformer, GPT-style model trains by gradient descent + backprop.
- It allows networks with millions–billions of parameters to be trained.
- It is generic — any differentiable computation graph can be trained by backprop (automatic differentiation in PyTorch/TensorFlow makes this nearly free).
Practical concerns.
- Vanishing / exploding gradients with deep nets — addressed by ReLU, batch normalisation, residual connections, gradient clipping.
- Mini-batch / Stochastic Gradient Descent for scalability.
- Optimisers — Adam, RMSProp, AdaGrad — accelerate and stabilise the update step.
(b) Basic structure and working principle of a perceptron in a neural network. (10 marks)
History. The Perceptron is the simplest artificial neuron, introduced by Frank Rosenblatt (1958) — the building block of modern neural networks.
Structure.
A perceptron takes real inputs , each multiplied by a learned weight , summed, plus a bias , and passed through an activation function:
x1 ─w1─┐
x2 ─w2─┤ z = Σ wᵢxᵢ + b a = σ(z) → output
... ┼─→ (──────────────)──→ ( σ )──→
xn ─wn─┘ bias b
Components.
- Inputs — features of one data sample.
- Weights — learned during training; encode importance and sign of each feature.
- Bias — shifts the decision boundary, allowing it to not pass through origin.
- Weighted sum (pre-activation): .
- Activation function — originally a step function . Modern versions use sigmoid, tanh, or ReLU.
- Output .
Geometric interpretation. defines a hyperplane that splits feature space into two regions. The perceptron classifies a point by which side of the hyperplane it lies on.
Working principle — learning.
The perceptron learning rule updates weights based on errors. For each training example with :
- Compute prediction .
- Update:
- Repeat over the dataset until convergence (or max epochs).
If the data is linearly separable, the Perceptron Convergence Theorem (Rosenblatt, Novikoff) guarantees this rule will find a separating hyperplane in finitely many steps.
Limitation — XOR problem. A single perceptron can only learn linearly separable patterns. It famously fails on the XOR function — there is no straight line that separates from . This limitation, pointed out by Minsky and Papert (1969), caused the "AI winter" until multi-layer perceptrons (MLPs) with non-linear activations and backpropagation were developed to overcome it.
Importance today.
- The perceptron is the conceptual atom of every neural network — a deep network is just many perceptrons stacked and connected.
- Logistic regression is a perceptron with sigmoid activation and cross-entropy loss.
- Understanding the perceptron clarifies linear classification, decision boundaries, weight updates, and the geometry of learning.