PGDDSA Study · Semester 1

PGD24110104A

November 2024 — Solved

Data Analytics and Prediction

PGD01C04

2 hours 30 minutes

50 marks

Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)

Q1. Mention two methods of handling missing data.

Deletion — drop rows (listwise) or columns with missing entries. Quick, but loses data and can introduce bias.
Imputation — fill missing values with the mean / median / mode of the column, or use a model (KNN imputation, MICE, regression imputation) when the missingness is informative.

Q2. Define regression in machine learning.

Regression is a supervised learning task that predicts a continuous numeric output $y \in \mathbb{R}$ from input features $X$ . The model learns a function $\hat y = f(X; \theta)$ by minimising a loss such as MSE: $L = \dfrac{1}{n}\sum (y_i - \hat y_i)^2$ . Examples: linear regression, polynomial regression, ridge/lasso, SVR, decision-tree regression.

Q3. What is Linear Discriminant Analysis (LDA)?

LDA is a supervised classification + dimensionality reduction technique that finds a linear combination of features that maximises between-class variance and minimises within-class variance. It projects data onto a lower-dimensional space where classes are best separated, assuming each class is Gaussian with a common covariance matrix.

Q4. What is a confusion matrix?

A confusion matrix is a $c \times c$ table comparing true labels against predicted labels for a $c$ -class classification problem. For binary classification:

	Predicted +	Predicted −
Actual +	TP	FN
Actual −	FP	TN

From it we compute accuracy, precision, recall, F1, specificity and the ROC curve.

Q5. Define bagging.

Bagging (Bootstrap AGGregatING, Breiman 1996) is an ensemble technique: train $B$ base learners on bootstrap samples drawn with replacement from the training set, then average their predictions (regression) or take majority vote (classification). It reduces variance and stabilises high-variance learners like decision trees. Random Forest is bagging applied to decorrelated decision trees.

Q6. What is a regression tree?

A regression tree is a decision tree whose leaves predict a continuous numeric value — usually the mean of the training targets in that leaf. The tree recursively splits the feature space to minimise variance (typically minimising RSS or MSE within each node), so similar targets end up in the same leaf.

Q7. Define classification accuracy.

Classification accuracy is the fraction of predictions that match the true labels:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{correct predictions}}{\text{total predictions}}.$

Easy to compute but misleading under class imbalance — a model that always predicts the majority class can score very high.

Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)

Q8.

(a) Different data preprocessing techniques used in machine learning. (10 marks)

Raw data is rarely model-ready. Preprocessing typically consumes 60–80% of an ML project. The main techniques:

1. Data cleaning.

Missing values — deletion, mean/median/mode imputation, KNN-imputation, MICE, model-based imputation, or learning a missingness indicator.
Outliers — z-score / IQR / isolation-forest detection; cap (winsorise), remove, or transform.
Duplicates and noise — drop exact duplicates, fuzzy dedupe, smoothing for noisy sensors.

2. Data integration. Combine multiple sources, resolve schema conflicts, merge on common keys, handle conflicting records.

3. Data transformation.

Scaling
- Standardisation (z-score): $x' = (x - \mu)/\sigma$ — assumes/produces zero mean, unit variance. Required for k-NN, k-means, SVM, gradient methods.
- Min-max: $x' = (x - \min)/(\max - \min)$ — maps to $[0, 1]$ . Useful for neural nets.
- Robust scaling — uses median and IQR; resistant to outliers.
Skew correction — log, square-root, Box–Cox, Yeo–Johnson.
Encoding categorical variables — one-hot, ordinal, target/mean encoding, hashing.
Datetime features — extract day, week, month, hour, is_weekend, holiday flag.
Text features — tokenisation, TF-IDF, embeddings (Word2Vec, BERT).

4. Data reduction.

Feature selection — variance threshold, chi-square, mutual information, L1 (Lasso), Recursive Feature Elimination.
Dimensionality reduction — PCA, LDA, t-SNE, UMAP, autoencoders.
Sampling / aggregation when data is large.

5. Discretisation / binning. Convert continuous features into discrete intervals — uniform, quantile, or model-based binning. Useful for tree models, naive Bayes, fairness analysis.

6. Class-imbalance handling.

Oversampling (SMOTE, ADASYN).
Undersampling majority class.
Class weights / focal loss.

7. Data splitting. Train/test split, stratified, time-based, or group-based — done before fitting scalers / imputers to avoid leakage.

8. Pipelining. Wrap all preprocessing steps + model in a single sklearn / TFX / MLflow pipeline so the same transformations are applied at training and inference, eliminating leakage and drift.

(b) Evaluate a healthcare dataset for overfitting and suggest tuning approaches. (10 marks)

Scenario. Suppose we trained a Random Forest on a small (≈2,000 patients) healthcare dataset to predict 30-day readmission. Training F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting.

Diagnosis — confirming overfitting.

Learning curves — plot train and validation F1 as we vary training-set size. Persistent wide gap = high variance / overfitting.
Train vs CV scores — large gap between training and 5-fold CV mean.
Sensitivity check — perturb random seed; if scores swing widely, model is unstable.
Feature importance audit — if a single feature like patient_id or admission_date dominates, it's leakage.

Common causes in healthcare data.

Small sample size relative to many features (high $p/n$ ).
Highly correlated lab measurements / leakage from features available post-outcome.
Class imbalance — readmissions are rare.
Heterogeneous patient subgroups.

Tuning approaches to reduce overfitting.

1. Reduce model complexity.

Random Forest: lower max_depth (e.g., 6–10), raise min_samples_leaf (≥ 20), lower max_features.
Decision Tree: post-prune with ccp_alpha.
Neural net: fewer layers / units, add dropout.

2. Regularisation. L1/L2 in linear / logistic regression; weight decay in deep models. For trees, regularisation appears as min-samples / pruning.

3. More data / better data.

Combine with public datasets (MIMIC-III).
Data augmentation — SMOTE for the minority readmission class; sampling with replacement for under-represented subgroups.
Feature engineering that consolidates redundant labs into clinically meaningful indices.

4. Robust validation.

Stratified k-fold (k = 5 or 10) to preserve readmission rate.
Group k-fold by patient ID — never let one patient appear in both train and validation.
Nested CV for unbiased hyperparameter tuning.

5. Hyperparameter search. Use grid / random / Bayesian search over n_estimators, max_depth, min_samples_leaf, max_features, class_weight. Optimise CV-mean F1 (or PR-AUC for imbalanced) — not training accuracy.

6. Early stopping in gradient-boosted models (XGBoost / LightGBM): stop adding trees when validation logloss plateaus.

7. Ensembling and stacking — average multiple model families to absorb variance.

8. Audit and remove leakage features. Any feature constructed after the outcome (e.g., follow-up visit indicator) must be removed.

9. Calibration. Use Platt scaling or isotonic regression so predicted probabilities are honest — crucial in clinical decision support.

Expected outcome. After applying these, training F1 typically drops a few points while validation F1 rises — closing the gap to roughly within 5 points indicates a healthier model that generalises.

Q9.

(a) Design a predictive system using KNN and evaluate how the choice of K impacts performance. (10 marks)

K-Nearest Neighbors — overview. KNN is a simple lazy supervised algorithm. To predict for a new point $x$ :

Compute distance $d(x, x_i)$ to every training point.
Pick the $K$ closest.
Classification: majority vote among the $K$ labels (with ties broken by distance or random).
Regression: average (optionally weighted by inverse distance) the $K$ target values.

It's instance-based — no model is "trained" in the conventional sense; the training set itself is the model.

Design — predictive system pipeline.

Step 1 — Define the problem. Example: predict whether a customer churns (binary classification) using account features.

Step 2 — Data preprocessing.

Handle missing values (median / mode imputation).
Encode categoricals (one-hot).
Scale features — critical because KNN uses raw Euclidean distance. Standardise or min-max so no feature dominates.
Remove or down-weight irrelevant features (KNN is sensitive to noise).

Step 3 — Choose distance metric.

Euclidean for continuous features.
Manhattan for high-dim sparse.
Cosine for text vectors / embeddings.
Hamming for categorical.
Mahalanobis to account for feature correlations.

Step 4 — Choose $K$ (see below).

Step 5 — Speed up neighbour search for big data — use KD-tree, Ball-tree, or approximate methods (HNSW, Annoy, FAISS).

Step 6 — Evaluate with stratified k-fold CV using accuracy / F1 / ROC-AUC.

Step 7 — Deploy with a vector index for low-latency lookup.

Impact of $K$ on performance.

$K$	Behaviour	Bias	Variance
$K = 1$	Nearest neighbour only	Low	Very high — overfits noise, jagged boundary
Small $K$ (3–5)	Fine-grained boundary	Low	High
Medium $K$ (10–30)	Smooth boundary	Moderate	Moderate
Large $K$	Predictions ≈ majority class	High	Low — underfits, near-flat boundary
$K = N$	Always predict global mean / mode	Maximal bias	Zero variance

Behaviour summary.

Small $K$ ⇒ low bias, high variance, sensitive to outliers — fits idiosyncrasies of training data ("overfits").
Large $K$ ⇒ high bias, low variance — smooths the decision boundary, ignores local structure ("underfits").

Choosing $K$ .

Plot CV accuracy vs $K$ and select the "elbow" — the smallest $K$ before performance stops improving meaningfully.
Prefer odd $K$ in binary classification to avoid ties.
A rule of thumb: $K \approx \sqrt{N}$ as a starting point.
Use distance-weighted voting to make KNN less sensitive to $K$ .

Worked illustration. On Iris (150 samples):

$K = 1$ → 96% CV accuracy but bouncy decision boundary.
$K = 5$ → 97% CV accuracy, smoother boundary.
$K = 50$ → drops to 88% because boundary is over-smoothed.

Practical caveats.

KNN suffers from the curse of dimensionality — distances lose meaning above ~20 features. Apply PCA / feature selection first.
Prediction is slow ( $O(N)$ per query) without an index.
Sensitive to feature scaling and outliers — always scale.

(b) Compare Support Vector Regression (SVR) and Decision Tree Regression. (10 marks)

Both are non-linear regression algorithms but with very different mechanisms.

Support Vector Regression (SVR).

Builds on SVM. Fits a function $\hat y = w^T x + b$ that lies within an $\varepsilon$ -tube around the data; errors inside the tube are ignored.
Optimisation: $\min \frac{1}{2}\|w\|^2 + C \sum (\xi_i + \xi_i^*), \quad \text{subject to } |y_i - (w^T x_i + b)| \le \varepsilon + \xi_i.$
Kernel trick ( $K(x, x')$ ) — linear, polynomial, RBF — lets SVR fit non-linear functions.
Solution depends only on support vectors — points outside the tube.

Decision Tree Regression (DTR).

Recursively splits feature space on the feature/threshold pair that most reduces target variance (MSE/MAE).
Leaves predict the mean target of the training points falling there.
The result is a piecewise-constant function.
Can be regularised by limiting depth, min-samples-per-leaf, or via pruning.

Comparison.

Aspect	SVR	Decision Tree
Function class	Smooth (depending on kernel)	Piecewise constant (step function)
Handles non-linearity	Via kernels	Naturally via recursive splits
Sensitivity to feature scale	High — must scale	None — invariant to monotonic transforms
Hyperparameters	$C$ , $\varepsilon$ , kernel, $\gamma$	max_depth, min_samples_leaf, criterion
Interpretability	Low (especially with non-linear kernel)	High (can be visualised)
Handling missing values	Needs imputation	Many implementations handle natively
Categorical features	Need encoding	Native support (some libraries)
Outlier sensitivity	Moderate — $\varepsilon$ -tube helps	Sensitive — can split aggressively on outliers
Training cost	$O(N^2)$ to $O(N^3)$ — slow on big data	$O(N \log N)$ per feature
Prediction speed	Fast	Very fast
Extrapolation	Smooth extrapolation possible	Cannot extrapolate beyond training-range targets
Memory	Stores support vectors	Stores tree structure
Best with	Small-to-medium, scaled, smooth data	Tabular data, heterogeneous features
Variance / bias	High bias, low variance (regularised)	Low bias, high variance (often overfits)

Practical advice. For modern tabular regression problems, gradient-boosted decision trees (XGBoost, LightGBM) combine the strengths of DTR with bagging/boosting and almost always outperform raw SVR or single trees. SVR is still preferred when data is small, smooth, and well-scaled — and when you want a mathematically clean model with a clear notion of "support vectors."

Q10.

(a) Apply LDA and PLS-DA on a real-world biological dataset and compare their classification performance. (10 marks)

Setting — biological example. A common benchmark: classify cancer subtypes (e.g., ALL vs AML leukaemia) from gene-expression microarrays — 7129 features, ~72 samples. The dataset has $p \gg n$ (far more features than samples), strong feature correlations, and small sample size.

LDA (Linear Discriminant Analysis).

Idea. Project data onto the linear axis that maximises the Fisher ratio $J(w) = \frac{w^T S_B w}{w^T S_W w},$ where $S_B$ is between-class scatter and $S_W$ is within-class scatter.

Assumptions. Each class is Gaussian with a common covariance matrix; classes differ in means.

Procedure.

Standardise features.
Compute $S_W$ and $S_B$ .
Solve $S_W^{-1} S_B w = \lambda w$ — the leading eigenvectors are the discriminant directions.
Project data, then classify (often using Gaussian likelihood or a downstream linear classifier).

Problem with high-dim biological data. $S_W$ is singular when $p > n$ , so LDA breaks down. Standard fixes: regularise ( $S_W + \alpha I$ ), use shrinkage LDA, or first reduce dimensionality via PCA.

PLS-DA (Partial Least Squares Discriminant Analysis).

Idea. Find latent components that maximise covariance between predictors $X$ and the (encoded) class label $Y$ . Unlike PCA — which maximises variance of $X$ alone — PLS uses $Y$ , so it is supervised and tailored for class separation.

Procedure.

Standardise features.
Encode $Y$ as dummy class indicators.
Iteratively extract latent components $T_k$ that maximise $\text{cov}(t, u)$ where $t = X w$ and $u = Y c$ .
Use the latent components as predictors in a linear classifier.

Strengths in $p \gg n$ biology.

Works gracefully when $p > n$ .
Handles multicollinear features (genes coexpressed).
Variable Importance in Projection (VIP) scores rank features — biologically interpretable.

Application & comparison on leukaemia dataset.

Aspect	LDA	PLS-DA
Suitability when $p > n$	Poor (needs regularisation or PCA)	Excellent
Multicollinearity	Sensitive	Robust (compresses correlated features)
Supervised?	Yes	Yes
Latent components	Maximise class separation	Maximise covariance with $Y$
Interpretability	Discriminant weights	VIP scores
Typical leukaemia accuracy	~85–90% (with regularisation)	~95% (clean separation)
Risk of overfitting	High without regularisation	Cross-validate # of components
Software	sklearn (with shrinkage), MASS	mixOmics (R), sklearn cross_decomposition

Validation. Use stratified k-fold CV (or LOOCV given small $n$ ). Report accuracy, balanced accuracy, F1, ROC-AUC and number of latent components. PLS-DA typically wins on raw gene-expression matrices; LDA wins when applied after PCA/feature selection that resolves the $p > n$ issue.

(b) Key assumptions of discriminant analysis. (10 marks)

Discriminant Analysis (specifically LDA and QDA) is a generative classifier grounded in modelling $P(X \mid Y)$ as Gaussian. Its statistical foundation rests on the following assumptions:

1. Gaussian class conditionals. For each class $k$ , the feature vector $X$ given $Y = k$ follows a multivariate normal: $X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma_k).$ Violation → discriminant analysis is biased. Skewed or multi-modal features should be transformed (log, Box–Cox) or replaced by a more flexible classifier.

2. Equality of covariance matrices (LDA only). $\Sigma_1 = \Sigma_2 = \dots = \Sigma_k = \Sigma.$ Under this assumption the decision boundary is linear. If covariances differ, use Quadratic Discriminant Analysis (QDA) which estimates a separate $\Sigma_k$ per class and produces quadratic boundaries.

3. Independent observations. Each sample is drawn independently from its class distribution. Time-series correlation, repeated measurements on the same subject, or hierarchical structure breaks this assumption.

4. Sufficient sample size per class. Estimating $\Sigma$ requires roughly $> p$ samples per class. With $p > n$ (high-dim genomics, text, image features), $\Sigma$ is singular. Fixes: regularised / shrinkage LDA, PCA before LDA, or PLS-DA.

5. No severe multicollinearity. Highly correlated predictors make $\Sigma$ near-singular and inflate the variance of estimated discriminant weights. Address by feature selection or shrinkage.

6. Linear separability (for LDA). Boundary is linear ⇒ LDA struggles with non-linearly separable data. Use QDA, kernel discriminants, or non-linear classifiers (SVM-RBF, GBT) when classes curve into each other.

7. Approximately balanced or known priors. LDA uses class priors $\pi_k$ in the Bayes formulation. If they are misestimated (e.g., training set ratio doesn't match production), recalibrate priors at prediction time.

8. Continuous, scaled features. LDA assumes continuous predictors. Mix-in of binary/categorical features should be one-hot encoded carefully; non-binary categoricals violate Gaussianity.

Diagnostics.

Mardia's test / Henze–Zirkler for multivariate normality.
Box's M test for equality of covariances (sensitive to non-normality, so use cautiously).
Residual plots of class means.
CV gap between train and test as overfitting check.

When assumptions are violated.

Modest violations: LDA is famously robust — often still performs competitively.
Severe violations: try QDA, RDA (regularised), Naive Bayes, logistic regression, kernel methods, or tree ensembles.

Q11.

(a) Design a system that predicts credit-card fraud and analyse it using all performance metrics in classification. (10 marks)

Problem. Real-time binary classification: legitimate (0) vs fraudulent (1) transaction. Severe class imbalance (typically 0.1–0.2% fraud). FN is very costly (money lost); FP is annoying (customer friction).

System design.

1. Data ingestion. Stream transactions through Kafka / Kinesis to a feature store, with batch + real-time pipelines.

2. Feature engineering.

Transaction-level: amount, merchant category, country, currency, channel.
Cardholder-level: age of account, average historical spend, std-dev.
Temporal: time-since-last-transaction, transactions-in-last-hour, hour-of-day, weekday.
Velocity: distance / time between consecutive transactions ("impossible-travel").
Graph features: shared device, shared IP with known fraud rings.

3. Preprocessing. One-hot encode categoricals, scale numeric, log-transform skewed (amount), handle missing.

4. Imbalance handling.

Class weights (inverse frequency).
SMOTE / ADASYN oversampling.
Anomaly detection as an auxiliary signal.

5. Model selection.

Baseline: Logistic Regression with L1.
Strong: Random Forest, XGBoost / LightGBM, neural net with autoencoder features.
Production stacks often combine a fast rules engine (for known patterns) and an ML model.

6. Training and tuning.

Stratified k-fold CV.
Optimise PR-AUC (not ROC-AUC) due to imbalance.
Hyperparameter search; calibrate probabilities with isotonic regression.

7. Deployment. Low-latency scoring (<50 ms), fallback rules, A/B-tested threshold, decision routing (auto-approve, step-up authentication, block).

8. Monitoring. Watch data drift, fraud rate, precision@K daily; retrain on schedule (rapid fraud-pattern drift).

Performance metrics — exhaustive analysis.

Let TP, TN, FP, FN come from the confusion matrix at chosen threshold.

Metric	Formula	In fraud context
Accuracy	$(TP+TN)/N$	Misleading — 99.8% baseline by predicting "not fraud"
Precision	$TP/(TP+FP)$	Of flagged transactions, how many are real fraud — drives customer friction
Recall (Sensitivity, TPR)	$TP/(TP+FN)$	Of real fraud, how many caught — drives money saved
Specificity (TNR)	$TN/(TN+FP)$	Of legitimate transactions, how many correctly let through
FPR	$FP/(FP+TN)$	Fraction of good customers wrongly blocked
FNR	$FN/(FN+TP)$	Fraud missed
F1	$2 P R / (P + R)$	Balanced precision–recall
F $\beta$ (e.g., $\beta = 2$ )	weights recall higher	Use $F_2$ to prioritise catching fraud
ROC-AUC	TPR vs FPR area	Threshold-free, but optimistic under imbalance
PR-AUC	Precision vs Recall area	More informative under heavy imbalance — preferred
MCC	Matthews Correlation Coef.	Single balanced number on imbalanced data
Log loss	Cross-entropy	Penalises confidently wrong predictions
Brier score	MSE on probabilities	Calibration quality
Cost-weighted loss	$\$ saved	Final business metric — combines cost per FN (lost money) and FP (friction)

Threshold tuning. Sweep the threshold $\tau$ along the PR curve to choose where business cost is minimised. Fraud teams often pick threshold to maintain a target precision (e.g., 80%) while maximising recall.

Diagnostic plots. Confusion matrix at chosen threshold, PR curve, ROC curve, calibration curve, lift / gain charts (precision in top-K%).

(b) Differentiate between class predictions and class probabilities. (10 marks)

Class predictions. A classifier outputs a hard label $\hat y \in \{0, 1, \dots, K-1\}$ — the predicted class for the input. For binary classification, this is typically obtained by:

$\hat y = \begin{cases} 1 & \text{if } P(y = 1 \mid x) \ge \tau \\ 0 & \text{otherwise} \end{cases}$

where $\tau$ is a decision threshold (default 0.5). Class predictions answer "which class?"

Class probabilities. The classifier outputs a probability vector $\hat p = (p_0, p_1, \dots, p_{K-1})$ with $\sum p_k = 1$ , giving the model's confidence that the input belongs to each class. Class probabilities answer "how confident, and in which class?"

Comparison.

Aspect	Class prediction	Class probability
Output type	Discrete label	Real number in [0, 1]
Information	Single answer	Full distribution
Threshold needed	Yes (e.g., 0.5)	No
Confidence shown	No	Yes
Loss function	0–1 loss, hinge	Cross-entropy, log loss, Brier
Useful metrics	Accuracy, F1, precision/recall	ROC-AUC, PR-AUC, log loss, calibration
Allows threshold tuning	–	Yes — set $\tau$ per business needs
Used in cost-sensitive decisions	Limited	Essential
Stackable / mixable with other models	Hard	Easy (averaging, blending)
Algorithms that natively output it	Most	Logistic, NB, NN; trees/SVM need calibration

Why probabilities are valuable.

Threshold flexibility. Move $\tau$ to trade precision and recall — e.g., raise $\tau$ in fraud to reduce false positives, lower $\tau$ in cancer screening to catch every case.
Cost-sensitive decisions. Combine probability with cost matrix to choose the action that minimises expected cost (block, manual review, auto-approve).
Ranking. ROC-AUC and PR-AUC measure how well the model ranks positives over negatives — independent of any single threshold.
Confidence calibration. Reliable probabilities let downstream systems weight predictions; e.g., "act when $P > 0.9$ , escalate when $0.5 < P < 0.9$ ".
Probabilistic ensembling. Averaging class probabilities of multiple models almost always beats hard-vote ensembling.
Uncertainty quantification. Low max-probability suggests the model is uncertain — useful for active learning and human-in-the-loop systems.

Calibration. Some classifiers (SVMs, raw trees) output uncalibrated scores. Apply Platt scaling or isotonic regression on a held-out set to convert scores into well-calibrated probabilities — essential for using $P(y \mid x)$ as a true probability.

When hard labels suffice. When a clear binary decision must be taken and probabilities won't influence downstream action (e.g., automated email folder placement, simple A/B split). Even here, retaining probabilities for monitoring and audit is recommended.