PGD24110104A

November 2024 — Solved

Data Analytics and Prediction
PGD01C04
2 hours 30 minutes
50 marks
Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)


Q1. Mention two methods of handling missing data.

  1. Deletion — drop rows (listwise) or columns with missing entries. Quick, but loses data and can introduce bias.
  2. Imputation — fill missing values with the mean / median / mode of the column, or use a model (KNN imputation, MICE, regression imputation) when the missingness is informative.

Q2. Define regression in machine learning.

Regression is a supervised learning task that predicts a continuous numeric output yRy \in \mathbb{R} from input features XX. The model learns a function y^=f(X;θ)\hat y = f(X; \theta) by minimising a loss such as MSE: L=1n(yiy^i)2L = \dfrac{1}{n}\sum (y_i - \hat y_i)^2. Examples: linear regression, polynomial regression, ridge/lasso, SVR, decision-tree regression.


Q3. What is Linear Discriminant Analysis (LDA)?

LDA is a supervised classification + dimensionality reduction technique that finds a linear combination of features that maximises between-class variance and minimises within-class variance. It projects data onto a lower-dimensional space where classes are best separated, assuming each class is Gaussian with a common covariance matrix.


Q4. What is a confusion matrix?

A confusion matrix is a c×cc \times c table comparing true labels against predicted labels for a cc-class classification problem. For binary classification:

Predicted +Predicted −
Actual +TPFN
Actual −FPTN

From it we compute accuracy, precision, recall, F1, specificity and the ROC curve.


Q5. Define bagging.

Bagging (Bootstrap AGGregatING, Breiman 1996) is an ensemble technique: train BB base learners on bootstrap samples drawn with replacement from the training set, then average their predictions (regression) or take majority vote (classification). It reduces variance and stabilises high-variance learners like decision trees. Random Forest is bagging applied to decorrelated decision trees.


Q6. What is a regression tree?

A regression tree is a decision tree whose leaves predict a continuous numeric value — usually the mean of the training targets in that leaf. The tree recursively splits the feature space to minimise variance (typically minimising RSS or MSE within each node), so similar targets end up in the same leaf.


Q7. Define classification accuracy.

Classification accuracy is the fraction of predictions that match the true labels:

Accuracy=TP+TNTP+TN+FP+FN=correct predictionstotal predictions.\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{correct predictions}}{\text{total predictions}}.

Easy to compute but misleading under class imbalance — a model that always predicts the majority class can score very high.


Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)


Q8.

(a) Different data preprocessing techniques used in machine learning. (10 marks)

Raw data is rarely model-ready. Preprocessing typically consumes 60–80% of an ML project. The main techniques:

1. Data cleaning.

  • Missing values — deletion, mean/median/mode imputation, KNN-imputation, MICE, model-based imputation, or learning a missingness indicator.
  • Outliers — z-score / IQR / isolation-forest detection; cap (winsorise), remove, or transform.
  • Duplicates and noise — drop exact duplicates, fuzzy dedupe, smoothing for noisy sensors.

2. Data integration. Combine multiple sources, resolve schema conflicts, merge on common keys, handle conflicting records.

3. Data transformation.

  • Scaling
    • Standardisation (z-score): x=(xμ)/σx' = (x - \mu)/\sigma — assumes/produces zero mean, unit variance. Required for k-NN, k-means, SVM, gradient methods.
    • Min-max: x=(xmin)/(maxmin)x' = (x - \min)/(\max - \min) — maps to [0,1][0, 1]. Useful for neural nets.
    • Robust scaling — uses median and IQR; resistant to outliers.
  • Skew correction — log, square-root, Box–Cox, Yeo–Johnson.
  • Encoding categorical variables — one-hot, ordinal, target/mean encoding, hashing.
  • Datetime features — extract day, week, month, hour, is_weekend, holiday flag.
  • Text features — tokenisation, TF-IDF, embeddings (Word2Vec, BERT).

4. Data reduction.

  • Feature selection — variance threshold, chi-square, mutual information, L1 (Lasso), Recursive Feature Elimination.
  • Dimensionality reduction — PCA, LDA, t-SNE, UMAP, autoencoders.
  • Sampling / aggregation when data is large.

5. Discretisation / binning. Convert continuous features into discrete intervals — uniform, quantile, or model-based binning. Useful for tree models, naive Bayes, fairness analysis.

6. Class-imbalance handling.

  • Oversampling (SMOTE, ADASYN).
  • Undersampling majority class.
  • Class weights / focal loss.

7. Data splitting. Train/test split, stratified, time-based, or group-based — done before fitting scalers / imputers to avoid leakage.

8. Pipelining. Wrap all preprocessing steps + model in a single sklearn / TFX / MLflow pipeline so the same transformations are applied at training and inference, eliminating leakage and drift.

(b) Evaluate a healthcare dataset for overfitting and suggest tuning approaches. (10 marks)

Scenario. Suppose we trained a Random Forest on a small (≈2,000 patients) healthcare dataset to predict 30-day readmission. Training F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting.

Diagnosis — confirming overfitting.

  1. Learning curves — plot train and validation F1 as we vary training-set size. Persistent wide gap = high variance / overfitting.
  2. Train vs CV scores — large gap between training and 5-fold CV mean.
  3. Sensitivity check — perturb random seed; if scores swing widely, model is unstable.
  4. Feature importance audit — if a single feature like patient_id or admission_date dominates, it's leakage.

Common causes in healthcare data.

  • Small sample size relative to many features (high p/np/n).
  • Highly correlated lab measurements / leakage from features available post-outcome.
  • Class imbalance — readmissions are rare.
  • Heterogeneous patient subgroups.

Tuning approaches to reduce overfitting.

1. Reduce model complexity.

  • Random Forest: lower max_depth (e.g., 6–10), raise min_samples_leaf (≥ 20), lower max_features.
  • Decision Tree: post-prune with ccp_alpha.
  • Neural net: fewer layers / units, add dropout.

2. Regularisation. L1/L2 in linear / logistic regression; weight decay in deep models. For trees, regularisation appears as min-samples / pruning.

3. More data / better data.

  • Combine with public datasets (MIMIC-III).
  • Data augmentation — SMOTE for the minority readmission class; sampling with replacement for under-represented subgroups.
  • Feature engineering that consolidates redundant labs into clinically meaningful indices.

4. Robust validation.

  • Stratified k-fold (k = 5 or 10) to preserve readmission rate.
  • Group k-fold by patient ID — never let one patient appear in both train and validation.
  • Nested CV for unbiased hyperparameter tuning.

5. Hyperparameter search. Use grid / random / Bayesian search over n_estimators, max_depth, min_samples_leaf, max_features, class_weight. Optimise CV-mean F1 (or PR-AUC for imbalanced) — not training accuracy.

6. Early stopping in gradient-boosted models (XGBoost / LightGBM): stop adding trees when validation logloss plateaus.

7. Ensembling and stacking — average multiple model families to absorb variance.

8. Audit and remove leakage features. Any feature constructed after the outcome (e.g., follow-up visit indicator) must be removed.

9. Calibration. Use Platt scaling or isotonic regression so predicted probabilities are honest — crucial in clinical decision support.

Expected outcome. After applying these, training F1 typically drops a few points while validation F1 rises — closing the gap to roughly within 5 points indicates a healthier model that generalises.


Q9.

(a) Design a predictive system using KNN and evaluate how the choice of K impacts performance. (10 marks)

K-Nearest Neighbors — overview. KNN is a simple lazy supervised algorithm. To predict for a new point xx:

  1. Compute distance d(x,xi)d(x, x_i) to every training point.
  2. Pick the KK closest.
  3. Classification: majority vote among the KK labels (with ties broken by distance or random).
  4. Regression: average (optionally weighted by inverse distance) the KK target values.

It's instance-based — no model is "trained" in the conventional sense; the training set itself is the model.

Design — predictive system pipeline.

Step 1 — Define the problem. Example: predict whether a customer churns (binary classification) using account features.

Step 2 — Data preprocessing.

  • Handle missing values (median / mode imputation).
  • Encode categoricals (one-hot).
  • Scale featurescritical because KNN uses raw Euclidean distance. Standardise or min-max so no feature dominates.
  • Remove or down-weight irrelevant features (KNN is sensitive to noise).

Step 3 — Choose distance metric.

  • Euclidean for continuous features.
  • Manhattan for high-dim sparse.
  • Cosine for text vectors / embeddings.
  • Hamming for categorical.
  • Mahalanobis to account for feature correlations.

Step 4 — Choose KK (see below).

Step 5 — Speed up neighbour search for big data — use KD-tree, Ball-tree, or approximate methods (HNSW, Annoy, FAISS).

Step 6 — Evaluate with stratified k-fold CV using accuracy / F1 / ROC-AUC.

Step 7 — Deploy with a vector index for low-latency lookup.

Impact of KK on performance.

KKBehaviourBiasVariance
K=1K = 1Nearest neighbour onlyLowVery high — overfits noise, jagged boundary
Small KK (3–5)Fine-grained boundaryLowHigh
Medium KK (10–30)Smooth boundaryModerateModerate
Large KKPredictions ≈ majority classHighLow — underfits, near-flat boundary
K=NK = NAlways predict global mean / modeMaximal biasZero variance

Behaviour summary.

  • Small KK ⇒ low bias, high variance, sensitive to outliers — fits idiosyncrasies of training data ("overfits").
  • Large KK ⇒ high bias, low variance — smooths the decision boundary, ignores local structure ("underfits").

Choosing KK.

  • Plot CV accuracy vs KK and select the "elbow" — the smallest KK before performance stops improving meaningfully.
  • Prefer odd KK in binary classification to avoid ties.
  • A rule of thumb: KNK \approx \sqrt{N} as a starting point.
  • Use distance-weighted voting to make KNN less sensitive to KK.

Worked illustration. On Iris (150 samples):

  • K=1K = 1 → 96% CV accuracy but bouncy decision boundary.
  • K=5K = 5 → 97% CV accuracy, smoother boundary.
  • K=50K = 50 → drops to 88% because boundary is over-smoothed.

Practical caveats.

  • KNN suffers from the curse of dimensionality — distances lose meaning above ~20 features. Apply PCA / feature selection first.
  • Prediction is slow (O(N)O(N) per query) without an index.
  • Sensitive to feature scaling and outliers — always scale.

(b) Compare Support Vector Regression (SVR) and Decision Tree Regression. (10 marks)

Both are non-linear regression algorithms but with very different mechanisms.

Support Vector Regression (SVR).

  • Builds on SVM. Fits a function y^=wTx+b\hat y = w^T x + b that lies within an ε\varepsilon-tube around the data; errors inside the tube are ignored.
  • Optimisation: min12w2+C(ξi+ξi),subject to yi(wTxi+b)ε+ξi.\min \frac{1}{2}\|w\|^2 + C \sum (\xi_i + \xi_i^*), \quad \text{subject to } |y_i - (w^T x_i + b)| \le \varepsilon + \xi_i.
  • Kernel trick (K(x,x)K(x, x')) — linear, polynomial, RBF — lets SVR fit non-linear functions.
  • Solution depends only on support vectors — points outside the tube.

Decision Tree Regression (DTR).

  • Recursively splits feature space on the feature/threshold pair that most reduces target variance (MSE/MAE).
  • Leaves predict the mean target of the training points falling there.
  • The result is a piecewise-constant function.
  • Can be regularised by limiting depth, min-samples-per-leaf, or via pruning.

Comparison.

AspectSVRDecision Tree
Function classSmooth (depending on kernel)Piecewise constant (step function)
Handles non-linearityVia kernelsNaturally via recursive splits
Sensitivity to feature scaleHigh — must scaleNone — invariant to monotonic transforms
HyperparametersCC, ε\varepsilon, kernel, γ\gammamax_depth, min_samples_leaf, criterion
InterpretabilityLow (especially with non-linear kernel)High (can be visualised)
Handling missing valuesNeeds imputationMany implementations handle natively
Categorical featuresNeed encodingNative support (some libraries)
Outlier sensitivityModerate — ε\varepsilon-tube helpsSensitive — can split aggressively on outliers
Training costO(N2)O(N^2) to O(N3)O(N^3) — slow on big dataO(NlogN)O(N \log N) per feature
Prediction speedFastVery fast
ExtrapolationSmooth extrapolation possibleCannot extrapolate beyond training-range targets
MemoryStores support vectorsStores tree structure
Best withSmall-to-medium, scaled, smooth dataTabular data, heterogeneous features
Variance / biasHigh bias, low variance (regularised)Low bias, high variance (often overfits)

Practical advice. For modern tabular regression problems, gradient-boosted decision trees (XGBoost, LightGBM) combine the strengths of DTR with bagging/boosting and almost always outperform raw SVR or single trees. SVR is still preferred when data is small, smooth, and well-scaled — and when you want a mathematically clean model with a clear notion of "support vectors."


Q10.

(a) Apply LDA and PLS-DA on a real-world biological dataset and compare their classification performance. (10 marks)

Setting — biological example. A common benchmark: classify cancer subtypes (e.g., ALL vs AML leukaemia) from gene-expression microarrays — 7129 features, ~72 samples. The dataset has pnp \gg n (far more features than samples), strong feature correlations, and small sample size.

LDA (Linear Discriminant Analysis).

Idea. Project data onto the linear axis that maximises the Fisher ratio J(w)=wTSBwwTSWw,J(w) = \frac{w^T S_B w}{w^T S_W w}, where SBS_B is between-class scatter and SWS_W is within-class scatter.

Assumptions. Each class is Gaussian with a common covariance matrix; classes differ in means.

Procedure.

  1. Standardise features.
  2. Compute SWS_W and SBS_B.
  3. Solve SW1SBw=λwS_W^{-1} S_B w = \lambda w — the leading eigenvectors are the discriminant directions.
  4. Project data, then classify (often using Gaussian likelihood or a downstream linear classifier).

Problem with high-dim biological data. SWS_W is singular when p>np > n, so LDA breaks down. Standard fixes: regularise (SW+αIS_W + \alpha I), use shrinkage LDA, or first reduce dimensionality via PCA.

PLS-DA (Partial Least Squares Discriminant Analysis).

Idea. Find latent components that maximise covariance between predictors XX and the (encoded) class label YY. Unlike PCA — which maximises variance of XX alone — PLS uses YY, so it is supervised and tailored for class separation.

Procedure.

  1. Standardise features.
  2. Encode YY as dummy class indicators.
  3. Iteratively extract latent components TkT_k that maximise cov(t,u)\text{cov}(t, u) where t=Xwt = X w and u=Ycu = Y c.
  4. Use the latent components as predictors in a linear classifier.

Strengths in pnp \gg n biology.

  • Works gracefully when p>np > n.
  • Handles multicollinear features (genes coexpressed).
  • Variable Importance in Projection (VIP) scores rank features — biologically interpretable.

Application & comparison on leukaemia dataset.

AspectLDAPLS-DA
Suitability when p>np > nPoor (needs regularisation or PCA)Excellent
MulticollinearitySensitiveRobust (compresses correlated features)
Supervised?YesYes
Latent componentsMaximise class separationMaximise covariance with YY
InterpretabilityDiscriminant weightsVIP scores
Typical leukaemia accuracy~85–90% (with regularisation)~95% (clean separation)
Risk of overfittingHigh without regularisationCross-validate # of components
Softwaresklearn (with shrinkage), MASSmixOmics (R), sklearn cross_decomposition

Validation. Use stratified k-fold CV (or LOOCV given small nn). Report accuracy, balanced accuracy, F1, ROC-AUC and number of latent components. PLS-DA typically wins on raw gene-expression matrices; LDA wins when applied after PCA/feature selection that resolves the p>np > n issue.

(b) Key assumptions of discriminant analysis. (10 marks)

Discriminant Analysis (specifically LDA and QDA) is a generative classifier grounded in modelling P(XY)P(X \mid Y) as Gaussian. Its statistical foundation rests on the following assumptions:

1. Gaussian class conditionals. For each class kk, the feature vector XX given Y=kY = k follows a multivariate normal: XY=kN(μk,Σk).X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma_k). Violation → discriminant analysis is biased. Skewed or multi-modal features should be transformed (log, Box–Cox) or replaced by a more flexible classifier.

2. Equality of covariance matrices (LDA only). Σ1=Σ2==Σk=Σ.\Sigma_1 = \Sigma_2 = \dots = \Sigma_k = \Sigma. Under this assumption the decision boundary is linear. If covariances differ, use Quadratic Discriminant Analysis (QDA) which estimates a separate Σk\Sigma_k per class and produces quadratic boundaries.

3. Independent observations. Each sample is drawn independently from its class distribution. Time-series correlation, repeated measurements on the same subject, or hierarchical structure breaks this assumption.

4. Sufficient sample size per class. Estimating Σ\Sigma requires roughly >p> p samples per class. With p>np > n (high-dim genomics, text, image features), Σ\Sigma is singular. Fixes: regularised / shrinkage LDA, PCA before LDA, or PLS-DA.

5. No severe multicollinearity. Highly correlated predictors make Σ\Sigma near-singular and inflate the variance of estimated discriminant weights. Address by feature selection or shrinkage.

6. Linear separability (for LDA). Boundary is linear ⇒ LDA struggles with non-linearly separable data. Use QDA, kernel discriminants, or non-linear classifiers (SVM-RBF, GBT) when classes curve into each other.

7. Approximately balanced or known priors. LDA uses class priors πk\pi_k in the Bayes formulation. If they are misestimated (e.g., training set ratio doesn't match production), recalibrate priors at prediction time.

8. Continuous, scaled features. LDA assumes continuous predictors. Mix-in of binary/categorical features should be one-hot encoded carefully; non-binary categoricals violate Gaussianity.

Diagnostics.

  • Mardia's test / Henze–Zirkler for multivariate normality.
  • Box's M test for equality of covariances (sensitive to non-normality, so use cautiously).
  • Residual plots of class means.
  • CV gap between train and test as overfitting check.

When assumptions are violated.

  • Modest violations: LDA is famously robust — often still performs competitively.
  • Severe violations: try QDA, RDA (regularised), Naive Bayes, logistic regression, kernel methods, or tree ensembles.

Q11.

(a) Design a system that predicts credit-card fraud and analyse it using all performance metrics in classification. (10 marks)

Problem. Real-time binary classification: legitimate (0) vs fraudulent (1) transaction. Severe class imbalance (typically 0.1–0.2% fraud). FN is very costly (money lost); FP is annoying (customer friction).

System design.

1. Data ingestion. Stream transactions through Kafka / Kinesis to a feature store, with batch + real-time pipelines.

2. Feature engineering.

  • Transaction-level: amount, merchant category, country, currency, channel.
  • Cardholder-level: age of account, average historical spend, std-dev.
  • Temporal: time-since-last-transaction, transactions-in-last-hour, hour-of-day, weekday.
  • Velocity: distance / time between consecutive transactions ("impossible-travel").
  • Graph features: shared device, shared IP with known fraud rings.

3. Preprocessing. One-hot encode categoricals, scale numeric, log-transform skewed (amount), handle missing.

4. Imbalance handling.

  • Class weights (inverse frequency).
  • SMOTE / ADASYN oversampling.
  • Anomaly detection as an auxiliary signal.

5. Model selection.

  • Baseline: Logistic Regression with L1.
  • Strong: Random Forest, XGBoost / LightGBM, neural net with autoencoder features.
  • Production stacks often combine a fast rules engine (for known patterns) and an ML model.

6. Training and tuning.

  • Stratified k-fold CV.
  • Optimise PR-AUC (not ROC-AUC) due to imbalance.
  • Hyperparameter search; calibrate probabilities with isotonic regression.

7. Deployment. Low-latency scoring (<50 ms), fallback rules, A/B-tested threshold, decision routing (auto-approve, step-up authentication, block).

8. Monitoring. Watch data drift, fraud rate, precision@K daily; retrain on schedule (rapid fraud-pattern drift).

Performance metrics — exhaustive analysis.

Let TP, TN, FP, FN come from the confusion matrix at chosen threshold.

MetricFormulaIn fraud context
Accuracy(TP+TN)/N(TP+TN)/NMisleading — 99.8% baseline by predicting "not fraud"
PrecisionTP/(TP+FP)TP/(TP+FP)Of flagged transactions, how many are real fraud — drives customer friction
Recall (Sensitivity, TPR)TP/(TP+FN)TP/(TP+FN)Of real fraud, how many caught — drives money saved
Specificity (TNR)TN/(TN+FP)TN/(TN+FP)Of legitimate transactions, how many correctly let through
FPRFP/(FP+TN)FP/(FP+TN)Fraction of good customers wrongly blocked
FNRFN/(FN+TP)FN/(FN+TP)Fraud missed
F12PR/(P+R)2 P R / (P + R)Balanced precision–recall
Fβ\beta (e.g., β=2\beta = 2)weights recall higherUse F2F_2 to prioritise catching fraud
ROC-AUCTPR vs FPR areaThreshold-free, but optimistic under imbalance
PR-AUCPrecision vs Recall areaMore informative under heavy imbalance — preferred
MCCMatthews Correlation Coef.Single balanced number on imbalanced data
Log lossCross-entropyPenalises confidently wrong predictions
Brier scoreMSE on probabilitiesCalibration quality
Cost-weighted loss\ savedFinal business metric — combines cost per FN (lost money) and FP (friction)

Threshold tuning. Sweep the threshold τ\tau along the PR curve to choose where business cost is minimised. Fraud teams often pick threshold to maintain a target precision (e.g., 80%) while maximising recall.

Diagnostic plots. Confusion matrix at chosen threshold, PR curve, ROC curve, calibration curve, lift / gain charts (precision in top-K%).

(b) Differentiate between class predictions and class probabilities. (10 marks)

Class predictions. A classifier outputs a hard label y^{0,1,,K1}\hat y \in \{0, 1, \dots, K-1\} — the predicted class for the input. For binary classification, this is typically obtained by:

y^={1if P(y=1x)τ0otherwise\hat y = \begin{cases} 1 & \text{if } P(y = 1 \mid x) \ge \tau \\ 0 & \text{otherwise} \end{cases}

where τ\tau is a decision threshold (default 0.5). Class predictions answer "which class?"

Class probabilities. The classifier outputs a probability vector p^=(p0,p1,,pK1)\hat p = (p_0, p_1, \dots, p_{K-1}) with pk=1\sum p_k = 1, giving the model's confidence that the input belongs to each class. Class probabilities answer "how confident, and in which class?"

Comparison.

AspectClass predictionClass probability
Output typeDiscrete labelReal number in [0, 1]
InformationSingle answerFull distribution
Threshold neededYes (e.g., 0.5)No
Confidence shownNoYes
Loss function0–1 loss, hingeCross-entropy, log loss, Brier
Useful metricsAccuracy, F1, precision/recallROC-AUC, PR-AUC, log loss, calibration
Allows threshold tuningYes — set τ\tau per business needs
Used in cost-sensitive decisionsLimitedEssential
Stackable / mixable with other modelsHardEasy (averaging, blending)
Algorithms that natively output itMostLogistic, NB, NN; trees/SVM need calibration

Why probabilities are valuable.

  1. Threshold flexibility. Move τ\tau to trade precision and recall — e.g., raise τ\tau in fraud to reduce false positives, lower τ\tau in cancer screening to catch every case.
  2. Cost-sensitive decisions. Combine probability with cost matrix to choose the action that minimises expected cost (block, manual review, auto-approve).
  3. Ranking. ROC-AUC and PR-AUC measure how well the model ranks positives over negatives — independent of any single threshold.
  4. Confidence calibration. Reliable probabilities let downstream systems weight predictions; e.g., "act when P>0.9P > 0.9, escalate when 0.5<P<0.90.5 < P < 0.9".
  5. Probabilistic ensembling. Averaging class probabilities of multiple models almost always beats hard-vote ensembling.
  6. Uncertainty quantification. Low max-probability suggests the model is uncertain — useful for active learning and human-in-the-loop systems.

Calibration. Some classifiers (SVMs, raw trees) output uncalibrated scores. Apply Platt scaling or isotonic regression on a held-out set to convert scores into well-calibrated probabilities — essential for using P(yx)P(y \mid x) as a true probability.

When hard labels suffice. When a clear binary decision must be taken and probabilities won't influence downstream action (e.g., automated email folder placement, simple A/B split). Even here, retaining probabilities for monitoring and audit is recommended.