November 2024 — Solved
Data Analytics and Prediction
End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).
Part A — Short Essay
Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)
Q1. Mention two methods of handling missing data.
- Deletion — drop rows (listwise) or columns with missing entries. Quick, but loses data and can introduce bias.
- Imputation — fill missing values with the mean / median / mode of the column, or use a model (KNN imputation, MICE, regression imputation) when the missingness is informative.
Q2. Define regression in machine learning.
Regression is a supervised learning task that predicts a continuous numeric output from input features . The model learns a function by minimising a loss such as MSE: . Examples: linear regression, polynomial regression, ridge/lasso, SVR, decision-tree regression.
Q3. What is Linear Discriminant Analysis (LDA)?
LDA is a supervised classification + dimensionality reduction technique that finds a linear combination of features that maximises between-class variance and minimises within-class variance. It projects data onto a lower-dimensional space where classes are best separated, assuming each class is Gaussian with a common covariance matrix.
Q4. What is a confusion matrix?
A confusion matrix is a table comparing true labels against predicted labels for a -class classification problem. For binary classification:
| Predicted + | Predicted − | |
|---|---|---|
| Actual + | TP | FN |
| Actual − | FP | TN |
From it we compute accuracy, precision, recall, F1, specificity and the ROC curve.
Q5. Define bagging.
Bagging (Bootstrap AGGregatING, Breiman 1996) is an ensemble technique: train base learners on bootstrap samples drawn with replacement from the training set, then average their predictions (regression) or take majority vote (classification). It reduces variance and stabilises high-variance learners like decision trees. Random Forest is bagging applied to decorrelated decision trees.
Q6. What is a regression tree?
A regression tree is a decision tree whose leaves predict a continuous numeric value — usually the mean of the training targets in that leaf. The tree recursively splits the feature space to minimise variance (typically minimising RSS or MSE within each node), so similar targets end up in the same leaf.
Q7. Define classification accuracy.
Classification accuracy is the fraction of predictions that match the true labels:
Easy to compute but misleading under class imbalance — a model that always predicts the majority class can score very high.
Part B — Long Essay
Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)
Q8.
(a) Different data preprocessing techniques used in machine learning. (10 marks)
Raw data is rarely model-ready. Preprocessing typically consumes 60–80% of an ML project. The main techniques:
1. Data cleaning.
- Missing values — deletion, mean/median/mode imputation, KNN-imputation, MICE, model-based imputation, or learning a missingness indicator.
- Outliers — z-score / IQR / isolation-forest detection; cap (winsorise), remove, or transform.
- Duplicates and noise — drop exact duplicates, fuzzy dedupe, smoothing for noisy sensors.
2. Data integration. Combine multiple sources, resolve schema conflicts, merge on common keys, handle conflicting records.
3. Data transformation.
- Scaling
- Standardisation (z-score): — assumes/produces zero mean, unit variance. Required for k-NN, k-means, SVM, gradient methods.
- Min-max: — maps to . Useful for neural nets.
- Robust scaling — uses median and IQR; resistant to outliers.
- Skew correction — log, square-root, Box–Cox, Yeo–Johnson.
- Encoding categorical variables — one-hot, ordinal, target/mean encoding, hashing.
- Datetime features — extract day, week, month, hour, is_weekend, holiday flag.
- Text features — tokenisation, TF-IDF, embeddings (Word2Vec, BERT).
4. Data reduction.
- Feature selection — variance threshold, chi-square, mutual information, L1 (Lasso), Recursive Feature Elimination.
- Dimensionality reduction — PCA, LDA, t-SNE, UMAP, autoencoders.
- Sampling / aggregation when data is large.
5. Discretisation / binning. Convert continuous features into discrete intervals — uniform, quantile, or model-based binning. Useful for tree models, naive Bayes, fairness analysis.
6. Class-imbalance handling.
- Oversampling (SMOTE, ADASYN).
- Undersampling majority class.
- Class weights / focal loss.
7. Data splitting. Train/test split, stratified, time-based, or group-based — done before fitting scalers / imputers to avoid leakage.
8. Pipelining. Wrap all preprocessing steps + model in a single sklearn / TFX / MLflow pipeline so the same transformations are applied at training and inference, eliminating leakage and drift.
(b) Evaluate a healthcare dataset for overfitting and suggest tuning approaches. (10 marks)
Scenario. Suppose we trained a Random Forest on a small (≈2,000 patients) healthcare dataset to predict 30-day readmission. Training F1 = 0.95, validation F1 = 0.71. The 24-point gap signals overfitting.
Diagnosis — confirming overfitting.
- Learning curves — plot train and validation F1 as we vary training-set size. Persistent wide gap = high variance / overfitting.
- Train vs CV scores — large gap between training and 5-fold CV mean.
- Sensitivity check — perturb random seed; if scores swing widely, model is unstable.
- Feature importance audit — if a single feature like
patient_idoradmission_datedominates, it's leakage.
Common causes in healthcare data.
- Small sample size relative to many features (high ).
- Highly correlated lab measurements / leakage from features available post-outcome.
- Class imbalance — readmissions are rare.
- Heterogeneous patient subgroups.
Tuning approaches to reduce overfitting.
1. Reduce model complexity.
- Random Forest: lower
max_depth(e.g., 6–10), raisemin_samples_leaf(≥ 20), lowermax_features. - Decision Tree: post-prune with
ccp_alpha. - Neural net: fewer layers / units, add dropout.
2. Regularisation. L1/L2 in linear / logistic regression; weight decay in deep models. For trees, regularisation appears as min-samples / pruning.
3. More data / better data.
- Combine with public datasets (MIMIC-III).
- Data augmentation — SMOTE for the minority readmission class; sampling with replacement for under-represented subgroups.
- Feature engineering that consolidates redundant labs into clinically meaningful indices.
4. Robust validation.
- Stratified k-fold (k = 5 or 10) to preserve readmission rate.
- Group k-fold by patient ID — never let one patient appear in both train and validation.
- Nested CV for unbiased hyperparameter tuning.
5. Hyperparameter search. Use grid / random / Bayesian search over n_estimators, max_depth, min_samples_leaf, max_features, class_weight. Optimise CV-mean F1 (or PR-AUC for imbalanced) — not training accuracy.
6. Early stopping in gradient-boosted models (XGBoost / LightGBM): stop adding trees when validation logloss plateaus.
7. Ensembling and stacking — average multiple model families to absorb variance.
8. Audit and remove leakage features. Any feature constructed after the outcome (e.g., follow-up visit indicator) must be removed.
9. Calibration. Use Platt scaling or isotonic regression so predicted probabilities are honest — crucial in clinical decision support.
Expected outcome. After applying these, training F1 typically drops a few points while validation F1 rises — closing the gap to roughly within 5 points indicates a healthier model that generalises.
Q9.
(a) Design a predictive system using KNN and evaluate how the choice of K impacts performance. (10 marks)
K-Nearest Neighbors — overview. KNN is a simple lazy supervised algorithm. To predict for a new point :
- Compute distance to every training point.
- Pick the closest.
- Classification: majority vote among the labels (with ties broken by distance or random).
- Regression: average (optionally weighted by inverse distance) the target values.
It's instance-based — no model is "trained" in the conventional sense; the training set itself is the model.
Design — predictive system pipeline.
Step 1 — Define the problem. Example: predict whether a customer churns (binary classification) using account features.
Step 2 — Data preprocessing.
- Handle missing values (median / mode imputation).
- Encode categoricals (one-hot).
- Scale features — critical because KNN uses raw Euclidean distance. Standardise or min-max so no feature dominates.
- Remove or down-weight irrelevant features (KNN is sensitive to noise).
Step 3 — Choose distance metric.
- Euclidean for continuous features.
- Manhattan for high-dim sparse.
- Cosine for text vectors / embeddings.
- Hamming for categorical.
- Mahalanobis to account for feature correlations.
Step 4 — Choose (see below).
Step 5 — Speed up neighbour search for big data — use KD-tree, Ball-tree, or approximate methods (HNSW, Annoy, FAISS).
Step 6 — Evaluate with stratified k-fold CV using accuracy / F1 / ROC-AUC.
Step 7 — Deploy with a vector index for low-latency lookup.
Impact of on performance.
| Behaviour | Bias | Variance | |
|---|---|---|---|
| Nearest neighbour only | Low | Very high — overfits noise, jagged boundary | |
| Small (3–5) | Fine-grained boundary | Low | High |
| Medium (10–30) | Smooth boundary | Moderate | Moderate |
| Large | Predictions ≈ majority class | High | Low — underfits, near-flat boundary |
| Always predict global mean / mode | Maximal bias | Zero variance |
Behaviour summary.
- Small ⇒ low bias, high variance, sensitive to outliers — fits idiosyncrasies of training data ("overfits").
- Large ⇒ high bias, low variance — smooths the decision boundary, ignores local structure ("underfits").
Choosing .
- Plot CV accuracy vs and select the "elbow" — the smallest before performance stops improving meaningfully.
- Prefer odd in binary classification to avoid ties.
- A rule of thumb: as a starting point.
- Use distance-weighted voting to make KNN less sensitive to .
Worked illustration. On Iris (150 samples):
- → 96% CV accuracy but bouncy decision boundary.
- → 97% CV accuracy, smoother boundary.
- → drops to 88% because boundary is over-smoothed.
Practical caveats.
- KNN suffers from the curse of dimensionality — distances lose meaning above ~20 features. Apply PCA / feature selection first.
- Prediction is slow ( per query) without an index.
- Sensitive to feature scaling and outliers — always scale.
(b) Compare Support Vector Regression (SVR) and Decision Tree Regression. (10 marks)
Both are non-linear regression algorithms but with very different mechanisms.
Support Vector Regression (SVR).
- Builds on SVM. Fits a function that lies within an -tube around the data; errors inside the tube are ignored.
- Optimisation:
- Kernel trick () — linear, polynomial, RBF — lets SVR fit non-linear functions.
- Solution depends only on support vectors — points outside the tube.
Decision Tree Regression (DTR).
- Recursively splits feature space on the feature/threshold pair that most reduces target variance (MSE/MAE).
- Leaves predict the mean target of the training points falling there.
- The result is a piecewise-constant function.
- Can be regularised by limiting depth, min-samples-per-leaf, or via pruning.
Comparison.
| Aspect | SVR | Decision Tree |
|---|---|---|
| Function class | Smooth (depending on kernel) | Piecewise constant (step function) |
| Handles non-linearity | Via kernels | Naturally via recursive splits |
| Sensitivity to feature scale | High — must scale | None — invariant to monotonic transforms |
| Hyperparameters | , , kernel, | max_depth, min_samples_leaf, criterion |
| Interpretability | Low (especially with non-linear kernel) | High (can be visualised) |
| Handling missing values | Needs imputation | Many implementations handle natively |
| Categorical features | Need encoding | Native support (some libraries) |
| Outlier sensitivity | Moderate — -tube helps | Sensitive — can split aggressively on outliers |
| Training cost | to — slow on big data | per feature |
| Prediction speed | Fast | Very fast |
| Extrapolation | Smooth extrapolation possible | Cannot extrapolate beyond training-range targets |
| Memory | Stores support vectors | Stores tree structure |
| Best with | Small-to-medium, scaled, smooth data | Tabular data, heterogeneous features |
| Variance / bias | High bias, low variance (regularised) | Low bias, high variance (often overfits) |
Practical advice. For modern tabular regression problems, gradient-boosted decision trees (XGBoost, LightGBM) combine the strengths of DTR with bagging/boosting and almost always outperform raw SVR or single trees. SVR is still preferred when data is small, smooth, and well-scaled — and when you want a mathematically clean model with a clear notion of "support vectors."
Q10.
(a) Apply LDA and PLS-DA on a real-world biological dataset and compare their classification performance. (10 marks)
Setting — biological example. A common benchmark: classify cancer subtypes (e.g., ALL vs AML leukaemia) from gene-expression microarrays — 7129 features, ~72 samples. The dataset has (far more features than samples), strong feature correlations, and small sample size.
LDA (Linear Discriminant Analysis).
Idea. Project data onto the linear axis that maximises the Fisher ratio where is between-class scatter and is within-class scatter.
Assumptions. Each class is Gaussian with a common covariance matrix; classes differ in means.
Procedure.
- Standardise features.
- Compute and .
- Solve — the leading eigenvectors are the discriminant directions.
- Project data, then classify (often using Gaussian likelihood or a downstream linear classifier).
Problem with high-dim biological data. is singular when , so LDA breaks down. Standard fixes: regularise (), use shrinkage LDA, or first reduce dimensionality via PCA.
PLS-DA (Partial Least Squares Discriminant Analysis).
Idea. Find latent components that maximise covariance between predictors and the (encoded) class label . Unlike PCA — which maximises variance of alone — PLS uses , so it is supervised and tailored for class separation.
Procedure.
- Standardise features.
- Encode as dummy class indicators.
- Iteratively extract latent components that maximise where and .
- Use the latent components as predictors in a linear classifier.
Strengths in biology.
- Works gracefully when .
- Handles multicollinear features (genes coexpressed).
- Variable Importance in Projection (VIP) scores rank features — biologically interpretable.
Application & comparison on leukaemia dataset.
| Aspect | LDA | PLS-DA |
|---|---|---|
| Suitability when | Poor (needs regularisation or PCA) | Excellent |
| Multicollinearity | Sensitive | Robust (compresses correlated features) |
| Supervised? | Yes | Yes |
| Latent components | Maximise class separation | Maximise covariance with |
| Interpretability | Discriminant weights | VIP scores |
| Typical leukaemia accuracy | ~85–90% (with regularisation) | ~95% (clean separation) |
| Risk of overfitting | High without regularisation | Cross-validate # of components |
| Software | sklearn (with shrinkage), MASS | mixOmics (R), sklearn cross_decomposition |
Validation. Use stratified k-fold CV (or LOOCV given small ). Report accuracy, balanced accuracy, F1, ROC-AUC and number of latent components. PLS-DA typically wins on raw gene-expression matrices; LDA wins when applied after PCA/feature selection that resolves the issue.
(b) Key assumptions of discriminant analysis. (10 marks)
Discriminant Analysis (specifically LDA and QDA) is a generative classifier grounded in modelling as Gaussian. Its statistical foundation rests on the following assumptions:
1. Gaussian class conditionals. For each class , the feature vector given follows a multivariate normal: Violation → discriminant analysis is biased. Skewed or multi-modal features should be transformed (log, Box–Cox) or replaced by a more flexible classifier.
2. Equality of covariance matrices (LDA only). Under this assumption the decision boundary is linear. If covariances differ, use Quadratic Discriminant Analysis (QDA) which estimates a separate per class and produces quadratic boundaries.
3. Independent observations. Each sample is drawn independently from its class distribution. Time-series correlation, repeated measurements on the same subject, or hierarchical structure breaks this assumption.
4. Sufficient sample size per class. Estimating requires roughly samples per class. With (high-dim genomics, text, image features), is singular. Fixes: regularised / shrinkage LDA, PCA before LDA, or PLS-DA.
5. No severe multicollinearity. Highly correlated predictors make near-singular and inflate the variance of estimated discriminant weights. Address by feature selection or shrinkage.
6. Linear separability (for LDA). Boundary is linear ⇒ LDA struggles with non-linearly separable data. Use QDA, kernel discriminants, or non-linear classifiers (SVM-RBF, GBT) when classes curve into each other.
7. Approximately balanced or known priors. LDA uses class priors in the Bayes formulation. If they are misestimated (e.g., training set ratio doesn't match production), recalibrate priors at prediction time.
8. Continuous, scaled features. LDA assumes continuous predictors. Mix-in of binary/categorical features should be one-hot encoded carefully; non-binary categoricals violate Gaussianity.
Diagnostics.
- Mardia's test / Henze–Zirkler for multivariate normality.
- Box's M test for equality of covariances (sensitive to non-normality, so use cautiously).
- Residual plots of class means.
- CV gap between train and test as overfitting check.
When assumptions are violated.
- Modest violations: LDA is famously robust — often still performs competitively.
- Severe violations: try QDA, RDA (regularised), Naive Bayes, logistic regression, kernel methods, or tree ensembles.
Q11.
(a) Design a system that predicts credit-card fraud and analyse it using all performance metrics in classification. (10 marks)
Problem. Real-time binary classification: legitimate (0) vs fraudulent (1) transaction. Severe class imbalance (typically 0.1–0.2% fraud). FN is very costly (money lost); FP is annoying (customer friction).
System design.
1. Data ingestion. Stream transactions through Kafka / Kinesis to a feature store, with batch + real-time pipelines.
2. Feature engineering.
- Transaction-level: amount, merchant category, country, currency, channel.
- Cardholder-level: age of account, average historical spend, std-dev.
- Temporal: time-since-last-transaction, transactions-in-last-hour, hour-of-day, weekday.
- Velocity: distance / time between consecutive transactions ("impossible-travel").
- Graph features: shared device, shared IP with known fraud rings.
3. Preprocessing. One-hot encode categoricals, scale numeric, log-transform skewed (amount), handle missing.
4. Imbalance handling.
- Class weights (inverse frequency).
- SMOTE / ADASYN oversampling.
- Anomaly detection as an auxiliary signal.
5. Model selection.
- Baseline: Logistic Regression with L1.
- Strong: Random Forest, XGBoost / LightGBM, neural net with autoencoder features.
- Production stacks often combine a fast rules engine (for known patterns) and an ML model.
6. Training and tuning.
- Stratified k-fold CV.
- Optimise PR-AUC (not ROC-AUC) due to imbalance.
- Hyperparameter search; calibrate probabilities with isotonic regression.
7. Deployment. Low-latency scoring (<50 ms), fallback rules, A/B-tested threshold, decision routing (auto-approve, step-up authentication, block).
8. Monitoring. Watch data drift, fraud rate, precision@K daily; retrain on schedule (rapid fraud-pattern drift).
Performance metrics — exhaustive analysis.
Let TP, TN, FP, FN come from the confusion matrix at chosen threshold.
| Metric | Formula | In fraud context |
|---|---|---|
| Accuracy | Misleading — 99.8% baseline by predicting "not fraud" | |
| Precision | Of flagged transactions, how many are real fraud — drives customer friction | |
| Recall (Sensitivity, TPR) | Of real fraud, how many caught — drives money saved | |
| Specificity (TNR) | Of legitimate transactions, how many correctly let through | |
| FPR | Fraction of good customers wrongly blocked | |
| FNR | Fraud missed | |
| F1 | Balanced precision–recall | |
| F (e.g., ) | weights recall higher | Use to prioritise catching fraud |
| ROC-AUC | TPR vs FPR area | Threshold-free, but optimistic under imbalance |
| PR-AUC | Precision vs Recall area | More informative under heavy imbalance — preferred |
| MCC | Matthews Correlation Coef. | Single balanced number on imbalanced data |
| Log loss | Cross-entropy | Penalises confidently wrong predictions |
| Brier score | MSE on probabilities | Calibration quality |
| Cost-weighted loss | \ saved | Final business metric — combines cost per FN (lost money) and FP (friction) |
Threshold tuning. Sweep the threshold along the PR curve to choose where business cost is minimised. Fraud teams often pick threshold to maintain a target precision (e.g., 80%) while maximising recall.
Diagnostic plots. Confusion matrix at chosen threshold, PR curve, ROC curve, calibration curve, lift / gain charts (precision in top-K%).
(b) Differentiate between class predictions and class probabilities. (10 marks)
Class predictions. A classifier outputs a hard label — the predicted class for the input. For binary classification, this is typically obtained by:
where is a decision threshold (default 0.5). Class predictions answer "which class?"
Class probabilities. The classifier outputs a probability vector with , giving the model's confidence that the input belongs to each class. Class probabilities answer "how confident, and in which class?"
Comparison.
| Aspect | Class prediction | Class probability |
|---|---|---|
| Output type | Discrete label | Real number in [0, 1] |
| Information | Single answer | Full distribution |
| Threshold needed | Yes (e.g., 0.5) | No |
| Confidence shown | No | Yes |
| Loss function | 0–1 loss, hinge | Cross-entropy, log loss, Brier |
| Useful metrics | Accuracy, F1, precision/recall | ROC-AUC, PR-AUC, log loss, calibration |
| Allows threshold tuning | – | Yes — set per business needs |
| Used in cost-sensitive decisions | Limited | Essential |
| Stackable / mixable with other models | Hard | Easy (averaging, blending) |
| Algorithms that natively output it | Most | Logistic, NB, NN; trees/SVM need calibration |
Why probabilities are valuable.
- Threshold flexibility. Move to trade precision and recall — e.g., raise in fraud to reduce false positives, lower in cancer screening to catch every case.
- Cost-sensitive decisions. Combine probability with cost matrix to choose the action that minimises expected cost (block, manual review, auto-approve).
- Ranking. ROC-AUC and PR-AUC measure how well the model ranks positives over negatives — independent of any single threshold.
- Confidence calibration. Reliable probabilities let downstream systems weight predictions; e.g., "act when , escalate when ".
- Probabilistic ensembling. Averaging class probabilities of multiple models almost always beats hard-vote ensembling.
- Uncertainty quantification. Low max-probability suggests the model is uncertain — useful for active learning and human-in-the-loop systems.
Calibration. Some classifiers (SVMs, raw trees) output uncalibrated scores. Apply Platt scaling or isotonic regression on a held-out set to convert scores into well-calibrated probabilities — essential for using as a true probability.
When hard labels suffice. When a clear binary decision must be taken and probabilities won't influence downstream action (e.g., automated email folder placement, simple A/B split). Even here, retaining probabilities for monitoring and audit is recommended.