PGDDSA Study · Semester 1

PGD24110102C

November 2024 — Solved

Foundation of Data Science and Algorithm Design

PGD01C02

2 hours 30 minutes

50 marks

Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)

Q1. Identify two sources of data collection in data science.

Primary sources — data collected directly for the project, e.g., surveys, sensors, experiments, web-scraped data, A/B test logs, IoT devices.
Secondary sources — pre-existing data reused for analysis, e.g., public datasets (Kaggle, UCI, World Bank), company databases, government open-data portals, APIs.

Q2. What is one reason for using data integration?

To produce a unified, consistent view of data drawn from multiple heterogeneous sources (different databases, file formats, business units). It removes redundancy, resolves schema and naming conflicts, and enables holistic analytics that no single source could support.

Q3. How can a heatmap help in evaluating student performance across subjects?

A heatmap displays a students × subjects matrix of scores as a colour grid (darker shades = higher marks). It instantly reveals:

Weak subjects as cold columns (low average across students).
Struggling students as cold rows.
Patterns and clusters — e.g., students who do well in maths tend to do well in physics.

Q4. A residual plot shows a U-shaped pattern. What does this imply about the regression model?

A U-shape (or inverted U) in the residual plot is a clear sign of non-linearity — the linear model is underfitting a curved relationship. The fix is to add polynomial / interaction terms, apply a non-linear transformation, or switch to a non-linear model (decision tree, kernel regression).

Q5. What is memoization in dynamic programming?

Memoization is a top-down dynamic programming technique that caches the result of each recursive subproblem the first time it is computed and returns the cached value on subsequent calls. It avoids re-solving overlapping subproblems and turns exponential recursion (e.g., naive Fibonacci $O(2^n)$ ) into linear time $O(n)$ .

Q6. Name any two algorithm design techniques.

Divide and Conquer — split the problem into smaller independent sub-problems, solve recursively, combine results (e.g., merge sort, quicksort).
Dynamic Programming — solve overlapping sub-problems once and store results in a table (e.g., longest common subsequence, 0/1 knapsack).

(Other valid answers: Greedy, Backtracking, Branch and Bound, Brute Force.)

Q7. Define a linear data structure with one example.

A linear data structure organises elements in a sequential order so that each element (except the first and last) has exactly one predecessor and one successor; elements are accessed in linear order.

Example: an array — fixed-size sequence of elements stored in contiguous memory, accessible by index in $O(1)$ . (Others: linked list, stack, queue.)

Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)

Q8.

(a) How does a Data Analyst contribute to a company's decision-making process through specific tasks and tools? (10 marks)

A Data Analyst turns raw business data into actionable insight. Their contribution to decision-making is end-to-end:

1. Data collection and integration. Pull data from operational databases, CRM, ERP, web analytics, and APIs into a single warehouse. Tools: SQL, Python (pandas), Apache Airflow, dbt.

2. Data cleaning and preparation. Handle missing values, deduplicate, standardise formats, detect outliers. Tools: pandas, OpenRefine, Excel Power Query.

3. Exploratory Data Analysis (EDA). Compute descriptive statistics, draw histograms, scatter plots, correlations to surface trends and anomalies. Tools: pandas, matplotlib, seaborn, Jupyter.

4. Dashboard and report building. Translate metrics into interactive visualisations that leadership can monitor in real time. Tools: Tableau, Power BI, Looker, Google Data Studio.

5. KPI definition and monitoring. Define quantitative goals (revenue, churn, conversion) and track them across periods. Send alerts when KPIs deviate.

6. Ad-hoc business questions. Answer specific stakeholder queries (e.g., "Which products drove last month's revenue dip?") with quick analyses.

7. Hypothesis testing and A/B-test analysis. Help marketing and product teams measure whether a change had a real effect. Tools: SciPy, R, statsmodels.

8. Insight communication and storytelling. Translate numbers into recommendations using slides, written reports, and stakeholder presentations.

Impact on decision-making. A good analyst converts gut decisions into data-driven decisions: pricing changes, marketing budgets, product roadmap priorities, hiring plans, supply-chain re-routing — all grounded in evidence rather than intuition. The analyst is the bridge between raw data and the people making choices.

(b) Applications of data science in healthcare and retail (10 marks)

Healthcare — two examples.

Disease prediction and diagnosis. ML models trained on patient records (demographics, labs, imaging) predict the probability of conditions such as diabetes, heart disease or cancer. Concrete example: Convolutional Neural Networks (CNNs) trained on chest X-rays detect pneumonia with accuracy comparable to radiologists. Used at scale by Google Health, Aidoc, Zebra Medical Vision.
Personalised treatment / precision medicine. Genomic data combined with clinical histories enables tailored treatment plans. Concrete example: IBM Watson for Oncology recommends cancer treatments by matching a patient's tumour profile to similar historical cases and the latest clinical trials. Pharmacogenomics adjusts drug dosage based on a patient's genetic profile.

(Additional healthcare uses: hospital re-admission prediction, drug discovery via molecule modelling, ICU early-warning systems, wearable-device monitoring.)

Retail — two examples.

Recommendation systems. Collaborative-filtering and deep-learning models personalise product suggestions per customer. Concrete example: Amazon's "Customers who bought this also bought" engine is estimated to drive ~35% of total revenue. Netflix's recommender saves an estimated $1 billion/year in churn reduction.
Demand forecasting and inventory optimisation. Time-series models (ARIMA, Prophet, LSTM) predict future demand by product, store and season — minimising stockouts and overstocks. Concrete example: Walmart uses ML demand forecasts to plan inventory across 4,700 stores. Zara optimises restock cycles using sales data refreshed twice per week.

(Additional retail uses: dynamic pricing, customer-lifetime-value modelling, churn prediction, market-basket analysis with association rules, fraud detection on payments.)

Q9.

(a) Role of data preprocessing in a data science project, including key steps. (10 marks)

Why preprocessing matters. Raw data is rarely clean. Studies report data scientists spend 60–80% of project time on preprocessing — but it has out-sized impact on model quality. Garbage in, garbage out. Good preprocessing:

Removes noise and inconsistencies that degrade accuracy.
Ensures features are on comparable scales (essential for gradient-based and distance-based models).
Encodes domain knowledge through transformations and engineered features.
Reduces dimensionality so models train faster and generalise better.

Key steps.

1. Data cleaning.

Missing values — drop rows, impute with mean/median/mode, model-based imputation (KNN, MICE).
Outliers — detect with z-score, IQR or isolation forests; cap, remove, or transform.
Duplicates — drop exact duplicates and fuzzy duplicates.
Typos / inconsistent formats — standardise units, date formats, capitalisation.

2. Data integration. Merge data from multiple sources, resolve schema conflicts, harmonise keys (e.g., customer ID across CRM and billing).

3. Data transformation.

Scaling — standardisation $(x - \mu)/\sigma$ or min-max $(x - \min)/(\max - \min)$ . Critical for k-NN, k-means, SVM, neural nets.
Encoding categorical variables — one-hot, ordinal, target encoding.
Skew correction — log, square-root, Box-Cox.
Date features — extract day-of-week, month, hour, holiday flag.

4. Data reduction.

Feature selection — variance threshold, correlation filtering, mutual information, LASSO.
Dimensionality reduction — PCA, t-SNE, UMAP.
Sampling / aggregation — when data is too large.

5. Data discretisation. Bin continuous variables into intervals — useful for decision trees, naive Bayes, fairness analysis.

6. Train/test split. Hold out an unseen test set (e.g., 80/20) before any subsequent step that "sees the data" to prevent leakage.

End-to-end impact. Good preprocessing tightens the feedback loop between data and model: cleaner inputs → faster training, higher accuracy, more robust generalisation, and explanations stakeholders trust.

(b) Compare and contrast ETL and ELT in cloud-based architecture. When would you use each? (10 marks)

ETL — Extract, Transform, Load.

Extract raw data from sources.
Transform it (cleansing, aggregation, joins) on a separate compute engine.
Load the cleaned data into the target warehouse.

ELT — Extract, Load, Transform.

Extract raw data from sources.
Load raw data straight into the cloud warehouse / data lake.
Transform inside the warehouse using its own compute (SQL / dbt).

Comparison table.

Aspect	ETL	ELT
Transform location	External staging engine	Inside warehouse
Storage of raw data	Discarded after load	Retained (data lake)
Compute engine	Dedicated ETL tool	Warehouse compute (Snowflake, BigQuery, Redshift)
Speed for very large data	Slower (network bottleneck)	Faster (in-place SQL)
Schema	Predefined, rigid	Schema-on-read, flexible
Reprocessing	Re-extract from source	Re-run SQL on stored raw data
Tooling	Informatica, Talend, SSIS	dbt, Snowflake, BigQuery
Best for	Structured, sensitive data (compliance)	Cloud-scale data, evolving schemas
Latency	Higher (batch)	Can be near real-time

When to use ETL.

Sensitive data must be cleansed / anonymised before it touches the warehouse (compliance: GDPR, HIPAA).
Source data formats are exotic and need heavy parsing best done in code.
Target warehouse has expensive or limited compute.
Legacy on-premise environments without elastic cloud compute.

When to use ELT.

You're on a cloud warehouse (Snowflake, BigQuery, Redshift, Databricks) with elastic, cheap compute.
You want to retain raw data for re-processing as analytics needs evolve.
Data is high volume / semi-structured (JSON, logs, events).
You want analysts to transform with SQL/dbt rather than depending on engineering ETL pipelines.

Modern trend. ELT has become the default in modern cloud data stacks (Fivetran → Snowflake → dbt → BI), because cloud warehouses make in-warehouse transformation cheap and reversible. ETL still dominates regulated, on-premise, or pre-cleansing scenarios.

Q10.

(a) Explain how regression analysis can be used in agriculture to predict crop yield based on rainfall and temperature. (10 marks)

Goal. Predict crop yield $Y$ (tonnes / hectare) from environmental inputs — rainfall $X_1$ (mm) and temperature $X_2$ (°C).

Why regression? Yield depends on continuous environmental factors. Regression learns a function $\hat{Y} = f(X_1, X_2)$ from historical seasons and uses it to forecast yield for new conditions.

Step 1 — Data collection. Multiple seasons of paired observations: monthly rainfall, average temperature, soil quality, seed variety, and the observed yield. Include weather-station data and government agricultural records.

Step 2 — EDA. Scatter plots of yield vs each predictor to check monotonicity. Compute correlations; check for outlier seasons (drought / flood).

Step 3 — Choose model.

Simple linear regression (one predictor): $\hat{Y} = \beta_0 + \beta_1 X_1$ .

Multiple linear regression: $\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon.$

Polynomial / non-linear — to capture that yield often follows an inverted-U with rainfall (too little ⇒ drought; too much ⇒ flooding): $\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \beta_3 X_2 + \beta_4 X_2^2 + \beta_5 X_1 X_2.$

Step 4 — Estimate coefficients by ordinary least squares: $\hat{\beta} = (X^T X)^{-1} X^T Y$ .

Step 5 — Evaluate.

$R^2$ = fraction of yield variance explained.
RMSE on held-out seasons.
Residual plots — must be random; U-shape means we missed a non-linearity.

Step 6 — Diagnostics and interpret. Suppose we obtain $\hat{Y} = 1.2 + 0.005\,X_1 - 0.08\,X_2$ . Interpretation:

Each extra mm of rainfall raises predicted yield by 0.005 t/ha.
Each extra °C of average temperature lowers predicted yield by 0.08 t/ha (heat stress).

Step 7 — Deploy. Feed forecasted weather into the model to predict next season's yield. Use the predictions to plan procurement, insurance, government subsidies and crop-rotation strategies.

Limitations and extensions.

Linear model misses extreme-weather effects — use Gradient Boosted Trees (XGBoost) or Random Forests for non-linear interactions.
Add lagged rainfall, soil moisture, NDVI from satellite imagery.
Validate across multiple geographies — yield drivers differ by region.

Real-world impact. ICRISAT and the Indian Agricultural Research Institute use yield-prediction regression to advise farmers on sowing dates and irrigation, increasing yield by 5–15%.

(b) Describe the process of Exploratory Data Analysis (EDA) and its importance in data science. (10 marks)

What is EDA? Exploratory Data Analysis — coined by John Tukey (1977) — is the practice of summarising, visualising and probing data to understand its structure, distributions, anomalies and relationships before any modelling.

Process.

1. Understand the data.

Inspect shape (df.shape), data types, first/last rows.
Read the data dictionary / source documentation.

2. Descriptive statistics.

For numeric columns: mean, median, std, min, max, percentiles (df.describe()).
For categorical columns: counts, mode, cardinality (df.value_counts()).

3. Univariate analysis — distribution of each variable.

Numeric: histogram, KDE, box plot — reveals skewness, multimodality, outliers.
Categorical: bar charts of frequencies.

4. Bivariate / multivariate analysis — relationships.

Numeric vs numeric: scatter plot, correlation matrix, pair plot.
Numeric vs categorical: box plot, violin plot.
Categorical vs categorical: contingency table, mosaic plot.

5. Missing-value analysis. Identify missing patterns (df.isnull().sum(), missingno matrix). Understand whether missingness is random or systematic — guides imputation.

6. Outlier detection. Box plots, z-score, IQR rule, isolation forests, scatter inspections.

7. Feature engineering hypotheses. Spot non-linear shapes, interactions, useful transformations (log, polynomial). Form hypotheses to be tested in modelling.

8. Summarise findings. Document distributions, key relationships, surprises and decisions for downstream modelling.

Importance / Why EDA matters.

Catches data-quality issues early — wrong units, duplicate IDs, encoding errors — before they pollute the model.
Drives feature engineering — visual patterns suggest the right transformations and interactions.
Informs model choice — linear models suit linear relationships; trees handle interactions; class imbalance hints at the need for SMOTE / class weights.
Validates assumptions — many models assume normality, independence, or homoscedasticity. EDA reveals violations.
Builds intuition — modellers must understand the domain before predicting. EDA is how data scientists learn the data.
Communicates story — exploratory charts are also the first material shared with stakeholders.

Standard tools. Python (pandas, matplotlib, seaborn, plotly), R (ggplot2, dplyr), commercial (Tableau, Power BI), automated EDA libraries (ydata-profiling / pandas-profiling, Sweetviz).

Q11.

(a) Why use cross-validation, and what are the different types of cross-validation? (10 marks)

Why cross-validation (CV)?

A single train/test split has problems:

The score is noisy — a different random split could give a different result.
For small datasets the test set is too small to be reliable.
We can't use all the data both for training and evaluation.

Cross-validation systematically rotates the role of train and test across the data so that every example is used for both at some point — giving a more stable, less biased performance estimate.

Specifically it helps to:

Estimate generalisation error more accurately.
Compare candidate models fairly.
Tune hyperparameters (the gold standard inside grid/random/Bayesian search).
Detect overfitting — large gap between train and CV scores.
Make better use of small datasets — every row contributes to both training and evaluation.

Types of cross-validation.

1. k-Fold Cross-Validation. Split data into $k$ equal folds. Train on $k - 1$ folds, validate on the held-out fold. Repeat $k$ times so each fold is validation once; average the scores. Typical $k = 5$ or $10$ . Most common general-purpose CV.

2. Stratified k-Fold. Same as k-fold but preserves the class distribution in each fold. Essential for imbalanced classification so every fold has both classes.

3. Leave-One-Out (LOOCV). Special case with $k = n$ . Train on $n - 1$ rows, validate on 1. Repeat $n$ times. Almost unbiased but expensive; use for very small datasets.

4. Leave-P-Out (LPOCV). Leave $p$ samples out as validation each iteration. Even more expensive.

5. Repeated k-Fold. Run k-fold multiple times with different random splits and average. Reduces variance of the estimate.

6. Time-Series CV (rolling / expanding window). Respects temporal order: train on $[0, t]$ , validate on $[t+1, t+h]$ , then slide the window forward. Never let future data leak into training.

7. Group / Leave-One-Group-Out CV. When data has groups (patients, users) that must not span train and test (to avoid leakage), use groups as splitting units.

8. Nested CV. Outer loop for performance estimation; inner loop for hyperparameter tuning. Prevents the bias caused by tuning on the same data used to score.

Choice. Use stratified k-fold for classification, k-fold for regression, time-series CV for sequential data, group CV when units repeat, and nested CV when reporting and tuning.

(b) What is overfitting, and what techniques are used to reduce it? (10 marks)

Definition. Overfitting occurs when a model learns the training data — including its noise and idiosyncrasies — so closely that it fails to generalise to new, unseen data. Symptom: low training error but high test/validation error.

It is the opposite of underfitting, where the model is too simple to capture the underlying pattern (high error on both train and test).

Why it happens.

Model is too complex relative to dataset size (deep tree, many parameters).
Too few training samples.
Noisy or mislabeled training data.
Training too long (especially in neural networks).
Data leakage from test into train.

Techniques to reduce overfitting.

1. More training data. The single most reliable cure. More data dilutes noise and gives the model more variety to generalise from. Data augmentation (rotations, crops for images; back-translation for text) effectively increases data for free.

2. Simpler model / smaller capacity. Use fewer parameters, shallower trees, lower polynomial degree. Choose model complexity that matches data size.

3. Regularisation.

L2 (Ridge) — penalises $\lambda \sum w_i^2$ . Shrinks all weights toward zero, distributing influence.
L1 (Lasso) — penalises $\lambda \sum |w_i|$ . Drives many weights to exactly zero (feature selection).
Elastic Net — combines L1 and L2.

4. Cross-validation. Catches overfitting during training; lets you pick a model whose CV error is low, not just train error.

5. Early stopping. For iterative models (gradient boosting, neural nets), stop training when validation error stops improving — before training error keeps falling and test error starts climbing.

6. Dropout (neural networks). Randomly "drop" neurons during each training step, forcing redundancy and preventing co-adaptation.

7. Pruning (decision trees). Cut back branches that don't improve validation error. Set max_depth, min_samples_leaf, etc.

8. Ensembling. Bagging (Random Forest), boosting and stacking average out errors of individual learners — usually generalises better than a single deep tree.

9. Batch normalisation / weight decay in deep networks.

10. Feature selection / dimensionality reduction. Remove noisy or redundant features (correlation filter, mutual information, PCA, autoencoders).

11. Proper validation strategy. Stratified or group splits to avoid leakage; held-out final test set never touched during tuning.

Diagnosis. Plot learning curves — train vs validation error over training-set size or epochs. A widening gap = overfitting; high train and validation error = underfitting; both low and close = well-fit.