PGD24110103A

November 2024 — Solved

Machine Learning for Data Science
PGD01C03
2 hours 30 minutes
50 marks
Solved

End-Semester Examination, First Semester PG Diploma in Data Science and Analytics (2024 Admission).

Part A — Short Essay

Answer any five questions. Each question carries 2 marks. (5 × 2 = 10 Marks)


Q1. Define machine learning. What are the different types of machine learning?

Machine Learning is a branch of AI that builds systems which learn patterns from data to make predictions or decisions, without being explicitly programmed for the task (Tom Mitchell, 1997).

Types:

  1. Supervised — labelled data; learn f:XYf : X \to Y (classification, regression).
  2. Unsupervised — unlabelled data; find structure (clustering, dimensionality reduction).
  3. Reinforcement — agent interacts with an environment, learns from rewards (Q-learning, policy gradients).
  4. (Semi-supervised, Self-supervised — hybrids that mix labelled and unlabelled data.)

Q2. What is dimensionality reduction?

Dimensionality reduction is the process of representing high-dimensional data Rn\mathbb{R}^n in a lower-dimensional space Rk\mathbb{R}^k (knk \ll n) while preserving as much information as possible. It combats the curse of dimensionality, removes redundancy, speeds up training and aids visualisation.

Examples: PCA, LDA, t-SNE, UMAP, autoencoders.


Q3. Define F1-score.

The F1-score is the harmonic mean of precision and recall:

F1=2precisionrecallprecision+recall.F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}.

It ranges 0 to 1 and gives a single number balancing false positives and false negatives — useful for imbalanced classification where accuracy is misleading.


Q4. What is the difference between precision and recall?

  • Precision = TPTP+FP\dfrac{TP}{TP + FP} — of items the model predicted positive, how many are truly positive. "How careful?"
  • Recall = TPTP+FN\dfrac{TP}{TP + FN} — of items truly positive, how many did the model find. "How complete?"

Precision matters when false positives are costly (spam filter). Recall matters when false negatives are costly (cancer screening).


Q5. What contributes to support and confidence?

In association rule mining for a rule ABA \Rightarrow B:

  • Support = count(AB)N\dfrac{\text{count}(A \cup B)}{N} — fraction of transactions containing both AA and BB. Measures the rule's frequency / popularity.
  • Confidence = count(AB)count(A)=P(BA)\dfrac{\text{count}(A \cup B)}{\text{count}(A)} = P(B \mid A) — fraction of AA-transactions that also contain BB. Measures the rule's reliability.

Q6. What makes RNNs suitable for processing sequential data?

Recurrent Neural Networks maintain a hidden state hth_t updated at each time step from the previous state and current input: ht=f(Wht1+Uxt)h_t = f(W h_{t-1} + U x_t). This recurrence gives them memory of past inputs, allowing them to model dependencies that depend on order — exactly what sequential data (text, speech, time-series) requires. Variants like LSTM and GRU add gating to handle long-range dependencies.


Q7. Why is the activation function important in a perceptron?

The activation function introduces non-linearity into the perceptron's output. Without it, a stack of layers would collapse into a single linear map, no matter how deep — incapable of learning non-linear patterns like XOR. Non-linear activations (sigmoid, tanh, ReLU) allow neural networks to approximate any continuous function (universal approximation theorem).


Part B — Long Essay

Answer any two questions. Each question carries 20 marks. (2 × 20 = 40 Marks)


Q8.

(a) Analyse the problem of email spam classification using ML. Explain the stages of data preprocessing. (10 marks)

Problem framing. Given an email ee, classify it as spam (1) or ham (0). This is a binary supervised classification problem. Performance is judged by precision (avoid throwing away genuine email), recall (catch spam), and the F1-score.

Typical pipeline. Email text + metadata → preprocessing → feature extraction → train a classifier (Naive Bayes, Logistic Regression, SVM or Transformer) → evaluate on held-out emails → deploy in mail-server filter.

Stages of data preprocessing for spam classification.

1. Data collection. Gather a labelled corpus — e.g., the Enron dataset, SpamAssassin, or in-house user reports. Include both spam and ham, ideally balanced or class-weighted.

2. Lowercasing. Convert all text to lowercase so "FREE" and "free" map to one token.

3. Removing HTML, URLs, email headers. Spam often contains HTML tags and tracking links. Use regex or a parser (BeautifulSoup) to strip them — but optionally keep a count of links as a feature, because spam tends to have many.

4. Tokenisation. Split text into tokens (words, sub-words). Use NLTK, SpaCy or transformer tokenisers.

5. Removing punctuation, digits, special chars. Or replace them with placeholder tokens like <NUM> if numeric patterns matter.

6. Stop-word removal. Drop high-frequency, low-information words ("the", "and"). Optional — sometimes stop-words help spam detection.

7. Stemming / Lemmatisation. Reduce words to roots ("running" → "run") so morphological variants share weight.

8. Handle imbalance. Often more ham than spam. Use class weights, oversample spam (SMOTE), or undersample ham.

9. Feature engineering.

  • Bag-of-Words / TF-IDF vectorisation.
  • n-grams ("free money" is more telling than the unigrams).
  • Meta-features: number of links, exclamation marks, ALL-CAPS ratio, length, sender domain, presence of attachments.

10. Train / test split. Stratified split to preserve spam ratio. Optionally use temporal split (older → train, newer → test) because spam content drifts.

11. Scaling / normalisation. Apply standard scaling to numeric meta-features if mixing with TF-IDF.

Modelling and evaluation. Train Multinomial Naive Bayes (a classical, fast baseline), Logistic Regression with L1, SVM with linear kernel, or fine-tune a small Transformer. Report precision, recall, F1, ROC-AUC. Iterate by examining false positives and false negatives.

(b) Various error calculation techniques in ML models (10 marks)

Regression error metrics.

MetricFormulaNotes
MAE (Mean Absolute Error)$\dfrac{1}{n}\sumy_i - \hat y_i
MSE (Mean Squared Error)1n(yiy^i)2\dfrac{1}{n}\sum (y_i - \hat y_i)^2Penalises large errors; differentiable.
RMSEMSE\sqrt{MSE}Same units as yy. Most reported.
MAPE$\dfrac{100}{n}\sum \left\dfrac{y_i - \hat y_i}{y_i}\right
R2R^21SSresSStot1 - \dfrac{SS_{res}}{SS_{tot}}Fraction of variance explained; 1 = perfect, 0 = mean baseline.
Adjusted R2R^2Penalises adding non-informative predictors.

Classification error metrics.

Built on the confusion matrix of TP, TN, FP, FN.

MetricFormulaMeaning
AccuracyTP+TNN\dfrac{TP + TN}{N}Overall correct rate. Misleading under imbalance.
PrecisionTPTP+FP\dfrac{TP}{TP + FP}Of predicted positives, how many real.
Recall (Sensitivity)TPTP+FN\dfrac{TP}{TP + FN}Of real positives, how many found.
SpecificityTNTN+FP\dfrac{TN}{TN + FP}True-negative rate.
F12PRP+R\dfrac{2 \cdot P \cdot R}{P + R}Balanced harmonic mean.
Fβ\betaweights recall vs precisionUse F2_2 when recall matters more.
ROC-AUCArea under TPR vs FPR curveThreshold-independent ranking quality.
PR-AUCArea under Precision-Recall curveBetter than ROC under heavy imbalance.
Log loss1n[ylogp^+(1y)log(1p^)]-\dfrac{1}{n}\sum [y \log \hat p + (1-y) \log(1-\hat p)]Penalises confident-wrong predictions.

Probabilistic / Information-theoretic.

  • Cross-entropy — multiclass log loss.
  • KL divergence — distance between predicted and true distributions.
  • Brier score — mean squared error on predicted probabilities.

Generalisation diagnostics.

  • Train vs validation gap — wide gap = overfitting.
  • Learning curves — error vs training size to diagnose bias vs variance.
  • Cross-validation mean and standard deviation.

Domain-specific.

  • Object detection: IoU, mAP.
  • Translation: BLEU, METEOR.
  • Recommender: NDCG, MAP, MRR.

Choosing the right metric. Tie the metric to the business cost: in fraud detection optimise recall + PR-AUC because FN is expensive; in spam optimise precision; in forecasting often RMSE. Wrong metric = wrong model wins.


Q9.

(a) Heart-disease prediction model using supervised learning — algorithm choice, feature selection, training, evaluation. (10 marks)

Problem. Given a patient's demographic, clinical and lab data, predict probability of heart disease (binary classification). Standard dataset: UCI Cleveland.

Step 1 — Data understanding. Features include age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting ECG, max heart rate, exercise-induced angina, ST depression, slope, # major vessels, thalassemia.

Step 2 — Preprocessing.

  • Impute missing values (median for numeric, mode for categorical, KNN imputation for nuanced cases).
  • One-hot encode categorical variables.
  • Standardise numeric features (mean 0, std 1) — crucial for distance / gradient models.
  • Stratified train/test split (e.g., 80/20).

Step 3 — Feature selection. Methods used together:

  • Domain knowledge — cardiologists' guidance on which markers matter (cholesterol, ST depression, age).
  • Univariate filter — Chi-square / mutual information against target.
  • Embedded methods — L1 (Lasso) logistic regression zeros out unimportant features.
  • Wrapper / RFE — Recursive Feature Elimination using a model's coefficients.
  • Model importance — Random Forest / XGBoost feature importance.

Keep ~8–10 most informative features to keep model interpretable.

Step 4 — Algorithm choice. Compare candidates:

AlgorithmProsCons
Logistic RegressionInterpretable, gives probabilities, fastLinear only
Decision TreeInterpretable, handles non-linearOverfits
Random ForestRobust, captures interactions, good baselineLess interpretable
Gradient Boosting (XGBoost)Often best accuracyTuning effort
SVM (RBF)Strong with small dataSlow on big data, no probs by default
KNNSimpleSensitive to scaling, slow at predict time

For clinical deployment, Logistic Regression or Random Forest is often chosen for the balance of accuracy and interpretability (doctors want to know why).

Step 5 — Training.

  • Define loss (binary cross-entropy / Gini for trees).
  • Fit on training set; tune hyperparameters via grid/random search inside stratified k-fold CV (k = 5).
  • Address class imbalance with class_weight='balanced' or SMOTE oversampling.

Step 6 — Evaluation. Heart disease is a high-cost FN problem — missing a sick patient is dangerous. Use:

  • Recall (Sensitivity) as the headline metric.
  • PR-AUC, ROC-AUC for threshold-free quality.
  • F1 as a balanced summary.
  • Confusion matrix at chosen threshold.
  • Calibration plot so predicted probabilities are honest.

Tune the decision threshold to achieve, say, 90% recall while keeping precision acceptable.

Step 7 — Interpret & deploy. SHAP values reveal each feature's contribution to a specific patient's risk score, which clinicians can audit. Deploy as a clinician decision-support tool — never autonomous diagnosis.

(b) Concept of hyperplanes in SVM (10 marks)

Hyperplane. In Rn\mathbb{R}^n, a hyperplane is the (n1)(n-1)-dimensional set H={x:wTx+b=0}H = \{x : w^T x + b = 0\} with wRn{0}w \in \mathbb{R}^n \setminus \{0\}. The vector ww is normal (perpendicular) to HH.

  • R2\mathbb{R}^2: a line.
  • R3\mathbb{R}^3: a plane.
  • Rn\mathbb{R}^n: an (n1)(n-1)-flat.

It divides space into two half-spaces H+={x:wTx+b0}H^+ = \{x : w^T x + b \ge 0\} and H={x:wTx+b0}H^- = \{x : w^T x + b \le 0\}.

SVM idea. For binary classification with yi{1,+1}y_i \in \{-1, +1\}, find the hyperplane that separates the classes with the largest margin. Among infinitely many separating hyperplanes, the one with the biggest distance to the nearest points generalises best.

Geometric margin. The distance from a point xix_i to hyperplane wTx+b=0w^T x + b = 0 is wTxi+bw\dfrac{|w^T x_i + b|}{\|w\|}. If we rescale ww and bb so that the closest points satisfy wTxi+b=1|w^T x_i + b| = 1, then the margin (distance between the two parallel boundaries wTx+b=±1w^T x + b = \pm 1) is

margin=2w.\text{margin} = \frac{2}{\|w\|}.

Hard-margin SVM (linearly separable).

Maximise margin = minimise w\|w\|, with correct classification: minw,b  12w2subject toyi(wTxi+b)1    i.\min_{w, b} \; \frac{1}{2}\|w\|^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \ge 1 \;\; \forall i.

Soft-margin SVM (non-separable data). Introduce slack ξi0\xi_i \ge 0 allowing some misclassification, controlled by penalty CC: minw,b,ξ  12w2+Ciξi,yi(wTxi+b)1ξi.\min_{w, b, \xi} \; \frac{1}{2}\|w\|^2 + C \sum_i \xi_i, \quad y_i(w^T x_i + b) \ge 1 - \xi_i. CC large ⇒ low tolerance, narrow margin. CC small ⇒ tolerant, wider margin.

Support vectors. The data points satisfying yi(wTxi+b)=1y_i (w^T x_i + b) = 1 (or violating with ξi>0\xi_i > 0) are support vectors. They lie on the margin boundaries and entirely determine the optimal hyperplane — removing any other point doesn't change the solution. This makes SVM memory-efficient at prediction time.

Kernel trick — non-linear hyperplanes. When classes are not linearly separable in the input space, map to a higher-dim feature space via ϕ(x)\phi(x) and find a linear hyperplane there — which corresponds to a non-linear boundary in the original space. Computing ϕ(x)\phi(x) explicitly is expensive, so we use kernels K(x,x)=ϕ(x),ϕ(x)K(x, x') = \langle \phi(x), \phi(x') \rangle:

  • Linear: K=xTxK = x^T x'
  • Polynomial: K=(xTx+1)dK = (x^T x' + 1)^d
  • RBF / Gaussian: K=exp(γxx2)K = \exp(-\gamma \|x - x'\|^2)
  • Sigmoid: K=tanh(κxTx+c)K = \tanh(\kappa x^T x' + c)

Decision function. New point classified by y^=sign(wTx+b)\hat y = \text{sign}(w^T x + b), with confidence proportional to distance from the hyperplane.


Q10.

(a) ECLAT algorithm for frequent itemset mining — compare with Apriori, illustrate with example. (10 marks)

Frequent Itemset Mining problem. Given a transaction database TT and minimum support σ\sigma, find all itemsets II whose support sup(I)σ\text{sup}(I) \ge \sigma. Used in market-basket analysis, web-click pattern mining, recommendation.

ECLAT — Equivalence CLAss Transformation (Zaki, 1997). Mines frequent itemsets using a depth-first search on a vertical (tid-list) representation of data.

Key idea — vertical layout. Instead of storing each transaction as a row of items (horizontal), store each item as the set of transaction IDs (TIDs) containing it:

ItemTID-set
Bread{1, 2, 4, 5}
Milk{1, 2, 3, 5}
Eggs{2, 4, 5}

The support of an itemset {X,Y}\{X, Y\} is then the size of the intersection of the TID-sets — pure set operations, no DB scan.

Algorithm.

ECLAT(prefix, items_with_tidlists, σ):
    for i in items_with_tidlists:
        if |tidlist(i)| ≥ σ:
            output (prefix ∪ {i}, |tidlist(i)|)
            new_items = []
            for j after i in items_with_tidlists:
                t = tidlist(i) ∩ tidlist(j)
                if |t| ≥ σ:
                    new_items.append((j, t))
            ECLAT(prefix ∪ {i}, new_items, σ)

Comparison with Apriori.

AspectAprioriECLAT
Data layoutHorizontalVertical (tid-lists)
Search strategyBreadth-first, level-wiseDepth-first
Support computationRepeated DB scansSet intersections
Candidate generationCombine k-itemsets to form (k+1) candidates, pruneIntersect tid-lists of items sharing a prefix
MemorySmaller per pass (candidates only)Larger — must hold tid-lists
Speed on dense dataSlow — many scansFast — fewer scans
Speed on sparse, very large dataAcceptableCan blow up memory
Disk-friendlyYesLess so

Example. Suppose 5 transactions:

TIDItems
1Bread, Milk
2Bread, Milk, Eggs
3Milk
4Bread, Eggs
5Bread, Milk, Eggs

Minimum support σ=2\sigma = 2.

Vertical layout: B = {1,2,4,5}, M = {1,2,3,5}, E = {2,4,5}.

Singletons (all ≥ 2): sup(B)=4, sup(M)=4, sup(E)=3.

Pairs by intersection:

  • B ∩ M = {1,2,5}, sup = 3 ✓
  • B ∩ E = {2,4,5}, sup = 3 ✓
  • M ∩ E = {2,5}, sup = 2 ✓

Triple: B ∩ M ∩ E = {2,5}, sup = 2 ✓.

All frequent itemsets: {B}, {M}, {E}, {B,M}, {B,E}, {M,E}, {B,M,E} — found with no extra DB scans, just three set intersections per branch.

(b) Upper Confidence Bound (UCB) for the exploration–exploitation dilemma (10 marks)

Exploration–exploitation dilemma. In multi-armed bandits and reinforcement learning, at each step we must choose between:

  • Exploit — play the action with the highest current estimated reward.
  • Explore — try other actions in case a less-tested one is actually better.

Always exploiting locks in to early estimates that may be wrong. Always exploring never collects reward. We need a principled balance.

UCB idea. Be optimistic in the face of uncertainty — for each arm, compute an upper confidence bound on its true reward and pick the arm with the highest bound. This naturally explores arms we haven't tried much (their bound is high because uncertainty is high) and exploits arms we trust (their estimate is high and uncertainty is low).

UCB1 algorithm (Auer et al., 2002). For each round tt pick:

at=argmaxa[rˉa+2lntNa]a_t = \arg\max_a \left[ \bar{r}_a + \sqrt{\frac{2 \ln t}{N_a}} \right]

where rˉa\bar{r}_a is the average reward observed from arm aa so far and NaN_a is the number of times aa has been played.

  • First term rˉa\bar{r}_a — exploitation.
  • Second term 2lnt/Na\sqrt{2 \ln t / N_a} — exploration bonus; large when NaN_a is small (arm under-tested) and slowly grows with tt.

Why it works. Hoeffding's inequality shows the true mean of arm aa lies below rˉa+2lnt/Na\bar{r}_a + \sqrt{2 \ln t / N_a} with high probability. So the chosen arm is either:

  • the truly best arm (we exploit correctly), or
  • an under-explored arm whose uncertainty pulled it up (we explore — and will tighten its bound over time).

Regret bound. UCB1 achieves O(logt)O(\log t) regret — provably optimal up to constants.

Practical worked example. Three arms, after some plays:

ArmPlays NaN_aAvg reward rˉa\bar r_aBonus 2ln100/Na\sqrt{2\ln 100 / N_a}UCB
1600.700.391.09
2300.600.551.15
3100.500.961.46

Even though Arm 3 currently has the lowest average reward, its high uncertainty gives it the highest UCB → UCB explores it next.

Variants and uses.

  • UCB2, UCB-V (variance-aware), KL-UCB (tighter bounds).
  • Used in A/B testing, online recommender systems, ad selection, hyperparameter search, and as the value back-up in Monte Carlo Tree Search (e.g., AlphaGo).
  • Reinforcement-learning extensions include UCRL for MDPs.

Q11.

(a) How do Artificial Neural Networks (ANNs) learn from data? Explain the role of backpropagation. (10 marks)

Architecture. An ANN is a layered graph of artificial neurons. Each neuron in layer \ell computes: z()=W()a(1)+b(),a()=σ(z()),z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \quad a^{(\ell)} = \sigma(z^{(\ell)}), where σ\sigma is a non-linear activation (ReLU, sigmoid, tanh). The full network is a composition of these layers.

Learning is optimisation. Define a loss function L(y^,y)L(\hat y, y) that measures how wrong the network's prediction is — e.g., MSE for regression, cross-entropy for classification. Training means finding weights θ={W(),b()}\theta = \{W^{(\ell)}, b^{(\ell)}\} that minimise the average loss over the training set: θ=argminθ  1Ni=1NL(y^(xi;θ),yi).\theta^* = \arg\min_\theta \; \frac{1}{N} \sum_{i=1}^N L\big(\hat y(x_i; \theta), y_i\big).

This is done by gradient descent — repeatedly nudge θ\theta in the direction of steepest descent of LL.

The training loop.

  1. Forward pass — feed input xx through layers to compute y^\hat y and loss LL.
  2. Backward pass (backpropagation) — compute L/θ\partial L / \partial \theta for every weight.
  3. UpdateθθηL/θ\theta \leftarrow \theta - \eta \, \partial L / \partial \theta (η\eta = learning rate).
  4. Repeat over mini-batches for many epochs.

Backpropagation.

Goal. Compute LW()\dfrac{\partial L}{\partial W^{(\ell)}} and Lb()\dfrac{\partial L}{\partial b^{(\ell)}} for every layer efficiently.

Idea. Apply the chain rule layer by layer, starting at the output and moving backward. Reuse intermediate results saved from the forward pass.

Equations (for a feed-forward net with loss LL at the output layer LL):

  1. Output error: δ(L)=aLσ(z(L))\delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)}).

  2. Propagate: δ()=(W(+1))Tδ(+1)σ(z())\delta^{(\ell)} = \big( W^{(\ell+1)} \big)^T \delta^{(\ell+1)} \odot \sigma'(z^{(\ell)}).

  3. Gradients: LW()=δ()(a(1))T,Lb()=δ().\frac{\partial L}{\partial W^{(\ell)}} = \delta^{(\ell)} \big( a^{(\ell-1)} \big)^T, \quad \frac{\partial L}{\partial b^{(\ell)}} = \delta^{(\ell)}.

Why it's brilliant. Without backprop, computing gradients for an LL-layer network with nn weights would cost O(n2)O(n^2) per training step. Backprop reduces it to O(n)O(n) by sharing computation through the chain rule — making deep networks feasible.

Why it's important.

  • Backpropagation is the engine of modern deep learning. Every CNN, RNN, Transformer, GPT-style model trains by gradient descent + backprop.
  • It allows networks with millions–billions of parameters to be trained.
  • It is generic — any differentiable computation graph can be trained by backprop (automatic differentiation in PyTorch/TensorFlow makes this nearly free).

Practical concerns.

  • Vanishing / exploding gradients with deep nets — addressed by ReLU, batch normalisation, residual connections, gradient clipping.
  • Mini-batch / Stochastic Gradient Descent for scalability.
  • Optimisers — Adam, RMSProp, AdaGrad — accelerate and stabilise the update step.

(b) Basic structure and working principle of a perceptron in a neural network. (10 marks)

History. The Perceptron is the simplest artificial neuron, introduced by Frank Rosenblatt (1958) — the building block of modern neural networks.

Structure.

A perceptron takes nn real inputs x1,x2,,xnx_1, x_2, \dots, x_n, each multiplied by a learned weight wiw_i, summed, plus a bias bb, and passed through an activation function:

  x1 ─w1─┐
  x2 ─w2─┤   z = Σ wᵢxᵢ + b      a = σ(z) → output
  ...    ┼─→ (──────────────)──→ ( σ )──→
  xn ─wn─┘                     bias b

Components.

  • Inputs x1,,xnx_1, \dots, x_n — features of one data sample.
  • Weights w1,,wnw_1, \dots, w_n — learned during training; encode importance and sign of each feature.
  • Bias bb — shifts the decision boundary, allowing it to not pass through origin.
  • Weighted sum (pre-activation): z=i=1nwixi+b=wTx+bz = \sum_{i=1}^n w_i x_i + b = w^T x + b.
  • Activation function σ\sigma — originally a step function σ(z)=1 if z0 else 0\sigma(z) = 1 \text{ if } z \ge 0 \text{ else } 0. Modern versions use sigmoid, tanh, or ReLU.
  • Output y^=σ(z)\hat y = \sigma(z).

Geometric interpretation. wTx+b=0w^T x + b = 0 defines a hyperplane that splits feature space into two regions. The perceptron classifies a point by which side of the hyperplane it lies on.

Working principle — learning.

The perceptron learning rule updates weights based on errors. For each training example (xi,yi)(x_i, y_i) with yi{0,1}y_i \in \{0, 1\}:

  1. Compute prediction y^i=σ(wTxi+b)\hat y_i = \sigma(w^T x_i + b).
  2. Update: ww+η(yiy^i)xiw \leftarrow w + \eta (y_i - \hat y_i) x_i bb+η(yiy^i)b \leftarrow b + \eta (y_i - \hat y_i)
  3. Repeat over the dataset until convergence (or max epochs).

If the data is linearly separable, the Perceptron Convergence Theorem (Rosenblatt, Novikoff) guarantees this rule will find a separating hyperplane in finitely many steps.

Limitation — XOR problem. A single perceptron can only learn linearly separable patterns. It famously fails on the XOR function — there is no straight line that separates {(0,1),(1,0)}\{(0,1),(1,0)\} from {(0,0),(1,1)}\{(0,0),(1,1)\}. This limitation, pointed out by Minsky and Papert (1969), caused the "AI winter" until multi-layer perceptrons (MLPs) with non-linear activations and backpropagation were developed to overcome it.

Importance today.

  • The perceptron is the conceptual atom of every neural network — a deep network is just many perceptrons stacked and connected.
  • Logistic regression is a perceptron with sigmoid activation and cross-entropy loss.
  • Understanding the perceptron clarifies linear classification, decision boundaries, weight updates, and the geometry of learning.