PGDDSA Study · Semester 1

PGD01C02

Module 1 · Introduction to Data Science

What is Data Science and its Evolution

Core Titles

Key headlines and terms for quick recall

Data Science — interdisciplinary field for extracting knowledge from data
Pillars: Statistics + Computer Science + Mathematics + Domain knowledge
Evolution: Statistics → BI → Data Mining → Big Data → Data Science → AI/ML
Drivers: data abundance, cheap compute, modern algorithms, business need
Related disciplines: Statistics, Machine Learning, Big Data Analytics, AI
Fourth Industrial Revolution — data-driven decision making

Basic Idea

What it is, why it matters, how it works

What is Data Science?

Data Science is the interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply that knowledge across a broad range of application areas.

It combines:

Mathematics & Statistics — for modelling, inference, uncertainty.
Computer Science — for algorithms, data structures, scalable systems.
Domain knowledge — to ask the right questions and interpret results.

A data scientist's goal: turn raw data into actionable decisions.

Evolution — how we got here

Era	What	Drivers
1960s–80s — Classical statistics	Hypothesis testing, regression on small samples	Manual data, academic use
1980s–90s — Business Intelligence (BI)	Dashboards, OLAP, reporting from databases	Relational DBs, decision support
1990s–2000s — Data Mining / KDD	Pattern discovery from large databases	Cheap storage, retail/credit data
2000s — Web & Big Data	Hadoop, MapReduce, NoSQL	Web scale, social media
2010s — Data Science	Combined stats, code, viz, ML	Cheap compute, open-source (R, Python), Kaggle
2015 → — AI / Deep Learning	GPUs, neural nets, foundation models	TensorFlow / PyTorch, GPT-style LLMs
2020 → — MLOps & Generative AI	Production ML pipelines, LLMs	Cloud, transformer revolution

Why now — what made data science explode

Data abundance — IoT, mobile, web logs, sensors produce data at zettabyte scale.
Compute — multi-core CPUs, GPUs, cloud (AWS/Azure/GCP) became cheap.
Algorithms — open-source ML (scikit-learn, TensorFlow, PyTorch) democratised research.
Storage — distributed file systems (HDFS, S3) let us keep everything.
Business pressure — competition demands data-driven decisions.

Data Science vs related fields

Data Analytics — descriptive: what happened? (subset of DS)
Machine Learning — algorithms that learn from data (tool of DS)
Big Data — handling volume, velocity, variety (infrastructure for DS)
AI — broader goal: systems that think/act intelligently (DS feeds AI)

Why it matters

DS is now the engine behind every modern product — Google search, Netflix recommendations, fraud detection at banks, MRI diagnosis at hospitals, route optimisation at delivery companies, dynamic pricing at airlines.

Mind Map

Visual structure of the concept

DATA SCIENCE
├── Definition
│   └── Extract knowledge from data
├── Pillars
│   ├── Mathematics & Statistics
│   ├── Computer Science / programming
│   └── Domain expertise
├── Evolution timeline
│   ├── Classical Stats (1960s)
│   ├── BI & OLAP (1980s)
│   ├── Data Mining / KDD (1990s)
│   ├── Big Data & Hadoop (2000s)
│   ├── Modern Data Science (2010s)
│   └── AI / LLM era (2020s)
├── Why now
│   ├── Data abundance
│   ├── Cheap compute (GPU, cloud)
│   ├── Open-source ML
│   └── Business pressure
└── Related fields
    ├── ML — algorithms
    ├── Big Data — infrastructure
    ├── AI — broader goal
    └── Analytics — descriptive subset

Exam Q&A

Part A (2 marks) and Part B (20 marks) style questions

Part A (2 marks each)

Q1. Define Data Science. The interdisciplinary field that uses scientific methods, processes and algorithms to extract knowledge and insights from structured and unstructured data, combining statistics, computer science and domain expertise.

Q2. List the three core pillars of Data Science. Mathematics & Statistics, Computer Science, and Domain Knowledge.

Q3. How does Data Science differ from Machine Learning? Data Science is the broader field of extracting insight from data (which includes data wrangling, visualization, statistics and ML); Machine Learning is one technique inside data science — algorithms that learn patterns from data.

Part B (20 marks)

Q. Trace the evolution of Data Science. Explain the factors that have driven its rapid growth in recent years and discuss its key pillars with examples.

Evolution.

Era	Key development
1960s–80s	Classical statistics on small samples (regression, ANOVA).
1980s	Business Intelligence — dashboards, OLAP, reporting from RDBMS.
1990s	Data Mining / KDD — pattern discovery from large databases (Apriori, decision trees, clustering).
2000s	Big Data — Hadoop (2006), MapReduce, NoSQL, distributed storage. Web 2.0 generated unprecedented data.
2010s	Data Science formalised — Hal Varian's "sexy job of the next decade." R / Python ecosystems (pandas, scikit-learn), Kaggle competitions, the role of "Data Scientist."
2015 →	Deep learning revolution — AlexNet (2012) made deep CNNs mainstream; TensorFlow (2015), PyTorch (2016).
2020 →	MLOps, LLMs, foundation models (BERT, GPT, Stable Diffusion).

Drivers of rapid growth.

Data abundance. The world produces ~2.5 quintillion bytes daily — IoT sensors, social media, mobile devices, transactions.
Cheap compute. GPUs (originally for graphics) revolutionised neural-network training; cloud computing made on-demand scaling affordable.
Storage — distributed file systems (HDFS, S3, Azure Blob) made it possible to keep "all" data for later analysis.
Open-source algorithms. scikit-learn, TensorFlow, PyTorch, Hugging Face freed cutting-edge methods from academia.
Business need. Competitive markets demand data-driven decisions — Amazon's pricing, Uber's routing, Netflix's recommendations are existential, not optional.
Talent and education. PG programs, MOOCs, bootcamps, communities (Kaggle, Stack Overflow).

Three pillars with examples.

Mathematics & Statistics. Probability, linear algebra, calculus, hypothesis testing power every ML algorithm. Example: a Naive Bayes spam classifier uses Bayes' theorem; PCA uses eigenvectors of the covariance matrix.
Computer Science. Algorithms, data structures, databases, distributed systems, programming. Example: a recommendation system at Netflix uses MapReduce-style distributed ML on Spark, with low-latency online prediction in Cassandra.
Domain knowledge. Without understanding the field — medicine, finance, retail — a data scientist can build statistically valid but practically useless models. Example: a fraud-detection model must understand legitimate transaction patterns (online vs in-store, holiday seasonality) to set meaningful thresholds.

Why all three? A pure statistician might miss algorithmic efficiency. A pure programmer might overfit. A pure domain expert lacks the modelling toolkit. The data scientist sits at the intersection — which is why the role is uniquely valuable and difficult to fill.

Data Science Roles