Data science

Map of content for data science — the end-to-end workflow from raw sensor readings to a trained model that makes predictions. The path: tools → collection → labelling and ethics → storage → big data → visualization → cleaning → features → dimensionality reduction → modelling → evaluation → pipelines.

Python tooling

The core libraries the rest of the workflow is built on.

Python dictionary — key-value mapping, Python’s workhorse container.
NumPy arrays — homogeneous n-dimensional arrays, the foundation of numerical Python.
NumPy array slicing — basic slicing semantics.
NumPy advanced indexing — boolean masks and integer-array indexing.
NumPy arithmetic and comparison operations — vectorized element-wise math.
Pandas — labelled tabular data on top of NumPy.
Pandas DataFrame — the 2D labelled table.
Pandas Series — the 1D labelled array.
scikit-learn — the ML library; fit/predict everything.

Data collection

Where raw data comes from: sensors and the field.

Data collection — overview of the data-gathering stage.
Lab vs in-the-wild data — controlled vs realistic collection.
Sensor — the general concept.
IMU — inertial measurement unit, combining accelerometer and gyroscope.
Accelerometer — measures linear acceleration.
Gyroscope — measures angular velocity.
Magnetometer — measures magnetic field.
Hall effect — the physical principle behind magnetometers.
EEG — electroencephalography, brain electrical activity.
ECG — electrocardiography, the textbook’s running example.
Sensor fusion — combining multiple sensors for better estimates.
Web scraping — pulling data from web pages.
Metadata — data about the data.

Labelling and ethics

Turning observations into supervised training data, responsibly.

Label noise — when labels are wrong.
Crowdsourcing labels — distributed human labelling.
Majority voting (labelling) — consolidating disagreeing labels.
Automated labelling — labels from heuristics or models.
Active learning — letting the model choose what to label next.
Informed consent — the ethical baseline for human-subject data.
GDPR — EU data protection.
HIPAA — US health data protection.
PIPEDA — Canadian data protection.

Data formats

How data is stored on disk.

CSV — flat, human-readable tables.
JSON — nested, semi-structured records.
HDF5 — hierarchical binary format for large numerical datasets.
HDF5 group — directory-like container inside an HDF5 file.
HDF5 dataset — array-like leaf inside an HDF5 file.
h5py — the Python interface to HDF5.
gzip — general-purpose compression.
lzf — fast, HDF5-friendly compression.
szip — scientific-data compression.

Relational databases

Structured storage with query languages.

Database management system — the system that runs queries.
Relational database — tables, rows, columns.
SQLite — the embedded RDBMS used as the worked example.
Entity-relationship diagram — modelling the schema.
Relational schema — the table-level design.
Primary key — unique row identifier.
Foreign key — reference to another table’s primary key.
ON DELETE CASCADE — referential-integrity behaviour.
SQL constraint — enforced rules at the column level.
Database transaction — atomic groups of operations.

SQL

The query language.

SQL — the language overview.
SQL DDL — data definition (CREATE, ALTER, DROP).
SQL DML — data manipulation (INSERT, UPDATE, DELETE).
SQL DCL — data control (GRANT, REVOKE).
SQL TCL — transaction control (COMMIT, ROLLBACK).
SELECT statement — querying rows.
WHERE clause — row filtering.
LIKE operator — pattern matching on strings.

SQL data types

How columns are typed.

CHAR (SQL) — fixed-length strings.
VARCHAR (SQL) — variable-length strings.
ENUM (SQL) — fixed set of string values.
DECIMAL (SQL) — exact decimal numbers.
FLOAT (SQL) — approximate floating-point.
SQL integer types — TINYINT through BIGINT.
SQL date and time types — DATE, TIME, DATETIME, TIMESTAMP.
BLOB (SQL) — binary large object.

Big data

When the data outgrows one machine.

Big data — the concept and the three Vs.
Cluster (computing) — many machines acting as one.
Node (computing) — a single machine in a cluster.
Apache Hadoop — the open-source big-data framework.
HDFS — Hadoop’s distributed file system.
NameNode — HDFS’s metadata server.
DataNode — HDFS’s block storage server.
Replication factor — how many copies of each block.
MapReduce — the distributed programming model.
YARN — Hadoop’s cluster resource manager.

Visualization

Communicating what’s in the data.

Matplotlib — the library.
Pyplot — Matplotlib’s stateful interface.
Matplotlib Figure — the top-level container.
Matplotlib Axes — a single plotting region.
Matplotlib GridSpec — flexible subplot layouts.
Matplotlib coordinate transforms — data, axes, figure, display coordinates.
Matplotlib colormap — mapping scalars to colours.
Matplotlib tick locators — controlling axis tick placement.
Principles of effective visualization — design rules for clarity.

Chart types

Choosing the right chart for the question.

Bar chart — categorical comparison.
Line graph — trends over an ordered axis.
Dual-axis chart — two y-scales on one plot.
Area chart — line chart with filled region.
Stacked bar chart — part-to-whole over categories.
Pie chart — proportions of a whole.
Scatter plot — two-variable relationships.
Bubble chart — scatter plot with a size dimension.
Heat map — matrix of values as a colour grid.

Missing data

Handling holes in the table.

Missing data — the problem and why it bites.
Deletion (missing data) — drop the offending rows or columns.
Imputation — fill the holes instead of dropping.
Zero-replacement imputation — fill with zero (often wrong).
Sample-and-hold imputation — carry the last observed value forward.
Linear interpolation — straight-line fill between neighbours.
Non-linear interpolation — splines and higher-order fills.

Noise and filtering

Separating signal from artifact.

Noise (signal) — unwanted variation in a measurement.
Artifact (signal) — non-random corruption from a known source.
Low-frequency noise — drift, baseline wander.
High-frequency noise — fast fluctuations on top of the signal.
Moving-average filter — the simplest low-pass filter.

Scaling

Putting features on comparable axes.

Normalization — rescaling features to a common range.
StandardScaler — zero-mean, unit-variance scaling.

Feature extraction

Turning raw windows of data into model inputs.

Feature extraction — the general concept.
Window (feature extraction) — the unit of computation, fixed or sliding.
Pandas rolling — the implementation engine for windowed features.
Mean (statistical) — first moment.
Standard deviation — spread.
Variance (statistical) — squared spread.
Skewness — asymmetry of the distribution.
Kurtosis — tail heaviness.

Dimensionality reduction

Fewer features, similar information.

Curse of dimensionality — why high-dimensional spaces are hostile.
Dimensionality reduction — the general problem.
Principal Component Analysis — linear projection onto maximum-variance directions.
t-SNE — non-linear embedding for visualization.

Machine learning foundations

The three paradigms.

Supervised learning — learn from labelled examples.
Unsupervised learning — find structure without labels.
Reinforcement learning — learn from rewards.

Regression

Predicting continuous targets.

Regression — the general problem.
Linear regression — straight-line fit.
Polynomial regression — higher-degree fits via feature expansion.
Loss function — the objective being minimized.
Mean squared error — the standard regression loss.
Gradient — direction of steepest increase.
Gradient descent — the optimization algorithm.
Learning rate — step size for gradient descent.

Classification

Predicting discrete labels.

Classification (ML) — the general problem.
Sigmoid function — the squashing function for binary probabilities.
Logistic regression — linear model with sigmoid output.
Binary cross-entropy — the standard classification loss.

Model evaluation

Telling whether a model actually works.

Training set — what the model learns from.
Validation set — what tuning decisions are made on.
Test set — the held-out final evaluation.
Train-test split — partitioning the data.
K-fold cross-validation — averaging over multiple splits.
Hyperparameter — tuning knob set outside training.
Data leakage — when test info bleeds into training; the silent killer.

Classification metrics

Once a model predicts labels, scoring it.

Confusion matrix — the four-cell summary of binary predictions.
True positive — correctly predicted positive.
True negative — correctly predicted negative.
False positive — predicted positive, actually negative.
False negative — predicted negative, actually positive.
Accuracy (ML) — fraction correct overall.
Precision (ML) — of predicted positives, how many were right.
Recall (sensitivity) — of actual positives, how many were caught.
Specificity (TNR) — of actual negatives, how many were rejected.
False positive rate (FPR) — 1 − specificity.
F1 score — harmonic mean of precision and recall.
Decision threshold — where probability becomes a label.
ROC curve — TPR vs FPR over all thresholds.
AUC — area under the ROC curve.

Pipelines

Composing the workflow.

scikit-learn pipeline — chained preprocessing and modelling steps.

Connects to Data structures (NumPy arrays and Pandas DataFrames are concrete data structures; SQL tables and hash-table indices sit underneath) and to Differential equations (gradient descent is a discrete approximation to the gradient-flow ODE $\dot{w} = - \nabla J (w)$ , and the logistic curve is the solution to the logistic ODE). Linear algebra underpins PCA and every linear model here.

Idriss Rami — Notes

Explorer