Map of content for data science — the end-to-end workflow from raw sensor readings to a trained model that makes predictions. The path: tools → collection → labelling and ethics → storage → big data → visualization → cleaning → features → dimensionality reduction → modelling → evaluation → pipelines.
Python tooling
The core libraries the rest of the workflow is built on.
- Python dictionary — key-value mapping, Python’s workhorse container.
- NumPy arrays — homogeneous n-dimensional arrays, the foundation of numerical Python.
- NumPy array slicing — basic slicing semantics.
- NumPy advanced indexing — boolean masks and integer-array indexing.
- NumPy arithmetic and comparison operations — vectorized element-wise math.
- Pandas — labelled tabular data on top of NumPy.
- Pandas DataFrame — the 2D labelled table.
- Pandas Series — the 1D labelled array.
- scikit-learn — the ML library; fit/predict everything.
Data collection
Where raw data comes from: sensors and the field.
- Data collection — overview of the data-gathering stage.
- Lab vs in-the-wild data — controlled vs realistic collection.
- Sensor — the general concept.
- IMU — inertial measurement unit, combining accelerometer and gyroscope.
- Accelerometer — measures linear acceleration.
- Gyroscope — measures angular velocity.
- Magnetometer — measures magnetic field.
- Hall effect — the physical principle behind magnetometers.
- EEG — electroencephalography, brain electrical activity.
- ECG — electrocardiography, the textbook’s running example.
- Sensor fusion — combining multiple sensors for better estimates.
- Web scraping — pulling data from web pages.
- Metadata — data about the data.
Labelling and ethics
Turning observations into supervised training data, responsibly.
- Label noise — when labels are wrong.
- Crowdsourcing labels — distributed human labelling.
- Majority voting (labelling) — consolidating disagreeing labels.
- Automated labelling — labels from heuristics or models.
- Active learning — letting the model choose what to label next.
- Informed consent — the ethical baseline for human-subject data.
- GDPR — EU data protection.
- HIPAA — US health data protection.
- PIPEDA — Canadian data protection.
Data formats
How data is stored on disk.
- CSV — flat, human-readable tables.
- JSON — nested, semi-structured records.
- HDF5 — hierarchical binary format for large numerical datasets.
- HDF5 group — directory-like container inside an HDF5 file.
- HDF5 dataset — array-like leaf inside an HDF5 file.
- h5py — the Python interface to HDF5.
- gzip — general-purpose compression.
- lzf — fast, HDF5-friendly compression.
- szip — scientific-data compression.
Relational databases
Structured storage with query languages.
- Database management system — the system that runs queries.
- Relational database — tables, rows, columns.
- SQLite — the embedded RDBMS used as the worked example.
- Entity-relationship diagram — modelling the schema.
- Relational schema — the table-level design.
- Primary key — unique row identifier.
- Foreign key — reference to another table’s primary key.
- ON DELETE CASCADE — referential-integrity behaviour.
- SQL constraint — enforced rules at the column level.
- Database transaction — atomic groups of operations.
SQL
The query language.
- SQL — the language overview.
- SQL DDL — data definition (CREATE, ALTER, DROP).
- SQL DML — data manipulation (INSERT, UPDATE, DELETE).
- SQL DCL — data control (GRANT, REVOKE).
- SQL TCL — transaction control (COMMIT, ROLLBACK).
- SELECT statement — querying rows.
- WHERE clause — row filtering.
- LIKE operator — pattern matching on strings.
SQL data types
How columns are typed.
- CHAR (SQL) — fixed-length strings.
- VARCHAR (SQL) — variable-length strings.
- ENUM (SQL) — fixed set of string values.
- DECIMAL (SQL) — exact decimal numbers.
- FLOAT (SQL) — approximate floating-point.
- SQL integer types — TINYINT through BIGINT.
- SQL date and time types — DATE, TIME, DATETIME, TIMESTAMP.
- BLOB (SQL) — binary large object.
Big data
When the data outgrows one machine.
- Big data — the concept and the three Vs.
- Cluster (computing) — many machines acting as one.
- Node (computing) — a single machine in a cluster.
- Apache Hadoop — the open-source big-data framework.
- HDFS — Hadoop’s distributed file system.
- NameNode — HDFS’s metadata server.
- DataNode — HDFS’s block storage server.
- Replication factor — how many copies of each block.
- MapReduce — the distributed programming model.
- YARN — Hadoop’s cluster resource manager.
Visualization
Communicating what’s in the data.
- Matplotlib — the library.
- Pyplot — Matplotlib’s stateful interface.
- Matplotlib Figure — the top-level container.
- Matplotlib Axes — a single plotting region.
- Matplotlib GridSpec — flexible subplot layouts.
- Matplotlib coordinate transforms — data, axes, figure, display coordinates.
- Matplotlib colormap — mapping scalars to colours.
- Matplotlib tick locators — controlling axis tick placement.
- Principles of effective visualization — design rules for clarity.
Chart types
Choosing the right chart for the question.
- Bar chart — categorical comparison.
- Line graph — trends over an ordered axis.
- Dual-axis chart — two y-scales on one plot.
- Area chart — line chart with filled region.
- Stacked bar chart — part-to-whole over categories.
- Pie chart — proportions of a whole.
- Scatter plot — two-variable relationships.
- Bubble chart — scatter plot with a size dimension.
- Heat map — matrix of values as a colour grid.
Missing data
Handling holes in the table.
- Missing data — the problem and why it bites.
- Deletion (missing data) — drop the offending rows or columns.
- Imputation — fill the holes instead of dropping.
- Zero-replacement imputation — fill with zero (often wrong).
- Sample-and-hold imputation — carry the last observed value forward.
- Linear interpolation — straight-line fill between neighbours.
- Non-linear interpolation — splines and higher-order fills.
Noise and filtering
Separating signal from artifact.
- Noise (signal) — unwanted variation in a measurement.
- Artifact (signal) — non-random corruption from a known source.
- Low-frequency noise — drift, baseline wander.
- High-frequency noise — fast fluctuations on top of the signal.
- Moving-average filter — the simplest low-pass filter.
Scaling
Putting features on comparable axes.
- Normalization — rescaling features to a common range.
- StandardScaler — zero-mean, unit-variance scaling.
Feature extraction
Turning raw windows of data into model inputs.
- Feature extraction — the general concept.
- Window (feature extraction) — the unit of computation, fixed or sliding.
- Pandas rolling — the implementation engine for windowed features.
- Mean (statistical) — first moment.
- Standard deviation — spread.
- Variance (statistical) — squared spread.
- Skewness — asymmetry of the distribution.
- Kurtosis — tail heaviness.
Dimensionality reduction
Fewer features, similar information.
- Curse of dimensionality — why high-dimensional spaces are hostile.
- Dimensionality reduction — the general problem.
- Principal Component Analysis — linear projection onto maximum-variance directions.
- t-SNE — non-linear embedding for visualization.
Machine learning foundations
The three paradigms.
- Supervised learning — learn from labelled examples.
- Unsupervised learning — find structure without labels.
- Reinforcement learning — learn from rewards.
Regression
Predicting continuous targets.
- Regression — the general problem.
- Linear regression — straight-line fit.
- Polynomial regression — higher-degree fits via feature expansion.
- Loss function — the objective being minimized.
- Mean squared error — the standard regression loss.
- Gradient — direction of steepest increase.
- Gradient descent — the optimization algorithm.
- Learning rate — step size for gradient descent.
Classification
Predicting discrete labels.
- Classification (ML) — the general problem.
- Sigmoid function — the squashing function for binary probabilities.
- Logistic regression — linear model with sigmoid output.
- Binary cross-entropy — the standard classification loss.
Model evaluation
Telling whether a model actually works.
- Training set — what the model learns from.
- Validation set — what tuning decisions are made on.
- Test set — the held-out final evaluation.
- Train-test split — partitioning the data.
- K-fold cross-validation — averaging over multiple splits.
- Hyperparameter — tuning knob set outside training.
- Data leakage — when test info bleeds into training; the silent killer.
Classification metrics
Once a model predicts labels, scoring it.
- Confusion matrix — the four-cell summary of binary predictions.
- True positive — correctly predicted positive.
- True negative — correctly predicted negative.
- False positive — predicted positive, actually negative.
- False negative — predicted negative, actually positive.
- Accuracy (ML) — fraction correct overall.
- Precision (ML) — of predicted positives, how many were right.
- Recall (sensitivity) — of actual positives, how many were caught.
- Specificity (TNR) — of actual negatives, how many were rejected.
- False positive rate (FPR) — 1 − specificity.
- F1 score — harmonic mean of precision and recall.
- Decision threshold — where probability becomes a label.
- ROC curve — TPR vs FPR over all thresholds.
- AUC — area under the ROC curve.
Pipelines
Composing the workflow.
- scikit-learn pipeline — chained preprocessing and modelling steps.
Connects to Data structures (NumPy arrays and Pandas DataFrames are concrete data structures; SQL tables and hash-table indices sit underneath) and to Differential equations (gradient descent is a discrete approximation to the gradient-flow ODE , and the logistic curve is the solution to the logistic ODE). Linear algebra underpins PCA and every linear model here.