Map of content for data science — the end-to-end workflow from raw sensor readings to a trained model that makes predictions. The path: tools → collection → labelling and ethics → storage → big data → visualization → cleaning → features → dimensionality reduction → modelling → evaluation → pipelines.

Python tooling

The core libraries the rest of the workflow is built on.

Data collection

Where raw data comes from: sensors and the field.

  • Data collection — overview of the data-gathering stage.
  • Lab vs in-the-wild data — controlled vs realistic collection.
  • Sensor — the general concept.
  • IMU — inertial measurement unit, combining accelerometer and gyroscope.
  • Accelerometer — measures linear acceleration.
  • Gyroscope — measures angular velocity.
  • Magnetometer — measures magnetic field.
  • Hall effect — the physical principle behind magnetometers.
  • EEG — electroencephalography, brain electrical activity.
  • ECG — electrocardiography, the textbook’s running example.
  • Sensor fusion — combining multiple sensors for better estimates.
  • Web scraping — pulling data from web pages.
  • Metadata — data about the data.

Labelling and ethics

Turning observations into supervised training data, responsibly.

Data formats

How data is stored on disk.

  • CSV — flat, human-readable tables.
  • JSON — nested, semi-structured records.
  • HDF5 — hierarchical binary format for large numerical datasets.
  • HDF5 group — directory-like container inside an HDF5 file.
  • HDF5 dataset — array-like leaf inside an HDF5 file.
  • h5py — the Python interface to HDF5.
  • gzip — general-purpose compression.
  • lzf — fast, HDF5-friendly compression.
  • szip — scientific-data compression.

Relational databases

Structured storage with query languages.

SQL

The query language.

SQL data types

How columns are typed.

Big data

When the data outgrows one machine.

Visualization

Communicating what’s in the data.

Chart types

Choosing the right chart for the question.

Missing data

Handling holes in the table.

Noise and filtering

Separating signal from artifact.

Scaling

Putting features on comparable axes.

Feature extraction

Turning raw windows of data into model inputs.

Dimensionality reduction

Fewer features, similar information.

Machine learning foundations

The three paradigms.

Regression

Predicting continuous targets.

Classification

Predicting discrete labels.

Model evaluation

Telling whether a model actually works.

Classification metrics

Once a model predicts labels, scoring it.

Pipelines

Composing the workflow.


Connects to Data structures (NumPy arrays and Pandas DataFrames are concrete data structures; SQL tables and hash-table indices sit underneath) and to Differential equations (gradient descent is a discrete approximation to the gradient-flow ODE , and the logistic curve is the solution to the logistic ODE). Linear algebra underpins PCA and every linear model here.