Metadata is information about a measurement, separate from the measurement itself: timestamps, sensor identifiers, units, location, the conditions under which the recording was made. A photo of a person is more useful if we also know when it was taken, where, with what camera, at what resolution. An ECG recording is more useful if we know which patient it came from, which leads were used, the patient’s position, and the sampling rate. The information about the measurement is part of the dataset just as the measurement itself is.
What metadata typically contains depends on the kind of data:
- Image files carry EXIF metadata: camera model, lens, exposure settings, GPS coordinates, timestamp.
- Files on disk carry filesystem metadata: creation time, modification time, owner, permissions.
- Sensor recordings carry per-sample timestamps (or the start time and sampling rate, from which timestamps can be reconstructed), the type and model of the sensor, where on the body or vehicle it was mounted, its orientation, the units of the recorded values, and a unique identifier for the device.
- Social-media data carries the location and time of the post, any hashtags, and the device.
- IoT-sensor data carries the device ID, location, operating temperature range, and update frequency.
Without metadata, a dataset becomes useless surprisingly quickly. A stack of ECG recordings with no patient identifiers, no timestamps, and no information about lead placement is a stack of waveforms — not a clinical dataset. The waveforms aren’t wrong; they’re just unanchored, and there’s no way to interpret them.
Metadata is sometimes itself the label. If we’re building a system that predicts a smartphone’s orientation, the device’s own orientation sensor is the ground truth — we don’t need a human to label anything. If we’re predicting tomorrow’s weather, the recorded weather observation that follows is the ground truth.