Deletion (missing data)

Deletion is the simplest strategy for handling Missing data: discard the rows (or columns) that contain missing values, and proceed with what’s left. It’s fast (no computation beyond filtering) and trivially correct: the resulting dataset has no missing values by construction.

In Pandas:

df.dropna()                       # drop rows with any NaN
df.dropna(subset=['col1'])        # drop rows where col1 is NaN
df.dropna(axis=1)                 # drop columns with any NaN
df.dropna(thresh=5)               # keep rows with at least 5 non-NaN values

Deletion has two real costs.

Loss of data. Every deleted row is a training example we don’t get to learn from. For datasets with abundant data and few missing values, the loss is negligible. For datasets where missing values are scattered across many rows, so dropping any row with any missing value would discard most of the data, deletion can amount to throwing the dataset away.

Desynchronization of paired channels. If we have an ECG channel and an EEG channel sampled in lockstep, and we delete only the ECG samples that are missing, the two channels are no longer aligned in time. Downstream code that assumes a regular grid breaks. The fix is to delete paired samples, dropping the ECG sample and the EEG sample at the same timestamp together, but that compounds the data-loss cost.

Deletion is the right choice when:

A column is mostly missing, so dropping the whole column is fine because there’s not enough left to impute from.
The dataset is large and missing values are rare, so losing 0.1% of rows is invisible.
The downstream task is tolerant of variable sample sizes.

Otherwise, Imputation preserves the dataset’s structure and is usually the better default.

Idriss Rami — Notes

Explorer

Deletion (missing data)

Graph View

Backlinks