Imputation

Imputation is the strategy of replacing missing values with estimates rather than discarding the rows that contain them. Imputation preserves the structure of the dataset and keeps paired channels aligned in time, which is its main advantage over deletion. The cost is computation — imputation requires processing, which uses memory, drains battery on portable devices, and can introduce latency in real-time systems.

Several imputation methods are in common use, differing in how cleverly they estimate the missing value:

Zero-replacement imputation — every missing value gets replaced with 0. Trivial to implement, almost always wrong.
Sample-and-hold imputation — repeat the most recent valid value. Better than zero; good when the signal varies slowly.
Linear interpolation — draw a straight line between the two neighboring valid samples and read the missing value off the line. Accurate when the signal varies smoothly.
Non-linear interpolation — fit a curve (polynomial, spline) through several surrounding samples. Better for naturally curved signals.

More sophisticated methods exist but are typically deferred to later courses:

EM (expectation-maximization) imputation treats missing values as latent variables and iteratively estimates them alongside model parameters.
kNN imputation finds the $k$ most similar rows in the dataset and averages their values for the missing column.
Machine-learning-based imputation trains a model to predict missing values from observed ones.

For most Introduction to Data Science work, the basic four (zero, sample-and-hold, linear, non-linear) cover what’s needed.

In Pandas, imputation is done with fillna and interpolate:

df.fillna(0)                                  # zero-replacement
df.fillna(method='ffill')                     # sample-and-hold (forward fill)
df.interpolate(method='linear')               # linear interpolation
df.interpolate(method='cubic')                # cubic interpolation
df.fillna(value={'col1': 0, 'col2': 1.5})     # different fill per column

The choice between methods is partly about what we know about the signal. For slowly-varying signals at high sample rates, sample-and-hold is usually fine. For ECG and EEG signals with natural curvature, non-linear interpolation tends to be more accurate. For categorical data (encoded as integers), interpolation produces nonsensical values like 1.7 between categories 1 and 2, and a category-level filling strategy (mode, or missing as its own category) is more appropriate.

Idriss Rami — Notes

Explorer

Imputation

Graph View

Backlinks