K-fold cross-validation is an evaluation procedure for small datasets. Instead of one fixed Validation set, split the training portion into equal pieces (folds) and train and evaluate times. Each time, one fold is the validation set and the remaining folds are used for training.
A single train/validate split can mislead on a small dataset. Get unlucky in the random split and the validation set might not be representative, so your hyperparameter choices come out miscalibrated. Cross-validation averages over multiple splits, which gives a steadier estimate.
The picture
With and folds labelled 1 through 5:
- Run 1: validate on fold 1, train on folds 2-5.
- Run 2: validate on fold 2, train on folds 1, 3, 4, 5.
- Run 3: validate on fold 3, train on folds 1, 2, 4, 5.
- Run 4: validate on fold 4, train on folds 1, 2, 3, 5.
- Run 5: validate on fold 5, train on folds 1, 2, 3, 4.
By the end, every fold has been used once for validation and four times as part of training. The five validation scores together beat any single train/validate split. Report the mean and standard deviation of the scores across folds: the mean is the central estimate, the standard deviation quantifies the uncertainty.
After cross-validation
Once cross-validation has picked a configuration:
- With a separate Test set held out, retrain the final model on all the non-test data using the chosen configuration, then evaluate it once on the test set.
- With no separate test set (the dataset was too small to spare one), report the cross-validation average and standard deviation as the performance estimate.
Why all folds for training each iteration: the model learns from every example except the one currently held out, which squeezes the most learning out of a small dataset. The cost is computational. Training times takes times as long as training once.
In scikit-learn
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(), LogisticRegression())
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f'{scores.mean():.3f} ± {scores.std():.3f}')cross_val_score(...) performs -fold CV (default 5-fold) and returns one score per fold. The scoring= parameter chooses the metric: 'accuracy', 'roc_auc', 'f1', anything from sklearn.metrics.
StratifiedKFold preserves class proportions in each fold, which you want for imbalanced classification. GroupKFold keeps related groups (e.g. samples from the same patient) entirely in either training or validation, never splitting them, for when group leakage is a concern.
For Hyperparameter tuning combined with cross-validation, GridSearchCV and RandomizedSearchCV automate the full procedure: try each configuration, run cross-validation, pick the best.