Loss function

A loss function is a function that measures how badly a model fits the data. A small loss means a good fit; a large loss means a bad fit. Training a model amounts to finding parameters that minimize the loss — the algorithmic procedure for doing this is typically Gradient descent.

The loss takes the model’s parameters $w$ as input and returns a single number summarizing prediction error across the training set. Different tasks use different losses:

Mean squared error (MSE) for Regression. Squared deviation of predicted from true values, averaged over the training set.
Binary cross-entropy for binary classification. Penalizes confidently-wrong predictions much more than uncertain ones.
Categorical cross-entropy for multi-class classification. Generalization of binary cross-entropy to $K > 2$ classes.
Hinge loss for support vector machines.
0-1 loss (count of mistakes) is the most intuitive loss but isn’t differentiable, so it isn’t useful for gradient descent — we use surrogate losses (cross-entropy, hinge) that are differentiable approximations.

The notation in the Introduction to Data Science textbook uses $J (w)$ for the loss, sometimes $L (w)$ in other sources. The arrow $\leftarrow$ used in gradient descent’s update rule means assign — the same as = in a Python program.

Why squaring (for MSE)

Two reasons. Sign cancellation: errors above and below the true value don’t cancel out — they both contribute positively to the loss. Disproportionate penalty for large errors: an error of 4 contributes 16, an error of 2 contributes 4, an error of 1 contributes 1. The model is strongly motivated to avoid big mistakes even at the cost of accepting more small ones.

Why cross-entropy (for classification)

Cross-entropy penalizes confident mistakes much more harshly than uncertain ones. A model that hedges (predicting probability 0.5 on the wrong class) suffers less than a model that confidently insists on the wrong answer (predicting 0.99 on the wrong class). This is intuitively the right behaviour: confidence should be earned.

The training problem

Given a model with parameters $w$ and a loss function $J (w)$ measured on training data, training is the optimization problem

$w^{*} = ar g min_{w} J (w)$

For Linear regression with MSE, the closed-form solution exists. For everything else, Gradient descent iterates step by step toward the minimum.

A loss surface that’s convex (bowl-shaped with a single global minimum) is well-behaved for gradient descent. The loss surface for Logistic regression with Binary cross-entropy is convex; the loss surface for a deep neural network is generally not, and gradient descent can get stuck in saddle points or bad local minima. Modern training algorithms (Adam, SGD with momentum) include tricks to escape these.

Idriss Rami — Notes

Explorer

Loss function

Why squaring (for MSE)

Why cross-entropy (for classification)

The training problem

Graph View

Table of Contents

Backlinks