A loss function is a function that measures how badly a model fits the data. A small loss means a good fit; a large loss means a bad fit. Training a model amounts to finding parameters that minimize the loss — the algorithmic procedure for doing this is typically Gradient descent.

The loss takes the model’s parameters as input and returns a single number summarizing prediction error across the training set. Different tasks use different losses:

  • Mean squared error (MSE) for Regression. Squared deviation of predicted from true values, averaged over the training set.
  • Binary cross-entropy for binary classification. Penalizes confidently-wrong predictions much more than uncertain ones.
  • Categorical cross-entropy for multi-class classification. Generalization of binary cross-entropy to classes.
  • Hinge loss for support vector machines.
  • 0-1 loss (count of mistakes) is the most intuitive loss but isn’t differentiable, so it isn’t useful for gradient descent — we use surrogate losses (cross-entropy, hinge) that are differentiable approximations.

The notation in the Introduction to Data Science textbook uses for the loss, sometimes in other sources. The arrow used in gradient descent’s update rule means assign — the same as = in a Python program.

Why squaring (for MSE)

Two reasons. Sign cancellation: errors above and below the true value don’t cancel out — they both contribute positively to the loss. Disproportionate penalty for large errors: an error of 4 contributes 16, an error of 2 contributes 4, an error of 1 contributes 1. The model is strongly motivated to avoid big mistakes even at the cost of accepting more small ones.

Why cross-entropy (for classification)

Cross-entropy penalizes confident mistakes much more harshly than uncertain ones. A model that hedges (predicting probability 0.5 on the wrong class) suffers less than a model that confidently insists on the wrong answer (predicting 0.99 on the wrong class). This is intuitively the right behaviour: confidence should be earned.

The training problem

Given a model with parameters and a loss function measured on training data, training is the optimization problem

For Linear regression with MSE, the closed-form solution exists. For everything else, Gradient descent iterates step by step toward the minimum.

A loss surface that’s convex (bowl-shaped with a single global minimum) is well-behaved for gradient descent. The loss surface for Logistic regression with Binary cross-entropy is convex; the loss surface for a deep neural network is generally not, and gradient descent can get stuck in saddle points or bad local minima. Modern training algorithms (Adam, SGD with momentum) include tricks to escape these.