Sigmoid function

The sigmoid function (also called the logistic function) is

$g (x) = \frac{1}{1 + e ^{- x}}$

Image: The logistic sigmoid function, public domain

It maps the entire real line to the interval $(0, 1)$ in an S-shape. The limits:

As $x \to - \infty$ : $e^{- x} \to \infty$ , so $g (x) \to 0$ .
As $x \to + \infty$ : $e^{- x} \to 0$ , so $g (x) \to 1$ .
At $x = 0$ : $e^{0} = 1$ , so $g (0) = 1/2$ .

The shape is nearly flat near 0 for very negative inputs, rises steeply through 0.5 at the origin, and levels off near 1 for very positive inputs.

Why this matters for classification

Because the sigmoid’s output is always in the open interval $(0, 1)$ , we can interpret it as a probability estimate. If $g (z) = 0.7$ , the model is predicting class 1 has probability 0.7 under the model’s assumptions. Whether that estimate is well-calibrated — whether examples the model rates 0.7 actually turn out to be class 1 about 70% of the time — depends on whether the model’s linearity assumption matches the data and on the training procedure. Logistic regression with cross-entropy on well-specified data tends to produce reasonably calibrated probabilities; more flexible models (deep networks, boosted trees) often produce overconfident outputs that need post-hoc calibration. The takeaway: a sigmoid output is a probability estimate, not literally “the probability.”

The sigmoid takes a linear combination of features that can be any real number:

$z = w_{0} + w_{1} x_{1} + w_{2} x_{2} + \dots + w_{m} x_{m}$

and squashes it into a valid probability. The S-shape means that small changes in $z$ near zero produce big changes in $g (z)$ (the model is decisive in the boundary region), while large positive or negative $z$ produce nearly constant outputs (the model is confident).

Derivative

A useful property: the derivative of the sigmoid has a clean form in terms of the sigmoid itself:

$g^{'} (x) = g (x) (1 - g (x))$

This makes the Gradient of logistic regression’s Binary cross-entropy loss very simple — the chain rule produces a clean expression. It’s one of the reasons the sigmoid + cross-entropy combination is the canonical pair for classification.

Alternatives

The sigmoid is one of several activation functions that show up in machine learning:

tanh — $tanh (x)$ , similar S-shape but maps to $(- 1, 1)$ instead of $(0, 1)$ . Equivalent to a shifted and scaled sigmoid.
ReLU — $max (0, x)$ , zero for negative inputs and linear for positive ones. Standard in modern neural networks because it avoids the vanishing gradient problem the sigmoid has for large $∣ x ∣$ .
softmax — generalizes the sigmoid to multi-class classification. For $K$ classes, $softmax (z)_{k} = e^{z_{k}} / \sum_{j} e^{z_{j}}$ . Produces a probability distribution over the $K$ classes.

In binary classification with Logistic regression, the sigmoid is the standard choice. For multi-class problems, softmax replaces it. For hidden layers of neural networks, ReLU dominates in modern practice.

Idriss Rami — Notes

Explorer

Sigmoid function

Why this matters for classification

Derivative

Alternatives

Graph View

Table of Contents

Backlinks