machine learning, computer science, and more

A visual exploration of the softmax function

The softmax function maps vectors to the probability simplex, making it a useful and ubiquitous function in machine learning. Formally, for ,

To get an intuition for its properties, we'll depict the effect of the softmax function for . First, for inputs coming from an equally-spaced grid (depicted in red), we plot the softmax outputs (depicted in green):

The output points all lie on the line segment connecting (x=0, y=1) and (x=1, y=0). This is because, given 2-dimensional inputs, the softmax maps to the 1-dimensional simplex, the set of all points in that sum to 1. Thus, the softmax function is not one-to-one (injective): multiple inputs map to the same output. For example, all points along the line are mapped to .

More interestingly, for any output (green dot), the inputs (red dots) that map to it are all located along 45-degree lines, i.e. lines parallel to the line . This is not an accident: the softmax function is translation-invariant. Formally, for any :

Next, we plot the effect of the softmax function on points already lying within the probability simplex. For inputs (in red) equally spaced between and , we plot their softmax outputs (in green):

What this reveals is that the softmax shrinks inputs towards the point . Except when the input already represents a uniform distribution over all classes, the softmax function is not idempotent.

We can illustrate this further by plotting equally-spaced softmax outputs (as colored points in ), and then drawing curves with matching colors to depict the set of inputs which map to a particular output. Due to translation invariance, these curves will actually be lines, as depicted below.

Notice that it becomes increasingly difficult to produce high-confidence outputs. For a particular softmax output , the corresponding set of inputs is the set of all points of the form . Even though the outputs are equally-spaced (0.1, 0.9), (0.2, 0.8), and so on, the inputs are increasingly distant when the probability mass is concentrated on either or .

Suppose that we wanted to make the softmax function a bit more idempotent-ish. One way to do this is by using the temperature-annealed softmax (Hinton. et al., 2015):

where can be thought of as a temperature. Typically, one uses a cooled-down softmax with , which leads to greater certainty in the output predictions. This is especially useful in inducing a loss function that focuses on the smaller number of hard examples, rather than on the much larger number of easy examples (Wang & Liu, 2021).

To visualize this effect, we repeat the previous plot, but now with :

We see that we can now “get by” with smaller changes in the inputs to the softmax. But the temperature-annealed softmax is still translation-invariant.

Code for these plots is here: https://gist.github.com/calvinmccarter/cae597d89722aae9d8864b39ca6b7ba5

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 2.7 (2015).

Wang, Feng, and Huaping Liu. “Understanding the behaviour of contrastive loss.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021).