A visual exploration of the softmax function
The softmax function
To get an intuition for its properties, we'll depict the effect of the softmax function for
The output points all lie on the line segment connecting (x=0, y=1) and (x=1, y=0). This is because, given 2-dimensional inputs, the softmax maps to the 1-dimensional simplex, the set of all points in
More interestingly, for any output (green dot), the inputs (red dots) that map to it are all located along 45-degree lines, i.e. lines parallel to the line
Next, we plot the effect of the softmax function on points already lying within the probability simplex. For inputs (in red) equally spaced between
What this reveals is that the softmax shrinks inputs towards the point
We can illustrate this further by plotting equally-spaced softmax outputs (as colored points in
Notice that it becomes increasingly difficult to produce high-confidence outputs. For a particular softmax output
Suppose that we wanted to make the softmax function a bit more idempotent-ish. One way to do this is by using the temperature-annealed softmax (Hinton. et al., 2015):
where
To visualize this effect, we repeat the previous plot, but now with
We see that we can now “get by” with smaller changes in the inputs to the softmax. But the temperature-annealed softmax is still translation-invariant.
Code for these plots is here: https://gist.github.com/calvinmccarter/cae597d89722aae9d8864b39ca6b7ba5
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 2.7 (2015).
Wang, Feng, and Huaping Liu. “Understanding the behaviour of contrastive loss.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2021).