Skip to content

Computing Log-Sum-Exp

This post is about a computational trick that everyone should know, but that doesn’t tend to be explicitly taught in machine learning courses.  Imagine that we have a set of N values, \{x_n\}^N_{n=1} and we want to compute the quantity

(1)   \begin{align*} z = \log \sum_{n=1}^N \exp\{x_n\}. \end{align*}

This comes up all the time when you want to parameterize a multinomial distribution using a softmax, e.g., when doing logistic regression and you have more than two unordered categories.  If you want to compute the log likelihood, you’ll find such an expression due to the normalization constant.  Computing this naively can be a recipe for disaster, due to underflow or overflow, depending on the scale of the x_n.  Consider a simple example, with the vector [0\;1\;0].  This seems pretty straightforward, and we get about 1.55.  Now what about [1000\;1001\;1000].  This seems like it should also be straightforward, but instead our computer gives us back \inf.  If we try [-1000\;-999\;-1000] we get -\inf.  What’s happening here?  Well, in your typical 64-bit double, \exp\{1000\}=\inf and \exp\{-1000\}=0 due to overflow and underflow, respectively.  Even though the log would make the numbers reasonably scaled again with infinite precision, it doesn’t work on a real computer with typical floating point operations.  What to do?

The trick to resolve this issue is pretty simple and exploits the identity:

(2)   \begin{align*} \log\sum_{n=1}^N\exp\{x_n\} &= a + \log\sum_{n=1}^N\exp\{x_n-a\} \end{align*}

for any a.  This means that you are free to shift the center of the exponentiated variates up and down, however you like.  A typical thing to do is to make

(3)   \begin{align*} a = \max_n x_n, \end{align*}

which ensures that the largest value you exponentiate will be zero.  This means you will definitely not overflow and even if the rest underflow you will still get a reasonable value.

Posted in Computation.

Tagged with .

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. sitaram says

    Nice explanation!!! It will be good if a pointer to the proof of equation 2 is given and also usage of this for forward algorithm of HMM is given. Here is the second one for HMM users:

You must be logged in to post a comment.