When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter , such that and . A somewhat more common occurrence is that we have a where , but we don’t know the normalization constant. That is, our is only proportional to the multinomial parameter . We want to rapidly generate a variate according to , given , something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);

samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has time complexity for setup (computing the CDF) and time complexity per sample. The per-sample complexity could probably be reduced to with a better data structure for finding the threshold. It turns out, however, that we can do better and get for the sampling, while still being .

One such method is due to Kronmal and Peterson (1979) and is called the *alias* method. It is also described in the wonderful book by Devroye (1986). George Dahl, Hugo Larochelle and I used this method in our ICML paper on learning undirected n-gram models with large vocabularies.

The basic idea is to observe that any discrete distribution over outcomes can be turned into a *uniform* distribution over (possibly degenerate) binary outcomes. Since sampling from a uniform distribution can be done in constant time, it is easy to sample once you've computed an appropriate mixture. The setup cost is linear in . You can convince yourself that such a mixture exists using induction. First, if , it is clearly easy. For , find and . We know that , so use these two to create a binary mixture between outcomes and where this component now owns all of the probability mass for but only of the mass for . Having done this, we now have a new discrete distribution with outcomes, which we can iterate until there is only one outcome.

In practice, this can be turned into a linear time algorithm in the following way (from Devroye (1986)), implemented in Python:

import numpy as np import numpy.random as npr def alias_setup(probs): K = len(probs) q = np.zeros(K) J = np.zeros(K, dtype=np.int) # Sort the data into the outcomes with probabilities # that are larger and smaller than 1/K. smaller = [] larger = [] for kk, prob in enumerate(probs): q[kk] = K*prob if q[kk] < 1.0: smaller.append(kk) else: larger.append(kk) # Loop though and create little binary mixtures that # appropriately allocate the larger outcomes over the # overall uniform mixture. while len(smaller) > 0 and len(larger) > 0: small = smaller.pop() large = larger.pop() J[small] = large q[large] = q[large] - (1.0 - q[small]) if q[large] < 1.0: smaller.append(large) else: larger.append(large) return J, q def alias_draw(J, q): K = len(J) # Draw from the overall uniform mixture. kk = int(np.floor(npr.rand()*K)) # Draw from the binary mixture, either keeping the # small one, or choosing the associated larger one. if npr.rand() < q[kk]: return kk else: return J[kk] K = 5 N = 1000 # Get a random probability vector. probs = npr.dirichlet(np.ones(K), 1).ravel() # Construct the table. J, q = alias_setup(probs) # Generate variates. X = np.zeros(N) for nn in xrange(N): X[nn] = alias_draw(J, q)

- R. A. Kronmal and A. V. Peterson. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33(4):214-218, 1979.
- L. Devroye. Non-Uniform Random Random Variate Generation. Springer-Verlag, New York, 1986. (freely available!)

This is a really nice clever method.

Just one observation about the python implementation above: when updating the probability mass (line 29), wouldn’t it be better to rearrange it to:

q[large] = (q[large] + q[small]) – 1.0 or q[large] = (q[large] – 1.0) + q[small]

To minimise the rounding error?

(see http://www.keithschwarz.com/darts-dice-coins/ section: “A Practical Version of Vose’s Algorithm”)