Skip to content

The Alias Method: Efficient Sampling with Many Discrete Outcomes

When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter \boldsymbol{\pi}\in\mathbb{R}^K, such that \pi_k\geq 0 and \sum_{k}\pi_k=1. A somewhat more common occurrence is that we have a \boldsymbol{\phi}\in\mathbb{R}^K where \phi_k\geq 0, but we don’t know the normalization constant. That is, our \boldsymbol{\phi} is only proportional to the multinomial parameter \boldsymbol{\pi}. We want to rapidly generate a variate according to \boldsymbol{\pi}, given \boldsymbol{\pi}, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);
samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has \mathcal{O}(K) time complexity for setup (computing the CDF) and \mathcal{O}(K) time complexity per sample. The per-sample complexity could probably be reduced to \mathcal{O}(\log K) with a better data structure for finding the threshold. It turns out, however, that we can do better and get \mathcal{O}(1) for the sampling, while still being \mathcal{O}(K).

One such method is due to Kronmal and Peterson (1979) and is called the alias method. It is also described in the wonderful book by Devroye (1986). George Dahl, Hugo Larochelle and I used this method in our ICML paper on learning undirected n-gram models with large vocabularies.

The basic idea is to observe that any discrete distribution over K outcomes can be turned into a uniform distribution over (possibly degenerate) binary outcomes. Since sampling from a uniform distribution can be done in constant time, it is easy to sample once you've computed an appropriate mixture. The setup cost is linear in K. You can convince yourself that such a mixture exists using induction. First, if K=1, it is clearly easy. For K>1, find k_{\sf min} = \arg\min_k \pi_k and k_{\sf max} = \arg\max_k \pi_k. We know that \pi_{k_{\sf min}}\leq 1/K, so use these two to create a binary mixture between outcomes k_{\sf min} and k_{\sf max} where this component now owns all of the probability mass for k_{\sf min} but only 1/K - \pi_{k_{\sf min}} of the mass for k_{\sf max}. Having done this, we now have a new discrete distribution with K-1 outcomes, which we can iterate until there is only one outcome.

In practice, this can be turned into a linear time algorithm in the following way (from Devroye (1986)), implemented in Python:

import numpy        as np
import numpy.random as npr

def alias_setup(probs):
    K       = len(probs)
    q       = np.zeros(K)
    J       = np.zeros(K,

    # Sort the data into the outcomes with probabilities
    # that are larger and smaller than 1/K.
    smaller = []
    larger  = []
    for kk, prob in enumerate(probs):
        q[kk] = K*prob
        if q[kk] < 1.0:

    # Loop though and create little binary mixtures that
    # appropriately allocate the larger outcomes over the
    # overall uniform mixture.
    while len(smaller) > 0 and len(larger) > 0:
        small = smaller.pop()
        large = larger.pop()

        J[small] = large
        q[large] = q[large] - (1.0 - q[small])

        if q[large] < 1.0:

    return J, q

def alias_draw(J, q):
    K  = len(J)

    # Draw from the overall uniform mixture.
    kk = int(np.floor(npr.rand()*K))

    # Draw from the binary mixture, either keeping the
    # small one, or choosing the associated larger one.
    if npr.rand() < q[kk]:
        return kk
        return J[kk]

K = 5
N = 1000

# Get a random probability vector.
probs = npr.dirichlet(np.ones(K), 1).ravel()

# Construct the table.
J, q = alias_setup(probs)

# Generate variates.
X = np.zeros(N)
for nn in xrange(N):
    X[nn] = alias_draw(J, q)
  • R. A. Kronmal and A. V. Peterson. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33(4):214-218, 1979.
  • L. Devroye. Non-Uniform Random Random Variate Generation. Springer-Verlag, New York, 1986. (freely available!)

Posted in Computation, Statistics.

Tagged with , .

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. emlyn says

    This is a really nice clever method.

    Just one observation about the python implementation above: when updating the probability mass (line 29), wouldn’t it be better to rearrange it to:
    q[large] = (q[large] + q[small]) – 1.0 or q[large] = (q[large] – 1.0) + q[small]
    To minimise the rounding error?
    (see section: “A Practical Version of Vose’s Algorithm”)

You must be logged in to post a comment.