This is the second of two posts based on a testing tutorial I’m writing with David Duvenaud.

In my last post, I talked about checking the MCMC updates using unit tests. Most of the mistakes I’ve caught in my own code were ones I caught with unit tests. (Naturally, I have no idea about the ones I haven’t caught.) But no matter how thoroughly we unit test, there are still subtle bugs that slip through the cracks. Integration testing is a more global approach, and tests the overall behavior of the software, which depends on the interaction of multiple components.

When testing samplers, we’re interested in testing two things: whether our algorithm is mathematically correct, and whether it is converging to a correct solution. These two goals are somewhat independent of each other: a mathematically correct algorithm can get stuck and fail to find a good solution, and (counterintuitively) a mathematically incorrect algorithm can often find the correct solution or at least get close. This post is about checking mathematical correctness.

The gold standard for testing MCMC algorithms is the Geweke test, as described in Geweke’s paper “Getting it right: joint distribution tests of posterior simulators.” [1] The basic idea is simple: suppose you have a generative model over parameters and data , and you want to test an MCMC sampler for the posterior . There are two different ways to sample from the joint distribution over and . First, you can forward sample, i.e. sample from , and then sample from . Second, you can start from a forward sample, and run a Markov chain where you alternate between

- Updating using one of your MCMC transition operators, which should preserve the distribution , and
- resampling the data from the distribution .

Since each of these operations preserves the stationary distribution, each step of this chain is a perfect sample from the joint distribution .

If your sampler is correct, then each of these two procedures should yield samples from exactly the same distribution. You can test this by comparing a variety of statistics of the samples, e.g. the mean of , the maximum absolute value of , and so on. No matter what statistics you choose, they should be indistinguishable between the two distributions.

The beauty of this approach is that it amplifies subtle bugs in the sampler. For instance, suppose we have a bug in our sampler which makes the posterior noise variance 1% too large. This would be nearly impossible to detect by simply running your model on real data and seeing if the results look plausible. Now consider what the Geweke test would do. After the parameters are sampled, the data will be generated with a 1% larger noise variance. When the parameters are resampled, the noise variance will be 1% too large on top of that, or 2%. Pretty soon, the noise variance will explode.

In his paper, Geweke uses frequentist hypothesis tests to compare the distributions. This can be difficult, however, since you’d need to account for dependencies between the samples in order to determine significance. A simpler approach is to make a P-P plot of the statistics. If it’s very close to a straight line, the test “passes”:

If it’s far off, you have a bug:

If the results are noisy and unclear, generate more forward samples and run the chains for longer:

(While these figures are synthetic, they’re representative of outputs I’ve gotten in the past.) If you wind up with the third case, and you can’t run the sampler any longer because it’s already too slow, try reducing the number of data points. Not only does this speed up each iteration, but it also makes the chains mix faster — the more data points there are, the stronger the coupling between and .

There’s a major drawback of the Geweke test, unfortunately: it gives you no indication of where the bug might be. All you can do is keep staring at your code till you find the bug. Therefore, there’s no point in running this test until all your unit tests already pass. This meshes with the general test-driven development methodology, where you get your unit tests to pass before the integration tests.

[1] J. Geweke. Getting it right: joint distribution tests of posterior simulators. JASA, 2004.

Samples from the Markov chain (second routine) are correlated. How is it that Markov chain produces “perfect” samples from the joint?

Hi Roger — I remember you telling me about this test and it sounds pretty interesting though I missed this post until now.

Here’s another posterior inference testing approach, that I’ve used before, though only for small models:

http://www.stat.cmu.edu/~acthomas/724/Cook.pdf

(Cook, Gelman, Rubin 2006, Validation of Software for Bayesian Models Using Posterior Quantiles)

It doesn’t have the magnification effect, but it applies more generally than MCMC (though that paper casts it overly narrowly, only talking about for sampling methods). You have to do many simulations: Simulate theta’ ~ fixedprior, then x’ ~ theta’, then run your inference algorithm to compute a posterior CDF of theta’ on the p(theta|x,fixedprior) distribution. Over many data simulations, this should be uniform. Or put another way, frequentist coverage is correct: e.g. 50% intervals contain the true theta’ exactly 50% of the time.

Like Geweke, the Cook et al. paper presents frequentist tests for this, but I think more useful is that they plot the quantile values (or rather, z-score transformations of them, which make it easy to see extreme cases). This probably can’t get subtle differences like your 1% overestimate example.

However, I’ve found one nice thing: you can make P-P plots (um, or “QQ plots”?) to get some idea of whether the posteriors tend to be too wide or too narrow, which correspond to S-shapes on a P-P plot. Or it’s easier to check just with CI coverage (e.g. if your 50% intervals trap theta’ 70% of the time, your posterior variance is too wide).

I think biased estimates correspond to humps above or below on that P-P plot, though I’d have to think through it more. There also might be certain types of checks you can do with point estimates too — your posterior means should be unbiased, maybe, so you can check whether you tend to be too high or too low? I’m less sure about this.