Skip to content


The empirical science of machine learning: evaluating RBMs

In the sciences, researchers are recognized not just for their discoveries about the physical world, but also for developing the technologies that make these discoveries possible. Countless Nobel prizes have been given for advances in genetic manipulations, X-ray crystallography, etc. which allow new questions to be asked and answered.

Machine learning is also an empirical field: we investigate in which situations a given algorithm or technique is likely to perform well. This is generally a very difficult question to study, so we might expect to hear a lot about advances which help us do empirical work better. Unfortunately, as a community we’re disproportionately focused on advances which directly improve performance on some task. This is the first in a series of posts highlighting threads of research which I think have considerable scientific value, in that they help us do empirical research better.

The first thread of research I want to talk about is the work of Ruslan Salakhutdinov and Iain Murray on evaluating restricted Boltzmann machines (RBMs). RBMs are one of the main building blocks in the field of deep learning, where the goal is to automatically learn high-level feature representations of a given kind of data, such as images or human speech. I’ll use the running example of modeling images of handwritten digits, using the MNIST dataset, since this has long been a standard benchmark in the field. An RBM is an undirected graphical model, in particular a bipartite graph:

The bottom layer, or “visible” layer, has a set of units representing the inputs. For instance, a 28*28 pixel image would have 784 visible units, one for each pixel. The top layer, or “hidden” layer, contains a set of unobserved units which are supposed to capture higher level features, such a strokes.

RBMs are a generative model, in that they represent a distribution over the input data. As an undirected model, this distribution is defined as a Boltzmann distribution, with variables {\bf v} and {\bf h} and parameters ({\bf a}, {\bf b}, {\bf W}):

    \[ p({\bf v}, {\bf h}) = \frac{1}{Z} \exp \left( {\bf a}^T {\bf v} + {\bf b}^T {\bf h} + {\bf v}^T {\bf W} {\bf h} \right) \]

The parameters ({\bf a}, {\bf b}, {\bf W}) are typically trained using an approximation to maximum likelihood, such as contrastive divergence. The term Z in the denominator, called the partition function, is the sum of unnormalized joint probabilities over all possible joint configurations ({\bf v}, {\bf h}). Unfortunately, this is generally intractable to compute. This means we can’t even answer what’s seemingly the most basic question about a generative model: what is the probability of a given data point? This is problematic, since one of our main ways of comparing generative models is to measure the probability they assign to a held-out test set. Because of this, researchers had to use more heuristic evaluation methods, such as generating (approximate) samples from the distribution or trying to interpret the learned representations.

Ruslan Salakhutdinov and Iain Murray’s 2008 paper, On the quantitative analysis of deep belief networks, gave a way of accurately evaluating the held-out likelihood of the model. They used a sampling algorithm called annealed importance sampling (AIS), a powerful method for estimating partition functions. The idea behind AIS is to start with a simple, tractable distribution, i.e. one for which we can draw exact samples and exactly compute the partition function. For a model with binary variables, a common choice would be a uniform distribution. AIS gradually “anneals” towards the distribution of interest by moving through a series of intermediate distributions, taking a Metropolis-Hastings step with respect to each of the intermediate distributions. With proper bookkeeping, this yields an unbiased estimator of the partition function of the original model.

There’s a fly in the ointment, though: AIS may or may not give accurate estimates, and the accuracy is difficult to determine. It can fail for one of two reasons: first, in order to give low variance estimates, it requires that each of the intermediate distributions be very similar to its predecessor. This implies that if the initial distribution is very far from the target one, a very large number of intermediate distributions are needed to ensure that the steps are small. Second, the sampler may fail to explore an important mode of the distribution; for instance, with RBMs, there’s often a mode where all the pixels are off, and Gibbs samplers may take a long time to find this mode. In general, it is very difficult to prove that a sampler has explored every mode of a distribution.

What happens if the AIS sampler is inaccurate?  In this case, it’s not too hard to show that AIS generally underestimates the partition function. Since the partition function appears in the denominator of the joint probability, underestimating the partition function means overestimating the likelihood — the estimator can be arbitrarily optimistic when it fails. This means if the RBM appears to get a high likelihood score, this could be because it’s a good generative model, or because the AIS sampler is bad!

Because of this concern, using AIS to estimate RBM likelihoods requires careful validation. One form of validation was to test the method on small RBMs containing only 25 hidden units. While learning a good model of handwritten digits requires about 500 hidden units, the broad outlines of the distribution can be captured by a model with 25 hidden units. Such RBMs are tractable: both sampling and partition function evaluation can be performed exactly by exhaustive enumeration of all 2^{25} joint assignments to the hidden units. They showed that AIS gave accurate estimates of the partition function for these small RBMs. This isn’t proof by itself, though, since samplers may simply mix faster for small RBMs, and the good results might not generalize to large RBMs.

Another form of validation was to use AIS in two different ways and ensure that the results were consistent. By using one RBM as the initial distribution and annealing to a different RBM, AIS yields an estimate of the ratios of partition functions of the two RBMs. This ratio can be compared with the ratio of the individual estimates for each RBM. The estimator passed this test: the ratios obtained from both methods were consistent.

Hugo Larochelle and Iain Murray provided an additional form of validation in their later paper, The neural autoregressive distribution estimator. In particular, an alternative method for evaluating the probability of an image is to make the RBM attempt to predict it sequentially, one pixel at a time. The product of the conditional probability terms gives the likelihood of the full image. While these conditional probabilities can’t be computed exactly, the fact that the sequential estimator needs to make actual predictions prevents it from “cheating” in the same way that AIS can cheat.

The main contribution of the paper was the Neural Autoregressive Distribution Estimator (NADE), a probabilistic model inspired by the sequential estimation procedure. Unlike with RBMs, the likelihood could be evaluated exactly under this model. While we wouldn’t expect NADE to get precisely the same likelihood score as an RBM, it should be in the same ballpark because the model architecture is so similar.

At the end of the day, these studies yielded two different likelihood estimates: RBMs with AIS (possibly optimistic), and NADE (exact likelihood for a different model). Both methods yielded very similar estimates of the held-out likelihood, and this combined evidence provides a high degree of reassurance that the estimates are reasonable.

So what eventually came from this ability to estimate the model likelihood? There were some surprising conclusions which have influenced subsequent work in deep learning:

  1. Contrastive divergence (CD), the algorithm which was nearly universally used to train RBMs up to that point, turned out to be a bad way to learn a density model. This approximate training criterion allowed the model to learn spurious modes with high probability far away from the data, thereby drastically lowering the likelihood of the data. Better approximations like persistent contrastive divergence (PCD) have since been developed to overcome this problem.
  2. While deep belief nets (DBNs) achieved a higher likelihood than RBMs trained with CD, this turned out to be an artifact of the training algorithm. Deep belief nets were able to improve the likelihood by repairing some of the mistakes made by CD training. However, RBMs trained with a better algorithm modeled the data distribution just as well as DBNs. (This result is particular to MNIST, which is regarded as a simple dataset; for more complex datasets like NORB, there appears to still be an advantage for deeper models.)

In short, this thread of research on evaluating RBMs is a perfect example of machine learning as an empirical science. The researchers went through nontrivial engineering efforts to build a tool for analyzing models and algorithms. They carefully validated the reliability of this tool through a convergence of several sources of evidence. They applied this tool to rigorously analyze several questions in deep learning which previously could only be talked about heuristically. I think it’s worth thinking more about how to perform this style of research, where we build better tools for analyzing scientific questions in our field.

 

Posted in Machine Learning.

Tagged with , , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

You must be logged in to post a comment.