Skip to content

Machine Learning Glossary

I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I’m still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no “exercises to the reader”. I also have a form for readers to request new definitions. Continued…

Posted in Machine Learning.

Optimal Spatial Prediction with Kriging

Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function Z at some locations {\bf s}_1, \ldots, {\bf s}_N, and we’d like to predict the function’s value at some new location {\bf s}_0. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.

I will assume that Z is an intrinsically stationary process. In other words, there exists some semivariogram \gamma({\bf h}) such that

    \[ \text{var}[Z({\bf s}+{\bf h}) - Z({\bf s})] = 2\gamma({\bf h}) . \]

Furthermore, I will assume that the process is isotropic, (i.e. that \gamma({\bf h}) is a function only of ||h||). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, \mathbb E[Z({\bf s})] = \mu. We would like to estimate Z({\bf s}) with a linear combination of our current observations. Our estimator will be Continued…

Posted in Machine Learning, Statistics.

Tagged with , .

Fisher information

I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:

    \begin{align*} {\bf F}_\theta &= {\mathbb E}_x[\nabla_\theta \log p(x;\theta) (\nabla_\theta \log p(x;\theta))^T] \\ &= {\rm Cov}_x[ \nabla_\theta \log p(x;\theta) ] \\ &= {\mathbb E}_x[ -\nabla^2_\theta \log p(x; \theta) ]. \end{align*}

It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.

This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative of KL divergence. Continued…

Posted in Statistics.

The Gumbel-Max Trick for Discrete Distributions

It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:

    \begin{align*} \pi_k &= \frac{\exp\{x_k\}}{\sum_{k'=1}^K\exp\{x_{k'}\}} \end{align*}

where the x_k are unconstrained in \mathbb{R}, but the \pi_k live on the simplex, i.e., \pi_k \geq 0 and \sum_{k}\pi_k=1. The x_k parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.


Posted in Computation, Probability.

Pseudo-marginal MCMC

This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function \pi({\bf x}), with {\bf x} \in \mathcal X, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state {\bf x} to a proposed state {\bf x}' with probability \min(1, \pi({\bf x}')/\pi({\bf x})) .

But what if we cannot evaluate \pi({\bf x}) exactly? Such a situation might arise if we are given a joint density function \pi({\bf x}, {\bf z}), with {\bf z} \in \mathcal Z, and we must marginalize out {\bf z} in order to compute \pi({\bf x}). In this situation, we may only be able to approximate

    \[ \pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,\]


Posted in Computation, Probability, Statistics.

Tagged with , , , , .

Chernoff’s bound

If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins).

My favorite form of Chernoff’s bound is: for X_1, ..., X_n independent binary random variables, and X=\sum_{i=1}^n X_i, \mu = E[X] and \delta > 2e-1, then

    \[ P(X > (1+\delta)\mu) < 2^{-\delta \mu}. \]

Note that X_i's are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so \delta is typically larger than 2e-1.

In the standard applications, the stochastic system has size N and an event of interest, X, has expectation \mu=O(\log N). The probability that any one event deviates significantly is inverse polynomial in the size of the whole system

    \[ P(X>(1+\delta)\mu) < \frac{1}{N^{\delta}}. \]

This is important since the total number of events is polynomial in N for many settings. By simple Union Bound,

    \[ P(\mbox{any } X>(1+\delta)\mu) = O(\frac{1}{N}). \]

So the chance of any deviation in any event is small!



Posted in Probability.

Upcoming Conferences

We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

Posted in Machine Learning.

Variational Inference (part 1)

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, p, with a tractable one, q, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.


Posted in Machine Learning, Probability, Statistics, Uncategorized.

Geometric means of distributions

Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.


Posted in Machine Learning.

Learning Theory: What Next?

In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments. Continued…

Posted in Machine Learning.