I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I’m still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no “exercises to the reader”. I also have a form for readers to request new definitions.
Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function at some locations , and we’d like to predict the function’s value at some new location . Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.
I will assume that is an intrinsically stationary process. In other words, there exists some semivariogram such that
Furthermore, I will assume that the process is isotropic, (i.e. that is a function only of ). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, . We would like to estimate with a linear combination of our current observations. Our estimator will be
– April 21, 2013
I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:
It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.
Posted in Statistics.
– April 8, 2013
It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:
where the are unconstrained in , but the live on the simplex, i.e., and . The parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.
– April 6, 2013
This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function , with , and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state to a proposed state with probability .
But what if we cannot evaluate exactly? Such a situation might arise if we are given a joint density function , with , and we must marginalize out in order to compute . In this situation, we may only be able to approximate
– March 31, 2013
If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins).
My favorite form of Chernoff’s bound is: for independent binary random variables, and , and , then
Note that are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so is typically larger than .
In the standard applications, the stochastic system has size and an event of interest, , has expectation . The probability that any one event deviates significantly is inverse polynomial in the size of the whole system
This is important since the total number of events is polynomial in for many settings. By simple Union Bound,
So the chance of any deviation in any event is small!
Posted in Probability.
– March 25, 2013
We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.
- International Conference on Artificial Intelligence and Statistics (AISTATS) in Scottsdale, AZ: April 29 – May 1, 2013
- New England Machine Learning Day at Microsoft Research New England: May 1, 2013
- First International Conference on Learning Representations (ICLR) in Scottsdale, AZ: May 2-4, 2013
- Conference on Bayesian Nonparametrics, Amsterdam: June 10-14, 2013
- Conference on Learning Theory (COLT), Princeton, NJ: June 12-14, 2013
- International Conference on Machine Learning (ICML), Atlanta, GA: June 16-21, 2013
- Conference on Uncertainty in Artificial Intelligence (UAI), Bellvue, WA: July 11-15, 2013
- Joint Statistical Meetings (JSM), Montreal: August 3-8, 2013
- International Joint Conference on Artificial Intelligence (IJCAI), Beijing: August 3-9, 2013
- Neural Information Processing Systems (NIPS), Lake Tahoe, NV: December 5-10, 2013
- MCMSki IV, Chamonix Mont-Blanc: January 6-8, 2014
Posted in Machine Learning.
– March 24, 2013
I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.
The goal of variational inference is to approximate an intractable probability distribution, , with a tractable one, , in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.
– March 22, 2013
Annealed importance sampling  is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.
Posted in Machine Learning.
– March 18, 2013
In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments.
Posted in Machine Learning.
– March 15, 2013