### NIPS 2006

#### Posted by David Corfield

In a week’s time I shall be in Vancouver attending the NIPS 2006 conference. NIPS stands for Neural Information Processing Systems. I’m looking forward to meeting some of the people whose work I’ve been reading over the past twenty months. Later in the week I shall be speaking up in Whistler at a workshop called ‘Learning when test and training inputs have different distributions’, and hopefully fitting in some skiing.

In a way you could say all of our use of experience to make predictions encounters the problem addressed by the workshop. If we include time as as one of the input variables, then our experience or ‘training sample’ has been gathered in the past, and we hope to apply it to
situations in the future. Or from our experience gathered *here*, we expect certain things to happen *there*. How is it, though, that sometimes you know time, space, or some other variable, don’t matter, whereas other times you know they do?

As concerns machine learning more generally, at last, I’m beginning to get a picture of what I think needs to be done. We have a beautiful picture when performing inference with finite-dimensional parametric models. Take a space over which there is a measure. Now consider the Banach space, $B$, of all densities on $X$, and consider a linear map $A$ from $B$ to $R^n$ which measures features of a density, perhaps its moments. Now, we also need a map from $B$ to $R$ which measures the Kullback-Leibler distance to a fixed reference density, and finally a map from $R^n$ to $R$ which is the indicator function of a point in $R^n$, so that it takes the value 0 at that point, and is infinite elsewhere. We then have a function from $B$ to $R$ formed by the sum of the two paths.

Now construct the Legendre-Fenchel dual situation, adding the dual maps $(R^{n})^* \to B^* \to R$ and $(R^{n})^* \to R$. The optimised values of these dual functions are equal, given certain conditions. When the smoke has cleared you see that the density which matches the moments of the empirical distribution and which is ‘closest’ to the reference density corresponds to the member of an *exponential family* which matches the empirical moments. Projecting from the reference density to the manifold of distributions with the same moment as the empirical distribution is equivalent to projecting from the empirical distribution to an exponential family, these being subspaces of the space of all distributions which are always perpendicular to the subspaces of fixed moments. Many famous families of distribution - Gaussian, exponential, Binomial, Poisson - are exponential.

Last time I mentioned attempts to extend this analysis to infinite-dimensional exponential families. This is, in general, ill-posed. A finite amount of data to select a point in an infinite-dimensional space. But one can *regularize* the situation by
requiring only approximate moment matching - the map from $R^n$ to $R$ picking out a region around a designated value. The Legendre-Fenchel dual of this is to put a prior on the exponential family, the solution then picking the maximum a posteriori distribution.

To capture much of the work going on in machine learning we need to do all this in the space of conditional densities. If we’re looking to make predictions of output given input, there’s no need to model the input distribution. We just need a *discriminative* model to capture the functional relationship between inputs and outputs. And even in the situation to be discussed at the workshop, if we have the true distribution in our family, our solution will converge to it as data increases.

Now it looks like all of these ingredients - conditional densities, Legendre-Fenchel duality, infinite-dimensional exponential families - can be put together such that Gaussian process classification and Gaussian process regression are specific cases. However, Gaussian process people, as true Bayesians, aren’t happy with this, preferring the mean of the posterior to its mode. But all is not lost. Something I didn’t mention last time was Bayesian information geometry. What is the best estimator, i.e., before I see any data, what is the best I can do? Well let $\tau$ be our estimator, taking any finite set of data to a probability distribution. Then define the generalization error:

$E(\tau) = \int_p P(p) \int_z P(z|p)D(p, \tau(z))$,

taking $D(p, .)$ to be the cost of misspecifying the true distribution $p$. What we want is an estimator which minimises this error. Snoussi explains (p. 6) how this best estimator sends $z$ to

$\int p P(p|z)$.

Even if we need to work with a limited space of models for computational ease, we merely project this posterior distribution onto that space.

Now, as he goes on to explains, there is plenty of scope for variation, using different distances between densities as costs, and then further using a weighted sum of the generalization error of the reference prior and the divergence of the prior from the Jeffreys prior (p. 10).

I can’t see anything standing in the way of transporting all of this over to conditional spaces. But whether it could tell us what to do when we know our models misspecify the true distribution, and when the training data follows a different distribution to the test data, is a different matter.

## Re: NIPS 2006

Learning seems to be hard!

As a curious blogger I am reminded by the difficulties neuroscientists have had to build neural networks models which mimics our ability to learn symbols, as explained on Developing Intelligence. But by by incorporating complexity from biologically-plausible simulations of neural computation recently a system with generalized and symbol-like processing was built.

It doesn’t seem to become overtrained, but deals well with novel data sets:

(See

http://develintel.blogspot.com/2006/10/generalization-and-symbolic-processing.html

and

http://www.pnas.org/cgi/content/full/102/20/7338)

A no-brainer hint that it can be done. :-)