Kernels in Machine Learning III
Posted by David Corfield
The use of kernel methods in machine learning often goes by the name nonparametric statistics. Sometimes this gets taken as though it’s the opposite of finite parametric statistics, but that’s not quite right. A typical piece of parametric statistics has a model like the two-parameter family of normal distributions, and the parameters of this model are then estimated from a sample.
What happens in the case of kernel methods in machine learning is that a model from a possibly infinite-dimensional family is specified by a function on the (finite) collection of training points. In other words you are necessarily restricted to a subfamily of models of dimension the size of the training set. So, in contrast to the usual parametric case, you are dealing with a data-dependent finite dimensional subfamily of models.
Now we can see why regularisation is important. Fitting a normal distribution is choosing a point in a two-dimensional manifold within the space of all distributions on . The choice is made of the member of the family whose first and second moments match those of the empirical distribution. Now in the nonparametric case, before we know the training data, we are dealing with an infinite-dimensional space dense within the space of all distributions. Where in the Normal distribution case we have two constraints to match the two dimensions of the family of models, now we are able to satisfy as many constraints as we like, and must end by matching the empirical distribution if we opt simply for the maximum likelihood model.
Instead of this so-called maximum likelihood estimation, what we can do instead is impose a bias so that we end up in our family by favouring ‘smaller’ functions on the training data. An intuitive way to think of this is by considering the Gaussian radial basis function as kernel on a Euclidean space :
This is known as a ‘heat kernel’ in Euclidean space. Gaussian process classification with this kernel amounts to selecting negative and positive ‘temperatures’ on the training points, setting the temperature of the rest of to zero, and allowing heat to flow for a designated time. After this time, points in will be classified according to the sign of their temperature, the magnitude of the temperature determing the certainty of the classification.
Now if we’re allowed to place any temperature values at the training points, we can select very hot ones for those labelled ‘’ and very cold ones for those labelled ‘’, so that they’re still hot or cold after heat flow has taken place. This is a classic case of overfitting, and we should have little confidence that our estimates on test points will be accurate. If on the other hand we manage to select modest temperatures in such a way that after the heat flow has taken place the classifier is accurate on training data, we should have greater confidence.