## June 25, 2007

### Kernels in Machine Learning I

#### Posted by David Corfield

I keep feeling small twinges of familiarity between some of what goes on here at the Café and what I read in the machine learning literature. I’ll jot down a sketch of what’s going on and see if I can get the connection clearer in my head.

One of the most important developments in machine learning over the past 15 years has been the introduction of kernel methods. A kernel on a set $X$ is a function $K: X \times X \to \mathbb{R}$, which is symmetric and positive definite, in the sense that for any $N \geq 1$ and any $x_1,..., x_N \in X$, the matrix $K_{ij} = K(x_i, x_j)$ is positive definite, i.e., $\sum_{i, j} c_i c_j K_{ij} \geq 0$ for all $c_1, ..., c_N \in \mathbb{R}$. (Complex-valued kernels are possible.)

Another way of looking at this situation is to reformulate it as a mapping $\phi : X \to H$, where $H$ is a reproducing kernel Hilbert space, a function space in which pointwise evaluation is a continuous linear functional.

The ‘kernel’ and ‘feature map to Hilbert space’ stories relate to each other as follows:

$K(x, .) = \phi (x)$.

The reproducing aspect of $H$ means that

$\langle f(.), K(x, .) \rangle = f(x)$,

and this is continuous. So we have

$K(x, y) = \langle \phi (x), \phi(y) \rangle$.

$H$ is the span of the set of functions $\{K(x, .)| x \in X\}$, and, under certain conditions, when we find the $f \in H$ for which a functional is minimised, it takes the form

$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$.

Many linear algebra techniques in $H$ just involve the inner product, and so can be conducted as a form of nonlinear algebra back in $X$, using the kernel trick.

Consider binary classification where we look to form a classifier which accurately finds the labels in $\{\pm 1\}$ of a sample from a distribution over $X \times \{\pm 1\}$ on the basis of their $X$ values, given a set of labelled training data. The support vector machine approach to this task looks to find the hyperplane in $H$ which best separates the images of data points $\phi (x_i)$, so that those with the same label $(y_i)$ fall in the same half space. The control for overfitting comes from finding such a hyperplane that classifies the training data accurately with the largest perpendicular distance to the nearest of them (the support vectors). Note that there’s a ‘Bayesian’ version of kernel methods which uses Gaussian processes.

The class of kernels on $X$ is closed under addition, multiplication by a positive scalar, multiplication, and pointwise limits. What else do we know about the structure on this class?

Kernels characterise the similarity of the input set. But how do they get chosen? Often a Gaussian radial-basis function (RBF) kernel is selected:

$K(x, y) = e^{-\|x - y\|^2 / \sigma^2}$.

But this can lead to an odd notion of similarity. Imagine handwritten digits, pixellated 30 $\times$ 30 so that their grey scale values form a vector in $\mathbb{R}^{900}$. The image of a ‘1’ shifted a couple of pixels to the right will appear very different in this representation. The classifying algorithm is not fed any information to let it know that it ought to classify it as the same digit. If it is trained on sufficient data this might well not matter since there will be a training example ‘close enough’ to allow for correct classification.

One way around the problem is to train a support vector machine with the given data, find the support vectors, then use translations of them as new ‘virtual’ data. This does have the effect of producing a more accurate classifier, but is seen by some as ad hoc. Ideally, the expected invariances would be encapsulated in the kernel.

But why not try to learn the kernel? This could be done by defining a finite parametric family of kernels and optimising for some associated quantity (or forming some Bayesian posterior). But a more interesting suggestion is to imitate the non-parametric nature of kernel methods at the level of kernel selection itself. More on this next time.

Posted at June 25, 2007 9:38 AM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1332

### Re: Kernels in Machine Learning I

Given $\langle f(\cdot), K(x,\cdot) \rangle = f(x)$ it seems I may roughly think of your kernel $K(x,\cdot)$ as a $\delta$-functional, $\delta_x$ but with respect to non-standard scalar products on spaces of functions.

I don’t understand yet how these kernels are used in the machine learning context which you have in mind.

In the example you mention, is the idea that some algorithm is supposed to determine a gray scale value given just an array of pixels? But it’s supposed to somehow “learn” this without us simply defining the gray scale value to be the average value of the pixels. Right?

So what is that learning procedure like, here? How does the kernel appear in this learning process?

Posted by: urs on June 25, 2007 6:46 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

In most conventional SVM techniques, the situation is that given a concrete kernel $K$ and a set of test data, one can mechanically construct the optimal classifier using that kernel and data set, which is a weight $c_i$ for each $x_i$. This classifier is then the sign of $f(y)=\sum_{x_i} c_i K(y,x_i)$. (As mentioned in my post below, a lot of the $c_i$ turn out to be 0.)

What David C is interested in is (I think) how one might use prior knowledge beyond just the training set to guide - and ideally mechanise - the choice of $K$.

Posted by: dave tweed on June 25, 2007 7:07 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

Sorry Urs, I wasn’t very clear. So you are given a training set of labelled data points. Each data point is a 900 place vector which corresponds to the grey-scale values of the image of a handwritten digit. Let’s keep to binary classification, and say we’re trying to distinguish between 0s and 1s. Each of the training points is labelled 0 or 1. The training set is taken to have come from the same distribution as the data we will eventually test our classifier on.

Now we’re trying to find a function which when given a 900 place vector outputs 0 or 1. The SVM technique is to map the 900 dimensional space into a possibly infinite-dimensional space, such as a function space. There it will look for a hyperplane which separates the images of the 0s from the 1s.

As you can imagine, the candidate hyperplanes form a very rich set. Unless we can ‘regularize’ the procedure it would be like fitting polynomials of any degree to a finite data set, i.e., a case of over-fitting.

SVM regularisation corresponds to finding a hyperplane such that the distance to the closest data points is a large as possible. If all the training data is separated accurately with a large margin, we expect it to continue well on future data.

Now, we don’t do the calculation in the image Hilbert space. Any inner products of images we need there can be carried out in terms of the kernel. Hyperplanes in the image space correspond to possibly complicated curves back in domain space.

As the other David said, what you end up with is a (reasonably sparse) classifier

$f(x) = \sum_{i = 0}^m c_i K(x, x_i),$

for a (reasonably small number) of training data points $x_i$. Some threshold value for $f(x)$ determines whether $x$, a test point, is classified as a 0 or a 1.

If you think of $K$ as some measure of similarity in the $X$ space, it’s like a handful of already labelled training points are voting on a new test point. You’re quite like this one which was a 1, but you’re much more like this one which was a 0, therefore we’ll classify you as 0.

Posted by: David Corfield on June 25, 2007 9:29 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

Some comments (that you probably already know): Firstly, geometric invariance is probably best tackled outside of the classifier since we have good models for geometry and so can convert things to canonical form, or for more complicated transformations form the tangent space around an exemplar in measuring distances. However, the basic problem you outline still applies in, eg, the difference between my handwritten digits when I’m relaxed as opposed to on a train as opposed to rushed. There’s commonality in my writing that can’t easily be analytically be factored out and yet the learning method ought ideally to be invariant to it.

Another implicit point is that what makes the SVM practical is that it’s only the support vectors (those points that lie on the margin) that actually affect the classification. (Of course, the whole test data-set determines which ones those vectors are.) However, it still feels weird to me that the entirety of the data set isn’t taken as “partially relevant” when classifying a new example.

Going into a possibly newer thought, one of the reasons I keep trying to actually use groupoids for something is that in learning problems different items in the test data set items may well have different allowable symmetries. As a silly example, when walking the arm position is pretty much free whilst when running the arms pretty much have to be in phase with the legs, so the “walking symmetries” is bigger than the “running symmetries”. But I can’t see any actual calculation or algorithm one could do that would actually take advantage of a groupoid structure.

Anyway, there are some shallow comments :-) Be interesting to see the following parts.

Posted by: dave tweed on June 25, 2007 6:55 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

However, it still feels weird to me that the entirety of the data set isn’t taken as “partially relevant” when classifying a new example.

The sparsity of the SVM’s solution comes from the form of its loss function, the so-called hinge loss (p. 4). Bayesians prefer the negative log probability. This uses all data points, as in full solution for Gaussian process classification. Approximations can be made in a controlled way, e.g., via the informative vector machine.

This tutorial explains the trade-off to be made very clearly. Allowing every training point to have its say is expensive.

Posted by: David Corfield on June 25, 2007 9:06 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

David C said “Allowing every training point to have its say is expensive” and pointed out about other loss functions in other SVM approaches.

Thanks for the pointer David; I hadn’t kept up with that stuff. The computational expense issue is true, although in the low-dimensional cases where you can use it I much prefer using kernel density estimation-based classifiers and then optimising the living daylights out of the code, just because I’m more comfortable that any bizarre “structure” in the classes is theoretically retrievable in the limit of sufficiently large numbers of samples. (Yeah, I know I said a while ago I don’t like analysing classifiers using their behaviour at infinite limits; I’m a hypocrite :-) ). This brushes over that KDE is more like a generative classifier than a discriminative classifier.

But as is well known, KDE in particular tends to conk out from the curse of dimensionality as the number of dimensions of your data space increases, ie, you need exponentially more training-data samples to adequately cover the space to a reasonable density (regardless of computational time issues, which also strike). That’s another reason I keep occasionally trying to really grok groupoids and figure out how to actually do something with them: it might be a start at understanding how to naturally form equivalences in high dimensional data spaces.

(I know you probably know all this stuff; I’m just being a bit more explicit just for casual readers.)

Posted by: dave tweed on June 26, 2007 10:51 AM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

What are your ideas about using groupoids? Is it that they detect local symmetries? E.g., we wouldn’t look at the full group of rotations acting on a digit, but we would want to identify images which are small rotations of a central one.

Hmm. How can we get the ‘Tale of Groupoidification’ to work here?

Posted by: David Corfield on June 26, 2007 5:23 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

I generally work with general vision/learning problems, so take the following as illustrative of vague thoughts about general stuff rather than how I’d actually approach characters classification.

Let’s widen from digits to general things mathematicians write. One thing I do is that I sometimes put a bar through $z$s (to disambiguate with 2’s), but sometimes I don’t. The action of putting a bar through the $z$ is a symmetry in the sense that it’s a transformation that leaves “being a $z$” unchanged, and it’s invertible. Now suppose we’re trying to classify one of my letters that looks a lot like a “non-bar-ed” $z$ in the training set of my writing, except it’s bar-ed. This ought to be strong evidence that it’s a $z$. A similar argument ought to apply with (bar,nonbar) reversed. Some people bar 7’s to disambiguate from 1’s, so the same transformation ought to apply 7s. But it doesn’t apply to any other characters: you can’t link a $+$ with a $|$ by an “un-bar” tranformation! In principle, there could be lots of different “local symmetries” applicable to different groups of things. And in principle you might be able to apply more than one (sequentially?) to an object.

So representing these rules seems like what a groupoid does. But ideally you’d want to fuse these groupoid ideas with your classifier technique rather than use them as discrete rules. But I’ve no idea how. And that’s before thinking about how to learn them…

Those are my vague thoughts anyhow.

Posted by: dave tweed on June 26, 2007 6:02 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

How can we get the “Tale of Groupoidification” to work here?

I was hoping to see a connection here, but then realized I didn’t understand your setup well enough.

But since the kernel lives over $X \times X$, and pretty much behaves like an ordinary matrix, it might be suggestive to think about replacing $\array{ &&X \times X \\ & \swarrow && \searrow \\ X &&&& X }$ more generally with a span of groupoids. But that’s just a wild guess. I have no idea how to relate this to the actual problem at hand.

Posted by: urs on June 26, 2007 8:53 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

Urs wrote “since the kernel lives over X×X, and pretty much behaves like an ordinary matrix”. I might be misunderstanding what you’re saying Urs, but I think the type of kernels
given as examples here (at end of section) show the kind of thing that gets used. One of the differences from what’s discussed in the TWFs about “the Tale” is that the kernel is a essentially a general functional metric between any two points in the data space (although probably not satsifying the axioms of a metric space) giving a “quantitative” idea of how different they are.

Posted by: dave tweed on June 27, 2007 10:13 PM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

From the kernel you can define a distance

$d(x, y) = \sqrt {K(x, x) - 2K(x, y) + K(y, y)}.$

I think I’ve found a kernel used by machine learning people very close to “the Tale”. More later.

Posted by: David Corfield on June 28, 2007 8:21 AM | Permalink | Reply to this

### Re: Kernels in Machine Learning I

All I had in mind is that the kernel is a function on $X \times X$ which is to be thought of as giving the entries of a square matrix with $|X|^2$ entries, as far as I understood.

Think of a vector similarly as a function on $X$. Think of this function as a 0-bundle.

Then multiplying the matrix with the vector amounts to

- pulling the function which gives the vector back along one of the legs of the correspondence space that I have drawn

- “tensor” that pullback with the function on $X \times X$ (the kernel), i.e. multiply pointwise

- then “push forward” the resulting function on $X \times X$ along the remaining leg (by summing up the numbers in the “fibers” with respect to the remaining edge).

That’s a fancy way of expressing matrix multiplication using correspondence spaces.

It is not exactly what John talks about in the Tale, but my impression is that the hope is that this is closely related.

Posted by: urs on June 28, 2007 10:21 AM | Permalink | Reply to this
Read the post Kernels in Machine Learning II
Weblog: The n-Category Café
Excerpt: More about kernels in machine learning.
Tracked: June 28, 2007 9:26 PM
Read the post Category Theory in Machine Learning
Weblog: The n-Category Café
Excerpt: Does category theory have a future in machine learning?
Tracked: September 5, 2007 4:04 PM

Post a New Comment