### Kernels in Machine Learning I

#### Posted by David Corfield

I keep feeling small twinges of familiarity between some of what goes on here at the Café and what I read in the machine learning literature. I’ll jot down a sketch of what’s going on and see if I can get the connection clearer in my head.

One of the most important developments in machine learning over the past 15 years has been the introduction of *kernel* methods. A kernel on a set $X$ is a function $K: X \times X \to \mathbb{R}$, which is symmetric and positive definite, in the sense that for any $N \geq 1$ and any $x_1,..., x_N \in X$, the matrix $K_{ij} = K(x_i, x_j)$ is positive definite, i.e., $\sum_{i, j} c_i c_j K_{ij} \geq 0$ for all $c_1, ..., c_N \in \mathbb{R}$. (Complex-valued kernels are possible.)

Another way of looking at this situation is to reformulate it as a mapping $\phi : X \to H$, where $H$ is a reproducing kernel Hilbert space, a function space in which pointwise evaluation is a continuous linear functional.

The ‘kernel’ and ‘feature map to Hilbert space’ stories relate to each other as follows:

$K(x, .) = \phi (x)$.

The reproducing aspect of $H$ means that

$\langle f(.), K(x, .) \rangle = f(x)$,

and this is continuous. So we have

$K(x, y) = \langle \phi (x), \phi(y) \rangle$.

$H$ is the span of the set of functions $\{K(x, .)| x \in X\}$, and, under certain conditions, when we find the $f \in H$ for which a functional is minimised, it takes the form

$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$.

Many linear algebra techniques in $H$ just involve the inner product, and so can be conducted as a form of nonlinear algebra back in $X$, using the kernel trick.

Consider binary classification where we look to form a classifier which accurately finds the labels in $\{\pm 1\}$ of a sample from a distribution over $X \times \{\pm 1\}$ on the basis of their $X$ values, given a set of labelled training data. The support vector machine approach to this task looks to find the hyperplane in $H$ which best separates the images of data points $\phi (x_i)$, so that those with the same label $(y_i)$ fall in the same half space. The control for overfitting comes from finding such a hyperplane that classifies the training data accurately with the largest perpendicular distance to the nearest of them (the support vectors). Note that there’s a ‘Bayesian’ version of kernel methods which uses Gaussian processes.

The class of kernels on $X$ is closed under addition, multiplication by a positive scalar, multiplication, and pointwise limits. What else do we know about the structure on this class?

Kernels characterise the similarity of the input set. But how do they get chosen? Often a Gaussian radial-basis function (RBF) kernel is selected:

$K(x, y) = e^{-\|x - y\|^2 / \sigma^2}$.

But this can lead to an odd notion of similarity. Imagine handwritten digits, pixellated 30 $\times$ 30 so that their grey scale values form a vector in $\mathbb{R}^{900}$. The image of a ‘1’ shifted a couple of pixels to the right will appear very different in this representation. The classifying algorithm is not fed any information to let it know that it ought to classify it as the same digit. If it is trained on sufficient data this might well not matter since there will be a training example ‘close enough’ to allow for correct classification.

One way around the problem is to train a support vector machine with the given data, find the support vectors, then use translations of them as new ‘virtual’ data. This does have the effect of producing a more accurate classifier, but is seen by some as *ad hoc*. Ideally, the expected invariances would be encapsulated in the kernel.

But why not try to learn the kernel? This could be done by defining a finite parametric family of kernels and optimising for some associated quantity (or forming some Bayesian posterior). But a more interesting suggestion is to imitate the non-parametric nature of kernel methods at the level of kernel selection itself. More on this next time.

## Re: Kernels in Machine Learning I

Given $\langle f(\cdot), K(x,\cdot) \rangle = f(x)$ it seems I may roughly think of your kernel $K(x,\cdot)$ as a $\delta$-functional, $\delta_x$ but with respect to non-standard scalar products on spaces of functions.

I don’t understand yet how these kernels are used in the machine learning context which you have in mind.

In the example you mention, is the idea that some algorithm is supposed to determine a gray scale value given just an array of pixels? But it’s supposed to somehow “learn” this without us simply defining the gray scale value to be the average value of the pixels. Right?

So what is that learning procedure like, here? How does the kernel appear in this learning process?