### Kernels in Machine Learning II

#### Posted by David Corfield

We’ve had a slight whiff of the idea that groupoids and groupoidication might have something to do with the kernel methods we discussed last time. It would surely help if we understood better what a kernel is. First off, why is it called a ‘kernel’?

This must relate to the kernels appearing in integral transforms,

$(T g)(u) = \int_T K(t, u) g(t) dt,$

where $g$, a function on the $T$ domain, is mapped to $T g$, a function on the $U$ domain. Examples include those used in the Fourier and Laplace transforms. Do these kernels get viewed as kinds of matrix?

In our case these two domains are the same and $K$ is symmetric. Our trained classifier has the form:

$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$

So the $g(t)$ is $\sum_{i = 1}^m c_i \delta (x_i, x)$. The kernel has allowed the transform of a weighted set of points in $X$, or, more precisely, a weighted sum of delta functions at those points, to a function on $X$.

Perhaps we can get a little closer to the beginning of John’s ‘Tale of groupoidification’ by considering the ‘bag of words’ kernel. This applies to a set of documents. For each member of a set of words we count the number of its occurrences in the document. The map we called $\phi$ in the last post sends a document to a vector whose entries are the counts of the occurrences of some list of words. Then the *bag of words* kernel forms the dot product between these vectors. So

$K(x, y) = \sum_i (#occurrences of i th word in x) \cdot (#occurrences of i th word in y).$

This has the look of a span of sets $S: X \rightarrow X$ (not sure how to do that broken arrow), where $S$ is the set of pairs of word tokens of the same type and their positions in a document. E.g., ([computer, word #56, document 42], [computer, word #91, document 65]) witnesses the presence of ‘computer’ in documents 42 and 65. The size of the subset of $S$ above this pair of documents is the (42, 65) entry of the kernel matrix.

You choose your set of words to be informative, e.g., avoid ‘and’ and ‘the’. Note that this kernel literally treats your documents as bags of words, or better multisets. It doesn’t care about the word order. Classifiers can work effectively even after information is thrown away, just as in the case of the digit classifier we spoke about which would work as effectively if all the 30 $\times$ 30 images were scrambled by the same permutation of its 900 pixels.

It would be fun if we could find a machine learning kernel which involved spans of groupoids rather than spans of sets.

## Re: Kernels in Machine Learning II

Interesting stuff.

Could you elaborate on your example; I was never that sharp and the fact it’s a Friday means you’ll have to spell things out. In particular, I’m not sure if attempting to you’re rate the “class compatibility” of two documents or two sets of documents?

Regarding your question about kernels, if you consider the discretised version of the fourier transform the DFT and its various similar transforms like wavelets, then the full discretised function (ie $g(1)$, $g(2)$, …, $g(N)$) is related to the transform discretised function ($Tg(1)$, …, $Tg(N)$) by a matrix mulitplication (as is obvious when you realise each “output” function value is a weighted linear combination of the original function values.) From this viewpoint the various “fast” algorithms correspond to “factorisations” of this matrix. However, I suspect that’s probably not the kind of matrix view of a kernel you’re looking for.

Final question: my only real knowledge of the span of groupoid stuff if the TWFs (in particular week 248 for the spans of groupoid stuff. Is there any online paper that deals with this stuff (in particular how relations transform into numeric valuations)? John is hinting that this will be explained about in coming TWFs.