June 28, 2007

Kernels in Machine Learning II

Posted by David Corfield

We’ve had a slight whiff of the idea that groupoids and groupoidication might have something to do with the kernel methods we discussed last time. It would surely help if we understood better what a kernel is. First off, why is it called a ‘kernel’?

This must relate to the kernels appearing in integral transforms,

$(T g)(u) = \int_T K(t, u) g(t) dt,$

where $g$, a function on the $T$ domain, is mapped to $T g$, a function on the $U$ domain. Examples include those used in the Fourier and Laplace transforms. Do these kernels get viewed as kinds of matrix?

In our case these two domains are the same and $K$ is symmetric. Our trained classifier has the form:

$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$

So the $g(t)$ is $\sum_{i = 1}^m c_i \delta (x_i, x)$. The kernel has allowed the transform of a weighted set of points in $X$, or, more precisely, a weighted sum of delta functions at those points, to a function on $X$.

Perhaps we can get a little closer to the beginning of John’s ‘Tale of groupoidification’ by considering the ‘bag of words’ kernel. This applies to a set of documents. For each member of a set of words we count the number of its occurrences in the document. The map we called $\phi$ in the last post sends a document to a vector whose entries are the counts of the occurrences of some list of words. Then the bag of words kernel forms the dot product between these vectors. So

$K(x, y) = \sum_i (#occurrences of i th word in x) \cdot (#occurrences of i th word in y).$

This has the look of a span of sets $S: X \rightarrow X$ (not sure how to do that broken arrow), where $S$ is the set of pairs of word tokens of the same type and their positions in a document. E.g., ([computer, word #56, document 42], [computer, word #91, document 65]) witnesses the presence of ‘computer’ in documents 42 and 65. The size of the subset of $S$ above this pair of documents is the (42, 65) entry of the kernel matrix.

You choose your set of words to be informative, e.g., avoid ‘and’ and ‘the’. Note that this kernel literally treats your documents as bags of words, or better multisets. It doesn’t care about the word order. Classifiers can work effectively even after information is thrown away, just as in the case of the digit classifier we spoke about which would work as effectively if all the 30 $\times$ 30 images were scrambled by the same permutation of its 900 pixels.

It would be fun if we could find a machine learning kernel which involved spans of groupoids rather than spans of sets.

Posted at June 28, 2007 9:12 PM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1342

Re: Kernels in Machine Learning II

Interesting stuff.

Could you elaborate on your example; I was never that sharp and the fact it’s a Friday means you’ll have to spell things out. In particular, I’m not sure if attempting to you’re rate the “class compatibility” of two documents or two sets of documents?

Regarding your question about kernels, if you consider the discretised version of the fourier transform the DFT and its various similar transforms like wavelets, then the full discretised function (ie $g(1)$, $g(2)$, …, $g(N)$) is related to the transform discretised function ($Tg(1)$, …, $Tg(N)$) by a matrix mulitplication (as is obvious when you realise each “output” function value is a weighted linear combination of the original function values.) From this viewpoint the various “fast” algorithms correspond to “factorisations” of this matrix. However, I suspect that’s probably not the kind of matrix view of a kernel you’re looking for.

Final question: my only real knowledge of the span of groupoid stuff if the TWFs (in particular week 248 for the spans of groupoid stuff. Is there any online paper that deals with this stuff (in particular how relations transform into numeric valuations)? John is hinting that this will be explained about in coming TWFs.

Posted by: dave tweed on June 29, 2007 10:57 AM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Could you elaborate on your example

Let’s define a simple task where this kernel might be used. We want a classifier which when fed a web-page from a news provider can accurately tell us which section it’s from: business, sport, politics, etc. Let’s keep things very simple and consider the two class situation - business vs. sports.

Now with some fixed dictionary of words and a collection of business page examples and sports page examples we construct a classifier

$f(x) = \sum_i c_i K(x, x_i),$

where the $x_i$ are training examples.

When given document $x$, we calculate $f(x)$ and classify $x$ as business or sport depending on whether $f(x) \geq b$, for some threshold $b$.

You can imagine that the bag-of-words kernel evaluated on two sports pages chosen at random will have a higher value than if evaluated on a sports page and a business page, so long as a good dictionary is chosen. Having a linear combination of kernel evaluations will make this more robust, as some sports pages will happen to have a large number of financial terms. This may even happen by fluke if there are football players called, say, Broker and Price.

In practice, there’s usually a weighting modifying the kernel I gave, so that words appearing in many training examples are not counted as so important.

I suspect that’s probably not the kind of matrix view of a kernel you’re looking for.

No, that’s very relevant. So how much will John’s Tale be able to tell us about something like the DFT? If we’re trying to understand the category theoretic structure of quantum mechanics here, it would not be surprising if it did.

Is there any online paper that deals with this stuff?

I guess we’ll have to wait until the next quantum gravity seminar gets written up. The idea that relations are matrices of truth values, and that matrices with values in other rigs can pick up a generalised form of relation, such as cost between two points, has been discussed, e.g., here.

Posted by: David Corfield on June 29, 2007 12:22 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

I’m still not completely sure I’m getting your example. In particular, I don’t see why your kernel entry (42,65) is related to the size of $S$.

Let me say what I’d have expected, and you can highlight where I’m getting confused. You want to classify a new document $x$ and you’ve got a set $D=\{x_i\}$ of training examples, each of which is an individual document. Your bag of words kernel applies between pairs of documents (as $K(x,y)$). Lets say you’ve got a list of words and your 25th word is “bread” and your 60th word is “loaf”. Then your $K_{25,60}$ entry is (proportional to?) the frequency that the $x$ and $y$ are in the same class when $x$ contains “bread” and $y$ “loaf”. The diagonal elements $K_{ii}$ are then when your word in both documents is the same. The total set of documents then applies by summing over all the examples in your training set.

Presuambly, in the span of sets view you’d have something like (“bread”,(word 85 of $x$ is “bread”,word 235 of $y$ is “loaf”),”loaf”). The key point I’m trying to get my head around is that you seem to be suggesting the actual “witness” is the same in all the kernels, whereas it seems to me like the actual witness ought to vary with the examples – although the relationship that it’s witnessing remains the same.

I think it’s partly also that the conventional view of a kernel is that it’s a device for measuring the distance between vectors of “attributes” in a way that’s appropriate for the attributes, irrespective of the class the data points come from, whereas it seems the generalised kernels “know something” about the classes as well as the attributes. (I’m not saying anything’s wrong, just that I don’t understand.)

Posted by: dave tweed on July 2, 2007 12:38 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Partly confusing phrasing above in para 2, sentence 3 should be: “Your bag of words kernel applies between two documents…”.

Posted by: dave tweed on July 2, 2007 12:46 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

I’m not sure whether what you describe is used. The ‘bag of words’ approach to document classification is certainly not the most refined, but it works surprisingly well. It really sees the document as a bag or multiset, as though it were generated by some multinomial distribution with no information from the adjacencies of words.

For a given dictionary the entries of the feature vector corresponding to a given document are the word frequencies. E.g., let the dictionary be (happy, hand, dear, elephant), then if $a$ = ‘Happy Birthday’, $\phi(a) = (4, 0, 1, 0)$. Then $K(a, a) = \phi(a) \cdot \phi(a) = 17$, recording pairs like (‘happy’ in third line, ‘happy’ in the second line). If $b$ = ‘I wanna hold your hand’, then $\phi(b) = (0, lots, 0, 0)$. Then $K(a, b) = \phi(a) \cdot \phi(b) = 0$. Whereas if $c$ = ‘Dear Prudence’, $\phi(c) = (0, 0, lots, 0)$, so $K(a, c) = \phi(a) \cdot \phi(c) = lots$, corresponding to pairs (‘dear’ in third line of Happy Birthday, ‘dear’ in $n^{th}$ line of ‘Dear Prudence’).

The span records all pairs of tokens of the same word in two documents.

Using kernel methods in machine learning, the learning task relies only on $K(x_i, x_j)$ with $x_i$ and $x_j$ ranging over the training documents. You’re trying to find a vector $(c_i)$ such that all the entries of the transformed vector $\sum_{j} K(x_i, x_j) c_j$ which are above some threshold correspond to documents of one type, and those below the other. With such a vector in hand you have a classifier to apply to new documents.

Posted by: David Corfield on July 2, 2007 8:03 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

(Firstly, if we’ve got anyone else reading, David C is using as examples the full song texts with the names give in his examples.) FWIW, I’ve talked read a lot about the theory of the SVM and used some prepackaged libraries of them, so I’m confused in the way that what you’re saying doesn’t fit with what I understand rather than I haven’t seen any of this before.

I think we’re saying the basically the same things about the conventional machine learning. The basic setup with the type of kernel you’re talking about is that when comparing two documents $X$ and $Y$ your kernel function $K(x,y)=x^T\mathbf{K}y$ where $x$ and $y$ are the word-counts vectors in $\mathbf{R}^n$, where $\mathbf{K}$ is the matrix of kernel entries.

If you then say you’re going to set off diagonal elements of the kernel matrix to zero (because you don’t have any model about that) and decide you can model “the frequency that the $x$ and $y$ are in the same class when $x$ and $y$ both contain word $w$” by the the constant value $1$ then the kernel matrix $\mathbf{K}$ is just the diagonal identity matrix. The same kernel matrix is then used comparing the data item against all the elements of the training data and these values are summed. (Granted for this particular kernel you can simplify much more than in the general case.) This seems to agree with what you’re saying.

However, I start to get confused when this is generalised to the spans of sets or groupoids. I still don’t get why the (42,65) entry of the kernel matrix is as stated in the original blog entry. AFAICS, the kernel is independent of the actual data sets used.

Aha: I wonder if what you’re calling the kernel matrix is something completely different from what I’ve been calling the kernel matrix above?

Posted by: dave tweed on July 2, 2007 9:02 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

The kernel (at least as my colleagues use the term) is just a map from $X \times X$ to $\mathbb{R}$, for some set $X$. In a later post I’ll talk about the idea of learning the kernel, but in most applications this is fixed. If $X$ is some population of documents, a common (but primitive) kernel is the one I’ve described, where for some preselected dictionary we have a ‘bag of words’ kernel:

$K(x, y) = \sum_i (# occurrences of i^{th} word in dictionary in document x) \times$ $(# occurrences of i^{th} word in dictionary in document y)$

So $K(42, 65)$ is picking up all the instances of the same word appearing in the documents 42 and 65. The union of all these instances for all pairs of documents in $X$ is the span.

This is like John Baez’s original example where the span was made up of the set of triples (French man, English woman, Russian who had these as favourite), mapping to French men and English women. Now we have (document 1, document 2, word in dictionary, where it appeared in document 1, where it appeared in document 2), with the two mappings to $X$, the documents, from the first two components.

I also noted that as far as kernel algorithms go, when training you never need to evaluate the kernel anywhere else than at the labelled documents appearing in the training data.

Posted by: David Corfield on July 2, 2007 10:57 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Firstly, ignore what I was saying about forming relations between different words: I just saw you refer to a non-zero (42,65) element and, given my misunderstanding of what “kernel matrix” meant tried to come up with a meaning for it.

Secondly, a big apology: googling reveals that “kernel matrix” is a standard name for the matrix of kernel results between pairs of items in the training set. For some reason, this terminology just didn’t register with me in the papers I read (which were primarily vision applications papers).

Just to be clear: each element in the kernel matrix is itself “the size of” a valid span of sets? (The $S$ in your original entry.) You’re saying that the union of all the kernel matrix entry “spans of sets” is a bigger span of sets and is what you’re interested in; I’m just checking individual entires are also a span of sets.

Posted by: dave tweed on July 3, 2007 10:21 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Ugh: when I say “$S$” above I mean what you refer to as “size of the subset of $S$ above this pair of documents is the (42, 65) entry of the kernel matrix”. Ie, for a particular pair of documents, a span of sets is a set with entries like ([“computer”,word 32],[“computer”,word 76]). Viewed just as a relation it’s a bit dull as it always relates a word to itself, but with the varying witnesses it’s still a valid span of sets.

Posted by: dave tweed on July 3, 2007 10:29 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Regarding your final paragraph, I would put it this way. Each entry of the kernel matrix is the cardinality of a subset of a set $S$ which, with its two projections to the set of documents, forms a span. The subset of $S$ corresponding to the (42, 65) entry are instances of the same word appearing in documents 42 and 65.

I guess you could imagine a more elaborate measurement of the similarity of documents where a witness to similarity is an instance of, say, ‘beef’ in one document and ‘pork’ in the other.

I’m looking forward to the next TWF.

Posted by: David Corfield on July 4, 2007 12:59 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

David wrote:

We’ve had a slight whiff of the idea that groupoids and groupoidication might have something to do with the kernel methods we discussed last time.

Sounds right!

You just explained how integral kernels arise naturally from ‘witnessed relations’ — that is, spans of sets. Each ‘witness’ to some relation between $x \in X$ and $y \in Y$ contributes a term to the integral kernel $K(x,y)$.

But, if you were studying machine learning in the presence of symmetry, you might want invariant witnessed relations. For example: if seeing some scene makes you infer something about the location of some object, seeing a translated version of that scene should cause you to make the same sort of inference, only translated!

Now, in week248 and subsequent Weeks, I explained how invariant witnessed relations are spans of groupoids. And, decategorifying these gives matrices — or in more highbrow contexts, ‘integral kernels’!

In fact, in a half-written This Week’s Finds I’m already starting to talk about some famous integral kernels that come from invariant witnessed relations.

For example, the Radon transform comes from a span involving the space $X$ of points in $\mathbb{R}^n$ and the space $Y$ of hyperplanes in $\mathbb{R}^n$. The Euclidean group $G$ acts on both these spaces, and there’s an obvious span of $G$-spaces describing the invariant relation

the point $x$ lies on the hyperplane $y$.

Namely:

                     S
/ \
/   \
/     \
/       \
v         v
X           Y


where $S$ is the set of pairs $(x,y)$ consisting of a point $x$ lying on a hyperplane $y$.

(In this particular case, there’s either 0 or 1 witnesses to the fact that $x$ lies on $y$, so our ‘invariant witnessed relation’ is really nothing but an invariant relation. It’s a sort of degenerate example.)

Given this span of $G$-spaces, and using the fact that $X$, $Y$ and $S$ are equipped with $G$-invariant measures having some nice properties, we instantly get the integral kernel for the Radon transform! I’ll explain how in the actual article.

But, the cool part is that this process of getting the kernel is nothing but a slightly souped-up version of decategorifying a span of groupoids to get a matrix of numbers.

‘Slightly souped-up’, because an integral kernel is really a matrix where the indices take continuous rather than discrete values, so the usual sums involved in matrix multiplication get replaced by integrals. That’s why our spaces $X$, $Y$ and $S$ must have measures for the game to work.

I hope this is clear. I think there really is something to what you’re saying.

Posted by: John Baez on June 30, 2007 7:46 AM | Permalink | Reply to this

Re: Kernels in Machine Learning II

This is one definition of the Radon transform in 2 dimensions, equation 1.2 from Peter Toft’s PhD:

$\hat{g} (p, t) = \int_{- \infty}^{+ \infty} \int_{- \infty}^{+ \infty} g(x, y) \delta (y - p x - \tau) d x d y$

That delta function arises from the simple incidence relation you mention.

Hmm, is there a similar way to think about the Fourier transform?

Posted by: David Corfield on July 1, 2007 1:19 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

David wrote:

That delta function arises from the simple incidence relation you mention.

Right. We’re thinking about points and lines in $\mathbb{R}^2$. Given a function $g$ on the space of points, its Radon transform $\hat g$ is a function on the space of lines. To evaluate $\hat g$ on a given line we simply integrate $g$ over the set of all points incident to that line.

This idea only relies on the incidence relation between points and lines and our choice of a measure on the set of all points incident to a given line. So, we could replace ‘point’ and ‘line’ by any types of figure in any Klein geometry, as long as we’ve chosen the incidence relation and the necessary measure.

Hmm, is there a similar way to think about the Fourier transform?

It doesn’t fit the pattern I just described. Instead, something more general is going on. Instead of a yes-or-no incidence relation between vectors $x$ and covectors $k$, the kernel here is a phase $exp(i k (x))$.

I already know something about categorifying the Fourier transform, but now you’re making me want to think of it from some ‘generalized incidence relation’ viewpoint. I’m not sure what I mean by that, but I’m reminded of the ‘phased sets’ we were talking about a while ago.

Posted by: John Baez on July 4, 2007 2:03 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Is it time for a crash course in geometric probability?

Posted by: David Corfield on July 4, 2007 3:27 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

The kernel in the Fourier transform is all about characters’ values on group elements, n’est-ce pas? I can see that this seems rather different from the set or groupoid of witnesses approach of a span.

Fourier (Pontryagin) kinds of situation are only rarely dealing with two copies of the same set, i.e., self-dual groups.

But it’s interesting that both kinds of kernels are used in machine learning.

Posted by: David Corfield on July 4, 2007 7:34 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

For future reference, a paper about positive definite kernels on homogeneous spaces

Posted by: David Corfield on July 5, 2007 12:36 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

This is really interesting discussion, even though I’m not completely following all the details. I’m reasonably familiar with the use of kernels in machine learning but am still having trouble relating it to groupoids.

Some of what is being said here reminds me of something else that confused me recently: a tutorial given by Risi Kondor on Group Theoretic Methods for Machine Learning at ICML 2007 last month. He’s written a paper advocating the use of group theory to describe symmetries in training data and showed, in the case of digit recognition, how to construct invariant features from representations of the rotation and translation groups on images. He also shows that you can then derive a kernel that also captures these invariants using properties of bispectra over images.

It’s fascinating stuff but I don’t pretend to understand all the details. Maybe the regulars here will find it interesting and/or illuminating?

Posted by: Mark on July 4, 2007 5:21 AM | Permalink | Reply to this

Re: Kernels in Machine Learning II

I am leading up to another paper by Kondor (with Jebara) on hyperkernels. He seems a very creative thinker. Somewhat similar to the one you mention is a paper which uses the representation theory of the symmetric groups to help track the identity of moving objects. A good example of the need for this is an air traffic controller tracking planes on a radar.

I’ll be meeting up with Persi Diaconis in Delphi at the end of this month, as will John Baez. He’s the author of Group Representation in Probability and Statistics.

I get the feeling that there’s huge scope for machine learning to learn from less visited areas of mathematics, like noncommutative harmonic analysis.

Posted by: David Corfield on July 4, 2007 1:21 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

By the way, I have no idea why the word “kernel” is used both to mean two completely different things: “function of two variables defining an integral transform” and “subspace on which a linear operator vanishes”. It’s unfortunate, since it makes phrases like “the kernel of the operator” hopelessly ambiguous in certain contexts.

Is “kernel” a translation of German “Kern”? Does this ambiguity also occur in German?

Posted by: John Baez on June 30, 2007 4:22 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

Is “kernel” a translation of German “Kern”?

I don’t know for sure which came first, but, yes, a “kernel” in math is called a Kern in German.

Does this ambiguity also occur in German?

Yes, it occurs in exactly the same way. Unfortunately.

When I was taught linear algebra, the professor felt like apologizing for introducing the abbreviation $\mathrm{ker}(f)$ for the kernel of a linear map $f$, since in German it saves but a single letter. He said: “But think of it as abbreviating the English kernel, then it saves three letters!”

One of the rare occurences where an English word is longer than its German equivalent. ;-)

Posted by: urs on July 2, 2007 2:04 PM | Permalink | Reply to this

Re: Kernels in Machine Learning II

A very interesting on-line book by Sigurdur Helgason, The Radon transform:

Given two homogeneous spaces $G/K$ and $G/K$ of the same group $G$, the Radon transform $u \to \hat {u}$ maps functions $u$ on the first space to functions $\hat {u}$ on the second space. For $\xi \in G/H, \hat{u}(\xi)$ is defined as the (natural) integral of $u$ over the set of points $x \in G/K$ which are incident to $\xi$ in the sense of Chern. (p. 5)

Page 63 shows you the relevant span of $G$-sets.

Posted by: David Corfield on July 5, 2007 2:35 PM | Permalink | Reply to this
Read the post Hierarchy and Emergence
Weblog: The n-Category Café
Excerpt: Emergence in hierarchies
Tracked: July 18, 2008 2:13 PM