Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

December 13, 2006

Back from NIPS 2006

Posted by David Corfield

Aside from thinking up the second n-Café millennium prize, there wasn’t much time for higher-dimensional algebra last week while I attended the NIPS 2006 machine learning conference/workshop in Vancouver/Whistler. To find a role even for ordinary category theory in machine learning presents an interesting challenge. Most of what I’ve seen relates to neural networks, e.g., here.

Now, it looked as though neural networks had had their day, superseded by kernel methods, both support vector machines (SVMs) and Gaussian processes (GPs), as used by most of the people I work with in Tübingen. But there’s been a recent revival of interest in neural networks with the discovery of a new way to train deep networks, as Yoshua Bengio described.

Up in Whistler, days were like sandwiches, workshops providing the bread to the skiing filler. On Friday, I attended the Causality and Feature Selection workshop. Bayesian networks featured prominently here, which I happen to know quite a lot about through my friend Jon Williamson.

On Saturday, I attended the related Learning when test and training inputs have different distributions workshop. The topic of this workshop is one where background knowledge is indispensable. Where the test and training distributions coincide, given enough training data I may expect my algorithm to converge to its optimal performance. I can’t rely on this if my algorithm has to learn from data unrepresentative of what it will eventually be used for. And yet such inference is very much a part of human reasoning. I may never have stepped foot in Africa before, nor ever been to a rainforest, nor ever seen a creature with ears that shape, with that colour fur, but if I see something suckling its young I can feel confident about a huge amount of facts relating to its anatomy, and physiology.

So how to encode sufficient background knowledge to allow transferral of inference to a broader domain. To do so in too domain-specific a way risks the autonomy prized in machine learning, which plays to a computer’s strengths, rapid calculation and string-searching. An illustration I like to give here is when searching for the word kangaroo in a long text, the human is very interested to learn that one of the chapter titles is ‘Australian Marsupials’, sensing that this will reduce the search space, whereas a computer can happily ignore this and just go string-searching. So, researchers are looking to feed in a less specific form of background knowledge, perhaps ideally just a representative scheme which could allow the learning of this background.

We were encouraged to give talks which would provoke discussion rather than just turn the workshop into a mini-conference. My questions to the audience were whether they felt satisfied with the ways found so far to encode background knowledge, and further whether they felt that the frequentists (statistical learning theorists, Vapnikians, SVM users) had been more imaginative to date than the Bayesians (GP users) in doing so. Frequentists present naturally answered ‘yes’ to the latter question. If true, this might be due to the hampering effect of forcing background knowledge to be encoded as a probability distribution, or it might just be through lack of effort. Here are some examples of background knowledge:

Clustering: we can expect that the input variables of data points bearing the same label will lie close to one another. Imagine a data set consisting of two interspersed crescent shapes. One point in crescent 1 is labelled red, one in crescent 2 is labelled blue. Intuitively you think to colour each crescent the same as its labelled member. But if you relied only on the labelled data, commonly used algorithms would just drive a classifying line through the perpendicular bisector of the line joining them. To use the unlabelled data you must feed in an assumption that points with the same label will tend to cluster. Vapnik’s transductive inference managed to achieve this. Lawrence and Jordan recently added the notion of the null category to the GP set-up to the same end for the Bayesians.

Invariance under small changes: we can expect small translations, rotations, and thinning or thickening of lines of images of numbers to represent the same number. Frequentists got here first by finding support vectors (most informative inputs), then forming virtual support vectors by making slight modifications to the originals. This was copied in the GP setting.

We heard of further methods devised by frequentists at the workshop:

Robustness under adversarial deletion: we may expect that informative inputs are robust enough that even if an adversary does their best to put us off with a given number of feature deletions, we’ll still be able to classify. It would be a bad symbol system if the removal of a single pixel from the image of a letter made it look like another. On the other hand, not many auditory features of ‘nineteen’ need be removed before it sounds like ‘ninety’, and letters are notoriously hard to hear over the telephone, such as ‘F’ and ‘S’.

Structural correspondence: sentences taken from a medical journal may be very different from sentences taken from the Wall Street Journal, but still we may expect that words close to certain ‘pivots’ will be the same parts of speech. So a word appearing after ‘a’ and before ‘required by’ will most likely be a noun in both financial and medical contexts. If we have tagged a large corpus of financial text, then an algorithm which has learned to classify these tags well on the basis of the selected pivots, should continue to work well on untagged medical texts. Shai Ben-David talked about generalization guarantees relating to this Structural Correspondence Learning. Keen as ever to get category theory in on the scene, I note that this work resonated with the discussion we had on analogies as spans.

Posted at December 13, 2006 9:57 PM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1074

10 Comments & 1 Trackback

Re: Back from NIPS 2006

find a role even for ordinary category theory in machine learning presents an interesting challenge. Most of what I’ve seen relates to neural networks, e.g., here.

Interesting. In the abstract behind that link it says, among other things:

The categorical semantic model described here explains the learning process as the derivation of colimits and limits in a concept category.

Is there a quick way to explain the main idea behind this, not requiring to read the entire document?

Posted by: urs on December 14, 2006 2:42 PM | Permalink | Reply to this

Re: Back from NIPS 2006

I should find out more about this approach. As I say in the post, the machine learning people I work with use kernel methods, rather than neural networks, so I haven’t found out much about the latter. Some of the work on Gaussian processes arose from the realisation that they correspond to a neural network whose hidden layer of nodes has become infinite. It has been suggested, however, that with this move, the possibility of understanding how a network represents concepts is lost. This is discussed in the useful 4 page Introduction to Carl Rasmussen and Chris Williams’ Gaussian Processes for Machine Learning.

More category theoretic modelling of biological systems has been carried out by Andrée Ehresmann and Jean-Paul Vanbremeersch in their Memory Evolutive Systems. Nils Baas, someone very closely involved in the interests of the Café, has teamed up with them contributing his theory of hyperstructures.

Posted by: David Corfield on December 15, 2006 9:11 AM | Permalink | Reply to this

Re: Back from NIPS 2006

I’m wondering whether the mathematics of machine learning fits into a “causal/conceptual ordering” scheme. (See your comment of October 23, 2006.) It seems that in machine learning one starts from “effects” (i.e., human learning) and attempts to find “causes” (mathematical models). The mathematics is judged, in part, by the appropriateness of its “effects”, at least, by the ability of a model to mimic the “effects” through adjustment of parameters, etc. This observation might apply to mathematical modeling in general. (I’m assuming that by “causal/conceptual ordering” you are referring to a mode of reasoning that starts from first principles and infers the rest of mathematics from these principles. So the direction of reasoning is from “cause” to “effect” rather than from “effect” to “cause”.)
dennis

Posted by: Dennis on December 14, 2006 3:17 PM | Permalink | Reply to this

Re: Back from NIPS 2006

Yes, in all sciences there’s a to-and-fro search for first principles. If you wanted a one sentence summary of Lakatos’s philosophy of mathematics, you could say that it was the insistance that mathematical reasoning be seen to be bidirectional. You spot a pattern, try to find a derivation for it, sharpened your concepts, find new facts, look for new principles, etc.

I don’t think machine learning has quite found a widely accepted principled basis yet, even though two very famous mathematicians have contributed: Steve Smale (also here) and David Mumford (I posted about him here). For a discussion of the immaturity of machine learning, see here. But then maybe mathematics too has hardly begun.

Posted by: David Corfield on December 15, 2006 9:45 AM | Permalink | Reply to this

Re: Back from NIPS 2006

You wrote: “You spot a pattern, try to find a derivation for it, sharpened your concepts, find new facts, look for new principles, etc.”

It seems these days that an arena of mathematics can spring up rapidly from deriving patterns or modelling. Not too long ago computer graphics carried a fairly lightweight mathematical toolkit. Now, undergraduate mathematical textbooks in the subject are readily available. I understand considerable mathematical sophistication has entered the field as well. This mathematics may well become, or perhaps already constitutes, a sophisticated applied field of mathematics in its own right – influenced by the needs of the computer graphics industry but nonetheless situated within the framework of typical mathematical deductive systems.

Posted by: Dennis on December 20, 2006 7:53 PM | Permalink | Reply to this

Re: Back from NIPS 2006

David wrote:

For a discussion of the immaturity of machine learning, see here. But then maybe mathematics too has hardly begun.

I wouldn’t be surprised if it turns out that at every stage — from the beginning of life in the Universe to its very end — the process of learning, and learning how to learn, feels like it has just begun.

This is the flip side of Kevin Kelly’s notion that the Singularity is always near.

Posted by: John Baez on December 21, 2006 1:09 AM | Permalink | Reply to this

Re: Back from NIPS 2006

NIPS is a promising endeavor.

Perhaps this work should be integrated with that of Nobel 2000 Medicine / Physiology laureate Eric R Kandel:

‘The Molecular Biology of Memory Storage: A Dialogue between Genes and Synapses’ (lecture).

Posted by: Doug on December 15, 2006 2:04 PM | Permalink | Reply to this

Bayesian reasoning and natural selection

A bit of a digression, but not completely unrelated:

I spent yesterday talking to my friend Chris Lee. He’s working on very general concepts of data analysis, pattern recognition and information theory as applied to genomics. I keep thinking he’d have a lot of fun talking to David Corfield if I could ever get them together…

From my diary:

Right now Chris is trying to understand natural selection from an information-theoretic standpoint. At what rate is information passed from the environment to the genome by the process of natural selection? How do we define the concepts here precisely enough so we can actually measure this information flow?

Chris pointed out an interesting analogy between natural selection and Bayesian inference.

The analogy is mathematically precise, and fascinating. In rough terms, it says that the process of natural selection resembles the process of Bayesian inference. A population of organisms can be thought of as having various ‘hypotheses’ about how to survive — each hypothesis corresponding to a different allele. (Roughly, an allele is one of several alternative versions of a gene.) In each successive generation, the process of natural selection modifies the proportion of organisms having each hypothesis, according to Bayes’ law!

Now let’s be more precise:

Bayes’ law says if we start with a ‘prior probability’ for some hypothesis to be true, divide it by the probability that some observation is made, then multiply by the ‘conditional probability’ that this observation will be made given that the hypothesis is true, we’ll get the ‘posterior probability’ that the hypothesis is true given that the observation is made.

Formally, the exact same equation shows up in population genetics! In fact, Chris showed it to me - it’s equation 9.2 on page 30 of this book:

  • R. Bürger, The Mathematical Theory of Selection, Recombination and Mutation, section I.9: Selection at a single locus, Wiley, 2000.

But, now all the terms in the equation have different meanings!

Now, instead of a ‘prior probability’ for a hypothesis to be true, we have the frequency of occurence of some allele in some generation of a population. Instead of the probability that we make some observation, we have the expected number of offspring of an organism. Instead of the ‘conditional probability’ of making the observation, we have the expected number of offspring of an organism given that it has this allele. And, instead of the ‘posterior probability’ of our hypothesis, we have the frequency of occurence of that allele in the next generation.

(Here we are assuming, for simplicity, an asexually reproducing haploid population - that is, one with just a single set of chromosomes.)

This is a great idea - Chris felt sure someone must have already had it. A natural context would be research on genetic programming, a machine learning technique that uses an evolutionary algorithm to optimize a population of computer programs according to a fitness landscape determined by their ability to perform a given task. Since there has also been a lot of work on Bayesian approaches to machine learning, surely someone has noticed their mathematical relationship?

Posted by: John Baez on December 21, 2006 12:52 AM | Permalink | Reply to this

Re: Back from NIPS 2006

Chapter 19 (Why have Sex?) of David Mackay’s machine learning book is good reading on this subject!

It’s an information theoretic argument, but that’s just the other side of the coin, and Mackay is firmly in Bayesian camp. There’s a nice argument for the superiority of sexual over asexual recombination in terms of information transmitted from one generation to the next

Posted by: Allan E on December 21, 2006 2:09 AM | Permalink | Reply to this

Re: Back from NIPS 2006

If natural selection is analogous to learning, maybe sex is analogous to having discussions in which ideas combine!

Posted by: John Baez on December 23, 2006 4:48 PM | Permalink | Reply to this
Read the post Common Applications
Weblog: The n-Category Café
Excerpt: Why does the same piece of mathematics find many applications?
Tracked: December 21, 2006 3:31 PM

Post a New Comment