Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

September 3, 2004

Building a Better Mousetrap

Anyone who’s followed this weblog, for any length of time, knows that a fair amount of energy went into ensuring that it produces valid XHTML+MathML. Starting with an off-the-shelf CMS like MovableType or WordPress, and “bullet-proofing” it, so that you can safely serve XHTML+MathML with the correct MIME-type is a tedious process.

I wrote a series of tedious posts (I,II,III,IV,V), recounting my own experience with MovableType.

So my ears definitely pricked up when Henri Sivonen, in the comments to one of Anne van Kesteren’s posts, argued that a truly XML-based CMS would make this task easier.

At their core, CMSs in general (or weblogging systems in particular) are very simple. They receive content (posts, comments, trackbacks, …), which they run through some templates to produce pages.

It’s these templates that Henri wants to improve. “Tag-soup” languages, like PHP, operate on strings to produce other strings (which might or might not turn out to be well-formed XML). Maybe your templates are correct, and will turn valid input into valid output (out of the box, both MovableType and WordPress mostly do so), or maybe you’ve attempted to customize them, and they now produce invalid poo.

Henri’s templating system acts on document trees. Well-formedness is built-in. Only at the very end, is the output run through an XML serializer to produce a web page. Naïvely, it sounds a little like XSLT, which allows you to specify rules for transforming one XML document into another.

Henri and his collaborators have a paper, describing the system, and Henri has just released some source code.

Personally, I don’t think that it’s the templating system that’s the real Achille’s heel of today’s CMSs. I, rather dismissively, summarized that part of the story with a

Before we plunge in, it would be wise to make sure your pages are valid XHTML 1.1.

in my writeup. The real problem is dealing with dodgy content. Much of it (comments, trackbacks, etc.) comes from sources who neither know, nor care about the exigencies of producing valid XHTML. Henri’s stance seems to be, “If it doesn’t parse to valid XML, reject it.” Which probably means you’re not going to have a whole lot of content on your Sivonen-powered weblog. Gently guiding users in producing valid content is the hard problem that no one has really addressed.

Still – aside from a few folks like yours-truly, who’ve cobbled something together – Henri’s system (which involves more than just the templating engine that he’s released), is the only one which validates the content on the way into the database. Two years after starting this weblog, I’m still waiting to see content-validation offered as even an optional feature on any mainstream CMS.

Posted by distler at September 3, 2004 10:20 AM

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/428

2 Comments & 0 Trackbacks

Re: Building a Better Mousetrap

I should mention that even though the template engine was my area, the whole system was not written by me. We had a team of six people working on the project.

Henri’s stance seems to be, “If it doesn’t parse to valid XML, reject it.” Which probably means you’re not going to have a whole lot of content on your Sivonen-powered weblog.

That I don’t have a “Sivonen-powered” blog is not due to inability to mold input into a suitable format but due to the lack of a blogging back end.

The input format does not need to be XHTML as long as the input format can be converted into XHTML before it flows any further into the system. This can be done either using a foo2xhtml converter followed by and XML parser or using a parser that parses foo but emits SAX events as if it was an XML parser parsing XHTML. In particular, dealing with HTML soup is a solved problem. John Cowan has done the heavy lifting and has written the aptly-named TagSoup, which parses tag soup but appears to the application as an XML parser parsing XHTML.

Two years after starting this weblog, I’m still waiting to see content-validation offered as even an optional feature on any mainstream CMS.

Assuming reasonable effort, correctness requires a library infrastructure that comes with built-in correctness. I think there is a great disconnect between mainstream hosting and correct infrastructure. In order to be able to do correct things with XML, a good XML infrastructure is needed. A good XML infrastructure, in turn, requires a good Unicode infrastructure.

Java has both. Also, thanks to the culture of avoiding JNI extensions, the threshold of throwing in new third-party jars is relatively low compared to environments where introducing a new dependency means you have to figure out how to compile the wrappers for a given C library on a given platform.

We didn’t need to write a SAX pipeline framework or an XML serializer. David Brownell had already written those. We didn’t need to write a tree object model for XML. There were several to choose from. We didn’t need to write a validator. James Clark had already written Jing. We didn’t need to write an HTML parser. John Cowan had already written TagSoup. We didn’t need to develop elaborate mapping tables for creating URLs from titles. Instead, we used ICU4J to convert the title into NFKD and then filtered out the URL-unsafe characters.

However, Java hosting is not mainstream, although Java is mainstream on internal enterprise servers. PHP4 with Perl 5 as the runner up seem to be the two mainstream languages as far as affordable hosting goes. As a result, in order to develop a mainstream CMS, it makes sense to use PHP4. The XML and Unicode infrastructure situation with PHP4 is dreadful, so it’s natural that features that require a strong XML and Unicode infrastructure don’t show up in mainstream PHP-based systems. CPAN makes the situation with Perl better, but language-level Unicode support was added to Perl relative recently, so the non-Unicode legacy is still having an effect.

However, experience suggests it is possible to do some input checking in a PHP-based system. Why even that level of checking is not maintstream cannot be explained by the lack of cool libraries.

Posted by: Henri Sivonen on September 4, 2004 8:13 AM | Permalink | Reply to this

Re: Building a Better Mousetrap

in the garage door industry they are constantly trying to build the better mouse trap and then getting stepped on by the larger corporations. At least in java and xml or other codes people don’t pull the code right out from under you so you can’t finish or create the better mouse trap or code or what have you.

Posted by: garage door opener on September 10, 2004 9:26 PM | Permalink | Reply to this

Post a New Comment