Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

November 25, 2006

Bulletholes

This blog, as well as the String Coffee Table and the n-Category Café, are served as application/xhtml+xml to compatible browsers. They, therefore, need to be well-formed at all times. Otherwise, visitors will see a “yellow screen-of-death” instead of the desired content.

In order to ensure well-formedness, user-input is validated before it can be posted. A local copy of the W3C Validator is hooked into the “preview” function for comments and entries. And, in the case of comments, we rigourously enforce that comments validate before they can be posted.

That sounds great in theory. And, in practice, it seems to have worked quite well. One might even be forgiven for complacently thinking the arrangement bulletproof.

But, then Henri Sivonen came along1, to point out that one has been living in a fool’s paradise. The W3C Validator fails to even enforce well-formedness. Actually, the fault is not in the software written by the W3C, but in the onsgmls SGML parser, which has only limited support for XML.

Far from being bulletproof, it was quite trivial to introduce non-well-formed content onto these blogs. That none of the previous six thousand or so comments have done so can be attributed either to dumb luck, or to the essential goodness of humanity. Needless to say, neither can be counted upon.

So, as a quick and dirty hack, if the W3C Validator says your comment is valid, I run it through a real XML parser, just to be sure. It seem a bit redundant, and the XML parser bails at the first well-formedness error (so it could take several passes to catch all the well-formedness errors missed by the W3C Validator). A better solution would be for someone to fix OpenSP 1.5.2, to ensure that onsgmls actually checks for well-formedness, when operating in XML mode.

Update (11/27/2006):

It seems to me that there are only about 3 people in the world using it, but I might as well release an updated version of the MTValidate plugin.

Version 0.4 of the plugin incorporates a new configuration option in /plugins/validator/config/validator.conf . Setting

XHTML_Check  = 1

runs ostensibly “valid” comments through a real XML parser, ensuring that they really are well-formed. To use this option, you’ll need the XML::LibXML Perl Module.

The new version also incorporates yet more user-friendly error messages from version 0.74 of the W3C Validator.


1 In response to a bit of flamebait from Anne van Kesteren.

Posted by distler at November 25, 2006 2:33 AM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1053

12 Comments & 1 Trackback

Re: Bulletholes

There are several other limitations as well

Posted by: Lachlan Hunt on November 25, 2006 10:55 AM | Permalink | Reply to this

More limitations

In light of that long list, “fixing” OpenSP might not be an option. Perhaps, better, to use a real XML parser for validating XML documents.

Posted by: Jacques Distler on November 25, 2006 11:21 AM | Permalink | PGP Sig | Reply to this

Re: Bulletholes

It seems to me that there are only about 3 people in the world using it

Who is the 3rd?

Posted by: Zack on November 28, 2006 1:18 PM | Permalink | PGP Sig | Reply to this

Only the lonely

Brian Koberlein uses it. Phil Ringnalda used to use it (until, for some reason he converted to WordPress, and then stopped blogging). And I know of at least one non-public blog that uses it.

Maybe there are more; but they’ll have to step forward and say so.

Posted by: Jacques Distler on November 28, 2006 1:43 PM | Permalink | PGP Sig | Reply to this

An ex-blogger

stopped blogging

He’s not dead, he’s resting. Remarkable bird, the Norwegian Blue, isn’t it, eh? Beautiful plumage.

Posted by: Phil Ringnalda on November 28, 2006 11:04 PM | Permalink | PGP Sig | Reply to this

Re: An ex-blogger

(Customer): He’s not pinin’! He’s passed on! This parrot is no more! He has ceased to be! He’s expired and gone to meet his maker! He’s a stiff! Bereft of life, he rests in peace! If you hadn’t nailed him to the perch he’d be pushing up the daisies! His metabolic processes are now history! He’s off the twig! He’s kicked the bucket, he’s shuffled off his mortal coil, run down the curtain and joined the bleedin’ choir invisibile!! This is an ex-parrot!!

(Shop Owner): Well, I’d better replace it, then. (he takes a quick peek behind the counter) Sorry squire, I’ve had a look ‘round the back of the shop, and uh, we’re right out of parrots.

We miss you.

Posted by: Jacques Distler on November 28, 2006 11:42 PM | Permalink | PGP Sig | Reply to this

Re: Bulletholes

Dear Jacques,

this is remotely on-topic because it is also about formatting issues etc.

I believe that the proposed 2007 notation for the preprint numbers - now announced on arxiv.org mainpage - is ugly and it will cause completely unnecessary havoc.

I suggest that the current scheme is kept, and when you exceed 999 in a given archive, you continue with a00 up to z99 - which are 2600 new papers you can add every month, making the total number of papers per month up to 3600 in every archive.

If you exceed 3600, you can add new papers from aaa to zzz which is 26^3 extra papers (over 15,000?).

Best
Lubos

Posted by: Lubos Motl on December 2, 2006 4:27 PM | Permalink | Reply to this

Yes, it’s off-topic, but …

Yes, I agree with you that removing, from the identifier, any vestige which would identify the subject-area of a preprint is a mistake.

Even if you want an arXiv-wide numbering scheme, there’s no reason why you couldn’t have identifiers that looked like math.AG/0705.0127 or hep-th/0705.0128. Note that these, though consecutively-received preprints, could be retrieved from a URL structure based on the above stems, e.g. http://arxiv.org/abs/math.AG/0705.0127, just like the current system.

If it’s any consolation to you, they didn’t consult with me about it.

Posted by: Jacques Distler on December 2, 2006 11:34 PM | Permalink | PGP Sig | Reply to this

Re: Yes, it’s off-topic, but …

I see. You have thought about it more than I did. ;-) You may contact them if they didn’t contact you. Perhaps. Have a nice evening, Lubos

Posted by: Lubos Motl on December 3, 2006 8:45 PM | Permalink | Reply to this

Re: Bulletholes

Why didn’t you join the qa-dev group and proposed your patches and participate to the discussion with other developers?

public-qa-dev@w3.org is the publicly archived list of the qa-dev effort. The archive is available for everyone to consult, and anyone can subscribe, but please contact the coordinators if you wish to participate in the qa-dev effort.

Posted by: karl Dubost, W3C on December 5, 2006 1:22 AM | Permalink | Reply to this

Patches

MTValidate is, already a fork of the W3C Validator, so the patch won’t be directly applicable to your code. But, at least, it should give you an idea …

--- mtvalidate-0.3/MTValidate.pl        2004-06-24 17:49:47.000000000 -0600
+++ mtvalidate-0.4/MTValidate.pl        2006-11-28 10:08:39.000000000 -0600
@@ -1,5 +1,5 @@
- -# MTValidate 0.3
- -# $Date: 2004/06/20 14:18:33 $
+# MTValidate 0.4
+# $Date: 2006/11/26 14:18:33 $
 
 # by Jacques Distler <distler@golem.ph.utexas.edu>
 # original by Alexei Kosut <akosut@cs.stanford.edu>
@@ -108,6 +105,7 @@
        -DefaultConfig    => { 
                               SGML_Parser       => '/usr/bin/onsgmls',
                              SGML_Library      => "$vdir/sgml-lib",
+                             XHTML_Check       => 0,
                             },
       );
     my %cfg = Config::General->new(%config_opts)->getall();
@@ -587,6 +585,30 @@
    $T->param(context => "entry");
 }
 
+# Hack to overcome shortcomings of onsgmls:
+# Reparse with an XML parser and see...
+if ($File->{'Is Valid'} && !($#{$File->{Warnings}} >= 0) && $CFG->{XHTML_Check}) {
+   use XML::LibXML;
+   my $parser = new XML::LibXML;
+   $parser->line_numbers(1);
+   eval {
+      for (@{$File->{Content}}) {
+        $parser->parse_chunk( $_ . "\n" );
+      }
+      $parser->parse_chunk("", 1);
+   };
+   if ($@) {
+        my $messages = $@;
+
+        # Line-number indicators begin on a new line
+        $messages =~ s{[^\x0d\x0a](:\d+:)}{\n$1}g;
+
+        # Strip Perl line numbers from error message.
+        $messages =~ s{[^\x0d\x0a]+[\x0d\x0a]$}{};
+
+        &add_warning($File, "Error:",  "<pre>" . encode_entities(decode_entities($messages)) . "</pre>" );
+   }
+}
 
 if ($File->{'Is Valid'} && !($#{$File->{Warnings}} >= 0) ) {
     $T->param(VALID => MTV_TRUE);

That’s it.

I’m sure a modicum of effort could produce a more polished result, but this was good enough for my purposes.

Posted by: Jacques Distler on December 5, 2006 2:15 AM | Permalink | PGP Sig | Reply to this

Re: Bulletholes

Finaly, there’s no fully functional validator which will be all in one machine.

The only one real way is to make sure that you know markup languages at 99.99% before you start doing anything, and that you have a clear idea of what you’re going to do.

Posted by: Shimon on December 18, 2006 10:44 PM | Permalink | Reply to this
Read the post Validating Comments
Weblog: Musings
Excerpt: Run comments through the W3C Validator before posting. The first in a series of "How-To" articles on MovableType.
Tracked: January 13, 2007 2:21 PM

Post a New Comment