Confession

January 9, 2005

For the past 10 months, I’ve been serving up ill-formed XML, and nobody’s complained. Probably, no one even noticed.

How did I manage to keep up the charade for so long? By encoding the goop and passing it along as a string.

My Atom feed contained a

<content type="application/xhtml+xml" mode="escaped">

element for each entry. The content of that element would have been perfectly at home inside an XHTML+MathML document, whose DTD defines over 2000 named entities. But inside an Atom document, where those named entities are not defined, the result would have been ill-formed crap. Only the five “safe” ones (&, <, >, ' and ") are allowed. So I engaged in the last refuge of a scoundrel: mode="escaped". The result was a nominally well-formed Atom feed, but that was only 'cuz I hid all the bad stuff.

To do things right, I’d need to make sure that all the named entities — both the common ones, like   and ©, and the esoteric ones, like &conint; — were converted to numeric character references (or UTF-8 characters, if that’s the character-encoding of the feed). There are tools to do that for the entities defined in the XHTML DTD, but nobody’d done it for the much larger number of named entities in the XHTML+MathML DTD. So I wrote an MT plugin to do the conversion.

The plugin adds a global filter

<MTEntryBody numeric_entities="1">

Non-MT user may be interested in the standalone Perl Module version:

use MathML::Entities;

$conv2refs = name2numbered($string); # numeric character refs
$conv2utf8 = name2utf8($string); # utf-8 characters

With the filter in place, I could change that fateful line in my Atom feed to

<content type="application/xhtml+xml" mode="xml">

A MathML-aware Feed Reader would, of course, make this effort worthwhile.

The peril of named entities doesn’t just apply to Atom feeds. It applies, as well, to web pages served as application/xhtml+xml. If I were to serve these XHTML+MathML pages to non-MathML-aware XHTML User Agents, like Camino or recent versions of Opera or Safari, the result would be a fatal parsing error. The same thing would happen if you used a custom DTD on an ordinary XHTML page.

Why? Because, when the browser encounters an unfamiliar DOCTYPE Declaration, only the five “safe” named entities are allowed. A single   in your document and … poof! … a yellow screen of death.

Personally, I avoid the issue by only serving these pages as application/xhtml+xml to MathML-aware browsers. Other browsers would not know what to do with the MathML anyway, so I just send them text/html. Most people prefer to decide what MIME type to send, based on the browser’s Accept header.

If you’re just using XHTML, then existing tools will handle the necessary conversions for you. But if you are dabbling in MathML, you need something a little stronger.

Aside: In eliminating the duplicates between the entities defined in the MathML and XHTML DTDs, I discovered a mistake in the MathML DTD. Υ is mapped to ϒ (ϒ, which is &Upsi;) instead of the correct Υ (Υ, as in the XHTML DTD). And don’t get me started on the fact that φ = &straightphi; is the curly one (φ), whereas ϕ = ϕ is the straight one (ϕ). I ended up issuing a revision to itex2MML, so that — whatever the Unicode garbage — the TeX codes would be converted to the expected glyphs (except in Safari, which has it backwards, which is to say straight … umh…).

Update:

Unfortunately, NetNewswire 2.0beta does not really know what to do with Atom feeds whose <content type="application/xhtml+xml" mode="xml">. It insists on decoding everything (as if they had been sent with mode="escaped"). Which made this post utterly illegible. Oh, well … that was fun while it lasted. (This is now fixed in the latest betas.)

Update (1/10/2005):

Perl Module updated with corrections and additions from MathML 2.0, 2^nd Edition.

Some people may wonder why I don’t provide an inverse mapping from numeric character references to MathML named entities. The reason is that the original mapping is neither 1-1 nor onto. There are multiple MathML named entities that map to the same Unicode character. And some MathML named entities map to a sequence of two or more Unicode characters.

Update (1/11/2005):

Standalone Perl Module released. As with any Perl module,

perl Makefile.PL
make
make test
make install

should be all that’s needed.

Posted by distler at January 9, 2005 12:29 AM

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/493

Some Related Entries

12 Comments & 1 Trackback

Re: Confession

I think we discussed this a while ago, not? For example in June 2004 :-)

Nice that you are not relying on a DTD anymore though.

Posted by: Anne on January 9, 2005 4:20 AM | Permalink | Reply to this

Necessity is …

I think we discussed this a while ago, not?

Discussing it is one thing. Implementing it is another.

The XHTML+MathML DTD defines ~~2223 named entities (2007 distinct ones)~~ 2336 named entities (2121 distinct ones) [MathML 2.0, 2^nd Edition] which could legally appear somewhere on these pages. Someone needed to create a tool to filter them.

I didn’t feel any particular compelling need until recently.

Posted by: Jacques Distler on January 9, 2005 11:05 AM | Permalink | PGP Sig | Reply to this

Re: Confession

$MathML-enabled post (click for more details).$

For the past 10 months, I’ve been serving up ill-formed XML, and nobody’s complained. Probably, no one even noticed.

Since the feed XML document itself was not ill-formed, that seems a bit sensational.

To do things right, I’d need to make sure that all the named entities — both the common ones, like   and ©, and the esoteric ones, like &conint; — were converted to numeric character references (or UTF-8 characters, if that’s the character-encoding of the feed).

Doing so is a good idea even if the XML document is not DTDless.

My feed is generated from my HTML front page by a cron job, but the character data lives as unescaped UTF-16 in memory before serializing, so I avoid the entity issue.

There are tools to do that for the entities defined in the XHTML DTD, but nobody’d done it for the much larger number of named entities in the XHTML+MathML DTD.

The need for a special tool could be avoided by using a SAX parser that resolves external entities and sends the ContentHandler events to a SAX serializer.

A MathML-aware Feed Reader would, of course, make this effort worthwhile.

The most probable way of fixing this wound be making Sage emit XHTML with the appropriate DOM gymnastics. Unfortunately, scrubbing the potential platypuses is a major pain on the DOM level and would be easier on the SAX level.

Personally, I avoid the issue by only serving these pages as application/xhtml+xml to MathML-aware browsers.

In theory, a UA might be MathML-aware but still not processing the DTD. There are XHTML-aware UAs that do not process the XHTML DTDs.

Why? Because, when the browser encounters an unfamiliar DOCTYPE Declaration, only the five “safe” named entities are allowed. A single   in your document and … poof! … a yellow screen of death.

For the record, spec-wise the result does not need to be that drastic. If the XML processor does not attempt to process the DTD, it is allowed to merely report skipped entities to the app which, in turn, could display some useless placeholders instead of halting. However, in Mozilla the XML processor is fooled into believing it processes the DTD even when it is actually handed a zero-length stream as the DTD.

Update: Unfortunately, NetNewswire 2.0beta does not really know what to do with Atom feeds whose <content type="application/xhtml+xml" mode="xml">. It insists on decoding everything (as if they had been sent with mode="escaped"). Which made this post utterly illegible. Oh, well … that was fun while it lasted.

At present, the main arguments for using Atom are vaporware arguments.

Posted by: Henri Sivonen on January 9, 2005 5:13 AM | Permalink | Reply to this

Fair is fair

$MathML-enabled post (click for more details).$

Since the feed XML document itself was not ill-formed, that seems a bit sensational.

Since I recently criticized NOAA for doing the same thing, this seems only fair.

In theory, a UA might be MathML-aware but still not processing the DTD.

In practice, there aren’t enough distinct MathML-aware UAs to worry about this.

Doing so is a good idea even if the XML document is not DTDless.

Currently, I’m only filtering my Atom feed. Ultimately, I may do the same thing with my XHTML pages as well.

For the record, spec-wise the result does not need to be that drastic. If the XML processor does not attempt to process the DTD, it is allowed to merely report skipped entities to the app which, in turn, could display some useless placeholders instead of halting. However, in Mozilla the XML processor is fooled into believing it processes the DTD even when it is actually handed a zero-length stream as the DTD.

Seems to be true of every other XML-capable browser I’ve tried, too. Which makes sense: nobody wants to display useless placeholders in place of common XHTML named entities.

At present, the main arguments for using Atom are vaporware arguments.

While I could always achieve the same effect with namespaced extensions to RSS 2.0, there’s little likelihood that feedreaders would actually implement those extensions.

As this little experiment shows, it’s too early to declare Atom syndication a resounding success, but there are considerable signs of progress.

Posted by: Jacques Distler on January 9, 2005 10:39 AM | Permalink | PGP Sig | Reply to this

Re: Fair is fair

Seems to be true of every other XML-capable browser I’ve tried, too. Which makes sense: nobody wants to display useless placeholders in place of common XHTML named entities.

Over here, Safari 1.2.4 displays placeholders and Opera 7.5.1 silently ignores skipped entity references. (Test case)

As this little experiment shows, it’s too early to declare Atom syndication a resounding success, but there are considerable signs of progress.

I have unsubscribed from atom-syntax twice, because there was a lot of discussion over details (plus the RDF permathread) but very little spec progress. As it stands, the major selling points (thoroughly specified; sorts out the “entity-encoded HTML” mess) are vaporware. I will reassess the situation when there is an actual RFC.

Posted by: Henri Sivonen on January 11, 2005 6:18 AM | Permalink | Reply to this

Yellow Screen of Death

Over here, Safari 1.2.4 displays placeholders and Opera 7.5.1 silently ignores skipped entity references.

Previous versions of Safari used to give a yellow screen of death when visiting this page. Version 1.2.4 renders the page (even when we fake the User Agent to be Netscape 7), but what it does is escape the named entities (α → α). I guess that’s what you meant by “placeholders.” Actually, it’s similar to the behaviour of Safari’s tag-soup parser: it, too, escapes unfamiliar named entities.

Opera, as you say, just skips them.

All in all, much friendlier behaviour than I’ve come to expect.

Posted by: Jacques Distler on January 11, 2005 8:40 AM | Permalink | PGP Sig | Reply to this

Re: Confession

FYI: there is a program that does the name→numeric mapping… at least it claims it does. I have not tried it. See http://www.orcca.on.ca/MathML/software.html

I think your “aside” is mistaken about the mapping of Upsilon in MathML 2.0. It was wrong in MathML 1 (as were many other entries), but it was corrected in MathML 2. I agree that Unicode’s change to phi and epsilon [you forgot that one] created problems.

Posted by: Neil Soiffer on January 10, 2005 2:19 PM | Permalink | Reply to this

Re: Confession

FYI: there is a program that does the name→numeric mapping… at least it claims it does. I have not tried it. See http://www.orcca.on.ca/MathML/software.html

That only converts the 251 HTML named entities (and is not as useful or versatile as the Perl modules I linked to above).

My module handles all 2121 named entities in the XHTML+MathML DTD [MathML 2.0, 2^nd Edition].

I think your “aside” is mistaken about the mapping of Upsilon in MathML 2.0. It was wrong in MathML 1 (as were many other entries), but it was corrected in MathML 2. I agree that Unicode’s change to phi and epsilon [you forgot that one] created problems.

My “aside” referred to the MathML 2.0 DTD included with the W3C Validator (in the particular case of Upsilon, scroll down, as it is declared twice). Indeed, these particular problems (among many others) are corrected in MathML 2.0, 2^nd Edition.

I don’t know what mappings are used by various User Agents (your plugin, different vintages of Mozilla/Firefox), so, for consistency, I probably should apply this filter to all my MathML output.

Posted by: Jacques Distler on January 10, 2005 3:00 PM | Permalink | PGP Sig | Reply to this

Re: Confession

I don’t know what mappings are used by various User Agents (your plugin, different vintages of Mozilla/Firefox), so, for consistency, I probably should apply this filter to all my MathML output.

Do you ever use entities that map to astral characters? What happens with those is interesting. My unverified guess is that in Mozilla the gfx math hacks won’t apply and you need a font with proper Unicode mappings for those characters.

Posted by: Henri Sivonen on January 11, 2005 5:56 AM | Permalink | Reply to this

Astral

$MathML-enabled post (click for more details).$

MathML uses lots of plane1D characters:

Fraktur letters (\mathfr{}): $\mathfr{A},\mathfr{a},\mathfr{K},\mathfr{k}$
Blackboard bold letters (\mathbb{}): $\mathbb{A},\mathbb{a},\mathbb{K},\mathbb{k}$
Calligraphic letters (\mathcal{}): $\mathcal{A},\mathcal{a},\mathcal{K},\mathcal{k}$

Mozilla (Mac) fails to render any of them correctly.

And, yes, I use them frequently in my posts (which sucks).

Posted by: Jacques Distler on January 11, 2005 8:10 AM | Permalink | PGP Sig | Reply to this

Re: Astral

When you feed those entities to Gecko, they are expanded to PUA characters instead of astral characters. The gfx is supposed map these to code points in known math fonts. This ugliness was introduced when Mozilla’s strings were officially UCS-2 strings. Now surrogate pairs are supported for Chinese, so this stuff should, IMO, be cleaned up as well.

The Mac gfx is broken in many ways. I still think the real characters are worth testing as they might actually reach ATSUI and render given a proper font.

Posted by: Henri Sivonen on January 11, 2005 12:35 PM | Permalink | Reply to this

Re: Astral

For various boring reasons that I don’t want to get into at the moment, this blog is served as iso-8859-1, not utf-8. However, it’s easy enough to take one of my MathML-heavy pages, run it through the following stream filter

#!/usr/bin/perl
use strict;
use MathML::Entities;

binmode STDOUT, ":utf8";

while (<>) {
  print name2utf8("$_");
}

and then serve the resulting page as utf-8.

This didn’t do a thing to fix the plane1D characters. At least it looked no worse than the original.

Posted by: Jacques Distler on January 11, 2005 1:41 PM | Permalink | PGP Sig | Reply to this

Read the post get_html_translation_table(�BER)
Weblog: Evan Nemerson's Blog
Excerpt: I'm creating a plugin for Serendipity which translates named entities (such as ©) into their numeric equivalents (eg ©). What I've come up with is an array of 2701 entities (compared to the 100 from PHP's get_html_translation_table(HTML_
Tracked: January 13, 2005 9:30 PM

Musings

Skip to the Main Content

January 9, 2005

Confession

Update:

Update (1/10/2005):

Update (1/11/2005):

12 Comments & 1 Trackback

Re: Confession

Necessity is …

Re: Confession

Fair is fair

Re: Fair is fair

Yellow Screen of Death

Re: Confession

Re: Confession

Re: Confession

Astral

Re: Astral

Re: Astral

Access Keys: