Confession
For the past 10 months, I’ve been serving up ill-formed XML, and nobody’s complained. Probably, no one even noticed.
How did I manage to keep up the charade for so long? By encoding the goop and passing it along as a string.
My Atom feed contained a
<content type="application/xhtml+xml" mode="escaped">
element for each entry. The content of that element would have been perfectly at home inside an XHTML+MathML document, whose DTD defines over 2000 named entities. But inside an Atom document, where those named entities are not defined, the result would have been ill-formed crap. Only the five “safe” ones (&
, <
, >
, '
and "
) are allowed. So I engaged in the last refuge of a scoundrel: mode="escaped"
. The result was a nominally well-formed Atom feed, but that was only 'cuz I hid all the bad stuff.
To do things right, I’d need to make sure that all the named entities — both the common ones, like
and ©
, and the esoteric ones, like ∮
— were converted to numeric character references (or UTF-8 characters, if that’s the character-encoding of the feed). There are tools to do that for the entities defined in the XHTML DTD, but nobody’d done it for the much larger number of named entities in the XHTML+MathML DTD. So I wrote an MT plugin to do the conversion.
The plugin adds a global filter
<MTEntryBody numeric_entities="1">
Non-MT user may be interested in the standalone Perl Module version:
use MathML::Entities;
$conv2refs = name2numbered($string); # numeric character refs $conv2utf8 = name2utf8($string); # utf-8 characters
With the filter in place, I could change that fateful line in my Atom feed to
<content type="application/xhtml+xml" mode="xml">
A MathML-aware Feed Reader would, of course, make this effort worthwhile.
The peril of named entities doesn’t just apply to Atom feeds. It applies, as well, to web pages served as application/xhtml+xml
. If I were to serve these XHTML+MathML pages to non-MathML-aware XHTML User Agents, like Camino or recent versions of Opera or Safari, the result would be a fatal parsing error. The same thing would happen if you used a custom DTD on an ordinary XHTML page.
Why? Because, when the browser encounters an unfamiliar DOCTYPE Declaration, only the five “safe” named entities are allowed. A single
in your document and … poof! … a yellow screen of death.
Personally, I avoid the issue by only serving these pages as application/xhtml+xml
to MathML-aware browsers. Other browsers would not know what to do with the MathML anyway, so I just send them text/html
. Most people prefer to decide what MIME type to send, based on the browser’s Accept
header.
If you’re just using XHTML, then existing tools will handle the necessary conversions for you. But if you are dabbling in MathML, you need something a little stronger.
Aside: In eliminating the duplicates between the entities defined in the MathML and XHTML DTDs, I discovered a mistake in the MathML DTD. Υ
is mapped to ϒ
(ϒ, which is ϒ
) instead of the correct Υ
(Υ, as in the XHTML DTD). And don’t get me started on the fact that φ
= ϕ
is the curly one (φ), whereas ϕ
= ϕ
is the straight one (ϕ). I ended up issuing a revision to itex2MML, so that — whatever the Unicode garbage — the TeX codes would be converted to the expected glyphs (except in Safari, which has it backwards, which is to say straight … umh…).
Update:
Unfortunately, NetNewswire 2.0beta does not really know what to do with Atom feeds whose<content type="application/xhtml+xml" mode="xml">
. It insists on decoding everything (as if they had been sent with mode="escaped"
). Which made this post utterly illegible. Oh, well … that was fun while it lasted. (This is now fixed in the latest betas.)Update (1/10/2005):
Perl Module updated with corrections and additions from MathML 2.0, 2nd Edition.Some people may wonder why I don’t provide an inverse mapping from numeric character references to MathML named entities. The reason is that the original mapping is neither 1-1 nor onto. There are multiple MathML named entities that map to the same Unicode character. And some MathML named entities map to a sequence of two or more Unicode characters.
Update (1/11/2005):
Standalone Perl Module released. As with any Perl module,perl Makefile.PL make make test make install
should be all that’s needed.
Re: Confession
I think we discussed this a while ago, not? For example in June 2004 :-)
Nice that you are not relying on a DTD anymore though.