Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

July 1, 2004

Trackbacks and MTStripControlChars

A little interlude in the physics reportage. There’s been some more controversy on the subject of Trackbacks.

A bit of background. The Trackback protocol does not discuss the issue of character encodings. Since it proceeds via an HTTP POST, in the absence of any charset declaration, it ought to be assumed that the charset is ISO-8859-1. But, in point of fact, it could be anything.

The obvious long-term solution is for the Trackback Specification to demand that a charset be declared (explicitly or implicitly) and for implementations (like MovableType) to handle the requisite transcoding to/from your blog’s native charset.

But we ain’t there yet1. Right now, you just have to guess at the trackback’s charset, and try to deal intelligently with the result.

Over a year ago, I wrote a plugin to ensure that data (like a trackback) which is purportedly ISO-8859-1 is really valid. Sam Ruby points out that I did an incomplete job of it. There were still some invalid characters that I accepted. That is, as they say, … unacceptable.

So I’ve revised MTStripControlChars to be really bulletproof.


1 After waiting around for six months, I finally implemented my own solution. This doesn’t obviate the need to MTStripControlChars, but it does mean that I don’t have to bone-headedly pretend that all trackbacks are iso-8859-1.

Posted by distler at July 1, 2004 2:57 AM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/391

2 Comments & 0 Trackbacks

Re: Trackbacks and MTStripControlChars

And even now, you can’t be sure, since a XML parser doesn’t have to support iso-8859-1. However, it is required to support utf-8 and utf-16 so maybe one of those is a better choice.

Posted by: Anne on July 1, 2004 2:41 AM | Permalink | Reply to this

Re: Trackbacks and MTStripControlChars

XML parsers are not required to support ISO-8859-1. However all the ones you will ever encounter do. ISO-8859-1 is the default charset encoding for HTTP. It would be stupid not to support it.

Lack of support for ISO-8859-1 is not an issue. Failure to filter out invalid characters (in whatever your chosen charset) is.

You are living in a fool’s paradise if you think that switching to UTF-8 absolves you of responsibility for filtering out invalid characters.

In the case of Trackbacks, it is really correct (in the absense of the as-yet-totally-unsupported use of explicit charset declarations) to assume that they are ISO-8859-1.

That is totally independent of the choice of native charset for your blog.

Posted by: Jacques Distler on July 1, 2004 3:00 AM | Permalink | PGP Sig | Reply to this

Post a New Comment