There be Dragons
i18n is hard. Don’t let anyone tell you any different.
Some people snigger at the fact that this blog (still) uses iso-8859-1
encoding. “Use utf-8
, man! It’s the solution to all your troubles.” Oh yeah? Let me tell you a story.
Zack Ajmal has a MovableType blog. He’s using the very latest version of MovableType (3.2), on an enlightened host with very recent versions of Perl (5.8.4) and MySQL (4.1.14). He sometimes posts some mathematics, so his blog is XHTML+MathML
, served with the correct application/xhtml+xml
MIME type. He’s implemented comment validation, and many of the other bells 'n whistles you’ve seen here at Musings that make doing this practical. But he also posts a lot in Urdu (a right-to-left language), so he has to grapple with BiDi and Internationalization issues that I don’t much have to worry about.
Naturally, his blog’s encoding is utf-8
.
Although using XHTML+MathML
named entities is perfectly valid, Zack knows that sending named entities over the wire is dangerous. A non-validating XML parser, which is not specifically XHTML+MathML
-aware, will choke on any but the “safe” 5 ('
, "
, &
, <
and >
). So Zack uses my NumericEntities plugin (a wrapper around MathML::Entities) to convert them into something safe.
“Hmmm…,” said Zack, “since I’m using utf-8
, why don’t I use the option in Distler’s plugin to convert named entities to utf-8
characters?” So he did. And, while the named entities were correctly converted to utf-8
, the rest of his blog suddenly looked like gibberish.
Naturally, he wrote me to tell me that it appeared that my plugin was broken. But the truth was that the problem lay elsewhere.
Appearances to the contrary, the rest of the code-path was not really utf-8
-safe. Even when you set the PublishCharset
of your blog to utf-8
, MovableType doesn’t actually mark its strings with Perl’s UTF-8 flag. So they’re really strings of bytes, which are treated internally by Perl as if they were iso-8859-1
.
The proximate cause of his problem arose when one of these phony “iso-8859-1
” strings (the rest of his page) was concatenated with an actual utf-8
string (the output of my plugin in utf-8
mode): Perl “converted” the former string to utf-8
before concatenating them, hopelessly spooging the result.
But the problem didn’t end with MovableType. Older versions of MySQL are equally Unicode-unaware. Even after his hosting provider upgraded to MySQL 4.1.14, which supports utf-8
text, the tables in his MT database were still encoded as “iso-8859-1
”.
And, even after he patched MovableType and converted the tables in the MySQL database to use utf-8
, the Perl module, DBD::mysql, for interfacing between them, remains blissfully Unicode-unaware.
You can read all about Zack’s (ongoing) travails. But my main message is: if anyone tells you, “i18n
is easy, just use utf-8
!” … go ahead and smack them.
Re: There be Dragons
I18n would be easy if everyone did use UTF-8. But they don’t, so it ain’t.