## November 2, 2005

### There be Dragons

i18n is hard. Don’t let anyone tell you any different.

Some people snigger at the fact that this blog (still) uses iso-8859-1 encoding. “Use utf-8, man! It’s the solution to all your troubles.” Oh yeah? Let me tell you a story.

Zack Ajmal has a MovableType blog. He’s using the very latest version of MovableType (3.2), on an enlightened host with very recent versions of Perl (5.8.4) and MySQL (4.1.14). He sometimes posts some mathematics, so his blog is XHTML+MathML, served with the correct application/xhtml+xml MIME type. He’s implemented comment validation, and many of the other bells 'n whistles you’ve seen here at Musings that make doing this practical. But he also posts a lot in Urdu (a right-to-left language), so he has to grapple with BiDi and Internationalization issues that I don’t much have to worry about.

Naturally, his blog’s encoding is utf-8.

Although using XHTML+MathML named entities is perfectly valid, Zack knows that sending named entities over the wire is dangerous. A non-validating XML parser, which is not specifically XHTML+MathML-aware, will choke on any but the “safe” 5 (&apos;, &quot;, &amp;, &lt; and &gt;). So Zack uses my NumericEntities plugin (a wrapper around MathML::Entities) to convert them into something safe.

“Hmmm…,” said Zack, “since I’m using utf-8, why don’t I use the option in Distler’s plugin to convert named entities to utf-8 characters?” So he did. And, while the named entities were correctly converted to utf-8, the rest of his blog suddenly looked like gibberish.

Naturally, he wrote me to tell me that it appeared that my plugin was broken. But the truth was that the problem lay elsewhere.

Appearances to the contrary, the rest of the code-path was not really utf-8-safe. Even when you set the PublishCharset of your blog to utf-8, MovableType doesn’t actually mark its strings with Perl’s UTF-8 flag. So they’re really strings of bytes, which are treated internally by Perl as if they were iso-8859-1.

The proximate cause of his problem arose when one of these phony “iso-8859-1” strings (the rest of his page) was concatenated with an actual utf-8 string (the output of my plugin in utf-8 mode): Perl “converted” the former string to utf-8 before concatenating them, hopelessly spooging the result.

But the problem didn’t end with MovableType. Older versions of MySQL are equally Unicode-unaware. Even after his hosting provider upgraded to MySQL 4.1.14, which supports utf-8 text, the tables in his MT database were still encoded as “iso-8859-1”.

And, even after he patched MovableType and converted the tables in the MySQL database to use utf-8, the Perl module, DBD::mysql, for interfacing between them, remains blissfully Unicode-unaware.

You can read all about Zack’s (ongoing) travails. But my main message is: if anyone tells you, “i18n is easy, just use utf-8!” … go ahead and smack them.

Posted by distler at November 2, 2005 1:48 AM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/671

### Re: There be Dragons

I18n would be easy if everyone did use UTF-8. But they don’t, so it ain’t.

Posted by: Aristotle Pagaltzis on November 2, 2005 7:01 AM | Permalink | Reply to this

### Re: There be Dragons

Aristotle, I could never say it better.

Unicode is good when you (in the first place) ACKNOWLEDGE that it exists and must work and if you support it right.

It is reasonable to expect that every component of every computer system right now absolutely ignores or messes up Unicode simply because nobody cares.

Posted by: Julik on November 2, 2005 7:20 AM | Permalink | Reply to this
Read the post As coisinhas interessantes de hoje...
Weblog: It's equal but it's different
Excerpt: [♬ Ao som do Acústico do Ultlraje à Rigor… ♬] Pois bem: Hoje, depois de ter brigado o suficiente com o gnuplot pra poder fazer meus gráficos do meu jeito, até sobrou um tempinho pra colocar uns links legais (alguns...
Tracked: November 2, 2005 8:03 AM
Read the post Dragons be gone
Weblog: Sam Ruby
Excerpt: Luckily, I’m outside of arms reach. You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8. Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Micro
Tracked: November 2, 2005 1:43 PM

### Re: There be Dragons

And Sam’s doubly-lucky to be out of reach, since he’s currently not-well-formeding (deforming?) your comment feed with a named character entity reference for his right angled single quote.

Posted by: Phil Ringnalda on November 2, 2005 5:34 PM | Permalink | PGP Sig | Reply to this

### Re: There be Dragons

Where’s my stick?

While “valid RSS” is almost an oxymoron, I should have appplied my filter to trackback pings in my comment feed as well.

Fixed.

Posted by: Jacques Distler on November 2, 2005 7:54 PM | Permalink | PGP Sig | Reply to this

### Re: There be Dragons

But my main message is: if anyone tells you, “i18n is easy, just use utf-8!” … go ahead and smack them.

I tend to tell people that i18n is easy as long as you pass around opaque strings, annotate those strings with language metadata, leave the rendering code for someone else, use UTF-8 in NFC on the wire and the native Unicode strings of your programming language in memory. :-)

Worked for me with Java, ICU4J, Hibernate and Postgres with UTF-16 in memory and UTF-8 on the wire.

The conclusion I draw from your post is not that UTF-8 was the problem but that the superfluous ISO-8859-1 to UTF-8 conversion was the problem. Another problem I see is MySQL.

Posted by: Henri Sivonen on November 5, 2005 3:18 AM | Permalink | Reply to this

### Re: There be Dragons

The conclusion I draw from your post is not that UTF-8 was the problem but that the superfluous ISO-8859-1 to UTF-8 conversion was the problem.

Umh, I don’t think I agree, though probably the fault is in my own lack of clarity. Perhaps Zack’s narrative is clearer.

Though Perl has fine support for native Unicode strings, MovableType doesn’t use it. UTF-8 is handled internally as an opaque string of bytes. This has all sorts of obvious drawbacks. But its weakness is immediately exposed when you try to do some UTF-8-aware string manipulations (Zack’s Urdu date localization, my named-entity conversion, …).

use UTF-8 in NFC on the wire and the native Unicode strings of your programming language in memory.

We agree 100%. Doing the first, but not the second, is just plain trouble.

Posted by: Jacques Distler on November 5, 2005 10:40 AM | Permalink | PGP Sig | Reply to this

### Re: There be Dragons

Well, I agree as well - but until now I only worked with Ruby and PHP where you don’t have distinctions between a UTF-aware string and a normal string. Personally I think it’s wrong to distinguish the two instead of distinguishing the metods working with bytes and methods working with chars. But that’s a different story altogether.

Posted by: Julik on November 8, 2005 12:34 PM | Permalink | Reply to this