Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

July 8, 2012

Astral Pain

Anyone who’s followed this blog, since the early days, has read more than one instance of my complaints about crappy support for Unicode in common programming tools. I’m sad to report that, even in 2012, doing Unicode is (still) harder than it looks.

Heterotic Beast is my math-enabled Forum software. It runs on Rails 3.1.6 and Ruby 1.9.3, so you’d think that all would be good. Which was why I was surprised that this post was ill-formed.

For faster performance, Heterotic Beast caches the rendered (X)HTML of each post in the database. Sure enough, the cached XHTML was truncated just before the “𝒜”, a character which, in Unicode, lies in Plane-1 (U+1D49C). Evidently, there was a problem storing characters outside the BMP.

Now, Rails3, by default, creates MySQL database tables with the ‘utf8’ encoding. Since UTF-8 covers all 16 Unicode planes, you might think that would be sufficient. You would be wrong. MySQL’s utf8 encoding only covers the BMP. It can’t handle 4-byte characters at all.

Fortunately, MySQL 5.5.3 (released in March 2010) introduced a new encoding, ‘utf8mb4, which actually, y’know, supports Unicode.

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

did the trick. Now the posts, in the database, didn’t get truncated at the first astral plane character. Unfortunately, instead of astral plane characters, the database entries contained garbage characters. Obviously, Rails had no idea that I had switched encodings in the database. I needed to say so, explicitly, in config/database.yml:

production:
    adapter: mysql2
    host: 127.0.0.1
    database: beast
    username: ...
    password: ...
    encoding: utf8mb4
    port: 3306

Ah, if only life were so simple. The release version of the mysql2 gem doesn’t support the utf8mb4 encoding. Fortunately (as of December, 2011), the development version does. So

gem 'mysql2',  :git => 'http://github.com/brianmario/mysql2.git'

(finally!) makes everything work as it should.

Remarkably, even after a decade of such pain, Unicode is, in 2012, still “cutting edge.”

Update:

Tom Christianson has a nice summary of the state of Unicode support in various languages (as opposed to databases, or database drivers, which was the issue here). Fittingly, his slides make heavy use of emoji characters from Unicode 6. So, if you didn’t already know that most languages’ Unicode support is a 💩, you’ll need to use Safari (or, alternatively, install the Symbola font) to view them properly.
Posted by distler at July 8, 2012 11:28 AM

TrackBack URL for this Entry:   https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2538

19 Comments & 0 Trackbacks

Re: Astral Pain

I’m not sure what you’re quoting “Cutting Edge” from, but neither Ruby, nor Rails, nor MySQL could be considered, on any level, “Cutting Edge.”

Unless you had Cobol or Coldfusion in mind, in which case, sure, if people use it now, it’s relatively “cutting edge.”

Posted by: Christopher Brown on July 10, 2012 2:37 PM | Permalink | Reply to this

Re: Astral Pain

I’m not sure what you’re quoting “Cutting Edge” from, but neither Ruby, nor Rails, nor MySQL could be considered, on any level, “Cutting Edge.”

Did you notice the release date (3/2010) of the first version of MySQL to handle Unicode correctly?

Did you notice the release date of the first version of the Ruby MySQL bindings, which support utf8mb4? (No, of course you didn’t, because they haven’t even been released yet.)

If using unreleased, development code, because no Unicode-capable release version is available doesn’t count as cutting edge, then what does?

Posted by: Jacques Distler on July 10, 2012 2:55 PM | Permalink | PGP Sig | Reply to this

Re: Astral Pain

MySQL is not cutting edge by definition.

That is because they don’t implement most features good dbms’s do, and because most advanced features they implement are botched, such as fake utf8, crappy join performance, not-ACID compliant innodb triggers, etc.

If you like cutting edge, or indeed just software that works, please try postgresql, it has sql compliance and features on par with oracle (it’s the only dbms that does in fact), it’s much faster than mysql if you actually use any advanced feature, it has full-text search, utf8 full support, ACID compliance, and so much more -

No, mysql is not cutting edge, and it got much worse since oracle cancelled the 6.0 that could have been one of the first decent innodb versions.

Posted by: Ludovic Urbain on July 11, 2012 2:21 AM | Permalink | Reply to this

Re: Astral Pain

I guess grandparent means that not much is to be expected in terms of UTF-8 support in the technologies you mention. Which I disagree with. Ever so often UTF-8 is an afterthought, implemented really akwardly. Part Anglocentrism, part intellectual laziness, part flawed design.

Posted by: B Dirks on July 10, 2012 3:04 PM | Permalink | Reply to this

Re: Astral Pain

It was most astonishing for me, that it was actually rather nice to do Codepoints.net in PHP (I thought long of this before I started the project). Since that language has basically no idea of encodings you can simply implement a set of “mangle bytes” wrappers on top, feed them what PHP calls “string” and get finest UTF-8 up to the Supplementary Private Use B plane.

It was, in fact, a more pleasing experience than my latest projects with Python and Perl, that both were quite fast with dealing out Unicode errors.

Posted by: Manuel Strehl on July 10, 2012 3:08 PM | Permalink | Reply to this

Re: Astral Pain

I stopped taking this seriously at “I was surprised when MySQL mangled my data”, as though MySQL has ever handled *any* form of data correctly :P

Posted by: Shish on July 10, 2012 3:43 PM | Permalink | Reply to this

Re: Astral Pain

This isn’t about Unicode so much as about the brokeness of Rails, its toy ORM and the culture / community around it.

Posted by: abc on July 10, 2012 5:19 PM | Permalink | Reply to this

Troll!

At the risk of feeding an evident troll

Aside from the bit about

Now, Rails3, by default, creates MySQL database tables with the ‘utf8’ encoding.

what make you think that this has anything to do with Rails?

Posted by: Jacques Distler on July 10, 2012 6:04 PM | Permalink | PGP Sig | Reply to this

Re: Troll!

How about the part where it “had no idea” you had switched encidings? Unlike, say, SQLAlchemy which would have checked the first time it queried the table.

Posted by: gggk on July 10, 2012 7:00 PM | Permalink | Reply to this

Re: Troll!

You re-check the encoding every time you make an SQL query? That sounds real efficient.

Having a default encoding (utf8) and a config file option, if you want to use a different encoding, seems like a pretty reasonable alternative.

That was not, by any stretch of the imagination, the source of my difficulties.

Posted by: Jacques Distler on July 10, 2012 8:10 PM | Permalink | PGP Sig | Reply to this

Re: Troll!

”” the first time it queried the table.”“” That means once. This is the correct way to handle things.

I don’t understand why you are defending Rails here, particularly its ORM. It is simply not production quality software.

Posted by: a on July 11, 2012 12:06 PM | Permalink | Reply to this

Re: Troll!

And I don’t know why you persist in bashing the Rails ORM, as it is completely irrelevant to the topic at hand.

Those who are bashing MySQL are, at least, on-topic. You, however, can’t seem to get it together enough to say something vaguely relevant.

Sad …

Posted by: Jacques Distler on July 11, 2012 12:12 PM | Permalink | PGP Sig | Reply to this

Re: Astral Pain

My apologies; I don’t do web development, so there may be a good reason … but for data like cached HTML, which need not be sorted or compared, why not store the encoded version as binary data?

Posted by: Chris Dellin on July 10, 2012 6:24 PM | Permalink | Reply to this

Re: Astral Pain

For cached (X)HTML, I suppose not. But the astral plane character(s) could just as easily have been in the ‘source’ text of the post; and the consequences of silent data loss would have been much more annoying in that case.

Posted by: Jacques Distler on July 10, 2012 8:16 PM | Permalink | PGP Sig | Reply to this

Re: Astral Pain

Well, Unicode is what happens when a large, officious, bureaucratic committee sets out to implement world peace, nuclear disarmament, and women’s rights via a character set. We end up with monstrosities like a “title case” for Roman transliterations of Cyrillic characters, the same character in different stylized fonts in the same character set, security concerns about various look-alike characters that are in fact different, (like some of the aforementioned Russian transliterated ones,) all the various doodles that were on the committee’s mind when they designed this thing, the Lilliputian fusses about a “Byte Order Mark”, which causes all manner of trouble by its presence or absence, handling mal-formed and non-normal utf8 byte sequences, recognizing regular expressions, the inability to quickly find a position in a string, and on and on and on.

Posted by: Justin on July 10, 2012 7:58 PM | Permalink | Reply to this

Re: Astral Pain

And you didn’t even mention Han Unification!

Posted by: Jacques Distler on July 10, 2012 8:24 PM | Permalink | PGP Sig | Reply to this

Re: Astral Pain

As a Japanese who knows the pre-Unicode days, I say Unicode is, however horrible, still a big step forward. At least I don’t have to choose which competing Japanese encodings I should use (Shift_JIS, EUC-JP or iso-8859-1) anymore.

And yes, I feel sorry and responsible to the world for the introduction of emojis which lie completely outside of BMP.

And a link to the obligtory Unicode cry for help.

Posted by: Yuji Tac. on July 11, 2012 1:09 AM | Permalink | Reply to this

Re: Astral Pain

It’s really hard to take Tom Christianson serious if he doesn’t even bother to add .NET to his comparison.

Posted by: Mike on July 28, 2012 5:45 AM | Permalink | Reply to this

Re: Astral Pain

It seems there is a growing group of people who’ve noticed the problems with Unicode. What we need is a unified push towards UTF-32 across all languages and OSe’s.

There are only 3 ose’s: Windows, OS X and Linux.
Only about 9 languages have significant developers using them:

Python
C#
Java
Ruby
Scala
Haskell
Erlang
C
C++

Universal UTF-32 for all!

Posted by: Raahul on November 30, 2012 4:41 AM | Permalink | Reply to this

Post a New Comment