## February 17, 2005

The last straw was when I received a Korean trackback, encoded in euc-kr.

The Trackback Specification makes no mention of character encodings, and MovableType’s original implementation was blissfully ignorant of any such notion. The sender of a Trackback ping sent a string of bytes (which represented a string of characters in charset of his blog) and the recipient dutifully published that string of bytes on his blog. If the recipient’s charset happened not to be the same as that of the sender, well, then, the result was gibberish.

The most recent versions of MovableType convey the sender’s charset in the HTTP headers of the Trackback. But the recipient doesn’t actually do anything with the information.

As a result, I had a slowly increasing number of gibberish Trackbacks on my blog, with no end in sight.

If you want something done right …

The first order of business is to realize that — somewhere along the line — we need to transcode the Trackback from the sender’s character encoding to the recipient’s. We can do this either before saving the Trackback to the database or after, when we go to build the actual blog pages.

It sounds tempting to do it once and for all and get it over with. But …

Perhaps the sender didn’t specify an encoding. Or perhaps he did, but specified it incorrectly (that Korean blog is supposedly utf-8, but the Trackback was euc-kr). Once we’ve transcoded and stored the result in the database, it’s pretty hard to recover. Better to store the original in the database, along with any charset information we may have received, and do the transcoding later. If we need to, we can add/correct the charset information and rebuild.

So the first order of business is to add a new column to the mt_tbping table:

--- lib/MT/TBPing.pm.orig       Tue Feb 15 15:25:30 2005
+++ lib/MT/TBPing.pm    Tue Feb 15 18:13:11 2005
@@ -11,7 +11,7 @@
__PACKAGE__->install_properties({
columns => [
'id', 'blog_id', 'tb_id', 'title', 'excerpt', 'source_url', 'ip',
-        'blog_name',
+        'blog_name', 'tb_charset',
],
indexes => {
created_on => 1,
--- schemas/mysql.dump.orig     Wed Aug 18 19:39:33 2004
+++ schemas/mysql.dump  Tue Feb 15 18:15:25 2005
@@ -262,6 +266,7 @@
tbping_source_url varchar(255),
tbping_ip varchar(15) not null,
tbping_blog_name varchar(255),
+    tbping_tb_charset varchar(255),
tbping_created_on datetime not null,
tbping_modified_on timestamp not null,
tbping_created_by integer,

The next order of business is to capture the character encoding specified in the HTTP headers and store it in the database. While we’re at it, I’m not sure why MovableType decides to truncate utf-8 strings at a fixed number of bytes, rather than a fixed number of characters. That seems like a recipe for disaster, so I commented it out.

--- lib/MT/App/Trackback.pm.orig        Mon Jan 24 18:40:31 2005
+++ lib/MT/App/Trackback.pm     Wed Feb 16 03:50:07 2005
@@ -219,7 +219,7 @@
my($title,$excerpt, $url,$blog_name) = map scalar $q->param($_),
qw( title excerpt url blog_name);

-    no_utf8($tb_id,$title, $excerpt,$url, $blog_name); +# no_utf8($tb_id, $title,$excerpt, $url,$blog_name);

return $app->_response(Error=>$app->translate("Need a Source URL (url)."))
unless $url; @@ -247,6 +247,9 @@$ping->tb_id($tb_id);$ping->source_url($url);$ping->ip($app->remote_ip || ''); + if ($ENV{'CONTENT_TYPE'} =~ /[Cc]harset=([a-zA-Z0-9-]+)/) {
+       $ping->tb_charset($1);
+    }
if ($excerpt) { if (length($excerpt) > 255) {
$excerpt = substr($excerpt, 0, 252) . '...';

Now, at this point, I could have made an enhancement. It’s very unlikely that a random string of bytes is valid utf-8. So, even if the Trackback Headers do not specify an encoding, it’s possible to test whether the Trackback could be utf-8 and set the charset accordingly. E.g.:

    } elsif ( _is_utf8($title .$blog_name . $excerpt) ) {$ping-tb_charset('utf-8');
}

...

sub _is_utf8 {
$_ = shift; m/^( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$/x;
}

For the time being, I’ve decided to forgo automated charset-guessing. If I get a lot of utf-8-encoded Trackbacks without a charset declaration, I’ll reconsider that.

Finally, we need to ensure that things get transcoded when the pages are built. The magic happens in _transcode_text(). We use Text::Iconv to convert from the original encoding to utf-8 and then we use Encode (if necessary) to convert from utf-8 to the blog’s native encoding.

--- lib/MT/Template/Context.pm.orig     Tue Feb 15 21:12:33 2005
+++ lib/MT/Template/Context.pm  Wed Feb 16 01:43:52 2005
@@ -26,6 +26,9 @@
@EXPORT = qw( FALSE );

use vars qw( %Global_handlers %Global_filters );
+
+my $publish_charset = _hdlr_publish_charset(); + sub add_tag { my$class = shift;
my($name,$code) = @_;
@@ -2426,7 +2432,8 @@
sanitize_on($_[1]); my$ping = $_[0]->stash('ping') or return$_[0]->_no_ping_error('MTPingTitle');
-    defined $ping->title ?$ping->title : '';
+    my $title = defined$ping->title ? $ping->title : ''; + return _transcode_text($ping->tb_charset, $title); } sub _hdlr_ping_url { sanitize_on($_[1]);
@@ -2438,7 +2445,8 @@
sanitize_on($_[1]); my$ping = $_[0]->stash('ping') or return$_[0]->_no_ping_error('MTPingExcerpt');
-    defined $ping->excerpt ?$ping->excerpt : '';
+    my $excerpt = defined$ping->excerpt ? $ping->excerpt : ''; + return _transcode_text($ping->tb_charset, $excerpt); } sub _hdlr_ping_ip { my$ping = $_[0]->stash('ping') @@ -2449,7 +2457,20 @@ sanitize_on($_[1]);
my $ping =$_[0]->stash('ping')
or return $_[0]->_no_ping_error('MTPingBlogName'); - defined$ping->blog_name ? $ping->blog_name : ''; + my$blog_name = defined $ping->blog_name ?$ping->blog_name : '';
+    return _transcode_text($ping->tb_charset,$blog_name);
+}
+
+sub _transcode_text {
+    my ($text_charset,$text) = @_;
+    require Text::Iconv;
+    use Encode;
+    if (defined $text_charset &&$text_charset ne $publish_charset ) { +$text = Text::Iconv->new($text_charset,'utf-8')->convert($text) unless $text_charset eq 'utf-8'; +$text = encode($publish_charset, decode('utf-8',$text), Encode::FB_XMLCREF) unless $publish_charset eq 'utf-8'; + } +$text =~ s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)/&amp;/g;
+    return \$text;
}

package MTPlugins::SubCategories;

I then went back into the database and defined charsets for all the “bad” Trackbacks, rebuilt a few pages and …

So, have at it! Trackback this entry and we’ll see what breaks.

#### Update:

Fixed an inadvertently-dropped bit of patch code for lib/MT/Template/Context.pm above.

#### Update (2/18/2005):

For those benighted souls, who think that “Use utf-8.” is the solution to all i18n issues, here’s some data to think about. 93% of the Trackbacks here are plain ASCII. It really doesn’t matter whether (or what) encoding you declare for them. The remaining ones are about equally divided between iso-8859-1, like this Icelandic Trackback and utf-8, like this Japanese one. And, even after you’ve gotten the encoding right, there are still serious bidi issues to be resolved, as in this Urdu Trackback.

#### Note:

I should have said that Sam Ruby has been doing pretty much the same thing for half a year now. He transcodes incoming Trackbacks on receipt (and he “auto-detects” utf-8). As you might expect, this occasionally fails.
Posted by distler at February 17, 2005 1:04 PM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/514

This is quite interesting, in part because I am now making a plug-in which will automatically decode trackbacks for russian blogs (where you can generally have only 2 encodings - cp1251 and utf8).

As MT has a hook for that it looks like you don’t have to modify the source. However, the expression used for matching UTF-8 does not always work for me - I send in sample strings in UTF-8 and they don’t get flagged. The data contains lots of non-latin text so it should not look as ASCII, still scratching my head on how to solve this. Maybe you have any ideas on how to solve this?

Posted by: Julik on February 25, 2005 11:55 PM | Permalink | Reply to this

### Safety First

If you are willing to do the transcoding at the time of receipt (and store the transcoded result in the database), then you ought to be able to do it with a plugin.

A couple of thing held me up.

1. Newer trackback implementations send the charset information in the HTTP headers, but I was not able to see how a plugin could access that information without modifying the MT source code.
2. Older trackback implementations send no charset information at all. So, at some point, you’re going to need to guess(/determine) their encodings.

You can try to guess the encoding, but, as that Korean trackback showed me, I don’t think you can do it reliably. The code to detect utf-8 does work (I tested it with the trackbacks on my blog), but there really isn’t a way to detect other encodings reliably (if at all).

Perhaps my approach is overly conservative but, after weighing the alternatives, I decided that keeping the original data on hand, so that I could — as a last resort — set the encoding manually, was the safest approach.

Posted by: Jacques Distler on February 26, 2005 12:25 AM | Permalink | PGP Sig | Reply to this

### Tests

I send in sample strings in UTF-8 and they don’t get flagged.

I should point out that MT’s procedure of truncating a utf-8 string at 252 bytes is unlikely to yield a string which is valid utf-8.

Perhaps that’s why you are having problems.

Posted by: Jacques Distler on February 26, 2005 1:35 AM | Permalink | PGP Sig | Reply to this

### Not onlym the whole thing is flawed

Well, in principle russian letters do not intersect in UTF-8 and CP-1251 behalf ONE letter which is used very seldom (following whe spelling rules - only in names of people and companies), so the detection should ‘just work’ in this case.

What I wanted to achieve is a ‘dropin’ implementation (so that you can just install the plugin and never think about it anymore). It would be also nice to have a button in ‘Plugin actions’ panel like “MT-TrackDecode: decode selected pings” or such (which would allow to preview your decoding, choose a charset and finally conver the pings to your PublishCharset). But first things first.

Unfortunately this is my first experience with Perl and all of the ‘characters-bytes’ thing in recent version drives me nuts. In PHP it is much simpler…. For example, I am curious if you can efficiently ‘unpack’ what MT gives you back into strings from bytes instead of hacking on MT’s module.

What is really needed (IMO) is:

• A clarification in TrackBack specs that specifies, that all trackbacks should be sent in UTF-8, with a charset header
• MT should probably use the Encode::from_to when sending a ping to convert the outgoing ping into UTF-8 from the publish charset of the blog
• Maybe there should be a way to specify a ‘fallback’ character set if the strings are not matched as UTF for incoming pings (for instance, it will be ISO for you and CP-1251 for most of the russian blogs, a japanese charset for japanese users etc. - but I digress)

Someone has to try to raise these issues in mt-dev. I don’t know if anybody will listen to me there, though. What is most annoying is that (being put off by these issues) russian bloggers just disable trackbacks after installing MovableType, and they never get used. And that’s what I would like to change.

Even more - a community of russian bloggers is currently writing an extension of standard TrackBack, which explicitly specifies that, should a ping be sent in invalid UTF-8 it should be rejected right away. Maybe this is also a good solution, however, I want to accept and send pings to people who use the same language as I and have blogs in this language. Besides, isn’t it nice if I send you a ping and russian-speaking visitors of your blog can read it along with your entry?

P.S. This is what happens when you implement web-services without agreeing on character sets first (i.e. without Unicode)

Posted by: Julik on February 26, 2005 10:43 AM | Permalink | Reply to this

### Spooged

Yes, the Trackback specification is spooged because it makes no mention of charset encodings.

I agree that declaring an encoding should be REQUIRED. I agree that there should be a default charset, if none is declared. But you cannot make that default charset blog-dependent. That way lies madness. The default charset should be iso-8859-1 (the default charset of HTTP) or utf-8 (the default charset of XML).

I disagree that you should REQUIRE that one particular charset (even a “nice” one like utf-8) be used. It doesn’t eliminate the need for transcoding. In fact, it can only increase the number of transcoding operations required (e.g., two CP-1251 blogs exchanging a trackback would require two transcodings, instead of zero), for little actual benefit.

I’ve filed a bug report with SixApart, pointing to this implementation. We’ll see what happens.

While I sympathise with the desire to create a plugin which would fix this problem of trackbacks, I don’t think you can implement a good solution using a plugin. Rather than implement a bad (or, at least, inadequate) solution, I think we should hack the MT source code, and hope that 6A eventually adopts our solution.

As to PHP versus Perl, Perl 5.8’s Unicode implementation is infinitely better than PHP4’s (I haven’t explore PHP5’s Unicode support much; I understand it’s better.)

Feel free to send a CP-1251 encoded Trackback to this entry, and see how my software handles it.

Posted by: Jacques Distler on February 26, 2005 11:33 AM | Permalink | PGP Sig | Reply to this
Weblog: julik-nl-probe-22:21:04
Excerpt: ÐŸÑ€Ð¾Ð±ÑƒÐµÐ¼ Ð·Ð´ÐµÑÑŒ... Ð¡ Ð¿Ñ€Ð¸Ñ†ÐµÐ¿Ð¾Ð¼: - ÐšÐ°Ðº Ð´Ð¾Ð»Ð³Ð¾ Ð¿Ñ€Ð¾Ð´ÐµÑ€Ð¶Ð¸Ñ‚ÑÑ Ð¼Ð°Ð³Ð¸Ñ? ÐŸÐ¾Ð½Ð°Ñ‡Ð°Ð»Ñƒ Ð½Ð¸ÐºÑ‚Ð¾ Ð½Ðµ Ð¾Ñ‚Ð²ÐµÑ‚Ð¸Ð» Ð½Ð° Ð²Ð¾Ð¿Ñ€Ð¾Ñ Ð Ð¾Ð»Ð°Ð½Ð´Ð°, Ð¿Ð¾ÑÑ‚Ð¾Ð¼Ñƒ Ð¾Ð½ Ð·Ð°Ð´Ð°Ð» ÐµÐ³Ð¾ Ð²Ð½Ð¾Ð²ÑŒ, Ð½Ð° ÑÑ‚Ð¾Ñ‚ ...
Tracked: February 26, 2005 3:21 PM
Read the post MPF: UTF-8 Russian trackback test (Charset set manually to utf-8 by JD.)
Weblog: julik-nl-probe-22:21:04
Excerpt: Пробуем здесь... С прицепом: - Как долго продержится магия? Поначалу никто не ответил на вопрос Роланда, поэтому он задал его вновь, на этот ...
Tracked: February 26, 2005 3:21 PM
Weblog: julik-nl-probe-22:21:04
Excerpt: Ïðîáóåì çäåñü... Ñ ïðèöåïîì: - Êàê äîëãî ïðîäåðæèòñÿ ìàãèÿ? Ïîíà÷àëó íèêòî íå îòâåòèë íà âîïðîñ Ðîëàíäà, ïîýòîìó îí çàäàë åãî âíîâü, íà ýòîò ðàç ïîäíÿâ ãëàçà íà äâóõ ìýííè, êîòîðûå ñèäåëè íàïðîòèâ íåãî â ãîñòèíîé äîìà îòöà Êàëëàãýíà, Õåí÷åêà è...
Tracked: February 26, 2005 3:22 PM
Read the post MPF: WINDOWS-1251 Russian trackback test (Charset set manually to windows-1251 by JD.)
Weblog: julik-nl-probe-22:21:04
Excerpt: Пробуем здесь... С прицепом: - Как долго продержится магия? Поначалу никто не ответил на вопрос Роланда, поэтому он задал его вновь, на этот раз подняв глаза на двух мэнни, которые сидели напротив него в гостиной дома отца Каллагэна, Хенчека и...
Tracked: February 26, 2005 3:22 PM
Weblog: julik-nl-probe-00:04:19
Excerpt: Пробуем здесь... С прицепом: - Как долго продержится магия? Поначалу никто не ответил на вопрос Роланда, поэтому он задал его вновь, на этот ...
Tracked: February 26, 2005 5:04 PM
Weblog: julik-nl-probe-00:04:19
Excerpt: Пробуем здесь... С прицепом: - Как долго продержится магия? Поначалу никто не ответил на вопрос Роланда, поэтому он задал его вновь, на этот раз подняв глаза на двух мэнни, которые сидели напротив него в гостиной дома отца Каллагэна, Хенчека и...
Tracked: February 26, 2005 5:04 PM

There we have it, 6 “identical” Russian trackbacks:

• Three encoded in utf-8, 3 in windows-1251.
• The two sent with the charset header are transcoded automagically.
• Of the four sent without charset headers, two had the tb_charset parameter set manually after they were received and stored in the database.
• The remaining two are gibberish (as one would expect).

Now, remember, I don’t try to autodetect utf-8 (though I could) and there’s no way I could auto-detect windows-1251. However, it’s trivial (a couple of mouse-clicks) to correct the lack of encoding information after the fact.

Posted by: Jacques Distler on February 26, 2005 6:03 PM | Permalink | PGP Sig | Reply to this

### Well this is trackback spam our way!

URL and MPF in the beginning stand for multipart-form-data and x-www-urlencoded (how the ping is being sent). I think I am encoding the latter correctly in my test pinger so…. strange.

My test pinger now sends 3 pings in 3 flavors:

Both in UTF-8, KOI-8 and CP-1251, as multipart-form-data and as URL-encoded (with charset header and without). That’s what the starting letters stand for actually.

Btw. I easilly got access to ENV from inside my plugin - strange you couldn’t. I will send you the plugin shortly, I think I will include something called “TrackDecodeFallbackCharset” - charset which will be used if the incoming ping is not flagged as UTF-8 and no header is provided, so that the user can set this for himself in mt.cfg

About mouseclicks - do you get additional edit fields in MT when you add properties to it’s objects?

Posted by: Julik on February 26, 2005 6:21 PM | Permalink | Reply to this

### Thanks for all the plugins

My memory is a little hazy on the environment variable issue. In the end, what I wanted to do — store the trackbacks in their original form, and transcode them on output, at least until we were absolutely sure we’d determined the charset correctly — required hacking the source code. Once I realized that, the other bits — which could have been done via an MT 3.x plugin — just seemed easier to implement directly in MT, rather than via a plugin.

A future enhancement is to provide a utility so that, once you are sure you’ve determined the trackback’s charset correctly, you can — with the click of a mouse — transcode it to your blog’s PublishCharset once and for all.

About mouseclicks - do you get additional edit fields in MT when you add properties to it’s objects?

No, though that’s another enhancement I’d like to add. Right now, I rely on a GUI MySQL client to edit the tbping table directly.

Posted by: Jacques Distler on February 27, 2005 3:23 AM | Permalink | PGP Sig | Reply to this

### Re: Thanks for all the plugins

Well, there it mostly is - the TrackDecode. Feel free to try it out.

It works by first scanning the header, then matching the string as utf-8 and finally doing a ‘fallback’ to a predefined character set (I use windows-1251 but you can also specify your own). Of course it is not that safe but in this case I decided to compromise the integrity of the ping in favor of plug&play. Usually you cannot get both when you automagically guess the charsets.

Plugin is drag&drop.

Posted by: Julik on February 27, 2005 5:36 PM | Permalink | Reply to this