Internationalization

April 24, 2004

Internationalization

Say you want to tag some text on a web page as being in a language other than the main language of the page (English, in the case of this blog). In HTML 4, you would slap a  around it. In XHTML 1.1, the lang attribute is gone, and you’d write

<span xml:lang="fr">ma vie en rose</span>

instead.

And therein lies a small problem. No matter how you set your Sanitize Spec in the blog preferences, MovableType will strip out the xml:lang attribute from any sanitized text like, say, the comments on your blog. It can’t handle attributes with colons in them.

Fortunately, the fix for this is easy.

--- lib/MT/Sanitize.pm.orig     Fri Apr 23 08:40:27 2004
+++ lib/MT/Sanitize.pm  Fri Apr 23 08:41:42 2004
@@ -98,7 +98,7 @@
                 (exists $tag_attr->{$name} && $tag_attr->{$name} eq '/')) {
                 if ($inside) {
                     my @attrs;
-                    while ($inside =~ m/(\w+)\s*=\s*(['"])(.*?)\2/gs) {
+                    while ($inside =~ m/([:\w]+)\s*=\s*(['"])(.*?)\2/gs) {
                         my $att = lc($1);
                         if ($ok_tags->{'*'}{$att} ||
                            (ref $ok_tags->{$name} && $ok_tags->{$name}{$att})) 
{

That takes care of easy languages, like French. But say you want to comment in Hebrew. Hebrew’s a Right-to-Left language. If you want to use a phrase in Hebrew in the midst of an English paragraph, you’d paste the Hebrew text into a <bdo dir="rtl" xml:lang="he"></bdo>.

“<bdo>” stands for “BiDirectional Override”, which temporarily reverses the direction of the text. If you want an entire paragraph in Hebrew, you’d paste the text into a .

[Update (5/11/2004): According to the W3C Draft on Handling Bi-Directional Text, you can mostly get away without using the <bdo> element, thanks to the Unicode Bi-Directional Algorithm and the super-secret character entities, &rlm; (Right-to-Left Mark) and &lrm; (Left-to-Right Mark), which let you control how neutral characters, like punctuation marks are treated. E.g. compare 1705 רחוב בן יהודה. (typed straight) with “‏1705 רחוב בן יהודה.‏” (uses some astutely-placed &rlm;s). Note: Safari screws this stuff up pretty badly; there are serious bugs in WebCore’s bidi implementation. There are also useful documents on Specifying the Language of Content and the ever-popular subject of Character Encodings (via Phil). ]

All these tags and attributes are allowed in the comments on this blog. The only bad news is with respect to Charsets. This blog uses ISO-8859-1. That handles Western Europeen languages just fine, but doesn’t know anything about non-Europeen languages. So if you enter

<span dir="rtl" xml:lang="he">הבנתי</span>

into the Comment Form and click “PREVIEW”, your browser will convert the text to numeric entities

<span dir="rtl" xml:lang="he">&#1492;&#1489;&#1504;&#1514;&#1497;</span>

which will display correctly, but which is not exactly the easiest thing to edit.

If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.

Posted by distler at April 24, 2004 1:46 AM

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/352

Some Related Entries

18 Comments & 2 Trackbacks

International keyboard input test

אני מבין עכשו.

That was input using a hebrew keyboard layout under MacOSX. Clicking on “PREVIEW” converted the hebrew text to numeric entities. Before conversion, it looked like

.אני מבין עכשו

After, it looked like

אני מבין עכשו.

Posted by: Jacques Distler on April 25, 2004 1:22 AM | Permalink | PGP Sig | Reply to this

Re: Internationalization

This is just a test. I’ll explain in a bit:

Iñtërnâtiônàlizætiøn

“Smart quotes”

Posted by: Sam Ruby on April 25, 2004 10:12 AM | Permalink | Reply to this

Re: Internationalization

Another test:

Iñtërnâtiônàlizætiøn

Posted by: Sam Ruby on April 25, 2004 11:28 AM | Permalink | Reply to this

A test with Japanese

日本語波動ですか。やっぱり、ブロッグの国際化もいいですねえ。

Posted by: Abiola Lapite on April 25, 2004 6:17 PM | Permalink | Reply to this

Re: A test with Japanese

Actually, that wasn’t quite what I had in mind; I’ll try again.
日本語波動ですか。やっぱり、ブロッグの国際化もいいですねえ。

Posted by: Abiola Lapite on April 25, 2004 6:24 PM | Permalink | Reply to this

Re: A test with Japanese

Sorry for yet another repeat. I used the wrong language tag to indicate Japanese.
日本語波動ですか。やっぱり、ブロッグの国際化もいいですねえ。
Seems to work just fine. Then again, it did even without being wrapped by the tag.

Posted by: Abiola Lapite on April 25, 2004 6:28 PM | Permalink | Reply to this

Why Semantic XHTML is good

As a string of characters, it works just fine. But, if you identify the language, it becomes more than just a string of characters. Screen Readers can read it and if the blog owner comes along and decides that all japanese text should be displayed in an attractive shade of green, then it will be.

(This demonstration would have been a bit more impressive had you signed your comment. Still, I promise that all I did was add

*:lang(ja) {color:#693;}

to my CSS file.)

Posted by: Jacques Distler on April 25, 2004 11:55 PM | Permalink | PGP Sig | Reply to this

Re: Internationalization

I have reported it as a bug for MT 3 beta. Hopefully they will correct it in MT 3.

Posted by: Srijith on April 25, 2004 9:18 PM | Permalink | Reply to this

MT Bugs

Well …

While we’re at it, let’s note that  and <dl> are block-level elements that should be “protected” from being wrapped in ... by MovableType’s “Convert Linebreaks” filter.

--- lib/MT/Util.pm.orig Sun Apr 25 00:41:20 2004
+++ lib/MT/Util.pm      Sun Apr 25 00:43:59 2004
@@ -179,7 +179,7 @@
     $str ||= '';
     my @paras = split /\r?\n\r?\n/, $str;
     for my $p (@paras) {
-        if ($p !~ m/^<(?:table|ol|ul|pre|select|form|blockquote|div|q)/) {
+        if ($p !~ m/^<(?:table|ol|ul|dl|p|pre|select|form|blockquote|div|q)/) {
             $p =~ s!\r?\n!<br />\n!g;
             $p = "<p>$p</p>";
         }

There are other block-level elements which could presumably be added to the list; these are the ones I allow in my comments.

If you want to set an entire paragraph in another language (especially, if it’s a right-to-left one), you want to wrap it in a paragraph tag, as in the example above.

Posted by: Jacques Distler on April 25, 2004 10:57 PM | Permalink | PGP Sig | Reply to this

Signed comment

And, finally, let’s try a little Iñtërnâtiônàlizætiøn.

Posted by: Jacques Distler on June 4, 2004 10:00 AM | Permalink | PGP Sig | Reply to this

Read the post I18n trackback test
Weblog: Sam Ruby
Excerpt: Iñtërnâtiônàlizætiøn
Tracked: February 16, 2005 3:52 AM

Read the post ٹریک بیک اور یونیکوڈ
Weblog: Procrastination
Excerpt: ‮موویبل ٹائپ ‪(Movable Type)‬ میں ٹریک بیک ‪(Trackbacks)‬ ایک عجب مسئلہ ہیں۔ کیونکہ ٹریک بیک یہ نہیں بتاتا کہ وہ کس زبان یا ‪encoding‬...‬
Tracked: February 16, 2005 11:28 PM

Re: Internationalization

I am just forcing a rebuild of the page with this comment as I sent a trackback with Unicode control characters for text direction.

Posted by: Zack on February 17, 2005 2:31 PM | Permalink | PGP Sig | Reply to this

Re: Internationalization

I solved the bidi issue but another has cropped up. The numeric character codes are counted as 8 characters and the semicolon at the end of the last one was truncated. This truncating to n characters is not a very good idea in my opinion. It would be better to truncate to m words with a basic check of the number of characters to guard against a continuous character stream or something.

Posted by: Zack on February 19, 2005 8:06 PM | Permalink | PGP Sig | Reply to this

Truncations

Yes, truncating at “n” characters, for any “n” could mean that you end up chopping off the excerpt in the middle of an entity, before the terminating “;”

In your case, since you’re using utf-8, you could always enter the Unicode characters directly, saving several characters for each entity in your excerpt.

I’m happy to entertain a more “intelligent” truncation algorithm, if anyone cares to propose some code …

Posted by: Jacques Distler on February 19, 2005 9:02 PM | Permalink | PGP Sig | Reply to this

Re: Truncations

Actually, you were lucky with your truncation. It truncated in the ENglish text portion. If the truncation had happenned in an Urdu word, you would have been left with an active RLO (Right to left override) character which would have mirrored all the regular English words from there to the end of the page.

Regarding a better truncating solution, I am using:

$excerpt = first_n_words($excerpt, 40) . '...';

The only time when this fails is when a spammy trackback has a long string of characters without any whitespace in the excerpt. I think that can be handled with other trackback spam.

you could always enter the Unicode characters directly,

The only way I know to do that in WinXP is to use Alt + numeric keypad. But my laptop doesn’t have a numeric keypad.

Posted by: Zack on February 19, 2005 10:48 PM | Permalink | PGP Sig | Reply to this

Re: Truncations

If the truncation had happenned in an Urdu word, you would have been left with an active RLO (Right to left override) character which would have mirrored all the regular English words from there to the end of the page.

From there, till the end of the paragraph, but ugly, nonetheless.

The only solution I can see is for me to append an explicit LRO at the end of the excerpt (just in case).

Regarding a better truncating solution, I am using:

$excerpt = first_n_words($excerpt, 40) . '...';

The only time when this fails is when a spammy trackback has a long string of characters without any whitespace in the excerpt.

That’s a bit more æsthetic, but “dangerous”. Better would be to truncate at the last whitespace before the n^th character.

you could always enter the Unicode characters directly,

The only way I know to do that in WinXP is to use Alt + numeric keypad. But my laptop doesn’t have a numeric keypad.

MacOSX, in addition to various international keyboard layouts offers a floating palette, from which you can select any Unicode character you want and click ‘insert’. Surely, there’s a similar utility for XP?

Posted by: Jacques Distler on February 19, 2005 11:14 PM | Permalink | PGP Sig | Reply to this

Re: Truncations

The only solution I can see is for me to append an explicit LRO at the end of the excerpt (just in case).

That would work in your case but not mine. A more elegant solution is to check the number of all such nested control characters and append the appropriate number of PDF (Pop directional format) characters.

Better would be to truncate at the last whitespace before the nth character.

Correct, but in that case it would be good form to handle entities properly as well. I’ll probably write something like that soon.

Posted by: Zack on February 20, 2005 1:20 AM | Permalink | PGP Sig | Reply to this

Re: Internationalization

I have a question about using lang=”tr” on an html form. The forms standard language is english, but from time to time the users need to input data using a turkish character set. The form is then automatically emailed when they hit submit. When I input the turkish characters via IE they look fine, but when the text is wrapped up into an email the turkish characters are gone and replaced with ascii.

Posted by: Greg on March 29, 2007 9:38 AM | Permalink | Reply to this

Re: Internationalization

The lang attribute has nothing to do with your issue. You are interested in character encodings.

For an HTML form, the encoding is set either by the encoding of the page or by the accept-charset attribute of the form.

From the sounds of it, the encoding of the form is set correctly, and your web application is receiving the form correctly encoded (utf-8? cp-1254?). On the other hand, when the data is emailed out, that, too, would require setting the encoding (in the Content-type header of the email). And I presume that is not being done correctly.

Not knowing any more about your application, that’s all I can say.

Posted by: Jacques Distler on March 29, 2007 10:29 AM | Permalink | PGP Sig | Reply to this

Musings

Skip to the Main Content

April 24, 2004