## April 14, 2004

### MTStripControlChars

Probably the top usability complaint with the comment system on this blog is that if you copy and paste some text into your comment, the smart quotes in the pasted text get turned into garbage characters.

This is, alas, a problem common to just about anyone who doesn’t use Windows-1252 encoding as the Charset of their website (does anyone really do that?). The only difference is that my software won’t let you post a comment with those illegal characters in it.

Since this is such a common problem, I long ago wrote a plugin to filter out these garbage characters from Trackbacks, RSS feeds I syndicate, my Technorati Cosmos, etc. I took a pretty brutal approach. The plugin defined a new global tag attribute, strip_controlchars. Adding the attribute to any MT variable substitution tag,

<$MTEntryBody strip_controlchars="1"$>

simply drops characters 0x80 to 0x9F from the content.

Inspired by Sam Ruby’s excellent Survival Guide, I recently decided that a more sophisticated approach was desirable. In the new version of the MTStripControlChars plugin,

<$MTEntryBody strip_controlchars="1"$>

works as before, but

<$MTEntryBody strip_controlchars="2"$>

translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities.

Incorporating this filter into my Comment Preview template, we now automagically fix those mangled smart quotes.

Hopefully, this will lead to a more pleasant user experience.

Update (6/23/2004): See this comment for information on using this plugin with UTF-8 encoded blogs. If your blog uses UTF-8 encoding, then Trackbacks (or legacy data in your database) which contain these (now strictly illegal) characters are still a problem. Unfortunately, UTF-8 is a multi-byte encoding scheme, so fixing the problem is not as simple as stripping-out these “bad” bytes.

Update (7/1/2004): As Sam Ruby points out, there are other “illegal” characters in addition to the above. I’ve updated the plugin to strip those out too. Again,the plugin is currently really only useful if your blog charset is ISO-8859-1.

Posted by distler at April 14, 2004 11:19 PM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/347

### Re: MTStripControlChars

Yes, fixing as opposed to stripping is excellent.

Posted by: Matt on April 15, 2004 1:14 AM | Permalink | Reply to this

### Re: MTStripControlChars

I actually had to implement something like this in PHP for the root page of my web site, which is a rudimentary RSS aggregator of sorts; one of the weblogs that it syndicates was invalidating my page by using Windows-1252 “smart quotes.” It didn’t occur to me to implement the same for my MT weblog (I’ve been happily using your original MTStripControlChars plugin), so thanks again, Jacques.

I do worry that the assumption that the code points Hex80 to Hex9F are always intended as Windows-1252 characters will eventually prove to be false, but I don’t know enough about the issue to make an educated comment, and so far I’ve had no problems.

Posted by: jacob on April 15, 2004 11:08 AM | Permalink | Reply to this

### Re: MTStripControlChars

As this comment form does not define an accept-charset value, user agents may interpret the default for this as the character encoding used to tranmit the form itself. Which, for this weblog, is iso-8859-1.

The 0x80 thru 0x9F range is defined as control characters in iso-8859-1.

Posted by: Sam Ruby on April 15, 2004 6:38 PM | Permalink | Reply to this

### Re: MTStripControlChars

As this comment form does not define an accept-charset value, user agents may interpret the default for this as the character encoding used to tranmit the form itself.

That makes perfect sense, as — ultimately — the comment is to appear on the blog individual archive page, which uses the same charset encoding as the form.

Does the problem occur because the form is POSTed with the “wrong” charset (and hence could be cured with an appropriate Accept-charset header)? Or does the problem occur with the charset used internally for copy/paste?

I was under the impression that it was the latter, which is why we are remapping characters server-side.

The 0x80 thru 0x9F range is defined as control characters in iso-8859-1.

Yes, thank you for that clarification.

I should have said that everything I’ve said assumes ISO-8859-1 is the charset in use (true for the vast majority of MT blogs). There will, doubtless, be other character mis-mappings if, say, you are using UTF-8 for your blog.

But I don’t have any experience with what breaks when commenters copy 'n paste on a UTF-8 encoded blog.

Posted by: Jacques Distler on April 15, 2004 8:25 PM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars

But I don’t have any experience with what breaks when commenters copy ‘n paste on a UTF-8 encoded blog.

The short answer is that MTStripControlChars
is not needed on utf-8 encoded blogs. At least IE and Moz both do the right thing.

Posted by: Sam Ruby on April 16, 2004 1:56 PM | Permalink | Reply to this

### UTF-8 and MT

The short answer is that MTStripControlChars is not needed on utf-8 encoded blogs.

MovableType has, apparently, other problems with UTF-8. These are irrelevant if you’re serving the comment-form as text/html. But Yuan-Chung’s solution (which, apparently, works) made me dizzy.

Posted by: Jacques Distler on April 16, 2004 8:59 PM | Permalink | PGP Sig | Reply to this

### MTStripControlChars and UTF-8

The short answer is that MTStripControlChars is not needed on utf-8 encoded blogs. At least IE and Moz both do the right thing.

Just to clarify, they do the right thing with new comments.

If you already have comments with 0x80 through 0x9F in your database (or receive trackbacks with them), switching to UTF-8 will not solve your problem it will actually make things worse.

Whereas these byte sequences are defined as control characters in ISO-8859-1, they don’t correspond to anything in UTF-8.

You need, more than ever, to filter them out, or remap them to something sensible. Hence this plugin is still useful for UTF-8 users. It won’t munge any valid UTF-8 content, but will take care of the invalid stuff you, almost inevitably, will have to deal with.

Posted by: Jacques Distler on June 12, 2004 11:40 AM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars and UTF-8

Whereas these byte sequences are defined as control characters in ISO-8859-1, they don’t correspond to anything in UTF-8.

Whoops! That’s not strictly true. There are no UTF-8 characters whose low-byte is in the range Ox80 – Ox9F. But there certainly are UTF-8 characters whose high-byte(s) are in this range.

While I pursue a more robust solution for UTF-8, it would be prudent to disable MTStripControlChars if your blog uses UTF-8 encoding.

Posted by: Jacques Distler on June 23, 2004 11:36 AM | Permalink | PGP Sig | Reply to this

### accept-charset

As this comment form does not define an accept-charset value …

Just so no one thinks me a shirker, I added an accept-charset attribute to this form. Specifically, I set

<form  method="post" accept-charset ="<MTPublishCharset>" ... >

This has absolutely no effect on the munging of smartquotes, at least not when POSTing from Mozilla.

(If you’re a masochist, and want to know why that’s the case, read bug 228779.)

Posted by: Jacques Distler on April 16, 2004 12:29 AM | Permalink | PGP Sig | Reply to this
Read the post StripControlChars
Weblog: Movable Type Plugin Directory
Excerpt: Updated to 0.3 to handle additional characters....
Tracked: June 3, 2004 5:31 AM

### Keeping a clean database

Do you have any interest in turning this into a 3.0 plugin that hooks comment-save and ping-save, to strip things out before they hit the database (I know your comments are cleaned through the preview, but most people don’t force preview), or should I steal it, er, fork it, um, expand on it myself?

Posted by: Phil Ringnalda on June 3, 2004 10:53 PM | Permalink | PGP Sig | Reply to this

### Re: Keeping a clean database

An excellent idea.

I haven’t explored the new plugin callbacks yet. This sounds like an excellent application.

But I’m not so interested in a 3.0-only plugin, if that’s at all avoidable. Fortunately, you are also the leading expert on writing plugins for 3.0 which are backwards-compatible with 2.x.

So, if you want to collaborate on this, I’d quite like to do it.

Posted by: Jacques Distler on June 3, 2004 11:19 PM | Permalink | PGP Sig | Reply to this

### Steal from the best

Heh. Not so much an expert as a good thief.

But, yes, going both ways shouldn’t be a problem. If the eval says we’re in 3.0, add the plugin slug and add a callback for comment save, ping save (and entry save? hmm) to a sub that just calls strip_controlchars with everything that might be ugly, title and body at least, and then either way add the conditional.

I’ve got the weekend off, I should be able to send you something with several boneheaded bugs sometime Saturday.

Posted by: Phil Ringnalda on June 4, 2004 12:32 AM | Permalink | PGP Sig | Reply to this

### Re: Steal from the best

Hey, have either of you made any progress on this? I, for one, am very interested in using it on my blog! Please keep me updated! Thanks :)

Posted by: Joshua Kaufman on June 15, 2004 7:14 PM | Permalink | Reply to this

### Re: Steal from the best

Could either Phil or Jacques let me know what happened with the “3.0 plugin that hooks comment-save and ping-save, to strip things out before they hit the database” as mentioned above in Phil’s comment? Has anyone worked on it or has it been canned? Thanks!

Posted by: Joshua Kaufman on August 10, 2004 2:06 AM | Permalink | Reply to this

### No progress

I haven’t thought about it any further.

For comments, comment-validation is the only real way to ensure that only good stuff gets into the database. The save-hook on the MTStripControlChars plugin is superfluous. Indeed, the plugin’s only there as a courtesy, to automatically fix things which would otherwise be flagged as errors.

For trackbacks, the problem is much more tricky, and MTStripControlChars is, at best, a band-aid solution (and that, only for ISO-8859-1 blogs). One could use it to “fix” stuff going into the database, but — for the present — I’d prefer to leave the stuff in the database untouched, until someone figures out a better solution.

Posted by: Jacques Distler on August 10, 2004 2:19 AM | Permalink | PGP Sig | Reply to this

### Re: Steal from the best

I keep thinking “I should remember to think about whether we really can do anything useful (without nuking things like PGP-signed comments)” but thinking about thinking hardly counts as progress, I’m afraid.

Posted by: Phil Ringnalda on August 10, 2004 10:53 AM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars

Why does MTStripControl… convert smartquotes and the like to their corresponding decimal codes rahter than hex? This generated invalid code when I ran it through the W3C validator. Switching the .pl file conversions to hex fixed that.

Has anyone thought about using this plugin to convert all pasted-in special characters, like non-English characters?

Posted by: DK on September 19, 2004 8:08 AM | Permalink | Reply to this

### Re: MTStripControlChars

re. last comment from me: I think I meant hex when I said dec.

Posted by: DK on September 19, 2004 8:16 AM | Permalink | Reply to this

### Hex versus Decimal

You mean &#x201C; versus &#8220;? They are both valid representations of ‘ “ ’ and neither will give you a lick of trouble.

As noted, MTStripControlChars doesn’t really work with UTF-8 encoded blogs (something is needed, even for UTF-8, to filter out invalid data, but I haven’t settled on the best approach). Or you could be having troubles with another, unrelated, feature of text filters.

In recent version of Perl, the order of execution of text filters (which look like attributes to MT tags, e.g.

<$MTCommentPreviewSubject strip_controlchars="2" remove_html="1" smarty_pants="2"$>

) is random. In some instances, this can cause problems. To force a particular order of execution, you can use the MTBlock plugin:

<MTBlock smarty_pants="2"><$MTCommentPreviewSubject strip_controlchars="2" remove_html="1"$></MTBlock>

as discussed here.

Has anyone thought about using this plugin to convert all pasted-in special characters, like non-English characters?

Any characters entered into the comment form which are not in your declared charset will be automatically converted to numeric entitities by the browser when the form is submitted. Or, at least, that’s what the browser is supposed to do (aside from this common screwup between ISO-8859-1 and Windows-1252, but aside from that …).

Posted by: Jacques Distler on September 19, 2004 10:53 AM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars

Thanks. Yes, I meant “ versus “ – the former works fine but generates errors in the W3C validator.

Posted by: DK on September 21, 2004 11:03 AM | Permalink | Reply to this

### Huh?

Both hexadecimal and decimal character references are valid (X)HTML, and neither cause problems with the W3C Validator. If you have a page with hexadecimal character references which cause problems with the W3C Validator, please provide a link.

This page has hexadecimal character references on it and it validates just fine.

Posted by: Jacques Distler on September 21, 2004 11:15 AM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars

Your plugin works fine for the MTEntriesBody tag but doesn’t work for the Comments, i.e.
$MTCommentBody$ strip_controlchars=”2” doesn’t work.

p.s. I have Greater Than and Less Than signs around the above code, but seems to mess up posting. I tried adding around it.

Suggestions?

Posted by: Tom Keating on April 6, 2005 1:09 PM | Permalink | Reply to this

Works in the comments over here.

If you want it to work in comment previews, you need to set the attribute on <MTCommentPreviewBody>

p.s. I have Greater Than and Less Than signs around the above code, but seems to mess up posting. I tried adding around it.

Since I allow a quite extensive subset of HTML in your comments, I don’t do any escaping for you. If you want “<”, you need to escape it yourself, by typing “&lt;” and so forth.

Posted by: Jacques Distler on April 6, 2005 4:55 PM | Permalink | PGP Sig | Reply to this
Weblog: VoIP Blog - VoIP News, Gadgets
Excerpt: I’ve been using MTGoogleSearch for Related Entries on my MovableType blog - and unfortunately some of the related entries have UTF-8 characters in the URL titles which changes my webpage’s default iso-8859-1 encoding to UTF-8. If at least one...
Tracked: April 20, 2005 10:36 AM

### Re: MTStripControlChars

Hi,

I’m trying to use this plugin with an installation of mt 2.61. I am using it in replace mode (option “2”). It does not seem to replace the characters. Is there any chance the character set I would like replaced is not the set the app is looking for? Or, am I just using it wrong?
To see what I am trying to do you can look at www dot egrigg9000 dot com slash mtlink .

My favorites:
Iâ€™d instead of I’d
â€œPro instead of “Pro
Whyâ€ instead of Why”
happenâ€¦ instead of happen:
â€“ instead of – Pro

I am not sure if this app is designed to replace characters for incoming posts only, or if it can apply to the entire archive via change template and rebuild. I have tried both.

-EG

Posted by: Elizabeth Grigg on May 8, 2005 10:53 AM | Permalink | Reply to this

### Re: MTStripControlChars

The short answer is that, somehow, your posts are being input using the UTF-8 charset, whereas your templates declare that your blog is written in the ISO-8859-1 charset. In between, you are filtering the output through my plugin.

The purpose of my plugin is to convert Windows-1252 (which Microsoft, hence most other browser makers, pretends is ISO-8859-1) to actual ISO-8859-1. It’s neither useful, nor necessary with UTF-8.

At the cost of, perhaps, boring you to tears, let me explain what happened to the “ ’ ” character you typed.

In Unicode, the “ ’ ” character is designated as U+2019. In UTF-8, that character is represented by a string of 3 bytes, 0xE28099. (I’m using hexadecimal notation, so each byte is represented by two hexadecimal digits, ‘E2’, ‘80’ and ‘99’.)

Let’s see what happens to each of those three bytes when your run them through my plugin and display the result as ISO-8859-1.

1. ISO-8859-1 is a single-byte charset. The byte 0xE2 is the character “â”.
2. The byte 0x80 is a control character. My plugin converts it to &#x20AC;, which is the character “€”.
3. The byte 0x99 is also a control character, and my plugin converts it to &#x2122;, which is the character, “™”.

Now, where my plugin would have been useful is if you had been using Windows-1252 (aka, what Microsoft pretends is ISO-8859-1). In Windows-1252, “ ’ ” is the single byte, 0x92, which is yet another control character in real ISO-8859-1. My plugin would have converted it to &#x2019;, which is the real way to write “ ’ ” in ISO-8859-1.

The bottom line for you is: use the same charset to publish your blog as you used to compose it. Use either UTF-8 or ISO-8859-1, but don’t mix them.

Clear?

Posted by: Jacques Distler on May 8, 2005 1:22 PM | Permalink | PGP Sig | Reply to this

### Re: MTStripControlChars

How do I run and install this plugin? I have uploaded the file to the plugins directory of movable type (v. 3.17) and I have added the MTEntryBody strip_controlchars=”2” to my template. But nothing happens.

Am I missing something?

Posted by: ted on June 30, 2005 6:49 PM | Permalink | Reply to this

### Re: MTStripControlChars

Same here…MT 3.17. Does this plugin not work with latest build of MT?

Posted by: Lance on August 8, 2005 1:59 AM | Permalink | Reply to this

### Yes, it does

It certainly does work with 3.17, as you can see on this blog.

It is not helpful with UTF-8. It is really intended for use with ISO-8859-1. But it “works” anyway.

Posted by: Jacques Distler on August 8, 2005 2:09 AM | Permalink | PGP Sig | Reply to this
Read the post Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: October 16, 2005 8:45 PM
Read the post Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: November 8, 2005 7:43 PM
Read the post MovableType Garbage Characters Problem
Weblog: VoIP & Gadgets Blog
Excerpt: I found a solution to garbage characters showing up in my blog. The solution is to download the MTStripControlChars plugin which essentially translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities. This fixed m...
Tracked: January 19, 2006 2:19 PM
Read the post In other sysadmin news...
Weblog: Juxtaposition
Excerpt: I also finally started using the excellent PromoteThis plugin to create the nice links to digg. Also, I had to...
Tracked: March 12, 2007 1:07 AM

Post a New Comment