MTStripControlChars
Probably the top usability complaint with the comment system on this blog is that if you copy and paste some text into your comment, the smart quotes in the pasted text get turned into garbage characters.
This is, alas, a problem common to just about anyone who doesn’t use Windows-1252 encoding as the Charset of their website (does anyone really do that?). The only difference is that my software won’t let you post a comment with those illegal characters in it.
Since this is such a common problem, I long ago wrote a plugin to filter out these garbage characters from Trackbacks, RSS feeds I syndicate, my Technorati Cosmos, etc. I took a pretty brutal approach. The plugin defined a new global tag attribute, strip_controlchars. Adding the attribute to any MT variable substitution tag,
<$MTEntryBody strip_controlchars="1"$>
simply drops characters 0x80 to 0x9F from the content.
Inspired by Sam Ruby’s excellent Survival Guide, I recently decided that a more sophisticated approach was desirable. In the new version of the MTStripControlChars plugin,
<$MTEntryBody strip_controlchars="1"$>
works as before, but
<$MTEntryBody strip_controlchars="2"$>
translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities.
Incorporating this filter into my Comment Preview template, we now automagically fix those mangled smart quotes.
Hopefully, this will lead to a more pleasant user experience.
Update (6/23/2004): See this comment for information on using this plugin with UTF-8 encoded blogs. If your blog uses UTF-8 encoding, then Trackbacks (or legacy data in your database) which contain these (now strictly illegal) characters are still a problem. Unfortunately, UTF-8 is a multi-byte encoding scheme, so fixing the problem is not as simple as stripping-out these “bad” bytes.
Update (7/1/2004): As Sam Ruby points out, there are other “illegal” characters in addition to the above. I’ve updated the plugin to strip those out too. Again,the plugin is currently really only useful if your blog charset is ISO-8859-1.
Posted by distler at April 14, 2004 11:19 PM
TrackBack URL for this Entry: http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/347
Re: MTStripControlChars
Yes, fixing as opposed to stripping is excellent.
Re: MTStripControlChars
I actually had to implement something like this in PHP for the root page of my web site, which is a rudimentary RSS aggregator of sorts; one of the weblogs that it syndicates was invalidating my page by using Windows-1252 “smart quotes.” It didn’t occur to me to implement the same for my MT weblog (I’ve been happily using your original MTStripControlChars plugin), so thanks again, Jacques.
I do worry that the assumption that the code points Hex80 to Hex9F are always intended as Windows-1252 characters will eventually prove to be false, but I don’t know enough about the issue to make an educated comment, and so far I’ve had no problems.
Re: MTStripControlChars
As this comment form does not define an accept-charset value, user agents may interpret the default for this as the character encoding used to tranmit the form itself. Which, for this weblog, is iso-8859-1.
The 0x80 thru 0x9F range is defined as control characters in iso-8859-1.
Read the post
StripControlChars
Weblog: Movable Type Plugin Directory
Excerpt: Updated to 0.3 to handle additional characters....
Tracked: June 3, 2004 5:31 AM
Keeping a clean database
Do you have any interest in turning this into a 3.0 plugin that hooks comment-save and ping-save, to strip things out before they hit the database (I know your comments are cleaned through the preview, but most people don’t force preview), or should I steal it, er, fork it, um, expand on it myself?
Re: MTStripControlChars
Why does MTStripControl… convert smartquotes and the like to their corresponding decimal codes rahter than hex? This generated invalid code when I ran it through the W3C validator. Switching the .pl file conversions to hex fixed that.
Has anyone thought about using this plugin to convert all pasted-in special characters, like non-English characters?
Re: MTStripControlChars
re. last comment from me: I think I meant hex when I said dec.
Re: MTStripControlChars
Thanks. Yes, I meant “ versus “ – the former works fine but generates errors in the W3C validator.
Re: MTStripControlChars
Your plugin works fine for the MTEntriesBody tag but doesn’t work for the Comments, i.e.
$MTCommentBody$ strip_controlchars=”2” doesn’t work.
p.s. I have Greater Than and Less Than signs around the above code, but seems to mess up posting. I tried adding around it.
Suggestions?
Read the post
Damn MTGoogleSearch!
Weblog: VoIP Blog - VoIP News, Gadgets
Excerpt: I’ve been using MTGoogleSearch for Related Entries on my MovableType blog - and unfortunately some of the related entries have UTF-8 characters in the URL titles which changes my webpage’s default iso-8859-1 encoding to UTF-8. If at least one...
Tracked: April 20, 2005 10:36 AM
Re: MTStripControlChars
Hi,
I’m trying to use this plugin with an installation of mt 2.61. I am using it in replace mode (option “2”). It does not seem to replace the characters. Is there any chance the character set I would like replaced is not the set the app is looking for? Or, am I just using it wrong?
To see what I am trying to do you can look at www dot egrigg9000 dot com slash mtlink .
My favorites:
I’d instead of I’d
“Pro instead of “Pro
Why†instead of Why”
happen… instead of happen:
– instead of – Pro
I am not sure if this app is designed to replace characters for incoming posts only, or if it can apply to the entire archive via change template and rebuild. I have tried both.
Any advice helpful. Thank you.
-EG
Re: MTStripControlChars
How do I run and install this plugin? I have uploaded the file to the plugins directory of movable type (v. 3.17) and I have added the MTEntryBody strip_controlchars=”2” to my template. But nothing happens.
Am I missing something?
Re: MTStripControlChars
Same here…MT 3.17. Does this plugin not work with latest build of MT?
Read the post
Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: October 16, 2005 8:45 PM
Read the post
Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: November 8, 2005 7:43 PM
Read the post
MovableType Garbage Characters Problem
Weblog: VoIP & Gadgets Blog
Excerpt: I found a solution to garbage characters showing up in my blog. The solution is to download the MTStripControlChars plugin which essentially translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities. This fixed m...
Tracked: January 19, 2006 2:19 PM
Read the post
In other sysadmin news...
Weblog: Juxtaposition
Excerpt: I also finally started using the excellent PromoteThis plugin to create the nice links to digg. Also, I had to...
Tracked: March 12, 2007 1:07 AM
Re: MTStripControlChars
Yes, fixing as opposed to stripping is excellent.