MTStripControlChars
Probably the top usability complaint with the comment system on this blog is that if you copy and paste some text into your comment, the smart quotes in the pasted text get turned into garbage characters.
This is, alas, a problem common to just about anyone who doesn’t use Windows-1252 encoding as the Charset of their website (does anyone really do that?). The only difference is that my software won’t let you post a comment with those illegal characters in it.
Since this is such a common problem, I long ago wrote a plugin to filter out these garbage characters from Trackbacks, RSS feeds I syndicate, my Technorati Cosmos, etc. I took a pretty brutal approach. The plugin defined a new global tag attribute, strip_controlchars
. Adding the attribute to any MT variable substitution tag,
<$MTEntryBody strip_controlchars="1"$>
simply drops characters 0x80 to 0x9F from the content.
Inspired by Sam Ruby’s excellent Survival Guide, I recently decided that a more sophisticated approach was desirable. In the new version of the MTStripControlChars plugin,
<$MTEntryBody strip_controlchars="1"$>
works as before, but
<$MTEntryBody strip_controlchars="2"$>
translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities.
Incorporating this filter into my Comment Preview template, we now automagically fix those mangled smart quotes.
Hopefully, this will lead to a more pleasant user experience.
Update (6/23/2004): See this comment for information on using this plugin with UTF-8 encoded blogs. If your blog uses UTF-8 encoding, then Trackbacks (or legacy data in your database) which contain these (now strictly illegal) characters are still a problem. Unfortunately, UTF-8 is a multi-byte encoding scheme, so fixing the problem is not as simple as stripping-out these “bad” bytes.
Update (7/1/2004): As Sam Ruby points out, there are other “illegal” characters in addition to the above. I’ve updated the plugin to strip those out too. Again,the plugin is currently really only useful if your blog charset is ISO-8859-1.
Posted by distler at April 14, 2004 11:19 PM
Re: MTStripControlChars
Yes, fixing as opposed to stripping is excellent.