Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

April 14, 2004

MTStripControlChars

Probably the top usability complaint with the comment system on this blog is that if you copy and paste some text into your comment, the smart quotes in the pasted text get turned into garbage characters.

This is, alas, a problem common to just about anyone who doesn’t use Windows-1252 encoding as the Charset of their website (does anyone really do that?). The only difference is that my software won’t let you post a comment with those illegal characters in it.

Since this is such a common problem, I long ago wrote a plugin to filter out these garbage characters from Trackbacks, RSS feeds I syndicate, my Technorati Cosmos, etc. I took a pretty brutal approach. The plugin defined a new global tag attribute, strip_controlchars. Adding the attribute to any MT variable substitution tag,

<$MTEntryBody strip_controlchars="1"$>

simply drops characters 0x80 to 0x9F from the content.

Inspired by Sam Ruby’s excellent Survival Guide, I recently decided that a more sophisticated approach was desirable. In the new version of the MTStripControlChars plugin,

<$MTEntryBody strip_controlchars="1"$>

works as before, but

<$MTEntryBody strip_controlchars="2"$>

translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities.

Incorporating this filter into my Comment Preview template, we now automagically fix those mangled smart quotes.

Hopefully, this will lead to a more pleasant user experience.

Update (6/23/2004): See this comment for information on using this plugin with UTF-8 encoded blogs. If your blog uses UTF-8 encoding, then Trackbacks (or legacy data in your database) which contain these (now strictly illegal) characters are still a problem. Unfortunately, UTF-8 is a multi-byte encoding scheme, so fixing the problem is not as simple as stripping-out these “bad” bytes.

Update (7/1/2004): As Sam Ruby points out, there are other “illegal” characters in addition to the above. I’ve updated the plugin to strip those out too. Again,the plugin is currently really only useful if your blog charset is ISO-8859-1.

Posted by distler at April 14, 2004 11:19 PM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/347

29 Comments & 6 Trackbacks

Re: MTStripControlChars

Yes, fixing as opposed to stripping is excellent.

Posted by: Matt on April 15, 2004 1:14 AM | Permalink | Reply to this

Re: MTStripControlChars

I actually had to implement something like this in PHP for the root page of my web site, which is a rudimentary RSS aggregator of sorts; one of the weblogs that it syndicates was invalidating my page by using Windows-1252 “smart quotes.” It didn’t occur to me to implement the same for my MT weblog (I’ve been happily using your original MTStripControlChars plugin), so thanks again, Jacques.

I do worry that the assumption that the code points Hex80 to Hex9F are always intended as Windows-1252 characters will eventually prove to be false, but I don’t know enough about the issue to make an educated comment, and so far I’ve had no problems.

Posted by: jacob on April 15, 2004 11:08 AM | Permalink | Reply to this

Re: MTStripControlChars

As this comment form does not define an accept-charset value, user agents may interpret the default for this as the character encoding used to tranmit the form itself. Which, for this weblog, is iso-8859-1.

The 0x80 thru 0x9F range is defined as control characters in iso-8859-1.

Posted by: Sam Ruby on April 15, 2004 6:38 PM | Permalink | Reply to this

Re: MTStripControlChars

As this comment form does not define an accept-charset value, user agents may interpret the default for this as the character encoding used to tranmit the form itself.

That makes perfect sense, as — ultimately — the comment is to appear on the blog individual archive page, which uses the same charset encoding as the form.

Does the problem occur because the form is POSTed with the “wrong” charset (and hence could be cured with an appropriate Accept-charset header)? Or does the problem occur with the charset used internally for copy/paste?

I was under the impression that it was the latter, which is why we are remapping characters server-side.

The 0x80 thru 0x9F range is defined as control characters in iso-8859-1.

Yes, thank you for that clarification.

I should have said that everything I’ve said assumes ISO-8859-1 is the charset in use (true for the vast majority of MT blogs). There will, doubtless, be other character mis-mappings if, say, you are using UTF-8 for your blog.

But I don’t have any experience with what breaks when commenters copy 'n paste on a UTF-8 encoded blog.

Posted by: Jacques Distler on April 15, 2004 8:25 PM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars

But I don’t have any experience with what breaks when commenters copy ‘n paste on a UTF-8 encoded blog.

The short answer is that MTStripControlChars
is not needed on utf-8 encoded blogs. At least IE and Moz both do the right thing.

Posted by: Sam Ruby on April 16, 2004 1:56 PM | Permalink | Reply to this

UTF-8 and MT

The short answer is that MTStripControlChars is not needed on utf-8 encoded blogs.

MovableType has, apparently, other problems with UTF-8. These are irrelevant if you’re serving the comment-form as text/html. But Yuan-Chung’s solution (which, apparently, works) made me dizzy.

Posted by: Jacques Distler on April 16, 2004 8:59 PM | Permalink | PGP Sig | Reply to this

MTStripControlChars and UTF-8

The short answer is that MTStripControlChars is not needed on utf-8 encoded blogs. At least IE and Moz both do the right thing.

Just to clarify, they do the right thing with new comments.

If you already have comments with 0x80 through 0x9F in your database (or receive trackbacks with them), switching to UTF-8 will not solve your problem it will actually make things worse.

Whereas these byte sequences are defined as control characters in ISO-8859-1, they don’t correspond to anything in UTF-8.

You need, more than ever, to filter them out, or remap them to something sensible. Hence this plugin is still useful for UTF-8 users. It won’t munge any valid UTF-8 content, but will take care of the invalid stuff you, almost inevitably, will have to deal with.

Posted by: Jacques Distler on June 12, 2004 11:40 AM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars and UTF-8

Whereas these byte sequences are defined as control characters in ISO-8859-1, they don’t correspond to anything in UTF-8.

Whoops! That’s not strictly true. There are no UTF-8 characters whose low-byte is in the range Ox80 – Ox9F. But there certainly are UTF-8 characters whose high-byte(s) are in this range.

While I pursue a more robust solution for UTF-8, it would be prudent to disable MTStripControlChars if your blog uses UTF-8 encoding.

Posted by: Jacques Distler on June 23, 2004 11:36 AM | Permalink | PGP Sig | Reply to this

accept-charset

As this comment form does not define an accept-charset value …

Just so no one thinks me a shirker, I added an accept-charset attribute to this form. Specifically, I set

<form  method="post" accept-charset ="<MTPublishCharset>" ... >

This has absolutely no effect on the munging of smartquotes, at least not when POSTing from Mozilla.

(If you’re a masochist, and want to know why that’s the case, read bug 228779.)

Posted by: Jacques Distler on April 16, 2004 12:29 AM | Permalink | PGP Sig | Reply to this
Read the post StripControlChars
Weblog: Movable Type Plugin Directory
Excerpt: Updated to 0.3 to handle additional characters....
Tracked: June 3, 2004 5:31 AM

Keeping a clean database

Do you have any interest in turning this into a 3.0 plugin that hooks comment-save and ping-save, to strip things out before they hit the database (I know your comments are cleaned through the preview, but most people don’t force preview), or should I steal it, er, fork it, um, expand on it myself?

Posted by: Phil Ringnalda on June 3, 2004 10:53 PM | Permalink | PGP Sig | Reply to this

Re: Keeping a clean database

An excellent idea.

I haven’t explored the new plugin callbacks yet. This sounds like an excellent application.

But I’m not so interested in a 3.0-only plugin, if that’s at all avoidable. Fortunately, you are also the leading expert on writing plugins for 3.0 which are backwards-compatible with 2.x.

So, if you want to collaborate on this, I’d quite like to do it.

Posted by: Jacques Distler on June 3, 2004 11:19 PM | Permalink | PGP Sig | Reply to this

Steal from the best

Heh. Not so much an expert as a good thief.

But, yes, going both ways shouldn’t be a problem. If the eval says we’re in 3.0, add the plugin slug and add a callback for comment save, ping save (and entry save? hmm) to a sub that just calls strip_controlchars with everything that might be ugly, title and body at least, and then either way add the conditional.

I’ve got the weekend off, I should be able to send you something with several boneheaded bugs sometime Saturday.

Posted by: Phil Ringnalda on June 4, 2004 12:32 AM | Permalink | PGP Sig | Reply to this

Re: Steal from the best

Hey, have either of you made any progress on this? I, for one, am very interested in using it on my blog! Please keep me updated! Thanks :)

Posted by: Joshua Kaufman on June 15, 2004 7:14 PM | Permalink | Reply to this

Re: Steal from the best

Could either Phil or Jacques let me know what happened with the “3.0 plugin that hooks comment-save and ping-save, to strip things out before they hit the database” as mentioned above in Phil’s comment? Has anyone worked on it or has it been canned? Thanks!

Posted by: Joshua Kaufman on August 10, 2004 2:06 AM | Permalink | Reply to this

No progress

I haven’t thought about it any further.

For comments, comment-validation is the only real way to ensure that only good stuff gets into the database. The save-hook on the MTStripControlChars plugin is superfluous. Indeed, the plugin’s only there as a courtesy, to automatically fix things which would otherwise be flagged as errors.

For trackbacks, the problem is much more tricky, and MTStripControlChars is, at best, a band-aid solution (and that, only for ISO-8859-1 blogs). One could use it to “fix” stuff going into the database, but — for the present — I’d prefer to leave the stuff in the database untouched, until someone figures out a better solution.

Posted by: Jacques Distler on August 10, 2004 2:19 AM | Permalink | PGP Sig | Reply to this

Re: Steal from the best

I keep thinking “I should remember to think about whether we really can do anything useful (without nuking things like PGP-signed comments)” but thinking about thinking hardly counts as progress, I’m afraid.

Posted by: Phil Ringnalda on August 10, 2004 10:53 AM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars

Why does MTStripControl… convert smartquotes and the like to their corresponding decimal codes rahter than hex? This generated invalid code when I ran it through the W3C validator. Switching the .pl file conversions to hex fixed that.

Has anyone thought about using this plugin to convert all pasted-in special characters, like non-English characters?

Posted by: DK on September 19, 2004 8:08 AM | Permalink | Reply to this

Re: MTStripControlChars

re. last comment from me: I think I meant hex when I said dec.

Posted by: DK on September 19, 2004 8:16 AM | Permalink | Reply to this

Hex versus Decimal

You mean &#x201C; versus &#8220;? They are both valid representations of ‘ “ ’ and neither will give you a lick of trouble.

As noted, MTStripControlChars doesn’t really work with UTF-8 encoded blogs (something is needed, even for UTF-8, to filter out invalid data, but I haven’t settled on the best approach). Or you could be having troubles with another, unrelated, feature of text filters.

In recent version of Perl, the order of execution of text filters (which look like attributes to MT tags, e.g.

<$MTCommentPreviewSubject strip_controlchars="2" remove_html="1" smarty_pants="2"$>

) is random. In some instances, this can cause problems. To force a particular order of execution, you can use the MTBlock plugin:

<MTBlock smarty_pants="2"><$MTCommentPreviewSubject strip_controlchars="2" remove_html="1"$></MTBlock>

as discussed here.

Has anyone thought about using this plugin to convert all pasted-in special characters, like non-English characters?

Any characters entered into the comment form which are not in your declared charset will be automatically converted to numeric entitities by the browser when the form is submitted. Or, at least, that’s what the browser is supposed to do (aside from this common screwup between ISO-8859-1 and Windows-1252, but aside from that …).

Posted by: Jacques Distler on September 19, 2004 10:53 AM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars

Thanks. Yes, I meant “ versus “ – the former works fine but generates errors in the W3C validator.

Posted by: DK on September 21, 2004 11:03 AM | Permalink | Reply to this

Huh?

Both hexadecimal and decimal character references are valid (X)HTML, and neither cause problems with the W3C Validator. If you have a page with hexadecimal character references which cause problems with the W3C Validator, please provide a link.

This page has hexadecimal character references on it and it validates just fine.

Posted by: Jacques Distler on September 21, 2004 11:15 AM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars

Your plugin works fine for the MTEntriesBody tag but doesn’t work for the Comments, i.e.
$MTCommentBody$ strip_controlchars=”2” doesn’t work.

p.s. I have Greater Than and Less Than signs around the above code, but seems to mess up posting. I tried adding around it.

Suggestions?

Posted by: Tom Keating on April 6, 2005 1:09 PM | Permalink | Reply to this

Comments

Works in the comments over here.

If you want it to work in comment previews, you need to set the attribute on <MTCommentPreviewBody>

p.s. I have Greater Than and Less Than signs around the above code, but seems to mess up posting. I tried adding around it.

Since I allow a quite extensive subset of HTML in your comments, I don’t do any escaping for you. If you want “<”, you need to escape it yourself, by typing “&lt;” and so forth.

Posted by: Jacques Distler on April 6, 2005 4:55 PM | Permalink | PGP Sig | Reply to this
Read the post Damn MTGoogleSearch!
Weblog: VoIP Blog - VoIP News, Gadgets
Excerpt: I’ve been using MTGoogleSearch for Related Entries on my MovableType blog - and unfortunately some of the related entries have UTF-8 characters in the URL titles which changes my webpage’s default iso-8859-1 encoding to UTF-8. If at least one...
Tracked: April 20, 2005 10:36 AM

Re: MTStripControlChars

Hi,

I’m trying to use this plugin with an installation of mt 2.61. I am using it in replace mode (option “2”). It does not seem to replace the characters. Is there any chance the character set I would like replaced is not the set the app is looking for? Or, am I just using it wrong?
To see what I am trying to do you can look at www dot egrigg9000 dot com slash mtlink .

My favorites:
I’d instead of I’d
“Pro instead of “Pro
Why†instead of Why”
happen… instead of happen:
– instead of – Pro

I am not sure if this app is designed to replace characters for incoming posts only, or if it can apply to the entire archive via change template and rebuild. I have tried both.

Any advice helpful. Thank you.

-EG

Posted by: Elizabeth Grigg on May 8, 2005 10:53 AM | Permalink | Reply to this

Re: MTStripControlChars

The short answer is that, somehow, your posts are being input using the UTF-8 charset, whereas your templates declare that your blog is written in the ISO-8859-1 charset. In between, you are filtering the output through my plugin.

The purpose of my plugin is to convert Windows-1252 (which Microsoft, hence most other browser makers, pretends is ISO-8859-1) to actual ISO-8859-1. It’s neither useful, nor necessary with UTF-8.

At the cost of, perhaps, boring you to tears, let me explain what happened to the “ ’ ” character you typed.

In Unicode, the “ ’ ” character is designated as U+2019. In UTF-8, that character is represented by a string of 3 bytes, 0xE28099. (I’m using hexadecimal notation, so each byte is represented by two hexadecimal digits, ‘E2’, ‘80’ and ‘99’.)

Let’s see what happens to each of those three bytes when your run them through my plugin and display the result as ISO-8859-1.

  1. ISO-8859-1 is a single-byte charset. The byte 0xE2 is the character “â”.
  2. The byte 0x80 is a control character. My plugin converts it to &#x20AC;, which is the character “€”.
  3. The byte 0x99 is also a control character, and my plugin converts it to &#x2122;, which is the character, “™”.

Now, where my plugin would have been useful is if you had been using Windows-1252 (aka, what Microsoft pretends is ISO-8859-1). In Windows-1252, “ ’ ” is the single byte, 0x92, which is yet another control character in real ISO-8859-1. My plugin would have converted it to &#x2019;, which is the real way to write “ ’ ” in ISO-8859-1.

The bottom line for you is: use the same charset to publish your blog as you used to compose it. Use either UTF-8 or ISO-8859-1, but don’t mix them.

Clear?

Posted by: Jacques Distler on May 8, 2005 1:22 PM | Permalink | PGP Sig | Reply to this

Re: MTStripControlChars

How do I run and install this plugin? I have uploaded the file to the plugins directory of movable type (v. 3.17) and I have added the MTEntryBody strip_controlchars=”2” to my template. But nothing happens.

Am I missing something?

Posted by: ted on June 30, 2005 6:49 PM | Permalink | Reply to this

Re: MTStripControlChars

Same here…MT 3.17. Does this plugin not work with latest build of MT?

Posted by: Lance on August 8, 2005 1:59 AM | Permalink | Reply to this

Yes, it does

It certainly does work with 3.17, as you can see on this blog.

It is not helpful with UTF-8. It is really intended for use with ISO-8859-1. But it “works” anyway.

Posted by: Jacques Distler on August 8, 2005 2:09 AM | Permalink | PGP Sig | Reply to this
Read the post Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: October 16, 2005 8:45 PM
Read the post Useful plug-ins
Weblog: Digicraft
Excerpt: Movble Type Plugins Amputator If you want your HTML or XHTML to display smoothly across browsers and platforms, you need to use a little finesse with non-ASCII characters such as curly quotes, or “ä” or “™.” If you view...
Tracked: November 8, 2005 7:43 PM
Read the post MovableType Garbage Characters Problem
Weblog: VoIP & Gadgets Blog
Excerpt: I found a solution to garbage characters showing up in my blog. The solution is to download the MTStripControlChars plugin which essentially translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities. This fixed m...
Tracked: January 19, 2006 2:19 PM
Read the post In other sysadmin news...
Weblog: Juxtaposition
Excerpt: I also finally started using the excellent PromoteThis plugin to create the nice links to digg. Also, I had to...
Tracked: March 12, 2007 1:07 AM

Re: MTStripControlChars

Private Fund Management - invest online
is an international investment company specializing in asset management services. During its long history, it has achieved and occupied a stable position in the financial market and won the confidence of numerous investors from all over the world.
Asset management
Asset management comprises the management of the client’s funds conducted on the basis of the contract signed by the investor and the management company. An investor transfers his or her powers to the management company, which chooses a professional and effective investment strategy based on the client’s aims and financial capability.
The traders react to any fluctuations on the financial market by immediately correcting the investment strategy in order to achieve and maintain high profit levels for an investor.
Asset management for our clients
• The reliability of cooperation with a professional investment company.
• No restrictions concerning the sum of the initial investment.
• Guaranteed profit rate acquired at specified periods of time.
• All decisions concerning the management of the acquired profits are made by the investor himself.
• The management company works hard to increase the investor’s income since the size of the brokerage received by it depends on the profit acquired by the client.
Does this investment method suit you?
Business development
The main reason for investing money into something is the formation of an additional source of passive income. If a client chooses the right way of investing money, he or she will be able to enjoy a certain degree of freedom in the development of his or her main business. Having a predetermined regular income, you will be able to expand the influence of your company at the market, invest the acquired profits into the development of new solutions and products, and define the prospective growth taking into consideration the peculiar features of your own stabilization fund.
Personal aims
By transferring a part of your funds to an asset management company, you will be able to figure out how you are going to use the acquired additional income for your own purposes. This sum used to be just a kind of stabilizer, but now you will be able to spend more money on recreation and unplanned purchases without increasing the size of the supply subtracted from your regular income.
Increasing the assets
By increasing the amount of funds transferred to an asset management company by means of acquired profits you will be increasing your own capital. At the same time the money doesn’t just get accumulated – it keeps on working for you. Consequently, the larger is the invested sum, the more profit you get from it.
The advantages of transferring free funds into asset management
 The ability to build up your own investment business.
 Freedom in the process of designing more ambitious development strategies.
 Guaranteed stability and substantial amount of profits.
 Additional funds that can be used in the realization of one’s personal aims.
 Capital growth and steady increase of the active income.
The advantages that we offer
Individual approach and absolutely straight dealing with our clients. We strive towards close long-term business relationships that are able to bring mutual profits to our clients, partners, and ourselves. Guided by the willingness to achieve our common goals, we pay maximum attention to each of our investors. We value long-term relationships with our investors much higher than one-time transactions – that’s why we keep on doing our best to give maximum confidence to our clients and ensure the perfect performance of our liabilities.
Reliability. We minimize the risks taken by our clients by means of investment diversification and the utilizations of a specific investment strategy. All the investments that we manage get insured at the conditions that guarantee fullest protection of our clients’ interests.
Blameless reputation. During our history we have signed a lot of profitable contracts. Long years of successful operation at the international market have resulted in the establishment of our company’s blameless reputation based on the professional operation of our staff as well as the highest quality of the provided services.
Safe Investment Management Conception. Our work is based upon the principles; the effectiveness of which has been tested and proven in practice.
• Objective valuation of expectations.
• Detailed reports about the achieved results.
• Scrupulous risk management.
• Full correspondence of our activity to the current legislation.
• The willingness to find an appropriate solution for every particular problem.
• Creative approach towards the problems experienced by our clients.
• Aiming at the establishment of long-term business relationships with our clients.
The clients working with Private Fund Management should be fully confident of the reliability and the potential profitability of their investment. Our employees will help you choose the most convenient and well-paying investment option for you after picking up the appropriate investment strategy and investment portfolio.
Investment portfolios
While choosing the appropriate investment means one will always have to look for the happy medium between two indices: profitability and possible risks. These indices are in direct relation to each other - the bigger is the potential profit, the bigger is the potential risk. It has to be noted that the concept of risk is getting less and less relevant these days since within the past seven years of our operation at the financial market none of our clients have ever received profits lower than those agreed upon during the process of signing the contract. We offer solutions able to help each of our customers to choose the investment means that is the most profitable for him or her in particular.
What is an investment portfolio?
An investment portfolio is the combination of assets that you invest your money into. The process of building up an investment portfolio is based upon the process of choosing securities. The main reason for creating a portfolio is pretty simple – if done correctly, it will allow you to supply your set of securities with such investment features (profitability and risk) that cannot be achieved by purchasing only stocks or bonds, for instance. Combination is the only key to creating a good investment portfolio.
Peculiar features of different investment portfolios
All investment portfolios are built up in accordance with one of the following strategies:
• The strategy aimed at the aggressive capital growth stimulation with high level of risk. Potential annual profitability of this strategy can be estimated at about 35%. This strategy is based upon the utilization of tools with a high level of risk: shares, futures, and options.
• The strategy aimed at low-risk investment and intended for steady capital growth (about 22% annually). Stocks can serve as an example of low-risk investment tools.

By combining high- and low-risk approaches in different proportions, the experts of Private Fund Management develop investment portfolios based on the requirements set by different clients.

Posted by: JacobRobey on April 17, 2008 9:26 AM | Permalink | Reply to this

Post a New Comment