Instiki Updates
The first Release Candidate of Rails 2.1 is out. To celebrate, I updated Instiki to Rails 2.1RC1. Inscrutably, Rails reports its version number as 2.0.991. Presumably, this will change with the final release of 2.1.
Update (6/2/2008):
And it did. Instiki has been updated to run on the release version of Rails 2.1.I’ve been looking hard at ways to speed up Instiki. There are many things which could be faster, including Ruby itself. Ruby 1.9, I’m told, is considerably faster than 1.8.6. But, for the moment, the biggest offender is one that I’m partially responsible for: the HTML5lib Sanitizer.
Take the Markdown+itex source for this page and run it through the following command
Benchmark.bmbm(20) do |x| x.report("Maruku:") {10.times do; Maruku.new( string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']} ).to_html; end} x.report("Maruku + Sanitizer:") {10.times do; sanitize_xhtml( Maruku.new(string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']} ).to_html); end} x.report("Using REXML Trees:") {10.times do; sanitize_rexml( Maruku.new(string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']} ).to_html_tree); end} end
The result, on golem, is
Rehearsal ------------------------------------------------------ Maruku: 4.210000 0.050000 4.260000 ( 4.304850) Maruku + Sanitizer: 14.190000 0.120000 14.310000 ( 14.593481) Using REXML Trees: 11.250000 0.090000 11.340000 ( 11.565092) -------------------------------------------- total: 29.910000sec
user system total real Maruku: 4.140000 0.030000 4.170000 ( 4.177362) Maruku + Sanitizer: 14.470000 0.130000 14.600000 ( 15.536027) Using REXML Trees: 11.710000 0.090000 11.800000 ( 12.040623)
Maruku takes around 0.4 seconds/iteration to parse the input, convert the itex to MathML, and generate the XHTML+MathML output for the page. Kinda slow, but acceptable. But running the HTML5lib Sanitizer on the result more than triples the total time required1.
That’s pretty horrendous, considering that we’re not talking about a terribly complicated page. I did a little refactoring in Instiki, so that the Sanitizer is run only once (instead of, stupidly, twice) for each page in Instiki. But, still, this is major suckage.
A little profiling
Profiler__::start_profile sanitize_xhtml(Maruku.new(string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']}).to_html) Profiler__::stop_profile Profiler__::print_profile($stderr)
reveals the problem:
% cumulative self self total time seconds seconds calls ms/call ms/call name 13.76 27.69 27.69 36393 0.76 1.29 HTML5::HTMLInputStream#char 8.41 44.62 16.93 34262 0.49 0.88 Array#include? 7.81 60.34 15.72 472223 0.03 0.03 String#== 5.03 70.46 10.12 15166 0.67 8.42 Array#each 4.43 79.37 8.91 4945 1.80 9.22 HTML5::HTMLInputStream#chars_until 2.78 84.96 5.59 1 5590.00 157270.00 HTML5::HTMLTokenizer#each 2.39 89.77 4.81 99118 0.05 0.07 Hash#[] 1.97 93.73 3.96 3038 1.30 2.17 REXML::Parsers::BaseParser#pull 1.93 97.62 3.89 99360 0.04 0.04 Fixnum#+ 1.88 101.41 3.79 31660 0.12 0.21 Range#=== 1.78 104.99 3.58 3368 1.06 12.70 HTML5::HTMLSanitizeModule.sanitize_token 1.57 108.14 3.15 4352 0.72 14.11 REXML::Formatters::Default#write 1.38 110.91 2.77 64174 0.04 0.04 Fixnum#<=> 1.37 113.67 2.76 66604 0.04 0.04 Module#=== 1.32 116.33 2.66 77423 0.03 0.03 Array#empty? 1.27 118.88 2.55 8518 0.30 0.86 REXML::Element#root 1.12 121.13 2.25 42074 0.05 0.06 String#gsub 1.06 123.26 2.13 4468 0.48 4.72 HTML5::HTMLTokenizer#tag_name_state ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
The HTML5lib Tokenizer (and, in particular, HTML5::HTMLInputStream
) is astonishingly inefficient. I was secretly hoping Ryan would tackle this. I’m really not eager to dive into fixing the Tokenizer myself. Perhaps I should just backport all the improvements from HTML5lib to my old sanitizer, and be done with the matter…
Update (5/22/2008): All that’s old is new again
So I dusted off my old sanitizer, backported all the improvements Sam and I had made to the HTML5lib version, and checked to make sure that it passed all (well, all but one) of the unit tests for the HTML5lib sanitizer.
I didn’t ditch HTML5lib entirely. I still use it to parse and sanitize the content of <nowiki>
tags (which allow you to enter raw (X)HTML, completely bypassing any Markdown, itex2MML or wiki processing). I suppose there are other, faster, HTML parsers though, to be fair, probably the most popular one(s) are not pure Ruby. HTML5lib may be slow, but it is Ruby-native and it is thorough. And <nowiki>
blocks tend to be rare and small.
The output of Maruku comes from serializing a REXML tree, so is theoretically always well-formed. Thus a lower-brow, but more efficient, sanitizer should work fine.
Here’s the result of
parsed_tree = Maruku.new(string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']} ).to_html_tree parsed = parsed_tree.to_s Benchmark.bmbm(20) do |x| x.report("Maruku:") {10.times do; Maruku.new( string.delete("\r").to_utf8, {:math_enabled => true, :math_numbered => ['\\[','\\begin{equation}']} ).to_html; end} x.report("HTML5 Sanitizer:") {10.times do; sanitize_xhtml(parsed); end} x.report("(using REXML tree):") {10.times do; sanitize_rexml(parsed_tree); end} x.report("Instiki Sanitizer:") {10.times do; xhtml_sanitize(parsed); end} end
As you can see, it beats the pants off HTML5lib, even when you hand the latter a pre-parsed REXML tree.
Rehearsal ------------------------------------------------------ Maruku: 4.080000 0.060000 4.140000 ( 4.165061) HTML5 Sanitizer: 10.370000 0.060000 10.430000 ( 10.509896) (using REXML tree): 7.130000 0.040000 7.170000 ( 7.433864) Instiki Sanitizer: 2.900000 0.020000 2.920000 ( 2.913333) -------------------------------------------- total: 24.660000sec
user system total real Maruku: 4.300000 0.050000 4.350000 ( 4.348464) HTML5 Sanitizer: 10.790000 0.070000 10.860000 ( 11.248804) (using REXML tree): 6.990000 0.020000 7.010000 ( 7.016154) Instiki Sanitizer: 2.830000 0.010000 2.840000 ( 2.853488)
But, more important than benchmarks is real-world performance. The new/old sanitizer cuts in half the total time required to render the Instiki Atom feed.
1 The last line is kinda interesting. Rather than serializing Maruku’s output to a string, and then having HTML5lib reparse it, it’s possible to pass a REXML tree to HTML5lib. Skipping that parsing stage saves about 30%.
Re: Instiki Updates
Yeah, performance is a bitch with html5lib. We’re waiting for someone to make a C/C++ implementation with appropriate language bindings for all the various higher level languages such as Ruby and Python.
At this point the only HTML5 parser with good performance is the Java based one written by Henri Sivonen.