## May 19, 2008

The first Release Candidate of Rails 2.1 is out. To celebrate, I updated Instiki to Rails 2.1RC1. Inscrutably, Rails reports its version number as 2.0.991. Presumably, this will change with the final release of 2.1.

#### Update (6/2/2008):

And it did. Instiki has been updated to run on the release version of Rails 2.1.

I’ve been looking hard at ways to speed up Instiki. There are many things which could be faster, including Ruby itself. Ruby 1.9, I’m told, is considerably faster than 1.8.6. But, for the moment, the biggest offender is one that I’m partially responsible for: the HTML5lib Sanitizer.

Take the Markdown+itex source for this page and run it through the following command

Benchmark.bmbm(20) do |x|
x.report("Maruku:") {10.times do; Maruku.new(
string.delete("\r").to_utf8, {:math_enabled => true,
:math_numbered => ['\\[','\']}
).to_html; end}
x.report("Maruku + Sanitizer:") {10.times do; sanitize_xhtml(
Maruku.new(string.delete("\r").to_utf8,
{:math_enabled => true,
:math_numbered => ['\\[','\']}
).to_html); end}
x.report("Using REXML Trees:") {10.times do; sanitize_rexml(
Maruku.new(string.delete("\r").to_utf8,
{:math_enabled => true,
:math_numbered => ['\\[','\']}
).to_html_tree); end}
end

The result, on golem, is

Rehearsal ------------------------------------------------------
Maruku:              4.210000   0.050000   4.260000 (  4.304850)
Maruku + Sanitizer: 14.190000   0.120000  14.310000 ( 14.593481)
Using REXML Trees:  11.250000   0.090000  11.340000 ( 11.565092)
-------------------------------------------- total: 29.910000sec
user     system      total        real
Maruku:              4.140000   0.030000   4.170000 (  4.177362)
Maruku + Sanitizer: 14.470000   0.130000  14.600000 ( 15.536027)
Using REXML Trees:  11.710000   0.090000  11.800000 ( 12.040623)

Maruku takes around 0.4 seconds/iteration to parse the input, convert the itex to MathML, and generate the XHTML+MathML output for the page. Kinda slow, but acceptable. But running the HTML5lib Sanitizer on the result more than triples the total time required1.

That’s pretty horrendous, considering that we’re not talking about a terribly complicated page. I did a little refactoring in Instiki, so that the Sanitizer is run only once (instead of, stupidly, twice) for each page in Instiki. But, still, this is major suckage.

A little profiling

Profiler__::start_profile
sanitize_xhtml(Maruku.new(string.delete("\r").to_utf8,
{:math_enabled => true,
:math_numbered => ['\\[','\']}).to_html)
Profiler__::stop_profile
Profiler__::print_profile(\$stderr)

reveals the problem:

  %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
13.76    27.69     27.69    36393     0.76     1.29  HTML5::HTMLInputStream#char
8.41    44.62     16.93    34262     0.49     0.88  Array#include?
7.81    60.34     15.72   472223     0.03     0.03  String#==
5.03    70.46     10.12    15166     0.67     8.42  Array#each
4.43    79.37      8.91     4945     1.80     9.22  HTML5::HTMLInputStream#chars_until
2.78    84.96      5.59        1  5590.00 157270.00  HTML5::HTMLTokenizer#each
2.39    89.77      4.81    99118     0.05     0.07  Hash#[]
1.97    93.73      3.96     3038     1.30     2.17  REXML::Parsers::BaseParser#pull
1.93    97.62      3.89    99360     0.04     0.04  Fixnum#+
1.88   101.41      3.79    31660     0.12     0.21  Range#===
1.78   104.99      3.58     3368     1.06    12.70  HTML5::HTMLSanitizeModule.sanitize_token
1.57   108.14      3.15     4352     0.72    14.11  REXML::Formatters::Default#write
1.38   110.91      2.77    64174     0.04     0.04  Fixnum#<=>
1.37   113.67      2.76    66604     0.04     0.04  Module#===
1.32   116.33      2.66    77423     0.03     0.03  Array#empty?
1.27   118.88      2.55     8518     0.30     0.86  REXML::Element#root
1.12   121.13      2.25    42074     0.05     0.06  String#gsub
1.06   123.26      2.13     4468     0.48     4.72  HTML5::HTMLTokenizer#tag_name_state
⋮       ⋮         ⋮        ⋮      ⋮        ⋮        ⋮

The HTML5lib Tokenizer (and, in particular, HTML5::HTMLInputStream) is astonishingly inefficient. I was secretly hoping Ryan would tackle this. I’m really not eager to dive into fixing the Tokenizer myself. Perhaps I should just backport all the improvements from HTML5lib to my old sanitizer, and be done with the matter…

#### Update (5/22/2008): All that’s old is new again

So I dusted off my old sanitizer, backported all the improvements Sam and I had made to the HTML5lib version, and checked to make sure that it passed all (well, all but one) of the unit tests for the HTML5lib sanitizer.

I didn’t ditch HTML5lib entirely. I still use it to parse and sanitize the content of <nowiki> tags (which allow you to enter raw (X)HTML, completely bypassing any Markdown, itex2MML or wiki processing). I suppose there are other, faster, HTML parsers though, to be fair, probably the most popular one(s) are not pure Ruby. HTML5lib may be slow, but it is Ruby-native and it is thorough. And <nowiki> blocks tend to be rare and small.

The output of Maruku comes from serializing a REXML tree, so is theoretically always well-formed. Thus a lower-brow, but more efficient, sanitizer should work fine.

Here’s the result of


parsed_tree = Maruku.new(string.delete("\r").to_utf8,
{:math_enabled => true,
:math_numbered => ['\\[','\']}
).to_html_tree
parsed = parsed_tree.to_s
Benchmark.bmbm(20) do |x|
x.report("Maruku:") {10.times do; Maruku.new(
string.delete("\r").to_utf8, {:math_enabled => true,
:math_numbered => ['\\[','\']}
).to_html; end}
x.report("HTML5 Sanitizer:") {10.times do;
sanitize_xhtml(parsed); end}
x.report("(using REXML tree):") {10.times do;
sanitize_rexml(parsed_tree); end}
x.report("Instiki Sanitizer:")  {10.times do;
xhtml_sanitize(parsed); end}
end

As you can see, it beats the pants off HTML5lib, even when you hand the latter a pre-parsed REXML tree.

Rehearsal ------------------------------------------------------
Maruku:              4.080000   0.060000   4.140000 (  4.165061)
HTML5 Sanitizer:    10.370000   0.060000  10.430000 ( 10.509896)
(using REXML tree):  7.130000   0.040000   7.170000 (  7.433864)
Instiki Sanitizer:   2.900000   0.020000   2.920000 (  2.913333)
-------------------------------------------- total: 24.660000sec
user     system      total        real
Maruku:              4.300000   0.050000   4.350000 (  4.348464)
HTML5 Sanitizer:    10.790000   0.070000  10.860000 ( 11.248804)
(using REXML tree):  6.990000   0.020000   7.010000 (  7.016154)
Instiki Sanitizer:   2.830000   0.010000   2.840000 (  2.853488)

But, more important than benchmarks is real-world performance. The new/old sanitizer cuts in half the total time required to render the Instiki Atom feed.

1 The last line is kinda interesting. Rather than serializing Maruku’s output to a string, and then having HTML5lib reparse it, it’s possible to pass a REXML tree to HTML5lib. Skipping that parsing stage saves about 30%.

Posted by distler at May 19, 2008 10:56 AM

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1689

Yeah, performance is a bitch with html5lib. We’re waiting for someone to make a C/C++ implementation with appropriate language bindings for all the various higher level languages such as Ruby and Python.

At this point the only HTML5 parser with good performance is the Java based one written by Henri Sivonen.

Posted by: Anne van Kesteren on May 19, 2008 1:44 PM | Permalink | Reply to this

Yeah, it’s pretty well known that html5lib performance sucks for this very reason. It is possible that I’m just unusually bad at this but I don’t think it’s trivial (in the normal English sense of the word) to make it faster subject to the constraints that:
a) We remain pure Python (/Ruby)
b) We continue to pass all the tests, including generating the correct parse errors
c) We continue to track line and column numbers
Given this fact, and the fact that even a naive low-level-language implementation seems to be about 1-2 orders of magnitude faster than html5lib, the most efficient solution, as Anne says is to wait for a C(++) implementation with appropriate bindings. Unfortunately I don’t have nearly enough time to work on such a think at the moment; fortunately someone else is working on a lxml-compatible C based implementation as a summer project (possibly Google Summer of Code, I forget).

Posted by: jgraham on May 19, 2008 2:48 PM | Permalink | Reply to this

Given this fact, and the fact that even a naive low-level-language implementation seems to be about 1-2 orders of magnitude faster than html5lib, the most efficient solution, as Anne says is to wait for a C(++) implementation with appropriate bindings.

In order to enable interesting things purely on the Java side, I recently refactored the tokenizer of the Validator.nu HTML parser in a way that also makes it suitable for line-by-line porting to C++.

I’m refactoring the tree builder similarly—primarily to make SVG camel case fixing zero-cost in performance and to make HTML elements work faster, too.

I believe that after this refactoring, it would be less work to write a project specific Java to C++ source-to-source translator than to write a performing and correct HTML5 parser in C++ from scratch. (Assuming that you take an AST-building Java parser as an off-the-shelf component and write an AST visitor that emits C++ only for the language features used by the Validator.nu HTML parser.)

Posted by: Henri Sivonen on May 19, 2008 3:33 PM | Permalink | Reply to this

### Recoding in C

Hpricot achieves pretty decent performance with a mostly native Ruby implementation (the scanner is written in C).

The thing that bugs me is that, even walking a pre-parsed REXML tree, HTML5lib performs poorly.

Posted by: Jacques Distler on May 21, 2008 2:42 AM | Permalink | PGP Sig | Reply to this

### Off-topic: Atom feed

Hello Jacques, is your summary+full content feed on http://golem.ph.utexas.edu/~distler/blog/atom10.xml quite OK?

Posted by: Phil Wilson on May 21, 2008 6:32 PM | Permalink | Reply to this

### Re: Off-topic: Atom feed

Looks OK to me. Are you having trouble with it?

Posted by: Jacques Distler on May 21, 2008 7:48 PM | Permalink | PGP Sig | Reply to this

Post a New Comment