# nlab

50 posts, 3 voices

I’ve now upgraded the nLab server and reinstalled everything. In particular, I pulled a fresh copy of instiki from the repository and made only two changes:

1. Using mysql as the database
2. Disabled the DNSBL spam check

In particular, I’m not mucking about with the logs just yet. We’ll see how this goes for now.

Now, should I enable the daily check on the instiki source code, or shall I keep that off for the time being and update manually?

Let’s keep it “off” for the time-being; I don’t have any plans to change anything for the next week, at least.

Do you have any other scripts/cron-jobs running?

I have a script that exports the day’s revisions to a set of bzr repositories (this is to make it easier for people to keep incremental backups of their webs). I haven’t reinstalled it as a cron job yet, and again I can run it manually.

Root has its usual cron jobs (at least, it does now: in the installed image then it doesn’t run them but lets anacron do it, which is fine except that on a server, anacron isn’t running; however, cron just checks for the existence of anacron, not whether or not it is running). But that’s all.

Okay, so after a day’s trading at the nLab, here are the closing prices. This is with nothing disabled.

atom_with_content: 192 requests, 176 taking less than 10s to process, 16 taking 50s to process.

list: 6 requests, 2 taking less than 10s to process, 4 taking 5 minutes to process.

Total number of requests: 4223, breakdown of time taken to process (rounded down to nearest 10s):

  0: 4142
1: 47
2: 8
3: 3
4: 1
5: 17
6: 1
7: 1
30: 1
31: 2
32: 1

So those atom_with_content and list are a major part of the requests that take longer than 30s to deliver.

But it didn’t crash, or become otherwise unresponsive. Which is better than what you were doing before your “clean” install.

Some actions, like list, are $O\left(N\right)$. With a Wiki your size, they will take a long time to complete1. If you want prevent people from calling the list action (or other $O\left(N\right)$ actions), I suggest blocking/redirecting it at the Apache level.

I assume you know how to use mod_rewrite to good effect.

But it appears that the nLab operates acceptably, even when you allow such actions.

1. Incorporating will_paginate (which is used by Heterotic Beast, for just this reason) in those actions will make them $O\left(1\right)$, again.

I was blocking them at the Apache level. That seemed to be what you were complaining about! Maybe I misunderstood.

I can understand list being so big, but atom_with_content is limited to the last … whatever … revisions, so that shouldn’t be so big, should it?

(I think that to get a better picture from the logs then I need to figure out a way to distinguish those that got cached from those that needed a serious operation. But as I’m back on “vanilla” instiki, my hacks to allow this have been taken out.)

Incidentally, the nLab had crashed when I came in this morning, but it’s been running fairly stably all day. So that might just have been first night jitters (or the rampaging googlebot that came through at about midnight).

atom_with_content renders the last 15 updated pages. So it should take approximately 15 times as long as it takes to render 1 page.

Incidentally, the nLab had crashed when I came in this morning…

Ouch. OK. Now I am puzzled.

Is there a process-limit that gets exceeded on your VPS? (Seems unlikely: if there were too many active processes, you wouldn’t be able to login.) Could it be that Passenger is killing off a long-running response, because it thinks Instiki is “inactive”. If so, try setting PassengerPoolIdleTime to 0.

The default (300) clearly won’t work for you if

1. A list request (say) takes more than 300s to complete.
2. That’s the only request received during the 5-minute period in question. You receive, on average, 3 requests/minute, so it’s not inconceivable that you go for long stretches with no requests at all.

I’ve set that one that kills off processes after 20 requests (PassengerMaxRequests), and I’ve set it up to allow 10 concurrent processes (PassengerMaxPoolSize).

I wouldn’t count this morning’s crash as anything special. More likely just teething problems.

I wouldn’t count this morning’s crash as anything special.

You seem to be inured to the idea of Instiki crashing. I am not. It shouldn’t crash, and there’s something wrong if it does. I’m not even convinced that my PassengerPoolIdleTime theory explains the phenomenon. Let’s see if it can go a week without hiccup.

It’s even possible that it didn’t crash this morning. It was very slow loading the main page so I restarted it. I couldn’t see from the logs anything special, though.

I’m not inured to instiki crashing! I would love to get to the bottom of it, but it’s hard to know what to monitor, and how to monitor it (especially as it may have been my monitoring procedures that contributed to its crashing).

As you say, let’s see how long it can last.

By the way, how many log files does it keep? Does it just keep renumbering them each time it rotates them? If so, I’d better move them out of the log directory each day.

It keeps 25 files, of size 1MB each.

Both of these numbers are configurable.

Yes, I was thinking that 1Mb was a bit low. When the googlebot hit last night then it was creating a new log file every 20 minutes or so!

Right, so on the crashing then I’m just watching the new system and waiting to see what it does and how it responds. I’ll keep a hold of all the logs for statistical purposes (not that I’ve any real idea about statistics …).

One thing that I am pretty sure that slows down a page load is if the page has a lot of wikilinks on it. I don’t know how it checks all the links, but is there some way that that could be speeded up? Urs has some pages with loads and loads of links, and there’s talk of having some pages where everything possible is linked (with CSS to lessen the visual impact).

One thing that I am pretty sure that slows down a page load is if the page has a lot of wikilinks on it. I don’t know how it checks all the links, but is there some way that that could be speeded up?

Could you compare (by deleting the cached page and reloading ) how long it takes to build a wikilink-heavy page, versus a “normal” one?

I’m particularly interested in the database-lookup times. As I said via email, the WikiReferences model uses a lot of raw SQL queries (which, therefore, do not benefit from ActiveRecord caching). But I am a little skeptical that is the cause of much of a slowdown.

Okay, let’s take http://ncatlab.org/nlab/show/smooth%20infinity-groupoid%20--%20structures which, according to grep, has the order of 409 wikilinks. (Actually, it has 409 hits for the string [[.) On first load, it was already cached and produced the following figures: 169ms first time, 62ms second. I got 0.026s for “time to first byte” and 0.663s for “from click to complete display”.

Now I delete it from the cache, and try again. A cup-of-tea later, and I get the following: 74074ms (View 72773, DB: 1276). On the receiving end, I get 82s and 87s for the delivery times. Second time, similar figures.

The time this gets a bit annoying is when editing a page, since then it has to regenerate it each time. That’s a fair wait if you’ve only changed a couple of spelling mistakes.

The “print” view wasn’t much faster: 71303ms.

On golem (the iMac on my desk), rendering a copy of that page takes

Completed in 27588ms (View: 27356, DB: 192)

Note that it does spend (what I consider to be) a significant amount of time querying the database, but it is totally dwarfed by the rendering time. (I don’t know why yours is spending an order of magnitude longer querying the database. Seems like something’s very wrong, there, even though the conclusion is the same.)

The page, of course, contains a number of markup errors, like

... [[transversal map]s ...

which sent Maruku into convulsions. Surprisingly, correcting those errors did not appreciably affect the rendering time (the above-reported time is after making the relevant markup corrections).

On my laptop, a typical time was

Completed in 48775ms (View: 48635, DB: 77)

SQlite3 is faster than MySQL, but the machine itself is significantly slower than the iMac.

Of those 49 seconds, spent rendering the page, 43 of them were spent in Maruku (for obscure reasons, Maruku has to be run twice, so really, we’re talking about 22 seconds to process the 175KB source).

Maruku doesn’t particularly care about the number of WikiLinks, so that has nothing to do with why it takes so long render this particular page.

Of the remaining 6 seconds, 4 seconds were spent in the Instiki Sanitizer. I don’t think there’s much to optimize there.

The remaining 2 seconds were, largely, spent in the Chunk-Handler – the thing that processes Wikilinks (and, presumably, cares about how many of them there are). 2 seconds is still a long time, but it’s not surprising. Doing on the order of ${10}^{3}$ RegExp substitutions (5360 chunk-masking and 686 chunk-unmasking operations, to be precise) on a 175KB string, takes significant time. Using Regexps to process long strings sucks.

I have looked at various optimizations of the Chunk-Handler code, but nothing I can do will contribute much to the speedup of rendering this page, which is dominated by Maruku.

Now, if one of your nLab folks were to volunteer to write a PEG grammar for Maruku’s extended Markdown syntax, …

#### Update:

Now, if one of your nLab folks were to volunteer to write a PEG grammar for Maruku’s extended Markdown syntax, …

Since I’m not gonna hold my breath for that to happen, I decided to spend some time (alas, more than I expected) making Maruku faster. The new rendering times for that page are

Completed in 13228ms (View: 13024, DB: 198)

on golem and

Completed in 21666ms (View: 20979, DB: 83)

on my laptop. Roughly a factor of 2 improvement in the total rendering time, in both cases.

Still not great, but it’s the best that I am going to achieve.

Now that I see the break-down, I agree with you that it’s a waste of time optimising the wikilinks for now. That’s an astonishing amount of time for maruku to take! I wonder how much of that is itex; I guess I can test that for myself by running maruku on a few files on my machine here, with and without itex.

I could ask on the nForum about volunteers for writing a PEG grammar. That sounds like a good, specific task that someone might just be willing to do. Shall I ask?

Do you have a clear idea of which bits of maruku’s syntax are missing, or would the task involve determining that?

Pretty much everything beyond the standard Markdown syntax needs to be written, though some folks in the peg-markdown network seem to have included Michel Fortin’s Markdown-Extra extensions (albeit, along with some of their own, incompatible, extensions).

Presumably, this fork of peg-markdown would be directly linked to the itex2MML functions which process inline and display equations (ie, it would not use the Ruby bindings provided by the itextomml gem). So there’s a little bit more to do than write the peg grammar. But not a lot more …

As to whether you want to ask someone to do this, I’ve already explained my desire to replace Maruku (for licensing reasons). Here’s another motivation, from efficiency. Instiki’s performance does genuinely suck, in this instance.

The question is, are they (your nlab colleagues) willing to do anything about improving it?

Feel free to point them to this discussion, and to the previous one on Markdown alternatives.

Did a lot of profiling this weekend, and produced a few tweaks to Maruku’s parsing, which speeded it up a little.

Unfortunately, the main discovery was that (with that test page as input), 3/4 of Maruku’s time is spent in the #to_html output method; only 1/4 is spent in parsing the original input. Thus, my efforts, which maybe improved the parsing speed by 5%, contributed at best a 1% speedup in the total Instiki processing time, i.e something you would never notice.

I hope that one of your guys finds formal grammars sufficiently “categorical” to be worthy of a small bit of their attention.

Bother. I can’t count. I saw that the last log file was numbered 23 and assumed that that meant I hadn’t missed any log files. Sadly not. So my logs aren’t complete. Ah well. Incidentally, why name them numerically? Why not name them by a date-and-time stamp? Then they wouldn’t have to all be renamed when the next one is created.

Thinking about Recently Revised and All Pages, you suggested (somewhere) taking them out of the sweeper as a way of stopping them being regenerated every time a page is edited (I don’t know if this was one of you “If you’re going to do something crazy, here’s a way of limiting how crazy you’re going to be” suggestions or if you thought this was actually a good idea). Then I’d have to manually regenerate them every, say, hour by deleting the cached copy so that the next hit recreated it. It occurred to me that the same cron job that deleted the cache could also hit the page in the server to force the regeneration. It then occurred to me that it was silly having that go via the webserver when it was on the same machine as the program.

With my current knowledge of instiki, what I thought of for getting round this was to have an instiki process running invoked “from the command line” and listening on, say, port 2500. I block that port from all outside traffic and use it only for localhost. Then the cron job hits that port, avoiding the webserver. The alternative would be to have it so that I could call instiki nlab/recentlyrevised on the commandline, whereupon instiki fires up, renders the page, and disappears in to the ether again.

Thinking about Recently Revised and All Pages, you suggested (somewhere) taking them out of the sweeper as a way of stopping them being regenerated every time a page is edited (I don’t know if this was one of you “If you’re going to do something crazy, here’s a way of limiting how crazy you’re going to be” suggestions or if you thought this was actually a good idea).

The former. You’re trading off workload on the server for stale data. Since computers are supposed to serve humans, rather than the other way around, the question is: does this improve the user experience?

Say you implement the above suggestion. On the one hand, the user always (or almost always, depending on implementation) receives the cached page, i.e. gets a quick response. On the other hand, the data is invariably stale.

Leaving these alone, the user is guaranteed to receive fresh data, but there could be a significant delay, if the page has to be regenerated. What percentage of requests for these pages hit the cache?

A better solution is to pull in the will_paginate gem, and paginate the data returned. That makes the request $O\left(1\right)$, again, instead of $O\left(N\right)$. So the user gets both a quick response AND up-to-date data.

As to moving away from Maruku, to some peg-markdown-based clone, note that

1. This will benefit Heterotic Beast as well.
2. The task I’m asking of your guys is not “programming,” per se. It involves writing a formal PEG grammar for Maruku’s extended Markdown syntax (starting with the existing PEG grammar, already in peg-markdown).
3. On the other hand, if there were someone handy in C, that would be most appreciated, too, because I am crappy at C. Hooking in itex2MML would take mere minutes for a competent C-programmer.
4. I imagine that a such a fork of peg-markdown (as it’s written in C), would be useful in your other projects, as well.

(Not so much a bug, but also not a feature request, so gets put here.)

Is there a reason why the italic style in theorems is done via:

.num_theorem * {
font-style: italic;
}

rather than:

.num_theorem {
font-style: italic;
}

The * makes it very hard to override the italic selection. For example, something like:

<p style="font-style: normal;"><a href="somewhere">link</a></p>

will come up as italic, I think. Actually, I can test it here:

###### Theorem

Yes, it did. So the normal CSS inheritance is effectively bypassed by the * selector. To override it, I need to put another * selector in place but I can only do that in the web CSS and not in the style on a div.

I’m generally reluctant to modify stuff that you’ve put in place! Is there a reason for the *, or can it be dropped?

Works for me.

###### Theorem

This is in italics. This is normal. And this is italic, again.

This whole paragraph is normal.

This paragraph is in italics.

was generated by

+-- {: .num_theorem}
###### Theorem ######
This is in italics. *This is normal.*{: style="font-style: normal;"} And this is italic, again.

This whole paragraph is normal.
{: style="font-style: normal"}

This paragraph is in italics.
=--

Aside from the kludgy bit about applying styles to spans, I have no idea what your issue is.

It’s about inheritance. Let me do my example again.

+-- {: .num_theorem}
###### Theorem ######
This is italic text.  *This is normal, but [this is not](#top).*{: style="font-style: normal;"}.
=--
###### Theorem

This is italic text. This is normal, but this is not..

The link should inherit the font-style: normal; from the span but it doesn’t because it matches the CSS rule .num_theorem *.

Ah. I see. I was deceived by your example

###### Theorem

This is italic text. But this is not.

+-- {: .num_theorem}
###### Theorem ######
This is italic text.  But [this is not](#top){: style="font-style: normal;"}.
=--

In general, the more specific CSS selector wins, and there was some circumstance where inheritance from .num_theorem was not sufficient. I needed the more specific .num_theorem *. But, for the life of me, I can’t recall the details.

If you can’t remember, then I’ll experiment with taking it out and see who complains, and about what.

Urs is complaining about slow times again …

Looking at the logs, I see a heck of a log of “Expired fragment”s. Do these add to a page’s rendering time? Or does the cache sweeper act non-synchronously (or whatever it’s called)?

For example, in the logs I see 815 lines saying “Expired fragment:” and the request that it seems to be a part of (can’t be totally sure due to several processes running simultaneously) takes about 6s to render. A more systematic search of the logs shows it’s not guaranteed, but often the case that the following request is slow - and the times when it isn’t could be due to it being a parallel process (would need a more sophisticated search).

The “Expired fragment” log-entries all have times (in ms) attached to them, so you would be in a better position than I to tell how long they took to execute.

But, yes, when you ask for new features (like source views for every revision), that means more pages to expire, when an edited page is saved.

Okay, so I just did a crude time count. I do have some lists totalling in the order of seconds. I get one at 20s, three at 10s, and about 30 over 1s. (Usual caveats that I can’t tell that all of these are from the same request. They occur concurrently in the logs.)