Software Forum

distler Moderator 123 posts

edited almost 14 years ago

On my machine, each Expired fragment takes 0.1-0.2 ms. So to get 10-20 s, you’re talking about expiring $~ 10^{5}$ fragments?!

Wow! That’s impressive.

posted almost 14 years ago

Andrew Stacey 118 posts

On occasion an expiration takes longer than 0.2ms, and when the number is in the thousands then the likelihood of that happening increases considerably, so $10^{5}$ is an overestimate. The largest number that I see in my slightly refined test is about 6000. That takes about 3s normally. In this log run, I had one taking 7s with 1400 expirations.

What are the rules for which pages need fragments expiring when a page is saved? I’m getting a heck of a lot of expired fragments in the logs. Looking a little further then the above figures are underestimates because they don’t take into account the fact that the logs might be split over several files, or be separated by the logs for other requests.

I have one log file consisting of 16876 lines. 15581 of them are ‘Expired fragment’s. There appear to be quite a lot of duplicates as well: In that lot, then views/nlab/list/people gets expired 28 times in a row. It seems to be the list and recently_revised ones that get expired several times in a row.

posted almost 14 years ago

distler Moderator 123 posts

It seems to be the list and recently_revised ones that get expired several times in a row.

That would be a consequence of

When a page is saved, expire all pages that reference that page.
When you expire a page, also expire the corresponding “index pages’ (list, recently-revised, atom feeds).

The first is further-complicated by the facility for renaming pages. That means we need to expire all the pages that refer to the old page and all the pages that refer to the new page.

I guess that could be optimized better for the case where the page doesn’t change names, as we don’t have to expire the same pages twice. I think the current procedure was motivated by complaints (from y’all) that, in some circumstances, pages were not being expired when they should.

posted almost 14 years ago

Andrew Stacey 118 posts

I want it all (hey-yeah)
I want it all (hey-yeah)
I want it all
And I want it now!

Given the number of links that Urs has on some of the nLab pages then I think this might well be a case for optimisation. If the page name doesn’t change, surely then you don’t have to expire any of the pages that refer to it? So it’s not a “don’t have to expire twice”, it’s a “don’t have to expire once”, isn’t it? Or am I missing something.

posted almost 14 years ago

distler Moderator 123 posts

Given the number of links that Urs has on some of the nLab pages…

It’s the number of inbound links that matters, but yeah.

If the page name doesn’t change, surely then you don’t have to expire any of the pages that refer to it?

You do, for a newly-created page… but not, I agree, for a revision of an existing page. I was, somewhat crudely, not distinguishing between those cases. It occurs to me that I can use an after_create hook to distinguish between those cases.

Thanks.

posted almost 14 years ago

Andrew Stacey 118 posts

No, no! Thank you. Let’s see if this makes Urs a little happier!

posted almost 14 years ago

Andrew Stacey 118 posts

Back on the CSS thing. I’m going to experiment with taking it out on my course wiki (safer than on the nLab). Since it’s in the main instiki file I guess I have to take it out system-wide (though I could put it back on a per-web basis, I guess). What’s the safest way to do that given that this is a file in the VCS? Should I comment out the line, or delete it?

(I want to avoid - as much as possible - breaking things when I do a bzr pull)

posted almost 14 years ago

admin Administator 63 posts

I suppose there is a marginally higher probability that a ‘merge’ will succeed with some lines commented-out, instead of deleted.

But I expect that it’s a small effect; hardly worth obsessing-over.

posted 13 years ago

Andrew Stacey 118 posts

This isn’t nLab-specific, but it’s neither a bug nor a feature requestion: more of a “How do I?”.

There’s an effect that I’d like to put on a page (or a family of pages). It’s achievable in CSS using some fancy pseudo-classes, but some browsers don’t support it (notably mobile browsers) so I was pondering a javascript solution. Essentially, it would just modify some CSS properties of certain elements (selected by class) when a link was clicked upon.

The details aren’t particularly important. What I want to know is whether or not there is an easy way to add a bit of javascript to a page. I suppose it could be added to all pages, but then it would be better if it were only all pages in a particular web. Something a bit like the stylesheet tweaks, but for javascript.

posted 13 years ago

admin Administator 63 posts

Sorry. There isn’t a way to localize the Javascript.

You could, however, add some site-wide Javascript, which attaches an event listener to some element(s), based on the request-URL.

posted 13 years ago

Andrew Stacey 118 posts

Okay, I’ll take a look at that. Is it obvious which file to add it to, or should I create a new file and add it to the page template?

posted 13 years ago

Andrew Stacey 118 posts

The azimuth project just got a massive spam hit, 317 pages in total. To deal with that, I ended up working on the database level. What I did was to try to simulate “rollbacks”: copy the data from the last decent copy and paste it as a new row in the “revisions” table. That seemed the safest approach.

But it did get me thinking about the database and specifically the “revisions” table. Two questions:

If I simply remove a row from the revisions table, does instiki get confused?
If the timestamps are a bit out of order, does instiki get confused? Or is the revision id the One True Order on the revisions table?

posted 13 years ago

Andrew Stacey 118 posts

Just noticed that you got hit by the same spammer. I think that instiki.org also got hit, but then it’s hard to tell with that site anymore.

Seems as though this spammer has gone for every instiki installation under the sun!

posted 13 years ago

admin Administator 63 posts

If I simply remove a row from the revisions table, does instiki get confused?

The revised_at field in the pages table should match the revised_at field of the last revision of that page.

2. If the timestamps are a bit out of order, does instiki get confused? Or is the revision id the One True Order on the revisions table?

The history of a page is reconstructed by sorting on the revised_at date. The revision id is irrelevant to that.

posted 13 years ago

Andrew Stacey 118 posts

Ah, I’d better fix that first one then. I’ll keep the second in mind for next time this happens and choose my dates more precisely.

posted 13 years ago

Andrew Stacey 118 posts

Errr … the pages table doesn’t have a revised_at field. It has an updated_at field. But this seems to get updated whenever the record gets changed. So if I start editing a page then the pages table gains a locked_by and a locked_at entry, and the updated_at entry is set to the same as the locked_at entry. Then when I cancel editing, the locked_at entry is set to NULL (the locked_by isn’t, though that’s probably not an issue) but the updated_at entry is left as it is. So the updated_at entry in the pages table does not necessarily point to the timestamp of the latest revision.

posted 13 years ago

Andrew Stacey 118 posts

Any thoughts on the following idea?

From time to time, the nLab gets a whole host of spiders and other bots crawling all over it. While I understand that they’re part of what makes the internet work, they can be a bit annoying and slow down the server for everyone else. So I thought of channelling requests a little more cleverly than I currently do. At the moment, I use a global queue in passenger which is fine until all the slots get a slow request. So what I thought was to have a semi-global queue with slow requests (like feeds and lists) and bots being handled by a few dedicated processes, normal requests by some others, and maybe a “priority” list as well. Since passenger doesn’t do this itself (it either has global queue or individual queues) I think that what I’d have to do is to have three virtual versions of the nLab, at least as far as apache and passenger are concerned. Then apache would examine the request and classify it according to which type it was and send it to the right version of the nLab. Passenger wouldn’t know that these are the same so would have a global queue for each, and that way requests get segregated and so don’t hold up others in other segments. The way that I’d have three virtual versions is simply with symlinks in the filesystem: “nlab”, “nlabPriority”, and “nlabSlow” would all be symlinks to the same instiki installation.

Can you see any immediate problems with that? As far as Instiki is concerned, it’s just like being run under passenger as there will be multiple instances of instiki running concurrently, which is what already happens. So that shouldn’t be affected. Apache, also, eats this sort of thing for breakfast, and passenger can cope with different programs as well. So I don’t see an immediate flaw.

(Of course, it may be that this won’t solve the blockage, but it’s less drastic than moving servers which is the other option.)

posted 13 years ago

distler Moderator 123 posts

Is it obvious why the spider aren’t just hitting the cache (in which case, they should not slow down the system at all)?

Are they asking for all revisions of some page (or whatever), that would entail a large percentage of cache-misses?

I ask, just because it seems to me that, if they are operating correctly, spiders shouldn’t lead to an undue slowdown. Maybe I’ve been remiss about

<meta name="robots" content="noindex,nofollow" />

directives.

In any case, is it clear that your 3-queue scheme is better than having one queue, with a larger number of worker processes? (I.e., do these spiders insist on making multiple simultaneous connections, or do they access the nlab serially?)

posted 13 years ago

Andrew Stacey 118 posts

Okay, so looking through the week’s log for bots (bot, spider, crawler), I get 33,517 hits (actual time period: 11th December 6:25am to 16th December 11:27am, so that’s an average of a little over 4 hits per minute). These break down as follows:

25690: show
2953: new
1345: history
1290: edit
1029: source
395: cancel_edit
81: files
67: atom_with_headlines
53: recently_revised
15: save
13: atom_with_content

There’s a few that I’ve missed out in between - there are clearly some bad links to the nlab.

I’d say that only show should show in that list. source could, but I don’t really see why. The saves are a bit worrying - I’m going to check those!

Next is to analyse how those are distributed.

posted 13 years ago

Andrew Stacey 118 posts

The “save”s were all due to one bot and none actually made it to the database.

posted almost 14 years ago distler Moderator 123 posts edited almost 14 years ago	On my machine, each Expired fragment takes 0.1-0.2 ms. So to get 10-20 s, you’re talking about expiring $~ 10^{5}$ fragments?! Wow! That’s impressive.

posted almost 14 years ago Andrew Stacey 118 posts	On occasion an expiration takes longer than 0.2ms, and when the number is in the thousands then the likelihood of that happening increases considerably, so $10^{5}$ is an overestimate. The largest number that I see in my slightly refined test is about 6000. That takes about 3s normally. In this log run, I had one taking 7s with 1400 expirations. What are the rules for which pages need fragments expiring when a page is saved? I’m getting a heck of a lot of expired fragments in the logs. Looking a little further then the above figures are underestimates because they don’t take into account the fact that the logs might be split over several files, or be separated by the logs for other requests. I have one log file consisting of 16876 lines. 15581 of them are ‘Expired fragment’s. There appear to be quite a lot of duplicates as well: In that lot, then `views/nlab/list/people` gets expired 28 times in a row. It seems to be the `list` and `recently_revised` ones that get expired several times in a row.

posted almost 14 years ago distler Moderator 123 posts	It seems to be the list and recently_revised ones that get expired several times in a row. That would be a consequence of When a page is saved, expire all pages that reference that page. When you expire a page, also expire the corresponding “index pages’ (`list`, `recently-revised`, atom feeds). The first is further-complicated by the facility for renaming pages. That means we need to expire all the pages that refer to the old page and all the pages that refer to the new page. I guess that could be optimized better for the case where the page doesn’t change names, as we don’t have to expire the same pages twice. I think the current procedure was motivated by complaints (from y’all) that, in some circumstances, pages were not being expired when they should.

posted almost 14 years ago Andrew Stacey 118 posts	I want it all (hey-yeah) I want it all (hey-yeah) I want it all And I want it now! Given the number of links that Urs has on some of the nLab pages then I think this might well be a case for optimisation. If the page name doesn’t change, surely then you don’t have to expire any of the pages that refer to it? So it’s not a “don’t have to expire twice”, it’s a “don’t have to expire once”, isn’t it? Or am I missing something.

posted almost 14 years ago distler Moderator 123 posts	Given the number of links that Urs has on some of the nLab pages… It’s the number of inbound links that matters, but yeah. If the page name doesn’t change, surely then you don’t have to expire any of the pages that refer to it? You do, for a newly-created page… but not, I agree, for a revision of an existing page. I was, somewhat crudely, not distinguishing between those cases. It occurs to me that I can use an `after_create` hook to distinguish between those cases. Thanks.

posted almost 14 years ago Andrew Stacey 118 posts	No, no! Thank you. Let’s see if this makes Urs a little happier!

posted almost 14 years ago Andrew Stacey 118 posts	Back on the CSS thing. I’m going to experiment with taking it out on my course wiki (safer than on the nLab). Since it’s in the main instiki file I guess I have to take it out system-wide (though I could put it back on a per-web basis, I guess). What’s the safest way to do that given that this is a file in the VCS? Should I comment out the line, or delete it? (I want to avoid - as much as possible - breaking things when I do a `bzr pull`)

posted almost 14 years ago admin Administator 63 posts	I suppose there is a marginally higher probability that a ‘merge’ will succeed with some lines commented-out, instead of deleted. But I expect that it’s a small effect; hardly worth obsessing-over.

posted 13 years ago Andrew Stacey 118 posts	This isn’t nLab-specific, but it’s neither a bug nor a feature requestion: more of a “How do I?”. There’s an effect that I’d like to put on a page (or a family of pages). It’s achievable in CSS using some fancy pseudo-classes, but some browsers don’t support it (notably mobile browsers) so I was pondering a javascript solution. Essentially, it would just modify some CSS properties of certain elements (selected by class) when a link was clicked upon. The details aren’t particularly important. What I want to know is whether or not there is an easy way to add a bit of javascript to a page. I suppose it could be added to all pages, but then it would be better if it were only all pages in a particular web. Something a bit like the stylesheet tweaks, but for javascript.

posted 13 years ago admin Administator 63 posts	Sorry. There isn’t a way to localize the Javascript. You could, however, add some site-wide Javascript, which attaches an event listener to some element(s), based on the request-URL.

posted 13 years ago Andrew Stacey 118 posts	Okay, I’ll take a look at that. Is it obvious which file to add it to, or should I create a new file and add it to the page template?

posted 13 years ago Andrew Stacey 118 posts	The azimuth project just got a massive spam hit, 317 pages in total. To deal with that, I ended up working on the database level. What I did was to try to simulate “rollbacks”: copy the data from the last decent copy and paste it as a new row in the “revisions” table. That seemed the safest approach. But it did get me thinking about the database and specifically the “revisions” table. Two questions: If I simply remove a row from the `revisions` table, does instiki get confused? If the timestamps are a bit out of order, does instiki get confused? Or is the revision id the One True Order on the revisions table?

posted 13 years ago Andrew Stacey 118 posts	Just noticed that you got hit by the same spammer. I think that instiki.org also got hit, but then it’s hard to tell with that site anymore. Seems as though this spammer has gone for every instiki installation under the sun!

posted 13 years ago admin Administator 63 posts	If I simply remove a row from the revisions table, does instiki get confused? The `revised_at` field in the `pages` table should match the `revised_at` field of the last `revision` of that page. 2. If the timestamps are a bit out of order, does instiki get confused? Or is the revision id the One True Order on the revisions table? The history of a page is reconstructed by sorting on the `revised_at` date. The revision `id` is irrelevant to that.

posted 13 years ago Andrew Stacey 118 posts	Ah, I’d better fix that first one then. I’ll keep the second in mind for next time this happens and choose my dates more precisely.

posted 13 years ago Andrew Stacey 118 posts	Errr … the `pages` table doesn’t have a `revised_at` field. It has an `updated_at` field. But this seems to get updated whenever the record gets changed. So if I start editing a page then the `pages` table gains a `locked_by` and a `locked_at` entry, and the `updated_at` entry is set to the same as the `locked_at` entry. Then when I cancel editing, the `locked_at` entry is set to NULL (the `locked_by` isn’t, though that’s probably not an issue) but the `updated_at` entry is left as it is. So the `updated_at` entry in the `pages` table does not necessarily point to the timestamp of the latest revision.

posted 13 years ago Andrew Stacey 118 posts	Any thoughts on the following idea? From time to time, the nLab gets a whole host of spiders and other bots crawling all over it. While I understand that they’re part of what makes the internet work, they can be a bit annoying and slow down the server for everyone else. So I thought of channelling requests a little more cleverly than I currently do. At the moment, I use a global queue in passenger which is fine until all the slots get a slow request. So what I thought was to have a semi-global queue with slow requests (like feeds and lists) and bots being handled by a few dedicated processes, normal requests by some others, and maybe a “priority” list as well. Since passenger doesn’t do this itself (it either has global queue or individual queues) I think that what I’d have to do is to have three virtual versions of the nLab, at least as far as apache and passenger are concerned. Then apache would examine the request and classify it according to which type it was and send it to the right version of the nLab. Passenger wouldn’t know that these are the same so would have a global queue for each, and that way requests get segregated and so don’t hold up others in other segments. The way that I’d have three virtual versions is simply with symlinks in the filesystem: “nlab”, “nlabPriority”, and “nlabSlow” would all be symlinks to the same instiki installation. Can you see any immediate problems with that? As far as Instiki is concerned, it’s just like being run under passenger as there will be multiple instances of instiki running concurrently, which is what already happens. So that shouldn’t be affected. Apache, also, eats this sort of thing for breakfast, and passenger can cope with different programs as well. So I don’t see an immediate flaw. (Of course, it may be that this won’t solve the blockage, but it’s less drastic than moving servers which is the other option.)

posted 13 years ago distler Moderator 123 posts	Is it obvious why the spider aren’t just hitting the cache (in which case, they should not slow down the system at all)? Are they asking for all revisions of some page (or whatever), that would entail a large percentage of cache-misses? I ask, just because it seems to me that, if they are operating correctly, spiders shouldn’t lead to an undue slowdown. Maybe I’ve been remiss about `<meta name="robots" content="noindex,nofollow" />` directives. In any case, is it clear that your 3-queue scheme is better than having one queue, with a larger number of worker processes? (I.e., do these spiders insist on making multiple simultaneous connections, or do they access the nlab serially?)

posted 13 years ago Andrew Stacey 118 posts	Okay, so looking through the week’s log for bots (bot, spider, crawler), I get 33,517 hits (actual time period: 11th December 6:25am to 16th December 11:27am, so that’s an average of a little over 4 hits per minute). These break down as follows: 25690: show 2953: new 1345: history 1290: edit 1029: source 395: cancel_edit 81: files 67: atom_with_headlines 53: recently_revised 15: save 13: atom_with_content There’s a few that I’ve missed out in between - there are clearly some bad links to the nlab. I’d say that only `show` should show in that list. `source` could, but I don’t really see why. The `save`s are a bit worrying - I’m going to check those! Next is to analyse how those are distributed.

posted 13 years ago Andrew Stacey 118 posts	The “save”s were all due to one bot and none actually made it to the database.

Software Forum

nlab

Voices