## April 13, 2005

### The arXiv in your pocket.

#### Posted by Robert M.

As pointed out on several other blogs, Joanna Karczmarek has been testing the waters with a downloadable version of the eprint arXiv. For the last few months, anyone with bittorrent installed on their machine has been able to download all of 2004’s papers in one go. But, as of yesterday, the whole shebang is now up for grabs.

You can grab the .torrent file either here or here. If you decide to download the arXiv, please let the bittorrent software continue to run, even after your download has finished. This makes the process faster and easier for everyone. In fact, it would be great if you just let the software run, in the background, for a few weeks. After all, if you’re on a University connection, you probably won’t even notice the 20 kB/s or so of bandwidth it uses.

By the way, the whole thing is only 7.4 Gb. That’s roughly a third of the smallest iPod on the market. So yes, you can carry the arXiv around in your pocket. The fun part, though, is how to index all of this data. It would be so boring if we just used the standard SLAC-type searches that we’re all used to. If you’re a windows user, you might want to check out Google Desktop. In a few weeks, Mac users who upgrade to Tiger can use Spotlight. Linux users have a few option as well, such as Beagle.

Update: As more people have started downloading the arXiv, the download speeds have really picked up. My download finished earlier today. One of the first things I noticed as I looked through the papers is that certain papers seem to be missing. My desktop environment generates small thumbnails of pdf files in place of icons, and I noticed that a few of the papers weren’t pdf files at all, but html with .pdf extensions.

Upon closer inspection, the html content of these files turns out to be the usual blurb that the arXiv offers up when it tries to convert a paper’s source to pdf. Clearly, these are the papers where the scripts failed to generate pdf. For instance, go check out hep-th/9108012 and try to grab a pdf version of that paper. After a few moments, the arXiv will return an error message, stating that it can’t generate a pdf file due to “incomplete or corrupted files”.

Not to worry, though. It seems as if the sources for some of these missing papers will produce valid pdf files with a minimum of fuss. For instance, if you download the source for hep-th/9108012 and pdflatex it, you’ll get a few errors. But you’ll also get a pdf version of the paper.

Posted at April 13, 2005 5:51 AM UTC

TrackBack URL for this Entry:   http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/552

### Re: The arXiv in your pocket.

There is an interesting race going on between the development of digital memory and that of electronic information bandwidth - and the way this affects our use of both of them.

On the one hand we can rejoice that mass-storage devices are becoming so small and cheap that electronic memory tends to behave as if infinite for practical purposes. So why not download all of arXiv and do all literature research with Google Desktop - offline?

On the other hand internet access is getting so fast and ubiquitous that one can come to the opposite conclusion: Why should I care about downloading archives of anything at all, when I can always access them remotely just as easily?

My personal computer recently suffered a hard disk failure of some sort and a couple of files were lost or damaged. Some of these files have almost existential importance for me. But I relax, since all my files are mirrored on one or two servers that I access remotely.

In fact it is the other way round: On these servers do I have my permanent directories and my personal computer mirrors what I need on a day-to-day basis.

Why should I not feel quite happy with having the arXiv not downloaded on my hard drive - much less on my MP3 player! What can I do with Google Desktop that I cannot do with Google? I am routinely using Google to find my own files on the net.

Of course I can see that it is a cool thing to carry hep-th around on my iPod, with people asking me:

‘What is that higher-dimensional super-current monster group that you are listening to?!?!’

;-)

Posted by: Urs Schreiber on April 13, 2005 4:43 PM | Permalink | Reply to this

### Re: The arXiv in your pocket.

Well, of course there are lots of benefits to having a collective store of data in a centralized location, that we can all access via SLAC, or google, or whatever. But there are plenty of interesting reasons that we might want a local copy, as well.

If I’m a windows user, I might be excited to use google’s facilities, via Google Desktop, to natural-language search a specific body of knowledge. I don’t have to worry about sorting out the other, extraneous hits that seem to come up when I use either Google or Google scholar. Similarly for users of the upcoming OS X Tiger on Max.

On the other hand, it would be a shame if people became too accustomed to working with their local copy, and end up missing relevant work since their last update. But I doubt that will be a problem. I see this as having more to do with access when you don’t have broadband, or you are on a plane. No one is going to stop using the arXiv because they have a local copy of its contents up to a certain point. It’s just going to be a useful reference to have around.

This past week I was collecting references for an upcoming paper, and my interent access went out. The local comcast guys were doing some sort of “upgrade” that left me with practically no service. I found all of the papers I needed in my brand new, local copy of the arXiv. I could have just found them the next day, using my connection at work, but finding them right then and there via my indexed copy of the arXiv’s contents was certainly more convenient.

Posted by: Bob McNees on April 15, 2005 8:06 PM | Permalink | Reply to this

### P2P Rsync

On the other hand, it would be a shame if people became too accustomed to working with their local copy, and end up missing relevant work since their last update.

Stale data strikes me as a big concern.

Papers which have been revised, new papers which have been added … I am reluctant to devote 8 GB to a body of data that is fairly rapidly going to become stale.

What we need is a good P2P protocol for doing updates to a large body of data. Sort of a P2P version of `rsync`, if you will.

I wanna be able to say to my P2P network, “I have the version of the archive, up-to-date as of April 10, 2005. Please send me everything that’s been updated since then.”

Without something like that, I’m gonna look at my local copy of hep-th, two years from now, and wonder, “Why did I waste 8 GB on that?”

Posted by: Jacques Distler on April 15, 2005 8:37 PM | Permalink | PGP Sig | Reply to this

### Re: P2P Rsync

Stale data should be a big concern, and whether or not this question gets resolved depends in large part on whether or not Joanna gets quality feedback related to her efforts. If this is a success, and is actually useful to people, then I think we definitely need to address the question of how we fold new data into this massive, local repository.

Jacques, I think you and I once had a conversation about using CVS to work on papers. That was a bit tongue in cheek (using CVS for collaboration), but it seems relevant for this question. We need a way of incorporating autonomous, unscheduled changes to the “arXiv head” into our own personal copies, much as a developer might use CVS update to bring recent changes into their own local copy of a program’s source-code.

Of course, what we really need is a means of distributing periodic updates that doesn’t require any changes to the way the arXiv is currently operated. I know that the arXiv keeps very detailed changelogs for all of the papers, so if Paul Ginsparg approves of what Joanna is doing, I’m sure he could very easily give her periodic updates. Compared to the time we’re willing to spend downloading the 7.5 gig arXiv, a few week’s or a month’s worth of updates would probably be trivial. In fact, it might not even require bittorrent to effectively distribute.

The arXiv archive, as it is distributed now, has a well laid out directory structure. It would be easy to repackage a week or a month’s worth of updates in that same format, so that untarring the file in the appropriate place simply overwrites stale material with fresh, updated copies.

Posted by: Bob McNees on April 15, 2005 9:16 PM | Permalink | Reply to this

### Re: P2P Rsync

Jacques, I think you and I once had a conversation about using CVS to work on papers. That was a bit tongue in cheek (using CVS for collaboration), but it seems relevant for this question.

Overkill for a paper, perhaps, but an excellent idea for collaboratively working on a book (say).

But it’s not what we need here. We don’t need version control (and the ability to roll back changes to previous versions).

What we need is something more like `rsync`. But we need something that acts in a P2P fashion.

a few week’s or a month’s worth of updates would probably be trivial. In fact, it might not even require bittorrent to effectively distribute.

Well, I know Joanna doesn’t have the bandwidth to host such a thing (distributing to all users of the Arxiv in Your Pocket periodic updates of all new papers received or updated).

Bittorent cuts down her bandwidth requirements to a manageable level. But it is most definitely not well-adapted to distributing updates.

The Linux people are onto this problem. After all, bittorrent was invented for the distribution of Linux ISO images. But, say you are interested in keeping up with the daily, weekly, or monthly progress of some Linux Distribution. You don’t want to be downloading a whole new ISO image to get a few files that have changed.

If you Google around, you’ll see that a lot of Linux folks are thinking about exactly the problem we’re talking about. One outcome of that thinking is Jigdo, which is sorta on the right track, if not exactly what would work for us.

Posted by: Jacques Distler on April 16, 2005 4:59 AM | Permalink | PGP Sig | Reply to this

### Re: P2P Rsync

Jacques, I think you and I once had a conversation about using CVS to work on papers. That was a bit tongue in cheek (using CVS for collaboration), but it seems relevant for this question.

Overkill for a paper, perhaps, but an excellent idea for collaboratively working on a book (say).

I was very recently thinking that for collaborating on a large (e.g. $\gtrsim 60$ pages) paper a CVS sort of software would be quite useful.

So here’s a big calculation and I correct a sign here and a factor of $1/2$ there. Communicating such small changes to collaborators and they finding and cross-checking them causes a lot of overhead that would be reduced by a suitable version control software.

Posted by: Urs on April 16, 2005 1:10 PM | Permalink | Reply to this