I'm trying to work out if it's possible to get the index searching performance I want using a disk-backed store, or whether I need to focus on optimising the indexes to fit totally in memory. The problem is that the optimisation strategies are somewhat different:

Storing indexes on disk: - Increase redundancy, trading space for better locality of reference.

Storing indexes in memory: - Reduce redundancy, trading locality of reference for fitting more data in memory.

Of course reference locality for caching does have an effect in memory as well as disk, but it's more a factor of ~10 (i.e. ~10 times slower to fetch a word from main memory than from l1 cache) whereas a disk seek compared to a sequential read is more like a factor of ~100000.

Typical server machines seem to have the order of ~100 times more disk than memory so whatever happens the redundancy increase needs to be less than a factor of 100 or there's little point.

Sequential disk reads are ~10 times slower than random access memory reads (cache misses) and ~100 times slower than sequential (L1) cache reads, so even if there are no seeks this is still going to be 10-100 times slower. But the payoff is potentually a lot more indexed data per server node.

N.B. these numbers are very approximate orders of magnitude. Assumptions are from the below links, along with my own measurements using hdparm, bonnie++, seeker.c and a some of my own code. Please let me know if I'm wildly out with any of these!

http://www.linuxinsight.com/howfastisyourdisk.html http://homepage.virgin.net/roy.longbottom/randmem%20results.htm