Searching arrays in X86 assembler with a bloom filter

I'm currently writing what amounts to an olap database in factor in my spare time, which is a pretty interesting project. Actually spare time is limited now that I have a child but factor is affording me a pretty good productivity compared to python and scheme so it's probably net-par compared to what I was doing a couple of years ago. I should probably write a bit more about that some time.

Anyway I recently made the bizarro discovery that in the world of factor for ease of deployment it makes more sense to write low level code in assembler rather than C. This is because factor effectively has a built in macro assembler DSL for each of the chipsets it supports, whereas shipping C code requires having a C compiler setup on the target machine which is a pita for windows platforms. Also X86 is fast becoming the only server processor that matters. Crazy world.

Ideally I'd not have to write any low-level code at all and I'd do it all in factor. To be honest factor is pretty bloody fast, and could conceivably compete well with raw assembler in the not-too-distant future given enough tweaking of the code generator. For the time being though C and assembler rule the performance roost in this environment.

Ok, so the interesting problem I have is:

I have a set of 32bit integers, and I want to search a large unsorted 32bit array accumulating the array indices of values matching those from the set. To solve this in factor I've been putting the input set in a hashtable and then sequentially looking up each element from the array, which algorithmically is pretty optimal O(n+m) but the ever present constant is pretty high. This takes about 4 seconds on a processor pegged at 800mhz for a 48MB array.

Ideally I want this performance down to under 100ms. This is going to be very tight because a 48mb file contains ~12M elements, so if I'm aiming for 100ms that means about 6 cycles per element on an 800mhz chip. Probably a bit too ambitious.

I'm thinking I'll create a bloom filter with a simple hashing function (like e.g. 'mod' ) and iterate the large array in assembler checking the filter for each element and then escaping back to factor to confirm and accumulate matches.

Incidentally, I've noticed that my motivation for writing about my programming ideas drops after I've implemented them, so this time I thought I'd experiment with using my blog to think through the idea before I attempt it. Can anybody see a better way to solve this problem given the constraints? I'll post again with the results...

Hashing Searching Sketching

For the computer scientists in the room:

Google techtalks rock, and this one about recent advances in retrieval techniques was especially interesting to me.

In particular it includes a remarkable recent observation for the improvement of hash tables: Hash table loading and performance can be dramatically improved simply by using more than one hash algorithm in the lookup.

I.e.

  • instead of doing one hash lookup, you hash the item with two functions (to get two results) and look in both buckets - one of them will contain the element. (i.e. constant lookup time)
  • On insert you hash twice and put the element in the bucket with the smallest number of elements.
  • A further optimisation: moving elements to their pairs to optimise loading allows you to get a constant maximum bucket length of 2. That means you don't even need to use a linked list for each bucket, you can preallocate a fixed array of (2 * number of nodes) - i.e. 2 elements for each bucket. With this scheme you can store ~1.7n values in n buckets with very low probablility of overflow.
  • Finally, simply keep a seperate (very small) overflow buffer for the rare cases where there's overflow.

The techtalk is definitely worth a watch - I found the speaker entertaining and easy to follow.

Actually, having spent a bit of time watching videos recently I've got to wonder if the video presentation is the future of university-level computer science education. I get so much more out of watching technical videos and then looking stuff up in wikipedia than I ever did out of going to undergrad lectures.