Hashing Searching Sketching

For the computer scientists in the room:

Google techtalks rock, and this one about recent advances in retrieval techniques was especially interesting to me.

In particular it includes a remarkable recent observation for the improvement of hash tables: Hash table loading and performance can be dramatically improved simply by using more than one hash algorithm in the lookup.

I.e.

  • instead of doing one hash lookup, you hash the item with two functions (to get two results) and look in both buckets - one of them will contain the element. (i.e. constant lookup time)
  • On insert you hash twice and put the element in the bucket with the smallest number of elements.
  • A further optimisation: moving elements to their pairs to optimise loading allows you to get a constant maximum bucket length of 2. That means you don't even need to use a linked list for each bucket, you can preallocate a fixed array of (2 * number of nodes) - i.e. 2 elements for each bucket. With this scheme you can store ~1.7n values in n buckets with very low probablility of overflow.
  • Finally, simply keep a seperate (very small) overflow buffer for the rare cases where there's overflow.

The techtalk is definitely worth a watch - I found the speaker entertaining and easy to follow.

Actually, having spent a bit of time watching videos recently I've got to wonder if the video presentation is the future of university-level computer science education. I get so much more out of watching technical videos and then looking stuff up in wikipedia than I ever did out of going to undergrad lectures.