Hashing Searching Sketching

For the computer scientists in the room:

Google techtalks rock, and this one about recent advances in retrieval techniques was especially interesting to me.

In particular it includes a remarkable recent observation for the improvement of hash tables: Hash table loading and performance can be dramatically improved simply by using more than one hash algorithm in the lookup.

I.e.

instead of doing one hash lookup, you hash the item with two functions (to get two results) and look in both buckets - one of them will contain the element. (i.e. constant lookup time)
On insert you hash twice and put the element in the bucket with the smallest number of elements.
A further optimisation: moving elements to their pairs to optimise loading allows you to get a constant maximum bucket length of 2. That means you don't even need to use a linked list for each bucket, you can preallocate a fixed array of (2 * number of nodes) - i.e. 2 elements for each bucket. With this scheme you can store ~1.7n values in n buckets with very low probablility of overflow.
Finally, simply keep a seperate (very small) overflow buffer for the rare cases where there's overflow.

The techtalk is definitely worth a watch - I found the speaker entertaining and easy to follow.

Actually, having spent a bit of time watching videos recently I've got to wonder if the video presentation is the future of university-level computer science education. I get so much more out of watching technical videos and then looking stuff up in wikipedia than I ever did out of going to undergrad lectures.