Indexes, Hashes & Compression

The new triplestore is coming along. It can do substring text searches (using a suffix array) and has a basic relational query engine. It doesn't optimise the query plans yet, but if you enter the queries in a good order (most selective clauses first) then you get good performance.

A few things have changed in my thinking since my last post about indexing. Although using hashes to represent tokens is really useful when joining datasets from different nodes in a cluster (no coordination overhead), I'm now thinking that they're not such a good idea for when laying out the actual triple indexes in memory (or on disk).

My reasoning is:

In order to get the performance I want (100 row results from relational queries in ~half second) I'm either going to have to keep the entire set of indexes in memory, or at the very least minimise the disk activity to a small number of sequential reads. Disk seeks are the order of ~10ms so 50 of them and I'm shot. If I end up aiming for the all-in-memory approach then I want to cram as many triples in to memory as possible. If I do use disk reads then locality of reference will be fundamental to the layout of the indexes.

Either way, I'm going to need to use compression on the indexes to achieve optimal storage or read efficiency. The problem with hashes is that they introduce a lot of randomness into the mix which reduces the ability to do delta compression (and then run length encoding of the deltas). I suspect that controlling allocation of identifiers could also be very useful in optimising locality of reference. All this is theoretical at the moment as I haven't actually implemented any index compression, but I hope to do this soon.