I wrote a bit about representing structured data in the last post. Here's some ideas for how I plan to index the data.

Indexing graphs as subject ranges

In indexing triples I need to provide indexed lookups for all 6 of the possible triple query patterns:

s->po sp->o p->os po->s o->sp os->p

(s=subject p=property/predicate o=object)

Most mature triplestores also index a 4th query element 'graph' or 'context'. I intend to support this query type without expanding the index by using a trick: In my triples format the fact that the subjects are auto-generated and local to the graph means I can choose them to be sequential and effectively re-use them as graph indexes - e.g. subjects between 193 and 11255 belong to graph 2 etc.. So for example the 'os->p' indexing can also support a 'og->sp' query pattern by restricting the range of subject matches to only those in the appropriate range.

Cache locality and triple indexes

I mentioned in my last post that I indended to use memory mapped sorted arrays for indexing. Nick Johnson left a comment on my last post (thanks Nick!) alerting me to the better cache locality properties of n-ary trees (where n is the number of elements that fit in a disk block) rather than using binary searching flat arrays.

This is an order of magnitude improvement. For example, for a 1 million element array of 32bit values you can do each cold search in just 3 page faults ( log1024(1M)=3, assuming block size of 4096 - i.e. 1024 elements per block). A binary search on a cold 1M element flat array would yield something more like ~22 faults (I'm assuming the last 10 lookups would happen in the same block). This prompted me to do some reading on cache coherency.

As it turns out (unless I'm missing something) the actual approach I had in mind should be pretty optimal cache locality wise. The plan is to exploit the even randomness of the hashes to reduce the searching overhead, hopefully amortizing to constant lookup (and fault) times.

To illustrate this, consider the o->ps and op->s lookup patterns. I plan to index these through a 3 level hierarchy of sorted arrays o->p->s: I.e. a sorted array of objects, each which points to a child array of predicates, each of which points to subjects.


[o1 o2 o3 o4 o5 o6 o7 ...] (object array)
          /          
         /           
   ...][p1 p2 p3 p4][..    (o4 predicate array)
          /
         /
  ...][s1 s2 s3 s4][....     (o4 p2 subject array)

The trick here is that each array is a sorted array of unique hashes, which because of the randomness of the hash should be evenly spread over the search space.

That means that an object with hash 'h' should be approximately in the position:

(h / hash-range) * numelements.

E.g. if the hash range is unsigned 32bit (0-4294967295), a 2147483648 value is in the middle of the array. The search would try this position first, and then use linear probing to locate the value. I'm hoping that this will result in 3 page faults to locate the first matching triple regardless of the size of the data. Because it doesn't implicitly rely on any block size it should also respond well to L1/L2 caching (unless I'm missing something!).

Hash lengths

As mentioned in the previous post I'm planning on internally identifying each symbol with a 64bit hash, along the same lines as 3store. However I'm currently thinking that I'll only use the first 32 bits in the top 2 lookup indexes. This will make the indexes denser which I think should help with L1/L2 cache locality when probing for a match. Of course the tradeoff is that there will be a lot of duplicate hashes - to account for these I'll put the latter 32bits in the 3rd level data arrays so that they can be filtered before joining.

N.B. I have no empirical performance data to back up any of these ideas, so this is all speculation at the moment (and likely to change as I gain in experience). I'd appreciate it if anybody can see where I've overlooked something or has any more optimal ideas for storing static triple data.