Feed on
Posts
Comments

I wrote a bit about representing structured data in the last post. Here’s some ideas for how I plan to index the data.

Indexing graphs as subject ranges

In indexing triples I need to provide indexed lookups for all 6 of the possible triple query patterns:

s->po
sp->o
p->os
po->s
o->sp
os->p

(s=subject p=property/predicate o=object)

Most mature triplestores also index a 4th query element ‘graph’ or ‘context’. I intend to support this query type without expanding the index by using a trick: In my triples format the fact that the subjects are auto-generated and local to the graph means I can choose them to be sequential and effectively re-use them as graph indexes - e.g. subjects between 193 and 11255 belong to graph 2 etc.. So for example the ‘os->p’ indexing can also support a ‘og->sp’ query pattern by restricting the range of subject matches to only those in the appropriate range.

Cache locality and triple indexes

I mentioned in my last post that I indended to use memory mapped sorted arrays for indexing. Nick Johnson left a comment on my last post (thanks Nick!) alerting me to the better cache locality properties of n-ary trees (where n is the number of elements that fit in a disk block) rather than using binary searching flat arrays.

This is an order of magnitude improvement. For example, for a 1 million element array of 32bit values you can do each cold search in just 3 page faults ( log1024(1M)=3, assuming block size of 4096 - i.e. 1024 elements per block).
A binary search on a cold 1M element flat array would yield something more like ~22 faults (I’m assuming the last 10 lookups would happen in the same block). This prompted me to do some reading on cache coherency.

As it turns out (unless I’m missing something) the actual approach I had in mind should be pretty optimal cache locality wise. The plan is to exploit the even randomness of the hashes to reduce the searching overhead, hopefully amortizing to constant lookup (and fault) times.

To illustrate this, consider the o->ps and op->s lookup patterns. I plan to index these through a 3 level hierarchy of sorted arrays o->p->s:
I.e. a sorted array of objects, each which points to a child array of predicates, each of which points to subjects.


[o1 o2 o3 o4 o5 o6 o7 ...] (object array)
          /
         /
   ...][p1 p2 p3 p4][..    (o4 predicate array)
          /
         /
  ...][s1 s2 s3 s4][....     (o4 p2 subject array)

The trick here is that each array is a sorted array of unique hashes, which because of the randomness of the hash should be evenly spread over the search space.

That means that an object with hash ‘h’ should be approximately in the position:

(h / hash-range) * numelements.

E.g. if the hash range is unsigned 32bit (0-4294967295), a 2147483648 value is in the middle of the array. The search would try this position first, and then use linear probing to locate the value. I’m hoping that this will result in 3 page faults to locate the first matching triple regardless of the size of the data. Because it doesn’t implicitly rely on any block size it should also respond well to L1/L2 caching (unless I’m missing something!).

Hash lengths

As mentioned in the previous post I’m planning on internally identifying each symbol with a 64bit hash, along the same lines as 3store. However I’m currently thinking that I’ll only use the first 32 bits in the top 2 lookup indexes. This will make the indexes denser which I think should help with L1/L2 cache locality when probing for a match. Of course the tradeoff is that there will be a lot of duplicate hashes - to account for these I’ll put the latter 32bits in the 3rd level data arrays so that they can be filtered before joining.

N.B. I have no empirical performance data to back up any of these ideas, so this is all speculation at the moment (and likely to change as I gain in experience). I’d appreciate it if anybody can see where I’ve overlooked something or has any more optimal ideas for storing static triple data.

Viewing 4 Comments

    • ^
    • v
    Your subject-id-range context trick doesn't sound like it has the same capabilities as other context systems. Does yours have a way to store 's1 p1 o1' in one context and 's1 p2 o2' in another? Or do you have to make some other statement to link the ids of the first s1 and the second s1?
    • ^
    • v
    Good point about the even distribution of the hashes. I'll have to remember that one. :)

    Didn't you say that at least the subject will be a sequential identifier, though, and so not susceptible to that optimisation?

    How many actual indexes will you need to efficiently support that set of queries, given your heirarchial index structure? Only 3?
    • ^
    • v
    Hi Drew,

    The latter. Subject identifiers aren't exposed to the client so there's no way to make statements using them specificially. Instead to join data from two subjects in different graphs you must use identity by discription (i.e. the subject that has these property values..) and the person/agent doing the query must know about them.
    Internally the subject IDs can be in the 'object' position to support things like containment. E.g. the XML:

    <pre>
    <person>
    <name>Phil Dawes</name>
    <email>phil@example.com</email>
    <knows>
    <person>
    <name>Steve</name>
    <email>s@example.com</email>
    </person>
    </knows>
    </person>
    </pre>

    Internally indexed as:
    <pre>
    #1 name "Phil Dawes"
    #1 tag Person
    #1 knows #2
    #2 name Steve
    #2 email steve@example.com
    #2 tag Person
    </pre>

    but externally you can't refer to them. Does that make sense?
    • ^
    • v
    Hi Nick,

    Opaque subject identifiers are even easier to index because they can be picked to be sequential in the index. I.e. subject 3 is at position 3.

    Re. number of indexes: I think I'll need at least the following.

    s->p->o
    p->o->s
    o->s->p

    So 3 index hierarchies for searches. The subject-id-in-the-object-position mentioned above is a special case, and will probably require its own (relatively small) index o->sp.
close Reblog this comment
blog comments powered by Disqus

generic acomplia purchase cialis overnight delivery cheap acomplia online buy generic clomid buy cialis low price viagra without prescription where to buy cialis lowest price levitra where to buy propecia cheap cialis from canada lasix no prescription viagra without rx cheap accutane tablets viagra online without prescription viagra no rx buying cialis online zithromax viagra in uk free cialis cialis us where to buy acomplia find cialis online buy viagra lowest price accutane prescription buy cheap accutane online cialis buy buy generic cialis online acomplia order propecia online lowest price synthroid synthroid without a prescription synthroid online buy propecia online cheap levitra online where to buy levitra cialis online review synthroid prices cialis generic cialis buy drug buy viagra on line viagra pharmacy cialis for order price of levitra zithromax online where to buy synthroid soma generic generic clomid propecia online stores viagra cheap drug cheap generic soma cialis cheap zithromax online cheap order accutane online purchase zithromax online purchase viagra online buy cheap clomid cheap generic propecia zithromax pharmacy online pharmacy cialis cheapest acomplia cost of cialis no prescription viagra free viagra purchase lasix online cialis from india viagra from india order discount cialis soma online stores find no rx cialis cialis no rx required find viagra without prescription approved cialis pharmacy lasix discount