BTriples - a model for aggregating structured data

Things have settled down a bit after the birth of baby #2 and I'm starting to get a bit of time to program again: about an hour a night. That means I'm thinking a lot about indexing structured data again.

Here are my most up-to-date thoughts on a model for representing aggregated structured data which I'm tentatively calling 'BTriples'. I'm writing this down mainly so I can refer to it in future writing.

The purpose of BTriples is to be an internal model for an OLAP database such that it can represent structured data from a variety of popular formats (json, xml, csv, relational) and can index and query across heterogeneous data sources.

A good candidate for such a model would appear to be RDF, but it falls short on a couple of counts for my requirements:

  • The first issue is that in order to represent vanilla data as RDF there's a certain amount of manual mapping that needs to be done. You need to come up with a URI scheme for your imported data, and you then need to do some schema and ontology work so that the data can be semantically joined with other RDF data. This manual import overhead removes the ability to do one-click database imports, which is something I'd like to achieve with my database tool.

  • The second issue is that the RDF model has strict semantic constraints that are difficult to manage over a large set of disconnected parties. Specifically the RDF model says that "URI references have the same meaning whenever they occur". This 'same meaning' is difficult to enforce without central control and makes RDF brittle in the face of merging data from globally disconnected teams.

TagTriples was my first attempt at creating a simplified RDF-like model, but it suffers from the problem that it can't represent anonymous nodes. This makes importing tree structures like XML or JSON a tricky exercise as you need to have some way to generate branch node labels from data that has none. When I was designing tagtriples I was also thinking in terms of an interchange format (like RDF). I no longer think creating an interchange format is important - the world already has plenty of these.

Btriples is basically my attempt at fixing the problems with tagtriples. The format is triple based like RDF and so I borrow a bunch of the terms from the RDF model.

BTriples Specification

The Btriples universe consists of a set of distinct graphs (think: documents). Each graph consists of an ordered set of statements. A statement is intended to convey some information about a subject. Each statement has three parts: a subject, a predicate (or property) and an object.

  • A subject identity is anonymous and is local to the graph. This means you can't refer to it outside the graph. (This is similar to a 'blank node' in RDF).
  • A predicate is a literal symbol (e.g. strings, numbers).
  • An object is either a literal symbol or an internal reference to a subject in the same graph.

Example (logical) statements:

  // row data
#1 name "Phil Dawes"
#1 "hair colour" Brown
#1 plays "French Horn"

  // array
#2 elem "Item 1"
#2 elem "Item 2"
#2 elem "Item 3"
#2 elem "Item 4"

  // tree
#3 type feed
#3 entry #4
#4 title "BTriples - a model for aggregating structured data"
#4 content "blah blah ..RDF... blah"

That's it.

Notes:

  • Btriples is not an interchange format. I have deliberately not defined a serialization of BTriples.

  • BTriples graphs are disconnected: Btriples does not define a method for them to refer to each other.

  • Perhaps the biggest departure from RDF is that there are no formal semantics in Btriples. The btriples model cannot tell you if a subject in one graph denotes the same thing as a subject in another.

  • Also the semantic meaning of symbols is not defined by BTriples and is up to the user to decide. Two identical symbols do not necessarily 'mean' the same thing.

  • The statements in a BTriples graph are *ordered*, so you can get data out in the same order it went in.

  • I'm not crazy about the BTriples name. Maybe I'll change it.

Indexing dilema - in memory or on disk?

I'm trying to work out if it's possible to get the index searching performance I want using a disk-backed store, or whether I need to focus on optimising the indexes to fit totally in memory. The problem is that the optimisation strategies are somewhat different:

Storing indexes on disk: - Increase redundancy, trading space for better locality of reference.

Storing indexes in memory: - Reduce redundancy, trading locality of reference for fitting more data in memory.

Of course reference locality for caching does have an effect in memory as well as disk, but it's more a factor of ~10 (i.e. ~10 times slower to fetch a word from main memory than from l1 cache) whereas a disk seek compared to a sequential read is more like a factor of ~100000.

Typical server machines seem to have the order of ~100 times more disk than memory so whatever happens the redundancy increase needs to be less than a factor of 100 or there's little point.

Sequential disk reads are ~10 times slower than random access memory reads (cache misses) and ~100 times slower than sequential (L1) cache reads, so even if there are no seeks this is still going to be 10-100 times slower. But the payoff is potentually a lot more indexed data per server node.


N.B. these numbers are very approximate orders of magnitude. Assumptions are from the below links, along with my own measurements using hdparm, bonnie++, seeker.c and a some of my own code. Please let me know if I'm wildly out with any of these!

http://www.linuxinsight.com/howfastisyourdisk.html http://homepage.virgin.net/roy.longbottom/randmem%20results.htm

Indexes, Hashes & Compression

The new triplestore is coming along. It can do substring text searches (using a suffix array) and has a basic relational query engine. It doesn't optimise the query plans yet, but if you enter the queries in a good order (most selective clauses first) then you get good performance.

A few things have changed in my thinking since my last post about indexing. Although using hashes to represent tokens is really useful when joining datasets from different nodes in a cluster (no coordination overhead), I'm now thinking that they're not such a good idea for when laying out the actual triple indexes in memory (or on disk).

My reasoning is:

In order to get the performance I want (100 row results from relational queries in ~half second) I'm either going to have to keep the entire set of indexes in memory, or at the very least minimise the disk activity to a small number of sequential reads. Disk seeks are the order of ~10ms so 50 of them and I'm shot. If I end up aiming for the all-in-memory approach then I want to cram as many triples in to memory as possible. If I do use disk reads then locality of reference will be fundamental to the layout of the indexes.

Either way, I'm going to need to use compression on the indexes to achieve optimal storage or read efficiency. The problem with hashes is that they introduce a lot of randomness into the mix which reduces the ability to do delta compression (and then run length encoding of the deltas). I suspect that controlling allocation of identifiers could also be very useful in optimising locality of reference. All this is theoretical at the moment as I haven't actually implemented any index compression, but I hope to do this soon.

Some ideas for static triple indexing

I wrote a bit about representing structured data in the last post. Here's some ideas for how I plan to index the data.

Indexing graphs as subject ranges

In indexing triples I need to provide indexed lookups for all 6 of the possible triple query patterns:

s->po sp->o p->os po->s o->sp os->p

(s=subject p=property/predicate o=object)

Most mature triplestores also index a 4th query element 'graph' or 'context'. I intend to support this query type without expanding the index by using a trick: In my triples format the fact that the subjects are auto-generated and local to the graph means I can choose them to be sequential and effectively re-use them as graph indexes - e.g. subjects between 193 and 11255 belong to graph 2 etc.. So for example the 'os->p' indexing can also support a 'og->sp' query pattern by restricting the range of subject matches to only those in the appropriate range.

Cache locality and triple indexes

I mentioned in my last post that I indended to use memory mapped sorted arrays for indexing. Nick Johnson left a comment on my last post (thanks Nick!) alerting me to the better cache locality properties of n-ary trees (where n is the number of elements that fit in a disk block) rather than using binary searching flat arrays.

This is an order of magnitude improvement. For example, for a 1 million element array of 32bit values you can do each cold search in just 3 page faults ( log1024(1M)=3, assuming block size of 4096 - i.e. 1024 elements per block). A binary search on a cold 1M element flat array would yield something more like ~22 faults (I'm assuming the last 10 lookups would happen in the same block). This prompted me to do some reading on cache coherency.

As it turns out (unless I'm missing something) the actual approach I had in mind should be pretty optimal cache locality wise. The plan is to exploit the even randomness of the hashes to reduce the searching overhead, hopefully amortizing to constant lookup (and fault) times.

To illustrate this, consider the o->ps and op->s lookup patterns. I plan to index these through a 3 level hierarchy of sorted arrays o->p->s: I.e. a sorted array of objects, each which points to a child array of predicates, each of which points to subjects.


[o1 o2 o3 o4 o5 o6 o7 ...] (object array)
          /          
         /           
   ...][p1 p2 p3 p4][..    (o4 predicate array)
          /
         /
  ...][s1 s2 s3 s4][....     (o4 p2 subject array)

The trick here is that each array is a sorted array of unique hashes, which because of the randomness of the hash should be evenly spread over the search space.

That means that an object with hash 'h' should be approximately in the position:

(h / hash-range) * numelements.

E.g. if the hash range is unsigned 32bit (0-4294967295), a 2147483648 value is in the middle of the array. The search would try this position first, and then use linear probing to locate the value. I'm hoping that this will result in 3 page faults to locate the first matching triple regardless of the size of the data. Because it doesn't implicitly rely on any block size it should also respond well to L1/L2 caching (unless I'm missing something!).

Hash lengths

As mentioned in the previous post I'm planning on internally identifying each symbol with a 64bit hash, along the same lines as 3store. However I'm currently thinking that I'll only use the first 32 bits in the top 2 lookup indexes. This will make the indexes denser which I think should help with L1/L2 cache locality when probing for a match. Of course the tradeoff is that there will be a lot of duplicate hashes - to account for these I'll put the latter 32bits in the 3rd level data arrays so that they can be filtered before joining.

N.B. I have no empirical performance data to back up any of these ideas, so this is all speculation at the moment (and likely to change as I gain in experience). I'd appreciate it if anybody can see where I've overlooked something or has any more optimal ideas for storing static triple data.

Indexing structured data (again)

Now that I'm up and running and starting to get productive on Gambit-C, I've turned my attention back to indexing structured data.

I've modified the tagtriples idea a bit to reflect my experience on importing data in other formats. I still think the most effective approach is not to try and define an interchange format, but instead to create an indexing scheme that can index data from lots of different structured data formats/models.

So the end goal is a system that can easily aggregate and index static structured data from a variety of systems and formats, and provide relational joining across graphs/datasets. Oh, and it's got to scale.

The data model I'm working with is similar to my tagtriples format except each subject is now system generated, local to the a particular graph/dataset, and opaque (i.e not exposed to the user). RDF officiandos will notice this is similiar to BNodes - think of it as RDF triples where the subject can only be a BNode and properties and objects are values.


#24221 name "Phil Dawes"
#24221 age 31
#24221 likes cheese

I've adopted this approach mainly to ease translation from other formats into this one. Automating the creation of good human-readable identifiers for subjects is often impossible when translating data from other formats that don't contain a natural subject identifier.

Having opaque subjects also means that the concept of identity internal to the graph is specific, but joins across graphs must be achieved by matching properties and objects. This coincides with my ideas about global identity - i.e. that a universal identity scheme is very difficult (/impossible) to implement and coordinate usefully on a large scale. I think the best way of handling identity on a large scale is through a combination of shared context and description using terms from already understood local vocabularies (e.g. english, email addresses, URLs, post codes etc..).

What this means in practice is that the person/agent querying the data must know something of the terms and structures used in the datasets to be able to join them. It also means that they must include additional clauses in the query to provide enough disambiguation to uniquely identifiy terms in the dataset (in the absence of URIs). E.g. I want the "Phil Dawes" with an email address 'phil@example.com'.

In practice this disambiguation tends to happen anyway even with RDF data (e.g. witness smushing of FOAF data). This is simply because coordinating artificial identifiers on a large scale is so difficult that people naturally revert to matching commonly understood terms from pre-existing vocabularies. (hence my previous posts about URIs being a bad tradeoff and not providing much value for the problems they create).

Anyway, from the indexing point of view I've ditched a relational database as a backend and am now working with memory mapped files. This provides more flexibility for indexing strategies. I'm going to begin by using sorted arrays for triple indexes as I'm working with relatively static data and this sounds like a good tradeoff of space to search speed. More about ideas on indexing in another post.

From the point of view of internal symbol identifiers, I'm leaning towards using hashes internally to represent symbols rather than incrementally allocated ids. This is similar to the approach adopted by 3store(PDF).

N.B. Personally I don't think it's a good tradeoff for 3store. 3store isn't (wasn't?) a federated store and so coordinating incremental allocated IDs is easy and would halve the index space (32bits for an ID vs 64 bits for a hash). Index size vs memory is the biggest performance bottleneck in my experience.

For a distributed parallel database however, hashes pay dividends. You don't need to worry about allocating unique IDs between parallel processes, and more importantly you don't need to coordinate IDs between nodes for data joining.

Anyway, I've got some ideas about distributed indexing so hopefully I'll get a chance to write about this soon.

The path from specificity to usefulness

I tried to comment on Seth's post, but I think the comments on his blog are a bit broken at the moment (the capcha question wasn't rendering, so I couldn't answer it!). I guess I'll trackback instead:

The path from specificity to usefulness that Seth describes was exactly the trip I took attempting to implement semantic web approaches at work to assist with managing IT operations info (which was stored in various silos). I started off with RDF and built a store which used some OWL rules to connect the data from the various sources. This proved cumbersome and difficult - other people found it quite a hurdle trying to understand RDF, and the same URIs got used to mean subtly different things (e.g. IT application vs project).

After a year and a half of evolving the system the best solution ended up being to just index triples of words. Vaguer than URIs, but easier to harvest and match from databases. Since humans write the queries, it turned out that the vagueness wasn't a problem at all. Universal Specificity (such as is required by URIs/RDF) just doesn't seem to scale very well in my experience.

Dark side of the semantic web

I haven't said anything much about semantic web stuff for a while as I've been occupied with other things. However Jim Hendler's 'Tales from the Dark Side' piece in IEEE Intelligent Systems reawoke an old interest. In short: I still think the RDF people have got it wrong with URIs, and so far nobody's convinced me otherwise.

My (same old) argument: URIs are bad for large-scale interoperability. The alternative: just use words and symbols occuring in real life, and use the context inherent in the communication to disambiguate meaning.

The interesting thing about the Hendler piece is that the it pretty much walks through the arguments I make for dropping URIs, but then avoids the conclusion:

If you and I decide that we will use the term "http://www.cs.rpi.edu/~hendler/elephant" to designate some particular entity, then it really doesn't matter what the other blind men think it is, they won't be confused when they use the natural language term "Elephant" which is not even close, lexigraphically, to the longer term you and I are using. And if they choose to use their own URI, "http://www.other.blind.guys.org/elephant" it won't get confused with ours. The trick comes, of course, as we try to make these things more interoperable. It would be nice if someone from outside could figure out, and even better assert in some machine-readable way, that these two URIs were really designating the same thing - or different things, or different parts of the same thing, or ... ooops, notice how quickly we're on that slippery slope. "

And this neatly sums up the situation with URIs. The low chance of collision represents a tradeoff: You get a high level of semantic precision - it's extremely unlikely that two parties will use the same URI to mean two totally unconnected things. You also get a very low level of semantic interoperability: it's equally unlikely that two unconnected parties will use the same URI to denote (even parts of!) the same thing.

Now I think the precision part is overrated - disambiguation of natural language terms can be tractably (and often trivially) achieved using contextual cues. However interoperablity of data from unconnected sources is really hard, and that's why I think this is a bad tradeoff.

Anyway, the crux of the Hendler piece is that for all the high level work going on in Semantic Web land (ontology languages, description logic), it's currently simple interoperability mechanisms that gain most traction and add the most value: 'a little semantics goes a long way'.

The piece implies (afaics) that this is where effort should be directed, and cites the example of matching FOAF data using email addresses as illustration of the potentual success of this approach. The matching heuristic is: if two FOAF resources are describing people with the same email address, they're very likely to be about the same person.

My experience concurs with the 'a little semantics goes a long way' sentiment, but personally I think FOAF has succeeded (for some measure of success) not because of RDF but in spite of it. I'd argue that the only reason the email matching works on a large scale is because email addresses are already concrete symbols grounded in the real world. FOAF didn't create them, it just provided a context for their use. FOAF's formal semantics certainly didn't create this interoperability - the largest example of foaf data is scraped from live journal's databases where the users creating the data have little concept of the ramifications of the 'http://xmlns.com/foaf/0.1/mbox' property.

If FOAF had to rely on artificial URIs as the sole means for identifying people it would struggle to gain any traction in the messy real world of the web.

However on the flip side I think FOAF would work just as well (and gain a lot more traction) if its underlying model didn't employ URIs at all and instead just used triples of words/symbols. Semantic web software would still be able to identify and index FOAF data: i.e. the symbol 'FOAF' is pretty unambiguous on its own, but even if it wasn't the juxtaposition of the symbol FOAF with properties like 'mbox', 'surname' etc.. would suffice for pretty accurate disambiguation.

Quad store performance solutions?

I couldn't find a way to comment on Benjamins post, so I've stuck it here:

What indexing are you using? My tagtriples store schema is basically a table with 4 ids which joins to an (ID,String) table. When it used to be an RDF store this held both literals and URIs.

I found the key to getting good performance with this arrangement was ensuring that the mysql query engine had enough optimised btree indexes to make a good guess at query execution order. I eventually went for 6 compound indexes on the triples table (ignore the seperate graphs table - I ditched this soon after).

Hope this helps (and sorry if I'm mis-assuming something or have mis-understood the problem)

microqueries

I've recently been experimenting with ways to provide simpler structured searching/querying to 'normal' web users (i.e. not techies). Sparql/SQL querying doesn't cut it here - we need something simpler.

One approach I've been trying is allowing simple query constraints in with the text search facility. Using the proximity searching capability JAM*VAT then finds a collection of symbols that match all the constraints in close proximity.

E.g. the search: "danny ayers >2005-10-25 <2005-10-27"

..brings back subjects linked to the words 'danny' and 'ayers' and any dates between 2005-10-25 and 2005-10-27 - in this case it finds blog posts made between these dates.

N.B. you may need to tweak the dates a bit in the above example if you're reading this later than october 2005.

More JAM*VAT query features

Using a relational database as a triplestore backend has a number of advantages - one of which is leveraging features of the backend SQL support with very little effort.

I've recently added a whole bunch of functionality to ttql (the experimental query language that JAM*VAT uses for querying). These include:

SQL (mysql) numeric and string functions (e.g. CONCAT, COUNT, SUM, MIN/MAX etc..) OrderBy GroupBy Limit

(this in addition to optional blocks and indexed substring comparisons).

Also, numeric and date comparisons are now indexed (dates are converted from various formats to a common representation at the query parsing stage).

Oh - and I've made select queries NON-distinct by default. I realised that since tagtriples was not logical-set based (the order matters), the same statement can quite legitimately crop up more than once in a graph. It thus follows that it ought to be possible to get all the occurances out via a structured query, and so I've added the DISTINCT keyword and made the queries non-distinct by default.

I really must get round to documenting this on the query page; if you can't wait (yeah right!), check out the test_ttql.py file for the query unittests.