Things have settled down a bit after the birth of baby #2 and I'm starting to get a bit of time to program again: about an hour a night. That means I'm thinking a lot about indexing structured data again.

Here are my most up-to-date thoughts on a model for representing aggregated structured data which I'm tentatively calling 'BTriples'. I'm writing this down mainly so I can refer to it in future writing.

The purpose of BTriples is to be an internal model for an OLAP database such that it can represent structured data from a variety of popular formats (json, xml, csv, relational) and can index and query across heterogeneous data sources.

A good candidate for such a model would appear to be RDF, but it falls short on a couple of counts for my requirements:

  • The first issue is that in order to represent vanilla data as RDF there's a certain amount of manual mapping that needs to be done. You need to come up with a URI scheme for your imported data, and you then need to do some schema and ontology work so that the data can be semantically joined with other RDF data. This manual import overhead removes the ability to do one-click database imports, which is something I'd like to achieve with my database tool.

  • The second issue is that the RDF model has strict semantic constraints that are difficult to manage over a large set of disconnected parties. Specifically the RDF model says that "URI references have the same meaning whenever they occur". This 'same meaning' is difficult to enforce without central control and makes RDF brittle in the face of merging data from globally disconnected teams.

TagTriples was my first attempt at creating a simplified RDF-like model, but it suffers from the problem that it can't represent anonymous nodes. This makes importing tree structures like XML or JSON a tricky exercise as you need to have some way to generate branch node labels from data that has none. When I was designing tagtriples I was also thinking in terms of an interchange format (like RDF). I no longer think creating an interchange format is important - the world already has plenty of these.

Btriples is basically my attempt at fixing the problems with tagtriples. The format is triple based like RDF and so I borrow a bunch of the terms from the RDF model.

BTriples Specification

The Btriples universe consists of a set of distinct graphs (think: documents). Each graph consists of an ordered set of statements. A statement is intended to convey some information about a subject. Each statement has three parts: a subject, a predicate (or property) and an object.

  • A subject identity is anonymous and is local to the graph. This means you can't refer to it outside the graph. (This is similar to a 'blank node' in RDF).
  • A predicate is a literal symbol (e.g. strings, numbers).
  • An object is either a literal symbol or an internal reference to a subject in the same graph.

Example (logical) statements:

  // row data
#1 name "Phil Dawes"
#1 "hair colour" Brown
#1 plays "French Horn"

  // array
#2 elem "Item 1"
#2 elem "Item 2"
#2 elem "Item 3"
#2 elem "Item 4"

  // tree
#3 type feed
#3 entry #4
#4 title "BTriples - a model for aggregating structured data"
#4 content "blah blah ..RDF... blah"

That's it.

Notes:

  • Btriples is not an interchange format. I have deliberately not defined a serialization of BTriples.

  • BTriples graphs are disconnected: Btriples does not define a method for them to refer to each other.

  • Perhaps the biggest departure from RDF is that there are no formal semantics in Btriples. The btriples model cannot tell you if a subject in one graph denotes the same thing as a subject in another.

  • Also the semantic meaning of symbols is not defined by BTriples and is up to the user to decide. Two identical symbols do not necessarily 'mean' the same thing.

  • The statements in a BTriples graph are *ordered*, so you can get data out in the same order it went in.

  • I'm not crazy about the BTriples name. Maybe I'll change it.