Indexing structured data (again)

Now that I'm up and running and starting to get productive on Gambit-C, I've turned my attention back to indexing structured data.

I've modified the tagtriples idea a bit to reflect my experience on importing data in other formats. I still think the most effective approach is not to try and define an interchange format, but instead to create an indexing scheme that can index data from lots of different structured data formats/models.

So the end goal is a system that can easily aggregate and index static structured data from a variety of systems and formats, and provide relational joining across graphs/datasets. Oh, and it's got to scale.

The data model I'm working with is similar to my tagtriples format except each subject is now system generated, local to the a particular graph/dataset, and opaque (i.e not exposed to the user). RDF officiandos will notice this is similiar to BNodes - think of it as RDF triples where the subject can only be a BNode and properties and objects are values.


#24221 name "Phil Dawes"
#24221 age 31
#24221 likes cheese

I've adopted this approach mainly to ease translation from other formats into this one. Automating the creation of good human-readable identifiers for subjects is often impossible when translating data from other formats that don't contain a natural subject identifier.

Having opaque subjects also means that the concept of identity internal to the graph is specific, but joins across graphs must be achieved by matching properties and objects. This coincides with my ideas about global identity - i.e. that a universal identity scheme is very difficult (/impossible) to implement and coordinate usefully on a large scale. I think the best way of handling identity on a large scale is through a combination of shared context and description using terms from already understood local vocabularies (e.g. english, email addresses, URLs, post codes etc..).

What this means in practice is that the person/agent querying the data must know something of the terms and structures used in the datasets to be able to join them. It also means that they must include additional clauses in the query to provide enough disambiguation to uniquely identifiy terms in the dataset (in the absence of URIs). E.g. I want the "Phil Dawes" with an email address 'phil@example.com'.

In practice this disambiguation tends to happen anyway even with RDF data (e.g. witness smushing of FOAF data). This is simply because coordinating artificial identifiers on a large scale is so difficult that people naturally revert to matching commonly understood terms from pre-existing vocabularies. (hence my previous posts about URIs being a bad tradeoff and not providing much value for the problems they create).

Anyway, from the indexing point of view I've ditched a relational database as a backend and am now working with memory mapped files. This provides more flexibility for indexing strategies. I'm going to begin by using sorted arrays for triple indexes as I'm working with relatively static data and this sounds like a good tradeoff of space to search speed. More about ideas on indexing in another post.

From the point of view of internal symbol identifiers, I'm leaning towards using hashes internally to represent symbols rather than incrementally allocated ids. This is similar to the approach adopted by 3store(PDF).

N.B. Personally I don't think it's a good tradeoff for 3store. 3store isn't (wasn't?) a federated store and so coordinating incremental allocated IDs is easy and would halve the index space (32bits for an ID vs 64 bits for a hash). Index size vs memory is the biggest performance bottleneck in my experience.

For a distributed parallel database however, hashes pay dividends. You don't need to worry about allocating unique IDs between parallel processes, and more importantly you don't need to coordinate IDs between nodes for data joining.

Anyway, I've got some ideas about distributed indexing so hopefully I'll get a chance to write about this soon.