I've been thinking a bit about scaling triplestores recently. My mysql based tagtriples store is working well at work but is ultimately limited by the amount of memory you can cram into a single machine. I've recently become seduced by the idea of putting all the bank's data (world's data?) into one massive triplestore, and that really requires clustering somehow.

So how can you utilise a cluster of machines to manage triple-oriented data?

AFAICS the simplest approach would be to just evenly spread the triple indexes across the machines in the cluster. An agent coordinating a query would then need to break the query into indiviual triple patterns, and then send each one out in turn to the members of the cluster (holding the appropriate piece of index), joining the results with that of previous requests.

Assuming that network is the overriding performance bottleneck, this would result in query performance scaling roughly linearly with the number of pattern joins in the query. (assuming that the query join order was optimised appropriately)

One optimisation to this scheme would be to split the data indexes across the cluster nodes according to data source/provenance (i.e. by triple 'graph'). Experience with tagtriples shows that a large proportion of query joins are intra-graph* and so by doing this you could limit the roundtrips to just inter-graph joins.

(* Probably moreso with tagtriples than with RDF since tagtriples doesn't have precise identifiers, so you need to join more to ensure precision)