Optimising mysql tables for rdf storage

I've made some modifications to the table layout used by my experimental smushing store. In particular, I've ditched the hashes table, and lumped node hashing in with the literals table. This improves smushing performance as it reduces the number of tables to update (the current biggest overhead when smushing resources on ifp).

Here's the new layout:

CREATE TABLE `triples` ( `graph` integer NOT NULL default '0', `subject` integer NOT NULL default '0', `predicate` integer NOT NULL default '0', `object` integer NOT NULL default '0', `literal` tinyint(1) NOT NULL default '0', `inferred` tinyint(1) NOT NULL default '0', UNIQUE KEY `spog` (`subject`,`predicate`,`object`,`graph`), KEY `posg` (`predicate`,`object`,`subject`,`graph`), KEY `ospg` (`object`,`subject`,`predicate`,`graph`), KEY `gspo` (`graph`,`subject`,`predicate`,`object`), KEY `gpos` (`graph`,`predicate`,`object`,`subject`), KEY `gosp` (`graph`,`object`,`subject`,`predicate`) ) TYPE=MyISAM CREATE TABLE nodes ( id integer NOT NULL AUTO_INCREMENT, hash bigint(20) NOT NULL, value text, PRIMARY KEY (hash), KEY (id) ) TYPE=MyISAM CREATE TABLE graphs ( id integer NOT NULL, UNIQUE KEY (id) ) TYPE=MyISAM

N.B. I added a lot more indexing to the triples table a while back. This pretty much copies the kowari indexing from Paul's post in May. (note that he wrongly assumes that mysql doesnt support use of compound indexes in his post. I'm sure somebody must have corrected him by now).

The cool thing about this is that all the relational queries can now be done by mysql solely in the indexes (no need crossref with the data file), which speeds things up a bit. Haven't done any profiling with really large datasets yet, but performs well with a couple of million statements or so.

Most queries (once optimised by the mysql query engine) boil down to:

  1. Translate the literals/uris into IDs via the nodes table (using an md5 hash to rapidly look up the values).
  2. Do the triple pattern joins on the triples table (all within the indexes)
  3. translate the result ids back into values (uris or literals) by joining back to the nodes table.

Ideally I would have got mysql to index the 'value' column in the nodes table directly, and lost the md5 hash column altogether. Unfortunately mysql doesnt appear to have hash index functionality for text/blobs. (It indexes the first n bytes of the field - this doesnt work very well with URIs, and means I can't put a UNIQUE constraint on the index.)

Any comments/suggestions for improvement gratefully received

More import optimisation

Claire's out tonight, so another evening spent on bulk rdf importing. Have managed to get the original 120705 statement dataset import down to 77.6 seconds - that's ~1500 triples a second!

The extra speed was mainly due to removing the need for database URI to id lookups by taking an in-memory copy of the hashes table. The problem wasn't really the lookups (which were cached), but rather the need to check each new URI to see if it's already been used (which thwarts the cache each time). I suspect that Bloom (or DeGan) filters would be really useful here, but I plumped for a python dictionary of 64bit md5 hashes (3store style) since it was easy for me to code quickly.

Anyway, working on proprietary data is not much use for benchmarking purposes, so I ran the new import code over the wordnet 1.6 rdf files. Managed to get all 473589 statements into my store in 315 seconds! - still within the 1500 per second mark.

For anyone wanting to compare with other stores, the order of import matters - I imported in the following order:

  • wordnet_hyponyms-20010201.rdf.xml
  • wordnet_nouns-20010201.rdf.xml
  • wordnet_glossary-20010201.rdf.xml
  • wordnet_similar-20010201.rdf.xml

which appeared to produce the fastest results, although I've not looked into why. Import includes indexing, removing duplicates, doing some FC inferences (although not many).

Oh yeah - I used my work laptop to do the test - is a powerbook g4 with half a gig of ram running gentoo linux. Mysql v4.0.20.

importing statements at speed

Spent the evening working on bulk assert speed for my ifp smushing store. Managed to get a 120000 statement import down from 410 seconds to 109 seconds - that's over 1000 triples per second. Pretty good considering the rdflib rdf/xml parsing takes 59 seconds on its own - I wonder how fast this would be with raptor...

The trick was to lock the database, generate all data into files (allocating ids in memory), then bulk import using 'LOAD DATA INFILE'.

Actually this has lead me to wonder whether there's an rdf store profiling toolkit I should be using to test this - the performance comparison documents I've read use proprietary data, which means I can't use them to compare my store. There was a post a while back with some example (copyright-free) large data files, but I can't find it now. If anybody remembers and knows where they are, please leave a comment.

Query issue when merging schemas

At work in our team we're just hitting the merging-schemas problem.

We have an alerting system that uses rdf to describe the alert schedules for different types of event. It describes people and groups they belong to, in addition to their shift rotas etc..

On discovery of an event (e.g. system failure), the system consults the rdf store to decide which people to SMS and email about the alert.

The problem is that the data in the store was created before we standardised a convention for minting URIs. It contains data common to other systems (e.g. people details are also stored in the global company ldap directory, which also has an RDF export), but uses a different schema.

Now that my ifpstore handles equivalentProperty inferencing, we will be in a position to remove the people data from the alerter store, and have the alerter service get data from the global directory instead (which is more likely to be up to date).

The problem is that the alerter service will want results to the query in a vocabulary it understands - the question is, how do we do this whilst still managing the uri-triple-bloat problem discussed in my previous post?

I'm thinking about a system that allows the client to specify graphs of vocabularies that it would prefer to have the results in. When choosing URIs for logical resources, the system could use this to pick more appropriate ones for the client.

uri bloat in queries to smushed data

One of the things I've found when dealing with IFP merged data is that you get lots of URIs for the same logical resource.

A good example of this is foaf:Person 'Dan Brickley', who is mentioned in lots of peoples foaf files. Because Dan doesnt attribute a URI to himself (and why should he?), quite a few people have invented their own, usually by using rdf:id in their documents.

Another place where you get multiple uris per 'logical' resource is when merging schemas where there are exact-match terms (i.e. smushing with owl:equivilentProperty) - you end up with the same 'logical' property with multiple URIs.

The problem with querying rdf stores of this aggregated smushed data is that the multiple URIs cause triple bloat in the results. E.g. the query:

construct * where {?dan foaf:mbox <mailto :danbri@w3.org>. ?dan ?prop ?obj.}

returns n*r statements, where r is the number of property/object pairs for dan, and n is the number of URIs denoting dan.

Unfortunately, rather than:

there are lots of resources with mbox danbri@w3.org (all with identical properties)

the client more commonly wants:

there is one logical resource 'dan-brickley', with multiple URIs denoting him

(at least in the software I've been involved with)

In my smushing store I'm experimenting with a scheme which returns one result per 'logical resource' (picking a URI to describe it at random), and then provides an api for fetching the equivilent URIs. When returning results in an rdf graph, it picks one URI, and then puts owl:sameIndividualAs or owl:equivilentProperty statements at the end of the graph.

graphs, inferencing and security

One of the reasons I need graphs in my ifpstore is that at work I have an application that needs to ensure it is getting its security information from the right source. It uses a store aggregated from multiple sources for its data, and so without graph-matching, any of the other sources could theoretically add a ('joeBloggs memberof superusergroup') triple to the mix and subvert the security.

The problem is that my store also supports SIA, IFP and subproperty/class inferencing. This opens up a whole new can of worms, because (again theoretically) one of the sources could add a triple that affects the inference (e.g. 'isMemberOf rdfs:subPropertyOf isAdministratorOf' springs to mind).

The best solution to this I can think of is to have a special trusted graph set aside to hold triples used for inferencing. The import mechanism would then copy trusted schema statements that affect inferencing to this graph.

The approach would add another constraint to each back-chained inference (i.e. statement has to be in the trusted graph).

Hmmm.. an alternative spin would be to have a seperate table for the trusted inference statements. Am not sure which would be quicker (because caching plays such a large part as the size of the store scales up) - would have to profile with a large store.

IFP Store gets BRQL (sort of)

Have implemented a basic 'parser' using dodgy regexes to split up the components of the query. It supports SOURCE and OPTIONAL, and inlines the constraints with the patterns. Works reasonably well if you stick to the basic format. Hopefully I'll get round to uploading a demo later today.

Have also added graph management support to the veudas gui.

IFP store to be bundled with veudas

I think I'm going to bundle my new store as part of my veudas project. It saves me having to start up a new sf project, and currently veudas is pretty tied to the ifpstore's features (although theoretically you could plug in a different store with the same features).

Unsmushing IFP ideas

Currently my smushing-store design doesnt handle unsmushing. This isn't a big problem for my needs currently, but conversations with Steve Harris indicated that it's something required quite frequently by his customers.

Given that its not a common thing for me, I need a solution that doesnt affect query speed (e.g. so don't want to use a sameAs indirection table for each URI component of each triple) - ideally I want to load any penalty on the unsmush action itself.

Am thinking of 3 extra columns (s,p,o) on the triples table, unindexed, with the original assertions in them (before smushing). An unassert of an SIA of IFP would revert the affected triples back to their original asserted values, then reapply the smush.

Actually, having consulted Steve, putting the original assertions in another table (fk'd to the triples table) would probably be more performant, since it would reduce the size of the all-important triples table, and enable better caching by mysql.

IFP smushing store design

My IFP store design is a copy of the 3store design, but using 32bit ids instead of hashes in the triples table. This enables a single resource to have multiple URIs, which makes IFP smushing much simpler.

ifpstore currently uses 3 tables:

triples table: - subject 32bit int - predicate 32bit int - object 32bit int - graph 32bit int - literal boolean - inferred boolean indexed on: (subject,predicate,object), (predicate), (object), (graph)

hashes table: - hash 64 bit md5 hash - id 32 bit int indexed on hash

literals table: -id 32 bit int -literal text indexed on id

The hashes table uses the same technique as Steve Harris' 3store - first 64 bits of an md5 hash. The query engine pre-hashes URIs and literals used in the query, and the sql call uses this table to convert hashes to integer ids to be used in the triples table.

The literals table enables conversion of id back to uri or literal text. This table generates sequential ids via auto_increment (which is the main reason for lumping URIs and literals in the same table).