Optimising mysql tables for rdf storage
I've made some modifications to the table layout used by my experimental smushing store. In particular, I've ditched the hashes table, and lumped node hashing in with the literals table. This improves smushing performance as it reduces the number of tables to update (the current biggest overhead when smushing resources on ifp).
Here's the new layout:
CREATE TABLE `triples` ( `graph` integer NOT NULL default '0', `subject` integer NOT NULL default '0', `predicate` integer NOT NULL default '0', `object` integer NOT NULL default '0', `literal` tinyint(1) NOT NULL default '0', `inferred` tinyint(1) NOT NULL default '0', UNIQUE KEY `spog` (`subject`,`predicate`,`object`,`graph`), KEY `posg` (`predicate`,`object`,`subject`,`graph`), KEY `ospg` (`object`,`subject`,`predicate`,`graph`), KEY `gspo` (`graph`,`subject`,`predicate`,`object`), KEY `gpos` (`graph`,`predicate`,`object`,`subject`), KEY `gosp` (`graph`,`object`,`subject`,`predicate`) ) TYPE=MyISAM CREATE TABLE nodes ( id integer NOT NULL AUTO_INCREMENT, hash bigint(20) NOT NULL, value text, PRIMARY KEY (hash), KEY (id) ) TYPE=MyISAM CREATE TABLE graphs ( id integer NOT NULL, UNIQUE KEY (id) ) TYPE=MyISAM
N.B. I added a lot more indexing to the triples table a while back. This pretty much copies the kowari indexing from Paul's post in May. (note that he wrongly assumes that mysql doesnt support use of compound indexes in his post. I'm sure somebody must have corrected him by now).
The cool thing about this is that all the relational queries can now be done by mysql solely in the indexes (no need crossref with the data file), which speeds things up a bit. Haven't done any profiling with really large datasets yet, but performs well with a couple of million statements or so.
Most queries (once optimised by the mysql query engine) boil down to:
- Translate the literals/uris into IDs via the nodes table (using an md5 hash to rapidly look up the values).
- Do the triple pattern joins on the triples table (all within the indexes)
- translate the result ids back into values (uris or literals) by joining back to the nodes table.
Ideally I would have got mysql to index the 'value' column in the nodes table directly, and lost the md5 hash column altogether. Unfortunately mysql doesnt appear to have hash index functionality for text/blobs. (It indexes the first n bytes of the field - this doesnt work very well with URIs, and means I can't put a UNIQUE constraint on the index.)
Any comments/suggestions for improvement gratefully received