Quad store performance solutions?

I couldn't find a way to comment on Benjamins post, so I've stuck it here:

What indexing are you using? My tagtriples store schema is basically a table with 4 ids which joins to an (ID,String) table. When it used to be an RDF store this held both literals and URIs.

I found the key to getting good performance with this arrangement was ensuring that the mysql query engine had enough optimised btree indexes to make a good guess at query execution order. I eventually went for 6 compound indexes on the triples table (ignore the seperate graphs table - I ditched this soon after).

Hope this helps (and sorry if I'm mis-assuming something or have mis-understood the problem)

Implementing graphs as triple ranges

Had an implementation thought on the train this morning:

Instead of implementing a graph as a single id in the 4th column of the quads table, how about implementing it as a range of triple ids. i.e. Each triple is given a sequential id in the 4th column when asserted, and the graphs are stored as (startid, endid).

Advantages:

  • subgraphs can also be captured (useful for 'quoting')
  • the order of the graph is preserved (useful for retaining graph order for e.g. signing, and handling ordered collections without requiring an ordered collection type)
  • the number of statements in a graph is implicitly stored (endid - startid)
Disadvantages:
  • Cant use a UNIQUE(s,p,o,g) index to assert that the same statement doesnt occur twice in a graph. (Is this a problem? not really - can put up with duplicates)
  • Graphs can only be asserted one at a time. (or could lease a block of ids or segregate the id space)
  • Store size limited by length of id field. (actually I suspect other factors will limit scalability before this, and it could be made 64 bit)
  • Queries involving SOURCE become more tricky to implement internally. (WHERE triple.graph < g.start AND triple.graph >g.end)
  • Can't add triples to an existing graph (well, not without re-allocating ids or maybe using a cluster field in the graph table or something)

Hmmm... Subgraphs and preserved ordering are compelling advantages - this is sounding like a good tradeoff to me...

Suffix array performance problems

Hit some bad performance hotspots at work yesterday with the new suffix stuff.

The main problem occurs when you do text searches with multiple words, some of which are very popular. The query optimiser frequently makes a bad decision, which boils down to not being able to compute the permutations of the popular word, as it crosses 2 joins (suffixes->suffixlinks->triples) before exploding. The biggest problem case was the word 'drkw' (i.e. the name of the company), which occurs in few suffixes (resulting in a good match on the index), but translates to many thousands of literals when linked through to the triplestore. E.g. In a search for 'drkw dam', the optimiser favours using 'drkw' first, since it has fewer suffix index matches, whereas a search for dam would return just a few hundred eventual literals.

Claire's gone to see a friend for a couple of hours, so have got some time to play with this...

RDF Literal searching using a suffix array

Text searches were becoming a big bottleneck in the knowledge base deployment at work. The grep solution allows speedup of pure text searches (such as a label search), but doesn't work so well when the text search is part of a bigger query.

Ideally I would have used mysql's fulltext indexing capability to index the literals; the problem is that this just indexes full words, and a lot of the searches require substring matches within the data (e.g. when searching for server names, the client often knows the server number 'e.g. 859', but not the full code 'ln3p0859app').

I had some hacking time over the xmas break, and I got round to implementing a basic suffix array within the veudas db schema that maps word suffixes to literals via a links table. This enables the suffixes to be indexed (I index the first 4 characters at the moment) for rapid matching of substrings to literals. The tables are like this:

CREATE TABLE suffixes ( id integer NOT NULL AUTO_INCREMENT, suffix text, KEY (suffix(4)), UNIQUE KEY (id) ) TYPE=MyISAM CREATE TABLE suffixlinks ( suffix integer, len integer, literal integer, KEY (suffix,len) ) TYPE=MyISAM

The length field in the suffixlinks is used in order to reduce the number of suffixes. E.g. the suffix 'aced' matches the word 'places' for 3 letters, and the word 'placed' for 4. The generated SQL snippet for a text search (in this case for substring '859') linking up to the core triples table looks something like this:

SELECT ...WHERE... AND suffixes.suffix LIKE '859%' AND suffixes.id = suffixlnk.suffix AND suffixlinks.len >= 3 AND suffixlinks.literal = triples.object

In order to reduce the number of suffixes I split the literals into words before generating suffixes from them. This means that text searches only match over words rather than whole literals (e.g. a search for 'phil dawes' matches 'dawes,phil' and 'phil dawes' and 'philip dawes'), but this is fine for my knowledge base application.

I would be interested in knowing if there is a better way to index text substrings in a relational database - I couldn't find much information about this on the web.

Veudas and veudastore packages containing this implementation can be downloaded from the veudas sourceforge project page.

veudastore standalone rdf store

I keep meaning to perfect these packages, but it doesnt happen so here's another 'release-early' package. Veudastore is a standalone version of the mysql RDF backend store that runs the veudas knowledge manager. It can be downloaded from the veudas sf project page. From the readme:

INSTALL: -------- - Install mysql and create an empty database USAGE: ------ import veudastore ctx = veudastore.init() ctx.setDatabaseConnectParams(USER,PASSWD,DB,HOST) ctx.createTables() ctx.importFromURISource(sourceuri, graphuri) q = """ prefix foaf: <http://xmlns.com/foaf/0.1/> prefix wnet: <http://xmlns.com/wordnet/1.6/> select ?p where (?p foaf:nick "danbri") (?p rdf:type wnet:Person) """) cols, rows = ctx.querySPARQL(q) for p, in rows: . print p

Literal searching - grep vs tablescan

One of the most useful bits of veudas' functionality is being able to search the knowledgebase for labels (and subproperties) containing a particular string. Unfortunately this is also the slowest operation. Currently veudas' ifp store does a table scan over the nodes table for regex searches - very slow (mysql doesnt support a general regex index).

I'm currently only handling a few million triples in my store, so I'm not ready to invest a large amount of time writing e.g. a suffix tree implementation. I do need a performance boost though, since some text searches take a number of seconds; Given this, I thought I'd compare database performance with flat-file searching using the Wordnet data (222439 literals). Here's the comparison data between (1) mysql table scan and (2) grepping a flat file.

1) put all the literals in 1 table and do a regex table scan

select id from labelcache where value LIKE "%football%" . > 146 matches in 0.32 seconds

2) put all the literals in a file and do a grep and pull results into python

mysql -D ifptest1 -u root -e 'select id,value from nodes where literal = "1"' > nodes

and then...

t = time.time() f = os.popen("grep -i football nodes") ids = [l[:l.index('t')] for l in f] print len(ids),"matches in",time.time() - t,"seconds" . > 146 matches in 0.042 seconds

I found this quite surprising - grepping a file and piping to python is an order of magnitude faster than mysql doing a table scan(!)

The rest of the query (once the results have been reduced to ids) takes 0.02 seconds. It looks like grepping a file may be a quick and easy way of getting a ~x10 performance boost (at least in the short term).

Optimising mysql tables for rdf storage

I've made some modifications to the table layout used by my experimental smushing store. In particular, I've ditched the hashes table, and lumped node hashing in with the literals table. This improves smushing performance as it reduces the number of tables to update (the current biggest overhead when smushing resources on ifp).

Here's the new layout:

CREATE TABLE `triples` ( `graph` integer NOT NULL default '0', `subject` integer NOT NULL default '0', `predicate` integer NOT NULL default '0', `object` integer NOT NULL default '0', `literal` tinyint(1) NOT NULL default '0', `inferred` tinyint(1) NOT NULL default '0', UNIQUE KEY `spog` (`subject`,`predicate`,`object`,`graph`), KEY `posg` (`predicate`,`object`,`subject`,`graph`), KEY `ospg` (`object`,`subject`,`predicate`,`graph`), KEY `gspo` (`graph`,`subject`,`predicate`,`object`), KEY `gpos` (`graph`,`predicate`,`object`,`subject`), KEY `gosp` (`graph`,`object`,`subject`,`predicate`) ) TYPE=MyISAM CREATE TABLE nodes ( id integer NOT NULL AUTO_INCREMENT, hash bigint(20) NOT NULL, value text, PRIMARY KEY (hash), KEY (id) ) TYPE=MyISAM CREATE TABLE graphs ( id integer NOT NULL, UNIQUE KEY (id) ) TYPE=MyISAM

N.B. I added a lot more indexing to the triples table a while back. This pretty much copies the kowari indexing from Paul's post in May. (note that he wrongly assumes that mysql doesnt support use of compound indexes in his post. I'm sure somebody must have corrected him by now).

The cool thing about this is that all the relational queries can now be done by mysql solely in the indexes (no need crossref with the data file), which speeds things up a bit. Haven't done any profiling with really large datasets yet, but performs well with a couple of million statements or so.

Most queries (once optimised by the mysql query engine) boil down to:

  1. Translate the literals/uris into IDs via the nodes table (using an md5 hash to rapidly look up the values).
  2. Do the triple pattern joins on the triples table (all within the indexes)
  3. translate the result ids back into values (uris or literals) by joining back to the nodes table.

Ideally I would have got mysql to index the 'value' column in the nodes table directly, and lost the md5 hash column altogether. Unfortunately mysql doesnt appear to have hash index functionality for text/blobs. (It indexes the first n bytes of the field - this doesnt work very well with URIs, and means I can't put a UNIQUE constraint on the index.)

Any comments/suggestions for improvement gratefully received

More import optimisation

Claire's out tonight, so another evening spent on bulk rdf importing. Have managed to get the original 120705 statement dataset import down to 77.6 seconds - that's ~1500 triples a second!

The extra speed was mainly due to removing the need for database URI to id lookups by taking an in-memory copy of the hashes table. The problem wasn't really the lookups (which were cached), but rather the need to check each new URI to see if it's already been used (which thwarts the cache each time). I suspect that Bloom (or DeGan) filters would be really useful here, but I plumped for a python dictionary of 64bit md5 hashes (3store style) since it was easy for me to code quickly.

Anyway, working on proprietary data is not much use for benchmarking purposes, so I ran the new import code over the wordnet 1.6 rdf files. Managed to get all 473589 statements into my store in 315 seconds! - still within the 1500 per second mark.

For anyone wanting to compare with other stores, the order of import matters - I imported in the following order:

  • wordnet_hyponyms-20010201.rdf.xml
  • wordnet_nouns-20010201.rdf.xml
  • wordnet_glossary-20010201.rdf.xml
  • wordnet_similar-20010201.rdf.xml

which appeared to produce the fastest results, although I've not looked into why. Import includes indexing, removing duplicates, doing some FC inferences (although not many).

Oh yeah - I used my work laptop to do the test - is a powerbook g4 with half a gig of ram running gentoo linux. Mysql v4.0.20.