More import optimisation
Claire's out tonight, so another evening spent on bulk rdf importing. Have managed to get the original 120705 statement dataset import down to 77.6 seconds - that's ~1500 triples a second!
The extra speed was mainly due to removing the need for database URI to id lookups by taking an in-memory copy of the hashes table. The problem wasn't really the lookups (which were cached), but rather the need to check each new URI to see if it's already been used (which thwarts the cache each time). I suspect that Bloom (or DeGan) filters would be really useful here, but I plumped for a python dictionary of 64bit md5 hashes (3store style) since it was easy for me to code quickly.
Anyway, working on proprietary data is not much use for benchmarking purposes, so I ran the new import code over the wordnet 1.6 rdf files. Managed to get all 473589 statements into my store in 315 seconds! - still within the 1500 per second mark.
For anyone wanting to compare with other stores, the order of import matters - I imported in the following order:
- wordnet_hyponyms-20010201.rdf.xml
- wordnet_nouns-20010201.rdf.xml
- wordnet_glossary-20010201.rdf.xml
- wordnet_similar-20010201.rdf.xml
which appeared to produce the fastest results, although I've not looked into why. Import includes indexing, removing duplicates, doing some FC inferences (although not many).
Oh yeah - I used my work laptop to do the test - is a powerbook g4 with half a gig of ram running gentoo linux. Mysql v4.0.20.