Literal searching - grep vs tablescan

One of the most useful bits of veudas' functionality is being able to search the knowledgebase for labels (and subproperties) containing a particular string. Unfortunately this is also the slowest operation. Currently veudas' ifp store does a table scan over the nodes table for regex searches - very slow (mysql doesnt support a general regex index).

I'm currently only handling a few million triples in my store, so I'm not ready to invest a large amount of time writing e.g. a suffix tree implementation. I do need a performance boost though, since some text searches take a number of seconds; Given this, I thought I'd compare database performance with flat-file searching using the Wordnet data (222439 literals). Here's the comparison data between (1) mysql table scan and (2) grepping a flat file.

1) put all the literals in 1 table and do a regex table scan

select id from labelcache where value LIKE "%football%" . > 146 matches in 0.32 seconds

2) put all the literals in a file and do a grep and pull results into python

mysql -D ifptest1 -u root -e 'select id,value from nodes where literal = "1"' > nodes

and then...

t = time.time() f = os.popen("grep -i football nodes") ids = [l[:l.index('t')] for l in f] print len(ids),"matches in",time.time() - t,"seconds" . > 146 matches in 0.042 seconds

I found this quite surprising - grepping a file and piping to python is an order of magnitude faster than mysql doing a table scan(!)

The rest of the query (once the results have been reduced to ids) takes 0.02 seconds. It looks like grepping a file may be a quick and easy way of getting a ~x10 performance boost (at least in the short term).