RDF Literal searching using a suffix array
Text searches were becoming a big bottleneck in the knowledge base deployment at work. The grep solution allows speedup of pure text searches (such as a label search), but doesn't work so well when the text search is part of a bigger query.
Ideally I would have used mysql's fulltext indexing capability to index the literals; the problem is that this just indexes full words, and a lot of the searches require substring matches within the data (e.g. when searching for server names, the client often knows the server number 'e.g. 859', but not the full code 'ln3p0859app').
I had some hacking time over the xmas break, and I got round to implementing a basic suffix array within the veudas db schema that maps word suffixes to literals via a links table. This enables the suffixes to be indexed (I index the first 4 characters at the moment) for rapid matching of substrings to literals. The tables are like this:
CREATE TABLE suffixes ( id integer NOT NULL AUTO_INCREMENT, suffix text, KEY (suffix(4)), UNIQUE KEY (id) ) TYPE=MyISAM CREATE TABLE suffixlinks ( suffix integer, len integer, literal integer, KEY (suffix,len) ) TYPE=MyISAM
The length field in the suffixlinks is used in order to reduce the number of suffixes. E.g. the suffix 'aced' matches the word 'places' for 3 letters, and the word 'placed' for 4. The generated SQL snippet for a text search (in this case for substring '859') linking up to the core triples table looks something like this:
SELECT ...WHERE... AND suffixes.suffix LIKE '859%' AND suffixes.id = suffixlnk.suffix AND suffixlinks.len >= 3 AND suffixlinks.literal = triples.object
In order to reduce the number of suffixes I split the literals into words before generating suffixes from them. This means that text searches only match over words rather than whole literals (e.g. a search for 'phil dawes' matches 'dawes,phil' and 'phil dawes' and 'philip dawes'), but this is fine for my knowledge base application.
I would be interested in knowing if there is a better way to index text substrings in a relational database - I couldn't find much information about this on the web.
Veudas and veudastore packages containing this implementation can be downloaded from the veudas sourceforge project page.