Quad store performance solutions?

I couldn't find a way to comment on Benjamins post, so I've stuck it here:

What indexing are you using? My tagtriples store schema is basically a table with 4 ids which joins to an (ID,String) table. When it used to be an RDF store this held both literals and URIs.

I found the key to getting good performance with this arrangement was ensuring that the mysql query engine had enough optimised btree indexes to make a good guess at query execution order. I eventually went for 6 compound indexes on the triples table (ignore the seperate graphs table - I ditched this soon after).

Hope this helps (and sorry if I'm mis-assuming something or have mis-understood the problem)

Namespaces v Context

Two ways to enable the disambiguation of data:

  • ensure that each identifier is namespaced so that it can't collide with any other
  • or
  • allow identifier names to collide, but also capture context to enable the consumer to disambiguate

RDF relies on the former approach, English the latter.

N.B. An advantage of the latter approach is that it allows the meaning of identifiers to deviate a little with context, whereas namespaces imply a rigid adherence to meaning (which may or may-not be practical).

microqueries

I've recently been experimenting with ways to provide simpler structured searching/querying to 'normal' web users (i.e. not techies). Sparql/SQL querying doesn't cut it here - we need something simpler.

One approach I've been trying is allowing simple query constraints in with the text search facility. Using the proximity searching capability JAM*VAT then finds a collection of symbols that match all the constraints in close proximity.

E.g. the search: "danny ayers >2005-10-25 <2005-10-27"

..brings back subjects linked to the words 'danny' and 'ayers' and any dates between 2005-10-25 and 2005-10-27 - in this case it finds blog posts made between these dates.

N.B. you may need to tweak the dates a bit in the above example if you're reading this later than october 2005.

More JAM*VAT query features

Using a relational database as a triplestore backend has a number of advantages - one of which is leveraging features of the backend SQL support with very little effort.

I've recently added a whole bunch of functionality to ttql (the experimental query language that JAM*VAT uses for querying). These include:

SQL (mysql) numeric and string functions (e.g. CONCAT, COUNT, SUM, MIN/MAX etc..) OrderBy GroupBy Limit

(this in addition to optional blocks and indexed substring comparisons).

Also, numeric and date comparisons are now indexed (dates are converted from various formats to a common representation at the query parsing stage).

Oh - and I've made select queries NON-distinct by default. I realised that since tagtriples was not logical-set based (the order matters), the same statement can quite legitimately crop up more than once in a graph. It thus follows that it ought to be possible to get all the occurances out via a structured query, and so I've added the DISTINCT keyword and made the queries non-distinct by default.

I really must get round to documenting this on the query page; if you can't wait (yeah right!), check out the test_ttql.py file for the query unittests.

Alternative to the Semantic Web?

Danny! Here's another alternative for you to consider:- tagtriples. (I must be sounding like a stuck record blathering on about tagtriples all the time, but hear me out...)

I started tagtriples as an attempt to find the simplest subset of RDF that wouldn't lose any of the merging pixie-dust features. (RDF was proving just too complicated to get critical mass at my work, and required up-front agreement to get data to merge). Fresh from the folksonomy buzz I tried replacing URIs with sets of tags used in combination, then realised that tags could be modelled as statements with predicate 'tag' and ended up with tagtriples.

So instead of a universal ID to define the thing, you rely on combinations of symbols and statements. This opens up more possibilities for magical pixie-dust merging as this emphasis leverages existing symbols grounded in real life (email addresses, phone numbers, names etc..) in combination to join data. (btw, note that even FOAF RDF drops URIs at the point where it needs to do the pixie-dust merging stuff).

But the really cool thing is: when you loose the URIs, scraping data from other general formats becomes simpler and, to a certain degree, actually automatable. You just recreate the structure of the data in triples, and then use the symbols from the source data as node identifiers. You don't even need to be precise to get a representation that can yield useful results (especially if you're doing sparql-style queries).

BTW, I've written importers for XML, RDF, CSV, and recently a microformats hCard and hCalendar. At work we have a turn-key database and ldap exporter. I'm now wondering whether it's possible to mine some quality of semantic statements out of general 'semantically-oriented' XHTML. Maybe at least enough to do some structured querying on. (really must post and ask the microformats list about this)

Other benefits:

  • clusters of symbols are amenable to proximity searching techniques - i.e. find a cluster of statements containing these symbols. This is a powerful way of finding microcontent structures in the data mush.

  • Sparql style structured querying becomes simpler - no more namespaces to remember! Combinations of statement patterns in the query easily restrict the set of possible matches to the point where you don't need namespaces to ensure precision.

Ok, so here's the final pitch: With all the scraping, searching and browsing stuff you can do, neither the author or the consumer needs to actually know anything about tagtriples.

I think that's really cool: The user can import data from existing formats and models, search for items across the merged data (using google-style text searches that work over symbols in close proximity), and browse the data structures without caring that there's some triples-tags-and-graphs model behind it.

That's the most powerful bit. This stuff is useful without gaining any sort of adoption critical mass. (E.g at work I installed JAM*VAT, emptied some databases and ldap stores into it and suddenly people can search and browse across the merged indexed data, traversing where symbol and statement combinations mesh).

So waddaya think? Existing formats + a generic model to aggregate and interpret the semantic data: An RDF alternative contender?

Proximity text searching

It's just occurred to me that I never posted about the proximity search capability that I built into JAM*VAT about 3 months ago.

It works by looking for symbols in close proximity. E.g. searching for 'Danny Ayers Blog' yields an answer 'raw', even though the word isn't in the search string. This is because the 'raw' symbol is connected to all of the above terms through the statements in the store.

(N.B. "Danny Ayers Channel" gives a more precise match, because the imported data is an rss10 channel. However the proximity text search is most useful when you don't know the exact vocabulary of the data you're querying.)

The implementation works by simply executing text searches to find statements and then using the resulting statements to filter searches of the remaining terms. It then runs a simple ranking algorithm on the results. The text searches run very fast because of the internal suffix array implementation (and work just as well with substrings).

This works pretty well at work where we have a application management store with ~1.5 million triples. The installation returns sub-second queries for things like 'bond trader server'. ('bond trader' doesn't actually exist btw - that was an example).

Indexing dates and numbers in a large triplestore

JAM*VAT is now mature enough that it handles relational operations over large amounts of aggregated structured data quickly and scalably, and also provides very fast regex text search operations (due to its inbuilt suffix array implementation).

However one area where it doesn't perform very well is in handling dates and numbers. E.g if you aggregated 10000 RSS feeds into it and then asked for posts made between 9am and 11am this morning, performance would be poor regardless of your hardware. The reason for this is that jamvat doesn't currently index symbols other than for text searching.

The best way I can think to add this capability is to augment the symbol 'string' with a double precision floating point version (where the string can be 'cast' to a numeric). Of course this numeric value won't always be precise (because it's floating point), but AFAICS that won't matter because the value is going to be used for comparitive indexing only (e.g. for > or < comparisons, and for numeric ordering).

So for example the backend SQL for a (symbol > 1000.000000000000001) comparison would be:

 WHERE  sym.numeric_value > 1000.000000000001 
 AND    CAST(sym.text_value AS DECIMAL) > 1000.000000000001

..which should allow the mysql query processor to use the index to narrow down the options (in conjunction with other indexed parts of the query) before applying more accurate numerical comparisons to a CASTed text symbol (i.e. casted to a precise fixed point decimal).

For dates I'm assuming that I'll be able to do date-to-number transformations (e.g. seconds past the epoch) prior to insertion into the database or in a query.

All this is in theory - haven't written the code yet. Can anybody see a better way? (or a problem with this approach?)

JAM*VAT 0.8.3 is out!

I'm quite excited about this release - it includes new POST functionality that accepts HTTP-POSTed content interpreted via mimetype. The upshot of which is that people can cut-n-paste xml chunks into JAM*VAT (which is a compelling way to demonstrate the technology).

You can try it via the online demo - click on the 'Post Data' link at the top, then cut-n-paste your xml or RDF data into the box. You should then be able to text-search it and browse via the jamvat interface.

Another super-cool feature of the HTTP-POST stuff is that it supports 'application/x-www-form-urlencoded' data - i.e. you can create a web form which posts directly to JAM*VAT. - E.g. try the following:

Name: Favourite Food:

Once you've submitted, try searching for the name on the search page, or finding your graph on the graphs page.

Other features:

  • fastcgi support
  • Improved XML and RDF translation
  • various bug fixes

Get the opensource software from the tagtriples sf project site.

Serendipity will build the semantic web

Here's a barrier to successfully using RDF URIs for identifying things collaboratively: you need to know the URI before you can use it.

If two parties create URIs for the same thing in seperation, the chances of them minting the same URI are pretty much nil. This is especially true with temporal seperation - you can't possibly find out the URI scheme an authority uses if they haven't created it yet.

I hit this problem deploying RDF at work and looked to OWL as a bridging solution. owl:sameAs and owl:InverseFunctionalProperty allow you to semantically connect URIs after the original data has been written. Unfortunately I was forced to concede that I couldn't make this approach work in practice.

The problem wasn't performance - a combination of back-chained inferencing and regular smushing passes to collapse the URIs into logical 'meanings' works reasonably well in this respect.

In the end the thing that killed it was complexity: An explosion of statements and indirection that was brittle and required constant [central] management in order to present a coherent picture to users (some of whom are of course creating data dependent on the inferred stuff). And this is on a small scale - merging a few RDF database exports, <1M (non-inferred) triples.

And I think this is the crux of the whole semweb problem - for a semantic web to emerge, the whole thing needs link up and work in a decentralized world. Publishers working in seperation need to have a fighting chance of having their data link up and add value, or they won't bother publishing.

As a semantic web enabler, URIs are fundamentally broken in this respect - they don't reuse existing grounding, and they don't take advantage of shared (or overlapping) context.

The recent folksonomy phenomenon has shown us that it is possible for serendipidous linking to happen on a large scale. This is achieved by leveraging existing real-world semantic grounding in shared (and well known) terms, and then requiring that clients do their own work in using context to disambiguate terms.

I think this idea has legs - it flips the problem into one that can be decentralized.

I.E. Instead of having lots of unconnected data that must be painstakingly merged centrally [which incidently is what's going on now when we attempt to convert other data to RDF, and when we create owl mapping statements], you have the opposite problem: lots of over-linked data which the consumer must disambiguate (and choose which links to follow) based on an operating context.

In practice, this after-the-fact link disambiguation turns out to be a much simpler problem (at my work at least). Simple tag text-matching turns out to be an excellent disambiguation tool, quickly collapsing the set of possible links to a managable subset whose size you can vary based on your accuracy and coverage requirements.

And the big bonus is that the aggregation can be automated. The JAM*VAT data aggregator deployed at work is collecting and merging data without any human intervention or source post-processing. This is because the authors of the data are using symbols already grounded in the context of the company. E.g. they're using server names to refer to servers, ldap UIDs to refer to employees and customers, and application names to refer to applications. Thus everything links up even though the data is created in seperation (usually generated from stovepipe databases created long before any notion of an integrated data-web).

Of course there is lots of ambiguity - application names are used to denote application databases and teams. DNS names are used to denote both servers and network router ports. However, once you know what you're looking for it's easy to disambiguate - 'bondtrader (database)', 'ln32babc22 (server)' etc..

Having been through this exercise on a small scale, my conclusion is this: If there is going to be a global semantic web of interconnected data, it will emerge from these principles (reuse of existing symbol-grounding, decentralized publishing, automatic serendipitious data merging) rather than through a carefully maintained web of precise identities and links.