Serendipity will build the semantic web

Here's a barrier to successfully using RDF URIs for identifying things collaboratively: you need to know the URI before you can use it.

If two parties create URIs for the same thing in seperation, the chances of them minting the same URI are pretty much nil. This is especially true with temporal seperation - you can't possibly find out the URI scheme an authority uses if they haven't created it yet.

I hit this problem deploying RDF at work and looked to OWL as a bridging solution. owl:sameAs and owl:InverseFunctionalProperty allow you to semantically connect URIs after the original data has been written. Unfortunately I was forced to concede that I couldn't make this approach work in practice.

The problem wasn't performance - a combination of back-chained inferencing and regular smushing passes to collapse the URIs into logical 'meanings' works reasonably well in this respect.

In the end the thing that killed it was complexity: An explosion of statements and indirection that was brittle and required constant [central] management in order to present a coherent picture to users (some of whom are of course creating data dependent on the inferred stuff). And this is on a small scale - merging a few RDF database exports, <1M (non-inferred) triples.

And I think this is the crux of the whole semweb problem - for a semantic web to emerge, the whole thing needs link up and work in a decentralized world. Publishers working in seperation need to have a fighting chance of having their data link up and add value, or they won't bother publishing.

As a semantic web enabler, URIs are fundamentally broken in this respect - they don't reuse existing grounding, and they don't take advantage of shared (or overlapping) context.

The recent folksonomy phenomenon has shown us that it is possible for serendipidous linking to happen on a large scale. This is achieved by leveraging existing real-world semantic grounding in shared (and well known) terms, and then requiring that clients do their own work in using context to disambiguate terms.

I think this idea has legs - it flips the problem into one that can be decentralized.

I.E. Instead of having lots of unconnected data that must be painstakingly merged centrally [which incidently is what's going on now when we attempt to convert other data to RDF, and when we create owl mapping statements], you have the opposite problem: lots of over-linked data which the consumer must disambiguate (and choose which links to follow) based on an operating context.

In practice, this after-the-fact link disambiguation turns out to be a much simpler problem (at my work at least). Simple tag text-matching turns out to be an excellent disambiguation tool, quickly collapsing the set of possible links to a managable subset whose size you can vary based on your accuracy and coverage requirements.

And the big bonus is that the aggregation can be automated. The JAM*VAT data aggregator deployed at work is collecting and merging data without any human intervention or source post-processing. This is because the authors of the data are using symbols already grounded in the context of the company. E.g. they're using server names to refer to servers, ldap UIDs to refer to employees and customers, and application names to refer to applications. Thus everything links up even though the data is created in seperation (usually generated from stovepipe databases created long before any notion of an integrated data-web).

Of course there is lots of ambiguity - application names are used to denote application databases and teams. DNS names are used to denote both servers and network router ports. However, once you know what you're looking for it's easy to disambiguate - 'bondtrader (database)', 'ln32babc22 (server)' etc..

Having been through this exercise on a small scale, my conclusion is this: If there is going to be a global semantic web of interconnected data, it will emerge from these principles (reuse of existing symbol-grounding, decentralized publishing, automatic serendipitious data merging) rather than through a carefully maintained web of precise identities and links.