In determining the meaning of tokens used in communication there are two widely used approaches to disambiguate that I'll charactise as 'namespacing' and 'context'.

When humans communicate amongst themselves they use the context of the communication to narrow down the range of possible meanings of terms used in the exchange, and human language doesn't employ namespaces at all. On the other hand computer identifier schemes typically use namespaces to prevent term clash, and don't use context at all.

The mechanisms operate differently:

Namespaces:

  • Every use of the namespaced term refers to the same concept. (or at least if it doesn't this is considered an error)
  • Deterministic

Context:

  • the concept denoted to by the term depends on the context in which it is used
  • Statistical ( is that the right word? )

My thinking is that namespaces work well in a closed environment because coordination overhead is low and deterministic programs are easy to write. Namespaced schemes do however require a management mechanism to ensure that each use of the same term denotes exactly the same thing. This works well if the terms are grounded in the system - e.g. on the www a URL is used to fetch a document, and thus its use as identifier for that document is grounded.

However the semantic web is an open environment with little grounding, which means that holistic term coordination and management isn't practical. Thus web-scale semantic web systems need to employ some degree of context based disambiguation anyway - i.e. the system can't globally merge statements together without considering issues of provenance and consistency. I wrote about this issue here and at present this consideration is usually handled manually by the person operating the RDF store or software, but as these systems grow and scale more of these issues will need to be addressed by software.

Note that it is important to distinguish between this and the issue of trust in the content of the communication - here I am purely talking about interpreting the meaning of the communication, specifically measuring term consistency between documents from disconnected sources.

Now if you take this this inevitable use of context at web scale as given, my question is: Could the semantic web bootstrap and scale better with a system that disambiguated entirely based on context and didn't employ namespaces at all? (i.e. like human language communication).

So I've been thinking along the lines of a scheme where literals and bnodes are used in place of URIs in RDF documents. Vocabularies use literal terms in place of URIs, and the combination of terms are used to infer meaning in aggregate.

Non-determinism issues aside this approach does have a central advantage: it removes the coordination and bootsrap overhead associated with use of namespaced identifiers, and particularly with issues peculiar to URIs:

  • artificial namespaces mean there's little term match serendipity between disconnected uncoordinated clients
  • pre-existing identifier schemes are commonly not valid URIs, making reuse difficult
  • URIs introduce unnecessary term ownership, authority and lifecycle issues
  • Other URI proprietary issues add to cognitive overhead: hash v slash, uri denotes document vs thing it describes

One particular advantage of the literals-in-combination approach is that data can be lifted from existing sources without the requirement to invent and translate identifiers into URI schemes. Currently translation of data into traditional RDF consists of two challenges:

  • converting the structure of the data into a triple graph
  • translating the identifiers into a URI scheme

Whereas the former is a one-shot deal for each data format, the latter frequently requires manual input for each document and is IMO the single biggest hurdle to putting data onto the semantic web.

Of course the downside of the approach is that software consuming the data needs to take a non-deterministic approach to term meaning. There is no globally correct answer to 'does this term in this document mean the same as this one?' - instead it is a function of both the context under which the documents were written and of the requirements of the querying client. Unfortunately I suspect that as people try to get traditional w3c semweb technologies to scale up in web scale environments they're going to find themselves in the same non-deterministic boat.

I'm experimenting with a literals and bnodes approach in my own software and will post updates to my blog.