Global identifier schemes don’t scale

When designing large information systems to hold data from a wide range of sources (e.g. a large company inventory or knowledge base), a common approach is to employ a global identifier scheme so that entities can be referenced unambiguously across the system. A really large scale example of this approach is the W3C Semantic Web effort, which identifies entities with URIs - Universal Resource Identifiers.

However despite widespread attempts at deployment I actually don't think that the global identifier approach works very well in practice, and is especially sub-optimal when used in large decentralized (uncoordinated) systems. The reason for this is that without painstaking coordination activity, shared identifiers tend to take on meaning which is specific to the context in which they are used. This effectively results in the identifiers not being 'globally' unambiguous anymore.

To illustrate this, lets say I create the global identifier id:PhilDawes to mean me, and I then create the statement:

id:PhilDawes weight 10stone

On more careful analysis it becomes apparent that I'm actually not creating information about the general 'me', but rather a specific 'me' that existed on 7th March 2006 (i.e. when I asserted the information). If I use the same identifier to make similar statements about my weight at other points in my life then the merged information will be inconsistent (a person can't have more than one weight) - this is because the id:PhilDawes being described is actually different in each case.

In effect it means that my identifier isn't really global, or at least if it is I am not using it consistently. You can't merge the information about id:PhilDawes without considering the context under which the data was created to see if the thing being described is actually logically consistent in each case. The identifier is effectively local to the context of the communication.

Of course this could be considered a matter of precision - instead of making a single unqualified statement, I should have qualified it with enough information to ensure that id:Phildawes being described is the more abstract id:PhilDawes that I intended rather than the specific one. e.g.

'On 7th March 2006, id:PhilDawes weight 10stone'

Of course I may need to be even more specific than this in order to ensure consistency - maybe:

'On 7th March 2006 at 09:22.35, id:PhilDawes weight 10stone (without clothes on)'

The problem is that people don't write data like this - shared assumptions are desirable between communicating parties: they reduce the required communication bandwidth. This means that in a large decentralized system you can expect plenty of ambiguity as people share identifiers; the upshot is that the only realistic course of action is to always consider the context when evaluating what the identifier is refering to.

So what does this mean for large decentralised systems that employ global identifier schemes? Well, providing that you always consider the context when evaluating data you should be able to minimise the consistency issues. Except that that kinda defeats the point of using global identifiers in the first place.

Earlier I said that I thought the use of global identifiers was especially sub-optimal for decentralized systems. Here are some downsides to employing a global scheme in such an environment:

1) Identifiers need to be sufficiently large and namespaced to avoid collision

2) Because no one unambiguous system of identification exists, there is a bootstrap problem: identifiers must be 'invented' and communicated for each thing described.

3) Having to find, choose and use identifiers with minimal risk of ambiguity requires effort and represents a high barrier to entry for people creating data. High enough that people often invent their own ids rather than reusing others.

And some effects of these downsides:

(1) makes the serialized data pretty cumbersome for localized data exchange - this fact plagues RDF protocols and results in simpler context-specific systems being used. Both (2) and (3) massively reduce the chance of serendipitous data crossover in an uncoordinated system (the sort of network-effect magic that makes tagging systems useful and popular). (2) also ensures that there's no reliable way of automating the import of external data into the system, short of creating a completely new (unambiguous) set of identifiers (which won't then merge with anything). This inevitably means that importing data into the system becomes a manual job.

Finally, the ground-up architecture of 'no ambiguity' usually means that there's no way of reconciling or disambiguating different uses of the same identifier when they do arise (without resorting to some out-of-band solution). For example in RDF there's no way to compare or reason about two differing 'usages' of the same URI inside the system. This means conflict resolution cannot be articulated within the system.

So what's the alternative? : Local identifiers scoped within the context of the communication (or document or database or whatever). Well actually that's not an alternative, it's what you really had in the first place with the 'global' scheme, but now you're not alluding to the idea of universal consistency.

So now, free from the illusionary shackles of unambiguous global identity, you can reuse terms from existing identity schemes such as database primary-keys, zip codes and common language without worrying about global-scope collision. Automated generic importing of external data (e.g. databases, xml etc..) becomes possible, and information serendipitously combines without manual intervention.

Of course clients must evaluate context and provenance before using data, but in an uncoordinated system like the semantic web (or company knowledgebase) they had to do that anyway. The added advantage of allowing reuse of existing schemes is that the broader deployment of common language yields opportunities for statistical analysis over the data, which can be used as a tool to assist context evaluation. (e.g. see spam filtering, pagerank etc..)

So, to sum up: I think that the idea of universal consistency in a large decentralized system is an illusion, and alluding to it with a global identification scheme imposes unnecessary shackles on the growth and adoption of the system. In short: local scope will happen anyway - you're better off embracing it.