I've been struggling with workable strategies to overcome the symbol-ambiguity problem, both in RDF (which can easily suffer from context skew), and also in my TagTriples scheme which just uses symbols/words instead of URIs to identify things and is thus more susceptable to ambiguity.

Had a bit of a revelation moment on the train yesterday:

Hypothesis: For KM applications, disambiguation of symbols (and URIs) into their contextual meanings does not necessarily need to be done up-front, either at the authoring or at the aggregation stage.

This is based on observations of the three principle information consumption usecases at work: Structured-Querying, Searching and Browsing.

I noticed that:

People authoring structured queries (think SQL, Sparql) know a lot about the structure of the information they want to query, and are easily able to derive the meaning of symbols from their context even when there is ambiguity. This is because they are (1) human, and (2) they understand the domain they are querying within.

But more importantly, structured query languages then give them the expressive power to capture this symbol-disambiguation reasoning by including context and scope patterns in the query.

e.g. There may be two 'BondTraders' as the result of aggregating from multiple contexts (maybe one is a software project and the other the deployed production software itself). However only one of them is running on a cluster of machines spread across two datacentres, and so the query factors out the other sense of the term.

This has an important ramification - it means potentually that for structured query clients, the ambiguity can be left in the aggregated data.

Ok - so no need for the aggregator to disambiguate up-front for structured-query clients. This leaves clients who are searching and browsing.

The searching and browsing activities are interesting because they often occur when the client (human or software) is hoping to discover new information. The client doesn't necessarily know the structure of the data, and maybe doesn't even fully understand the domain of the information in which it is searching and navigating. The client is therefore poorly equipped to differentiate the meanings of ambiguous symbols.

However, the other characterisic of this activity is that it usually involves a narrow-band, iterative interface.

E.g. 'Searching' clients (think google) are characterized by using a simple interface, maybe iteratively, to narrow down the search space until they can latch onto a small number of results that are close enough to their goal.

'Browsing' clients (think web browser, URIQA) retrieve (at most) a handful of nodes at once, executing small iterative traversal queries as they navigate carefully through new data and structure.

In both of these modes, I'd speculate that the size of knowledge chunks retrieved are small enough for the computer to apply disambiguation analysis strategies in real time.

E.g. the statement

BondTrader productionServer FooBahServer12

..could be statistically disambiguated in real time by looking at the scope of the document supplying this information. If most of the subjects having property 'productionServer' map to external symbols that have 'type Application', then it follows that this BondTrader is probably also an application. So the link from BondTrader can point to a chunk of aggregated BondTrader information from documents/graphs describing applicationy things (rather than projecty things).

Moreover, the interface is hands-on and iterative (even for software clients), potentually allowing the client to tune the disambiguation on the fly - adjusting the error rate and trading off false-negatives in the process.

So to conclude: It looks to me, contrary to my earlier intuition, that symbol/context disambiguation can be delayed until the data is closer to the client. This is good news because it means the information can be aggregated without processing and transformation (and thus automatically without potentual signal-loss/skewing), and also that the disambiguation process can be iterative, dynamic, and tailored to the client context.