uri bloat in queries to smushed data

One of the things I've found when dealing with IFP merged data is that you get lots of URIs for the same logical resource.

A good example of this is foaf:Person 'Dan Brickley', who is mentioned in lots of peoples foaf files. Because Dan doesnt attribute a URI to himself (and why should he?), quite a few people have invented their own, usually by using rdf:id in their documents.

Another place where you get multiple uris per 'logical' resource is when merging schemas where there are exact-match terms (i.e. smushing with owl:equivilentProperty) - you end up with the same 'logical' property with multiple URIs.

The problem with querying rdf stores of this aggregated smushed data is that the multiple URIs cause triple bloat in the results. E.g. the query:

construct * where {?dan foaf:mbox <mailto :danbri@w3.org>. ?dan ?prop ?obj.}

returns n*r statements, where r is the number of property/object pairs for dan, and n is the number of URIs denoting dan.

Unfortunately, rather than:

there are lots of resources with mbox danbri@w3.org (all with identical properties)

the client more commonly wants:

there is one logical resource 'dan-brickley', with multiple URIs denoting him

(at least in the software I've been involved with)

In my smushing store I'm experimenting with a scheme which returns one result per 'logical resource' (picking a URI to describe it at random), and then provides an api for fetching the equivilent URIs. When returning results in an rdf graph, it picks one URI, and then puts owl:sameIndividualAs or owl:equivilentProperty statements at the end of the graph.