The path from specificity to usefulness

I tried to comment on Seth's post, but I think the comments on his blog are a bit broken at the moment (the capcha question wasn't rendering, so I couldn't answer it!). I guess I'll trackback instead:

The path from specificity to usefulness that Seth describes was exactly the trip I took attempting to implement semantic web approaches at work to assist with managing IT operations info (which was stored in various silos). I started off with RDF and built a store which used some OWL rules to connect the data from the various sources. This proved cumbersome and difficult - other people found it quite a hurdle trying to understand RDF, and the same URIs got used to mean subtly different things (e.g. IT application vs project).

After a year and a half of evolving the system the best solution ended up being to just index triples of words. Vaguer than URIs, but easier to harvest and match from databases. Since humans write the queries, it turned out that the vagueness wasn't a problem at all. Universal Specificity (such as is required by URIs/RDF) just doesn't seem to scale very well in my experience.

How useful is structured data?

My recent look at microformats has lead me to think more about the levels of grey between being able to fully interpret (understand) data, and not being able to interpret it at all. Microformats are currently very binary in this regard - either the software knows the microformat and is able to interpret it, or it doesn't. This is at odds to other data formats, including XML and RDF, which can convey structure even if the software doesn't fully understand the schema and vocabulary in use.

Some (local, approx) definitions:

Graph
A coarse-grained 'chunk' of data. E.g. a document on the web.
Structure
The scoping of data, parent/child relationships, links, which bits are properties and which bits are values
Schema
A set of restrictions which explicitly constrain the values and structure that can be used in the data (without requiring understanding of the actual meaning). E.g. XMLSchema, RelaxNG, WSDL
Vocabulary
The 'meaning' of properties. e.g. 'what does "name" mean?'. Usually articulated relative to other properties. (e.g. OWL)

Here's some 'levels' of data understanding, and some things software can usefully do at each level:

1) The software is unable to interpret the meaning of the data, and also unable to interpret the structure

The data can still be broken into a sea of atomic bits, and those bits indexed to enable searching. E.g. A straight text index on an html document doesn't attempt to interpret the structure of the data - it merely indexes the occurances of the words in the sea of text. Consumers that understand both structure and meaning can then retrieve the graphs (documents) that contain certain words.

2) The software is unable to interpret the meaning of the data, but can interpret its structure

The data structure can be held and indexed (even though the meaning isn't understood at any level). It can be aggregated and presented ready-indexed (e.g. via a structured query interface) to some agent or program that does understand more of the meaning. The software can break the data into logical chunks that are more granular than the graphs input into it. The software can perform transformations on the data, present it marked up in a different way, and perform statistical analysis to look for trends and similarities in structure/vocabulary with other graphs.

3) The software is able to interpret some of the meaning (knows some of the vocabulary used), but not all of it.

It can perform structured queries, operations and transformations based on the bits of vocabulary it does understand. It can present this 'understood' data, along with the structured data it doesn't understand to an agent/program/human that may be able to interpret more of the vocabulary.

4) The software is able to interpret the meaning of the data.

Then at this point it's probably human.

Structure in Microformats

Have spent some spare time looking at microformats recently (and more importantly, writing a microformats parser).

The main thing that troubles me is that microformats have no explicit way of conveying the structure of the data. This scuppers the idea of a general microformats importer (which I would obviously like for JAM*VAT, amongst other things).

There are three ways a metadata scheme can convey structure:

1) In the data itself (e.g. RDF, XML, OPML) 2) In a seperate schema (e.g. ASN1) 3) Out of bounds (i.e. documented somewhere, but not 'discoverable' by the parser)

Microformats currently use the third - the structure of the data needs to be pre-known by the parser, since there's reliable no way of deducing it from the data. This is a concious decision on the part of the microformats community - they don't want to go down the schema-language rathole. However it does have a few negative effects:

  • Schema design needs to be centralised (or at least well publicised) (since each new schema must be adopted and implemented by the parser writers)
  • You can't use existing parsers to parse new formats

I think the latter effect means that niche microformats are unlikely to emerge, since writing a range of parsers for the important languages is a big job.

Alternative to the Semantic Web?

Danny! Here's another alternative for you to consider:- tagtriples. (I must be sounding like a stuck record blathering on about tagtriples all the time, but hear me out...)

I started tagtriples as an attempt to find the simplest subset of RDF that wouldn't lose any of the merging pixie-dust features. (RDF was proving just too complicated to get critical mass at my work, and required up-front agreement to get data to merge). Fresh from the folksonomy buzz I tried replacing URIs with sets of tags used in combination, then realised that tags could be modelled as statements with predicate 'tag' and ended up with tagtriples.

So instead of a universal ID to define the thing, you rely on combinations of symbols and statements. This opens up more possibilities for magical pixie-dust merging as this emphasis leverages existing symbols grounded in real life (email addresses, phone numbers, names etc..) in combination to join data. (btw, note that even FOAF RDF drops URIs at the point where it needs to do the pixie-dust merging stuff).

But the really cool thing is: when you loose the URIs, scraping data from other general formats becomes simpler and, to a certain degree, actually automatable. You just recreate the structure of the data in triples, and then use the symbols from the source data as node identifiers. You don't even need to be precise to get a representation that can yield useful results (especially if you're doing sparql-style queries).

BTW, I've written importers for XML, RDF, CSV, and recently a microformats hCard and hCalendar. At work we have a turn-key database and ldap exporter. I'm now wondering whether it's possible to mine some quality of semantic statements out of general 'semantically-oriented' XHTML. Maybe at least enough to do some structured querying on. (really must post and ask the microformats list about this)

Other benefits:

  • clusters of symbols are amenable to proximity searching techniques - i.e. find a cluster of statements containing these symbols. This is a powerful way of finding microcontent structures in the data mush.

  • Sparql style structured querying becomes simpler - no more namespaces to remember! Combinations of statement patterns in the query easily restrict the set of possible matches to the point where you don't need namespaces to ensure precision.

Ok, so here's the final pitch: With all the scraping, searching and browsing stuff you can do, neither the author or the consumer needs to actually know anything about tagtriples.

I think that's really cool: The user can import data from existing formats and models, search for items across the merged data (using google-style text searches that work over symbols in close proximity), and browse the data structures without caring that there's some triples-tags-and-graphs model behind it.

That's the most powerful bit. This stuff is useful without gaining any sort of adoption critical mass. (E.g at work I installed JAM*VAT, emptied some databases and ldap stores into it and suddenly people can search and browse across the merged indexed data, traversing where symbol and statement combinations mesh).

So waddaya think? Existing formats + a generic model to aggregate and interpret the semantic data: An RDF alternative contender?