On context and merging data

I was having a discussion with somebody at a conference recently about merging data on a large scale, and how the differing 'contexts' under which the data is created make global merging difficult. He asked me to define what I meant by the 'context' of the data. It's a wooly term I'd been using for a while but had never actually attempted to nail down what I meant. Here's my best stab:

Context of the data: the set of implicit assumptions shared between users of the dataset.

To illustrate this, consider the act of designing a database for an application:

In order to be successful, the database must model (i.e. be consistent with) a portion of the real-life domain, sufficient enough to be useful in fulfilling the requirements of the application. The trick of the data modeller is to create a database that's able to do this whilst being simple enough to manage within the constraints of the project. (when you consider that exactly modelling the real-world domain means essentially creating a detail-complete clone of it, it's obvious that you need to simplify to some degree).

The act of making the database 'simple' effectively factors out bits of the domain which don't vary within the scope of the application. These factored-out-bits are in effect the context: the implicit assumptions made and shared by the users of that data model (usually the application code) but missing from the explicit data model itself.

This is obviously a tradeoff - you trade explicit completeness for simplicity. Picking the right abstractions and simplifying assumptions can pay big dividends when creating and maintaining the data and developing applications that use it. This motivation means that the database modeller has the application in mind when creating those abstractions. The application of the data drives the data model.

So now to get to the point:

Merging data is hard because

  • Each dataset is built on a different set of simplifying assumptions and abstractions
  • The degree to which these abstractions and assumptions are consistent depends on the application you are trying to fulfill with the merged data.

The second point is not immediately obvious: Because of point (1) you can't just merge all the data (regardless of transformations) - it's not globally consistent. This means you have to choose which data elements to merge. Moreover this choice and the transformations you use depends on the context of the application you are trying to fulfill, and differs between applications.

Which all means that you can't expect to create one super database for your company that is an amalgam of all the current databases, and then build applications around it. At least not without spending lots of money trying to build it, and then subsequently spending more money struggling to modify the database to fit all the applications simultaniously.

A better approach (IMO) is to defer the merging activity as late as possible, and as close as possible to the application itself. I.e. to effectively have a logical super-database per application. I think building a robust system to facilitate this requires dispensing with any notion of global consistency entirely. (But if I start writing about that I'll never get this post finished - I'll save it for another time).

Stateless and Stateful RSS Aggregators

For the purpose of future discussions, I'd like to distinguish between stateful and stateless rss aggregation.

A stateless aggregator is one which just consumes and represents the current information in an rss feed (or set of rss feeds). It doesn't remember items that were previously on an RSS feed, and so is effectively just rendering the current state of the RSS XML data.

A stateful aggregator remembers any items that it picks up over time.

The Firefox live bookmark technology is an example of a stateless aggregator, as is the Planet aggregator software (which we use at work to aggregate our blogs into a single web page and feed).

Technorati, IceRocket and desktop rss aggregators like RSSBandit and Thunderbird are examples of stateful aggregators.

Testing the ‘*’ tag

This is a test to see if the '*' tag works as a wordpress category (see Sean's comment)....

... Well, I can't add it using the new ajax category thingy in wordpress 2.0, but adding it on the category page works. We'll have to see what the RSS looks like...

External Internal Blogging Trial

Following up on my original drkw 'External Internal blogging'* post, I'm trialing the idea of aggregating employees public posts into the drkw internal blogosphere. For the time being any posts I tag with 'workfriendly' will end up injected into the internal drkw posts feed.

Unfortunately I haven't been able to sort out comments yet - ideally comments to tagged posts would also be aggregated into the main comments feed*.

If any other drkw employees with a public blog would like to be included, drop me a mail (or leave a comment on this post).

  • behind DRKW firewall

Silver Spoon

My sister bought me a copy of 'The Silver Spoon' for xmas, and I think it's fab. The book is laid out in order of ingredients (e.g. there's a section on leeks, one on sprouts, one on salmon etc..) so it's easy to find recipes that you've got the ingredients for. Also most of the recipes are short and specific to that ingredient so rather than relying on a single large recipe for the entire meal I can pick 2 or 3 and put them together on one plate.

On Intelligence

Just finished reading 'On intelligence' by Jeff Hawkins, which I really enjoyed.

Hawkins is the entrepreneur responsible for inventing the palmpilot and handsprung treo among other things, but his primary interest is brains and discovering how they work. To further this interest he has spent a load of his entrepenurial cash creating a neural science research institute and this book is the result of his research. Put simply it is an attempt at a general algorithm for how the neocortex (the 'intelligent' bit of the human brain) works.

I read this book after seeing Aaron Swartz enthusiastic post which raves about both Hawkins and the book. (the post includes a crude description of the basic algorithm, so I won't duplicate that here).

Django openid auth - first stab

I've been experimenting with adding openid authentication to django. I couldn't find another software package to do this (although I did see this, which implies there is some other code out there) Anyway - here's mine so far.

The main problem I've hit is that the username column in the django authentication db schema (v0.90) only has 30 characters, so I can't use the openid url as the username.

Instead I'm currently using the first 30 chars of an md5 hash of the url, which sucks. I probably need to create a new openid auth model which holds the openid url and adds a view for getting new users to create a unique username (or something). Or maybe I should contact the django developers about expanding this?... hmm..

Django media-serving dev webserver

I've been porting my jamvat software to the django platform. Django looks cool and I hope it will get me some

  • Get jamvat running on the windows platform
  • Provide some UI, connection pooling, security and debugging goodies
  • Remove some setup documentation burden (since it's a 'django app')

Anyway, I was immediately hit by the problem that the django dev webserver doesn't serve static files (i.e. js, images, css etc..), except for urls in the built-in admin server. I can't work out why, and I can't see why people aren't complaining about this - I can only assume that it wants you to use a 'proper' webserver to serve these files even while you're deving.

I couldn't find a solution to this, so have written a hack that does the same thing as django does for the admin server files. Just set 'MEDIAROOT' and add a 'DEVMEDIA_PREFIX' to your project settings, then run the script instead of 'django-admin.py runserver'.

How useful is structured data?

My recent look at microformats has lead me to think more about the levels of grey between being able to fully interpret (understand) data, and not being able to interpret it at all. Microformats are currently very binary in this regard - either the software knows the microformat and is able to interpret it, or it doesn't. This is at odds to other data formats, including XML and RDF, which can convey structure even if the software doesn't fully understand the schema and vocabulary in use.

Some (local, approx) definitions:

Graph
A coarse-grained 'chunk' of data. E.g. a document on the web.
Structure
The scoping of data, parent/child relationships, links, which bits are properties and which bits are values
Schema
A set of restrictions which explicitly constrain the values and structure that can be used in the data (without requiring understanding of the actual meaning). E.g. XMLSchema, RelaxNG, WSDL
Vocabulary
The 'meaning' of properties. e.g. 'what does "name" mean?'. Usually articulated relative to other properties. (e.g. OWL)

Here's some 'levels' of data understanding, and some things software can usefully do at each level:

1) The software is unable to interpret the meaning of the data, and also unable to interpret the structure

The data can still be broken into a sea of atomic bits, and those bits indexed to enable searching. E.g. A straight text index on an html document doesn't attempt to interpret the structure of the data - it merely indexes the occurances of the words in the sea of text. Consumers that understand both structure and meaning can then retrieve the graphs (documents) that contain certain words.

2) The software is unable to interpret the meaning of the data, but can interpret its structure

The data structure can be held and indexed (even though the meaning isn't understood at any level). It can be aggregated and presented ready-indexed (e.g. via a structured query interface) to some agent or program that does understand more of the meaning. The software can break the data into logical chunks that are more granular than the graphs input into it. The software can perform transformations on the data, present it marked up in a different way, and perform statistical analysis to look for trends and similarities in structure/vocabulary with other graphs.

3) The software is able to interpret some of the meaning (knows some of the vocabulary used), but not all of it.

It can perform structured queries, operations and transformations based on the bits of vocabulary it does understand. It can present this 'understood' data, along with the structured data it doesn't understand to an agent/program/human that may be able to interpret more of the vocabulary.

4) The software is able to interpret the meaning of the data.

Then at this point it's probably human.

Structure in Microformats

Have spent some spare time looking at microformats recently (and more importantly, writing a microformats parser).

The main thing that troubles me is that microformats have no explicit way of conveying the structure of the data. This scuppers the idea of a general microformats importer (which I would obviously like for JAM*VAT, amongst other things).

There are three ways a metadata scheme can convey structure:

1) In the data itself (e.g. RDF, XML, OPML) 2) In a seperate schema (e.g. ASN1) 3) Out of bounds (i.e. documented somewhere, but not 'discoverable' by the parser)

Microformats currently use the third - the structure of the data needs to be pre-known by the parser, since there's reliable no way of deducing it from the data. This is a concious decision on the part of the microformats community - they don't want to go down the schema-language rathole. However it does have a few negative effects:

  • Schema design needs to be centralised (or at least well publicised) (since each new schema must be adopted and implemented by the parser writers)
  • You can't use existing parsers to parse new formats

I think the latter effect means that niche microformats are unlikely to emerge, since writing a range of parsers for the important languages is a big job.