I was having a discussion with somebody at a conference recently about merging data on a large scale, and how the differing 'contexts' under which the data is created make global merging difficult. He asked me to define what I meant by the 'context' of the data. It's a wooly term I'd been using for a while but had never actually attempted to nail down what I meant. Here's my best stab:

Context of the data: the set of implicit assumptions shared between users of the dataset.

To illustrate this, consider the act of designing a database for an application:

In order to be successful, the database must model (i.e. be consistent with) a portion of the real-life domain, sufficient enough to be useful in fulfilling the requirements of the application. The trick of the data modeller is to create a database that's able to do this whilst being simple enough to manage within the constraints of the project. (when you consider that exactly modelling the real-world domain means essentially creating a detail-complete clone of it, it's obvious that you need to simplify to some degree).

The act of making the database 'simple' effectively factors out bits of the domain which don't vary within the scope of the application. These factored-out-bits are in effect the context: the implicit assumptions made and shared by the users of that data model (usually the application code) but missing from the explicit data model itself.

This is obviously a tradeoff - you trade explicit completeness for simplicity. Picking the right abstractions and simplifying assumptions can pay big dividends when creating and maintaining the data and developing applications that use it. This motivation means that the database modeller has the application in mind when creating those abstractions. The application of the data drives the data model.

So now to get to the point:

Merging data is hard because

  • Each dataset is built on a different set of simplifying assumptions and abstractions
  • The degree to which these abstractions and assumptions are consistent depends on the application you are trying to fulfill with the merged data.

The second point is not immediately obvious: Because of point (1) you can't just merge all the data (regardless of transformations) - it's not globally consistent. This means you have to choose which data elements to merge. Moreover this choice and the transformations you use depends on the context of the application you are trying to fulfill, and differs between applications.

Which all means that you can't expect to create one super database for your company that is an amalgam of all the current databases, and then build applications around it. At least not without spending lots of money trying to build it, and then subsequently spending more money struggling to modify the database to fit all the applications simultaniously.

A better approach (IMO) is to defer the merging activity as late as possible, and as close as possible to the application itself. I.e. to effectively have a logical super-database per application. I think building a robust system to facilitate this requires dispensing with any notion of global consistency entirely. (But if I start writing about that I'll never get this post finished - I'll save it for another time).