BTriples - a model for aggregating structured data

Things have settled down a bit after the birth of baby #2 and I'm starting to get a bit of time to program again: about an hour a night. That means I'm thinking a lot about indexing structured data again.

Here are my most up-to-date thoughts on a model for representing aggregated structured data which I'm tentatively calling 'BTriples'. I'm writing this down mainly so I can refer to it in future writing.

The purpose of BTriples is to be an internal model for an OLAP database such that it can represent structured data from a variety of popular formats (json, xml, csv, relational) and can index and query across heterogeneous data sources.

A good candidate for such a model would appear to be RDF, but it falls short on a couple of counts for my requirements:

  • The first issue is that in order to represent vanilla data as RDF there's a certain amount of manual mapping that needs to be done. You need to come up with a URI scheme for your imported data, and you then need to do some schema and ontology work so that the data can be semantically joined with other RDF data. This manual import overhead removes the ability to do one-click database imports, which is something I'd like to achieve with my database tool.

  • The second issue is that the RDF model has strict semantic constraints that are difficult to manage over a large set of disconnected parties. Specifically the RDF model says that "URI references have the same meaning whenever they occur". This 'same meaning' is difficult to enforce without central control and makes RDF brittle in the face of merging data from globally disconnected teams.

TagTriples was my first attempt at creating a simplified RDF-like model, but it suffers from the problem that it can't represent anonymous nodes. This makes importing tree structures like XML or JSON a tricky exercise as you need to have some way to generate branch node labels from data that has none. When I was designing tagtriples I was also thinking in terms of an interchange format (like RDF). I no longer think creating an interchange format is important - the world already has plenty of these.

Btriples is basically my attempt at fixing the problems with tagtriples. The format is triple based like RDF and so I borrow a bunch of the terms from the RDF model.

BTriples Specification

The Btriples universe consists of a set of distinct graphs (think: documents). Each graph consists of an ordered set of statements. A statement is intended to convey some information about a subject. Each statement has three parts: a subject, a predicate (or property) and an object.

  • A subject identity is anonymous and is local to the graph. This means you can't refer to it outside the graph. (This is similar to a 'blank node' in RDF).
  • A predicate is a literal symbol (e.g. strings, numbers).
  • An object is either a literal symbol or an internal reference to a subject in the same graph.

Example (logical) statements:

  // row data
#1 name "Phil Dawes"
#1 "hair colour" Brown
#1 plays "French Horn"

  // array
#2 elem "Item 1"
#2 elem "Item 2"
#2 elem "Item 3"
#2 elem "Item 4"

  // tree
#3 type feed
#3 entry #4
#4 title "BTriples - a model for aggregating structured data"
#4 content "blah blah ..RDF... blah"

That's it.

Notes:

  • Btriples is not an interchange format. I have deliberately not defined a serialization of BTriples.

  • BTriples graphs are disconnected: Btriples does not define a method for them to refer to each other.

  • Perhaps the biggest departure from RDF is that there are no formal semantics in Btriples. The btriples model cannot tell you if a subject in one graph denotes the same thing as a subject in another.

  • Also the semantic meaning of symbols is not defined by BTriples and is up to the user to decide. Two identical symbols do not necessarily 'mean' the same thing.

  • The statements in a BTriples graph are *ordered*, so you can get data out in the same order it went in.

  • I'm not crazy about the BTriples name. Maybe I'll change it.

Searching arrays in X86 assembler with a bloom filter pt 3

Continuing on from yesterday evening, I had a bit of time tonight in front of the telly so I implemented the rest of the fast-search functionality in factor using the assembler code from the previous post.

The first step was to create the simple bloom filter for sending to the assember code. To keep the assembler fast and simple I'm just setting one bit for each element in the input sequence. Also I hardcoded the size of the filter for simplicity (more later).

Factor has a bit-array type that backs onto a block of memory so writing 'prepare-filter' was easy:


: set-bit ( n filter -- ) t -rot set-nth ; inline

: (prepare-filter) ( filter seq -- )
  [ [ 1048576 mod ] dip set-bit ] curry each ; 

: prepare-filter ( seq -- filter )
  1048576 <bit-array> [ (prepare-filter) ] keep ;

I also needed some code to filter out false positives found by the assembler filter. To do this I pre-create a hash from the input set and then pass each result to a 'collect-element-if-correct' word:


: seq>hash ( seq -- hash ) [ dup ] H{ } map>assoc ; inline

: check-elt ( ptr i -- v/f )
  alien-unsigned-cell itemshash get at ;

: collect-element-if-correct ( alien i -- )
  tuck check-elt [ 2array , ] [ drop ] if* ;

Incidently, somebody left a comment on my last post saying that the bt (bit-test) instruction was a performance hog when testing directly against memory, so I modified the code to load the relevant bits into a register and bt against that instead. He was right - on the simple test I got a speedup from 170ms to around 105ms! I replaced the single bt instruction from the last post with the following:


  tmpreg "val" operand MOV
  tmpreg 5 SHR                    ! tmp contains dword offset into filter
  tmpreg 2 SHL                    ! clear bottom 2 bits and make it a byte offset
  tmpreg "filter" operand tmpreg [+] MOV   ! tmp contains filter dword
  "val" operand tmpreg BT

which disassembled with gas looks like this:


mov    %edx,%edi
shr    $0x5,%edi
shl    $0x2,%edi
mov    (%eax,%edi,1),%edi
bt     %edx,%edi

I'd actually intended on using bit shifts to eliminate the BT instruction altogether but it seems bitshifts can only take immediate operands in X86 assembler so that means I can't dynamically shift according to the contents of a register. There's probably another way - are there any x86 assembler experts out there with hints?

The only other performance affecting variable was the size of the bitarray used as the filter. This was where I was expecting to come unstuck: I sometimes have 1000s of elements in the input set and so would need a pretty big bitarray to keep the number of false positives low enough to get good performance. On the other hand I expected a large bitarray would hammer the cpu caches (as access is pretty random) and kill performance.

As it happens the performance on a 1Kbit filter seems to be about the same as one 4Mbits wide (i.e. half a megabyte). I'm not sure I fully understand why is as I'm pretty sure the L1 data cache on a core2duo is 32kB. Probably all the action takes place in L2 cache.

Anyway, I'm getting ~107ms to check an entire 48MB array (12M elements) for matches against a set of a 2836 elements using a 1Mb filter on a CPU pegged to 800mhz. At 1800mhz it takes about 48ms. For my requirements that pretty much validates the idea and the approach.

Searching arrays in X86 assembler with a bloom filter

I'm currently writing what amounts to an olap database in factor in my spare time, which is a pretty interesting project. Actually spare time is limited now that I have a child but factor is affording me a pretty good productivity compared to python and scheme so it's probably net-par compared to what I was doing a couple of years ago. I should probably write a bit more about that some time.

Anyway I recently made the bizarro discovery that in the world of factor for ease of deployment it makes more sense to write low level code in assembler rather than C. This is because factor effectively has a built in macro assembler DSL for each of the chipsets it supports, whereas shipping C code requires having a C compiler setup on the target machine which is a pita for windows platforms. Also X86 is fast becoming the only server processor that matters. Crazy world.

Ideally I'd not have to write any low-level code at all and I'd do it all in factor. To be honest factor is pretty bloody fast, and could conceivably compete well with raw assembler in the not-too-distant future given enough tweaking of the code generator. For the time being though C and assembler rule the performance roost in this environment.

Ok, so the interesting problem I have is:

I have a set of 32bit integers, and I want to search a large unsorted 32bit array accumulating the array indices of values matching those from the set. To solve this in factor I've been putting the input set in a hashtable and then sequentially looking up each element from the array, which algorithmically is pretty optimal O(n+m) but the ever present constant is pretty high. This takes about 4 seconds on a processor pegged at 800mhz for a 48MB array.

Ideally I want this performance down to under 100ms. This is going to be very tight because a 48mb file contains ~12M elements, so if I'm aiming for 100ms that means about 6 cycles per element on an 800mhz chip. Probably a bit too ambitious.

I'm thinking I'll create a bloom filter with a simple hashing function (like e.g. 'mod' ) and iterate the large array in assembler checking the filter for each element and then escaping back to factor to confirm and accumulate matches.

Incidentally, I've noticed that my motivation for writing about my programming ideas drops after I've implemented them, so this time I thought I'd experiment with using my blog to think through the idea before I attempt it. Can anybody see a better way to solve this problem given the constraints? I'll post again with the results...

Namespacing & Context - ramifications for the semantic web

In determining the meaning of tokens used in communication there are two widely used approaches to disambiguate that I'll charactise as 'namespacing' and 'context'.

When humans communicate amongst themselves they use the context of the communication to narrow down the range of possible meanings of terms used in the exchange, and human language doesn't employ namespaces at all. On the other hand computer identifier schemes typically use namespaces to prevent term clash, and don't use context at all.

The mechanisms operate differently:

Namespaces:

  • Every use of the namespaced term refers to the same concept. (or at least if it doesn't this is considered an error)
  • Deterministic

Context:

  • the concept denoted to by the term depends on the context in which it is used
  • Statistical ( is that the right word? )

My thinking is that namespaces work well in a closed environment because coordination overhead is low and deterministic programs are easy to write. Namespaced schemes do however require a management mechanism to ensure that each use of the same term denotes exactly the same thing. This works well if the terms are grounded in the system - e.g. on the www a URL is used to fetch a document, and thus its use as identifier for that document is grounded.

However the semantic web is an open environment with little grounding, which means that holistic term coordination and management isn't practical. Thus web-scale semantic web systems need to employ some degree of context based disambiguation anyway - i.e. the system can't globally merge statements together without considering issues of provenance and consistency. I wrote about this issue here and at present this consideration is usually handled manually by the person operating the RDF store or software, but as these systems grow and scale more of these issues will need to be addressed by software.

Note that it is important to distinguish between this and the issue of trust in the content of the communication - here I am purely talking about interpreting the meaning of the communication, specifically measuring term consistency between documents from disconnected sources.

Now if you take this this inevitable use of context at web scale as given, my question is: Could the semantic web bootstrap and scale better with a system that disambiguated entirely based on context and didn't employ namespaces at all? (i.e. like human language communication).

So I've been thinking along the lines of a scheme where literals and bnodes are used in place of URIs in RDF documents. Vocabularies use literal terms in place of URIs, and the combination of terms are used to infer meaning in aggregate.

Non-determinism issues aside this approach does have a central advantage: it removes the coordination and bootsrap overhead associated with use of namespaced identifiers, and particularly with issues peculiar to URIs:

  • artificial namespaces mean there's little term match serendipity between disconnected uncoordinated clients
  • pre-existing identifier schemes are commonly not valid URIs, making reuse difficult
  • URIs introduce unnecessary term ownership, authority and lifecycle issues
  • Other URI proprietary issues add to cognitive overhead: hash v slash, uri denotes document vs thing it describes

One particular advantage of the literals-in-combination approach is that data can be lifted from existing sources without the requirement to invent and translate identifiers into URI schemes. Currently translation of data into traditional RDF consists of two challenges:

  • converting the structure of the data into a triple graph
  • translating the identifiers into a URI scheme

Whereas the former is a one-shot deal for each data format, the latter frequently requires manual input for each document and is IMO the single biggest hurdle to putting data onto the semantic web.

Of course the downside of the approach is that software consuming the data needs to take a non-deterministic approach to term meaning. There is no globally correct answer to 'does this term in this document mean the same as this one?' - instead it is a function of both the context under which the documents were written and of the requirements of the querying client. Unfortunately I suspect that as people try to get traditional w3c semweb technologies to scale up in web scale environments they're going to find themselves in the same non-deterministic boat.

I'm experimenting with a literals and bnodes approach in my own software and will post updates to my blog.

How realistic is using OWL for semweb data integration?

I like listening to the talis podcasts because they motivate me to think about semantic-web issues. Unfortunately I usually spend the entire session muttering to myself because I disagree with so much that is said.

The issue for me is that speakers often paint a rosy view of a merged data world where 'if only' people would adopt RDF and share their ontologies, systems would be able to communicate and share data. OWL is commonly painted in a broad-brushed way as the mechanism that would then enable semantic web interoperability. I have my doubts - deterministic ontologies get complicated and brittle very quickly.

Here's a test:

If atom were an RDF format (i.e. same data structure, just in RDF), could OWL realistically be used to allow an RSS1.0 app to interpret atom data?

URIs are syntactically universal, not semantically universal

Thanks to all who commented to my previous post, it's made me rethink and clarify my position on the problems with scaling Semantic Web technologies. I boiled it down to this:

Semantic Web clients beware: URIs are syntactically universal, not semantically universal.

The rationale for this is that although it is practically impossible for two disconnected parties to 'mint' the same syntactic URI, it is much more likely that two connected parties will use the same URI to refer to two similar but differing concepts.

The conceptual difference may just be missing qualification: e.g. one document refering to somebody in 1995 and another in 2001. Or it could be genuine misunderstanding: For example our unix server team includes the location of a server as part of its identity - when a server is moved to a different datacentre it effectively becomes a completely different server. Other teams assuming they understand the identity scheme get a surprise when the linked unix server data disappears.

Now this isn't a problem with URIs per se: any other universal identifier scheme would exhibit the same trait. It is however a problem for Semantic Web software which commonly treats URIs as semantically universal without qualification. The bottom line is that before merging RDF graphs one must first manually compare the contexts of each source document to confirm that the URIs in them are mutually compatible.

W3C Semantic Web = Global Ontology after all?

I only just read Jim Hendler's piece from last month "shirkying my responsibility", in which he states that the W3C Semantic Web vision was never about a global shared ontology at all:

"Get it - we are opposing the idea of everyone sharing common concepts."

This seems odd to me, because if that is the case and all communication on the semantic web is local then why is the basic system of identity the URI, a global identifier scheme?

On the contrary, I suspect that the W3C Semantic Web is predicated on global agreement: that all RDF documents containing a URI should use it to identify the same concept, otherwise the whole RDF inference stack breaks. A global ontology that's defined in lots of inter-connected pieces scattered around the web is still a global ontology.

Indexing dilema - in memory or on disk?

I'm trying to work out if it's possible to get the index searching performance I want using a disk-backed store, or whether I need to focus on optimising the indexes to fit totally in memory. The problem is that the optimisation strategies are somewhat different:

Storing indexes on disk: - Increase redundancy, trading space for better locality of reference.

Storing indexes in memory: - Reduce redundancy, trading locality of reference for fitting more data in memory.

Of course reference locality for caching does have an effect in memory as well as disk, but it's more a factor of ~10 (i.e. ~10 times slower to fetch a word from main memory than from l1 cache) whereas a disk seek compared to a sequential read is more like a factor of ~100000.

Typical server machines seem to have the order of ~100 times more disk than memory so whatever happens the redundancy increase needs to be less than a factor of 100 or there's little point.

Sequential disk reads are ~10 times slower than random access memory reads (cache misses) and ~100 times slower than sequential (L1) cache reads, so even if there are no seeks this is still going to be 10-100 times slower. But the payoff is potentually a lot more indexed data per server node.


N.B. these numbers are very approximate orders of magnitude. Assumptions are from the below links, along with my own measurements using hdparm, bonnie++, seeker.c and a some of my own code. Please let me know if I'm wildly out with any of these!

http://www.linuxinsight.com/howfastisyourdisk.html http://homepage.virgin.net/roy.longbottom/randmem%20results.htm

Indexes, Hashes & Compression

The new triplestore is coming along. It can do substring text searches (using a suffix array) and has a basic relational query engine. It doesn't optimise the query plans yet, but if you enter the queries in a good order (most selective clauses first) then you get good performance.

A few things have changed in my thinking since my last post about indexing. Although using hashes to represent tokens is really useful when joining datasets from different nodes in a cluster (no coordination overhead), I'm now thinking that they're not such a good idea for when laying out the actual triple indexes in memory (or on disk).

My reasoning is:

In order to get the performance I want (100 row results from relational queries in ~half second) I'm either going to have to keep the entire set of indexes in memory, or at the very least minimise the disk activity to a small number of sequential reads. Disk seeks are the order of ~10ms so 50 of them and I'm shot. If I end up aiming for the all-in-memory approach then I want to cram as many triples in to memory as possible. If I do use disk reads then locality of reference will be fundamental to the layout of the indexes.

Either way, I'm going to need to use compression on the indexes to achieve optimal storage or read efficiency. The problem with hashes is that they introduce a lot of randomness into the mix which reduces the ability to do delta compression (and then run length encoding of the deltas). I suspect that controlling allocation of identifiers could also be very useful in optimising locality of reference. All this is theoretical at the moment as I haven't actually implemented any index compression, but I hope to do this soon.

Apollo and the Web

Patrick Logan seems convinced that the having a proprietary runtime doesn't matter because apollo apps can still interact with data on the web.

This doesn't sound like a great deal to me. For example if I'm unable to run an app because the vendor doesn't support my OS or device, then being able to fetch blobs of app-specific data from URLs is of little comfort to me.

(or maybe I'm missing something?)