Apollo and the Web

May 17 2007

Patrick Logan seems convinced that the having a proprietary runtime doesn't matter because apollo apps can still interact with data on the web.

This doesn't sound like a great deal to me. For example if I'm unable to run an app because the vendor doesn't support my OS or device, then being able to fetch blobs of app-specific data from URLs is of little comfort to me.

(or maybe I'm missing something?)

On how search engines work

May 17 2007

This link showed up on programming-reddit today. I've been reading Tim Bray's ongoing blog for quite a while, but this piece from 2003 predates my discovery of it. And that's a big shame because it is quite simply the most brilliantly readable overview to the whole topic of 'large scale search' that I've ever come across. Nuff said.

Check it out people!

London Haskell User Group

May 02 2007

Neil Bartlett (who used to work at Dresdner Kleinwort, my employer) dropped me a mail to mention that he's started a London Haskell User Group, and that the inaugural meeting on ~~29th~~ 23rd May will include a talk by none other than Simon Peyton Jones of GHC and Software Transactional Memory fame!

Definitely one for all those interested in functional programming and concurrency in general (and these days what forward-looking software engineer isn't?). More details can be found on the LondonHug page.

Some ideas for static triple indexing

Apr 30 2007

I wrote a bit about representing structured data in the last post. Here's some ideas for how I plan to index the data.

Indexing graphs as subject ranges

In indexing triples I need to provide indexed lookups for all 6 of the possible triple query patterns:

s->po sp->o p->os po->s o->sp os->p

(s=subject p=property/predicate o=object)

Most mature triplestores also index a 4th query element 'graph' or 'context'. I intend to support this query type without expanding the index by using a trick: In my triples format the fact that the subjects are auto-generated and local to the graph means I can choose them to be sequential and effectively re-use them as graph indexes - e.g. subjects between 193 and 11255 belong to graph 2 etc.. So for example the 'os->p' indexing can also support a 'og->sp' query pattern by restricting the range of subject matches to only those in the appropriate range.

Cache locality and triple indexes

I mentioned in my last post that I indended to use memory mapped sorted arrays for indexing. Nick Johnson left a comment on my last post (thanks Nick!) alerting me to the better cache locality properties of n-ary trees (where n is the number of elements that fit in a disk block) rather than using binary searching flat arrays.

This is an order of magnitude improvement. For example, for a 1 million element array of 32bit values you can do each cold search in just 3 page faults ( log1024(1M)=3, assuming block size of 4096 - i.e. 1024 elements per block). A binary search on a cold 1M element flat array would yield something more like ~22 faults (I'm assuming the last 10 lookups would happen in the same block). This prompted me to do some reading on cache coherency.

As it turns out (unless I'm missing something) the actual approach I had in mind should be pretty optimal cache locality wise. The plan is to exploit the even randomness of the hashes to reduce the searching overhead, hopefully amortizing to constant lookup (and fault) times.

To illustrate this, consider the o->ps and op->s lookup patterns. I plan to index these through a 3 level hierarchy of sorted arrays o->p->s: I.e. a sorted array of objects, each which points to a child array of predicates, each of which points to subjects.


[o1 o2 o3 o4 o5 o6 o7 ...] (object array)
          /          
         /           
   ...][p1 p2 p3 p4][..    (o4 predicate array)
          /
         /
  ...][s1 s2 s3 s4][....     (o4 p2 subject array)

The trick here is that each array is a sorted array of unique hashes, which because of the randomness of the hash should be evenly spread over the search space.

That means that an object with hash 'h' should be approximately in the position:

(h / hash-range) * numelements.

E.g. if the hash range is unsigned 32bit (0-4294967295), a 2147483648 value is in the middle of the array. The search would try this position first, and then use linear probing to locate the value. I'm hoping that this will result in 3 page faults to locate the first matching triple regardless of the size of the data. Because it doesn't implicitly rely on any block size it should also respond well to L1/L2 caching (unless I'm missing something!).

Hash lengths

As mentioned in the previous post I'm planning on internally identifying each symbol with a 64bit hash, along the same lines as 3store. However I'm currently thinking that I'll only use the first 32 bits in the top 2 lookup indexes. This will make the indexes denser which I think should help with L1/L2 cache locality when probing for a match. Of course the tradeoff is that there will be a lot of duplicate hashes - to account for these I'll put the latter 32bits in the 3rd level data arrays so that they can be filtered before joining.

N.B. I have no empirical performance data to back up any of these ideas, so this is all speculation at the moment (and likely to change as I gain in experience). I'd appreciate it if anybody can see where I've overlooked something or has any more optimal ideas for storing static triple data.

Indexing structured data (again)

Apr 29 2007

Now that I'm up and running and starting to get productive on Gambit-C, I've turned my attention back to indexing structured data.

I've modified the tagtriples idea a bit to reflect my experience on importing data in other formats. I still think the most effective approach is not to try and define an interchange format, but instead to create an indexing scheme that can index data from lots of different structured data formats/models.

So the end goal is a system that can easily aggregate and index static structured data from a variety of systems and formats, and provide relational joining across graphs/datasets. Oh, and it's got to scale.

The data model I'm working with is similar to my tagtriples format except each subject is now system generated, local to the a particular graph/dataset, and opaque (i.e not exposed to the user). RDF officiandos will notice this is similiar to BNodes - think of it as RDF triples where the subject can only be a BNode and properties and objects are values.


#24221 name "Phil Dawes"
#24221 age 31
#24221 likes cheese

I've adopted this approach mainly to ease translation from other formats into this one. Automating the creation of good human-readable identifiers for subjects is often impossible when translating data from other formats that don't contain a natural subject identifier.

Having opaque subjects also means that the concept of identity internal to the graph is specific, but joins across graphs must be achieved by matching properties and objects. This coincides with my ideas about global identity - i.e. that a universal identity scheme is very difficult (/impossible) to implement and coordinate usefully on a large scale. I think the best way of handling identity on a large scale is through a combination of shared context and description using terms from already understood local vocabularies (e.g. english, email addresses, URLs, post codes etc..).

What this means in practice is that the person/agent querying the data must know something of the terms and structures used in the datasets to be able to join them. It also means that they must include additional clauses in the query to provide enough disambiguation to uniquely identifiy terms in the dataset (in the absence of URIs). E.g. I want the "Phil Dawes" with an email address 'phil@example.com'.

In practice this disambiguation tends to happen anyway even with RDF data (e.g. witness smushing of FOAF data). This is simply because coordinating artificial identifiers on a large scale is so difficult that people naturally revert to matching commonly understood terms from pre-existing vocabularies. (hence my previous posts about URIs being a bad tradeoff and not providing much value for the problems they create).

Anyway, from the indexing point of view I've ditched a relational database as a backend and am now working with memory mapped files. This provides more flexibility for indexing strategies. I'm going to begin by using sorted arrays for triple indexes as I'm working with relatively static data and this sounds like a good tradeoff of space to search speed. More about ideas on indexing in another post.

From the point of view of internal symbol identifiers, I'm leaning towards using hashes internally to represent symbols rather than incrementally allocated ids. This is similar to the approach adopted by 3store(PDF).

N.B. Personally I don't think it's a good tradeoff for 3store. 3store isn't (wasn't?) a federated store and so coordinating incremental allocated IDs is easy and would halve the index space (32bits for an ID vs 64 bits for a hash). Index size vs memory is the biggest performance bottleneck in my experience.

For a distributed parallel database however, hashes pay dividends. You don't need to worry about allocating unique IDs between parallel processes, and more importantly you don't need to coordinate IDs between nodes for data joining.

Anyway, I've got some ideas about distributed indexing so hopefully I'll get a chance to write about this soon.

Gambit-C namespaces

Apr 21 2007

One of the first issues I had when evaluating scheme as a possible replacement for Python as my hobby language was its apparent lack of module/library/namespace system. How do people possibly build big programs? I wondered.

Now it turns out most (all?) scheme implementations have features to deal with modules and libraries. Gambit's is particularly nebulous in that it doesn't appear to be documented anywhere. Anyway, here's how it appears to work. I'm sure somebody will correct me if I've got something wrong:

Gambit has the 'namespace' primitive, with which you can declare that certain definitions belong in certain namespaces. Here's an example:


> (namespace ("f#" foo) ("b#" bar baz))

This means (AFAICS): "any further use/definition of the term 'foo' will reference the f# namespace and any use of bah/baz will reference the b# namespace".

e.g.


> (namespace ("f#" foo) ("b#" bar baz))

> (define (foo) "I am foo")  ; defines f#foo

> (foo)
"I am foo"

> (f#foo)
"I am foo"

> (define (bar) "I am bar")

> (b#bar)
"I am bar"

This is cool because it allows you to retroactively fit namespaces to scheme code loaded from other files. E.g. If mod1.scm and mod2.scm both defined a procedure 'foo', you could use namespaces to allow both to be used in the same environment thus:


> (namespace ("m1#" foo))
> (load "mod1")    ; contains: (define (foo) "I am mod1's foo")

> (namespace ("m2#" foo))
> (load "mod2")    ; contains: (define (foo) "I am mod2's foo")

> (m1#foo)
"I am mod1's foo"

> (m2#foo)
"I am mod2's foo"

Job done. Now I haven't really used gambit namespaces much, so I not in a position to provide a good comparison with other approaches, however the feature does seem in keeping with the rest of the scheme language. By that I mean rather than a large set of fully blown rich language features you get a small bunch of simple but very extensible primitives with which to build your own language.

An good example of building a big system over these small primitives is Christian Jaeger's chjmodule library where he has used namespaces along with 'load' and 'compile-file' (and of course macros) to build a more industrial strength module system. This includes an 'import' keyword that loads from a search path and a procedure to recursively compile and import modules. Some example code from the README:



$ gsc -i -e '(load "chjmodule")' -

> (import 'srfi-1)
> fold
#
> (fold + 0 '(1 2 3))
6
> (build-recursively/import 'regex-case)
            ; recompiles regex.scm (a dependency) if necessary,
            ; then (re)compiles regex-case.scm if necessary and imports it.
> (regex-case "http://www.xxx.yy" ("http://(.+)" (_ url) url) (else url))
"www.xxx.yy"
> *module-search-path* 
("."
 "mod"
 "gambit"
 "~/gambit"
 "~~"
 "..")

Sweet. I'm guessing it'll also be possible to build the r6rs library syntax in gambit scheme the same way.

Some hardcore Gambit-C features

Apr 21 2007

Somebody asked me about gambit-c the other day, and why I was using that as opposed to some other language or runtime for my own-time coding stuff. Despite the scheme language being all cool, the thing that really made his eyes light up was the C features in gambit (it is called gambit-c for a reason). Here's some cool stuff you can do with gambit:

Scheme compiles to native machine code
You can embed C code directly into gambit scheme files.

(c-include)

(c-lambda)

You can compile and load C into a running scheme process


(define (cload f)
  (compile-file f)
  (load f))

You can compile the whole thing into a native binary.

(*) N.B. although uncommon outside the lisp world, these compilation features are actually pretty common in lisp/scheme implementations. E.g. I think chicken, bigloo and SBCL provide simililar things.

Getting the hang of the lisp style

Mar 29 2007

Finally, it's starting to feel like I'm getting the hang of the lisp style. The key for me seems to be in re-reading the contents of srfi-1 every time I think I need a loop.

lisp
scheme

SQL Injection Cheat Sheet

Mar 17 2007

I found this SQL Injection Cheat-Sheet today (thanks reddit!). A must-read for anybody writing (or hacking into) web applications with a database.

Hashing Searching Sketching

Mar 08 2007

For the computer scientists in the room:

Google techtalks rock, and this one about recent advances in retrieval techniques was especially interesting to me.

In particular it includes a remarkable recent observation for the improvement of hash tables: Hash table loading and performance can be dramatically improved simply by using more than one hash algorithm in the lookup.

I.e.

instead of doing one hash lookup, you hash the item with two functions (to get two results) and look in both buckets - one of them will contain the element. (i.e. constant lookup time)
On insert you hash twice and put the element in the bucket with the smallest number of elements.
A further optimisation: moving elements to their pairs to optimise loading allows you to get a constant maximum bucket length of 2. That means you don't even need to use a linked list for each bucket, you can preallocate a fixed array of (2 * number of nodes) - i.e. 2 elements for each bucket. With this scheme you can store ~1.7n values in n buckets with very low probablility of overflow.
Finally, simply keep a seperate (very small) overflow buffer for the rare cases where there's overflow.

The techtalk is definitely worth a watch - I found the speaker entertaining and easy to follow.

Actually, having spent a bit of time watching videos recently I've got to wonder if the video presentation is the future of university-level computer science education. I get so much more out of watching technical videos and then looking stuff up in wikipedia than I ever did out of going to undergrad lectures.