Idea for a global interpreter lock optimized for low contention

Slava is thinking about adding a global interpreter lock to Factor, much like the Python GIL, as a step on the path to a full multithreaded VM. This would allow factor code to run while blocking FFI calls (e.g. C library calls) execute. As part of this each FFI call would need to release the lock before calling into C code and obtain it again afterwards before returning to Factor.

One of the problems with adding a GIL is that it penalizes the common single-threaded case. This got me thinking about how a mutex lock could be implemented that optimized performance for the low contention case so as to minimize the performance impact to single threaded apps. Here's the best approach I've come up with so far:

You need:
  1. a spinlock (implemented inline via cpu primitives)
  2. a boolean to represent the GIL
  3. an int to represent contention on the GIL
  4. an OS semaphore (win32 semaphore or pthread condition)

The goal is to arrive at a situation where if there is no contention then FFI calls just use inline assembler primitives to acquire/release the spinlock and flip the GIL, otherwise they fall back on the additional overhead of library calls to an OS semaphore.

Acquiring the Lock

The idea is that if there's no contention then acquiring the lock is just a case of obtaining the spinlock and flipping the GIL boolean.

(code is half-baked pseudocode, sorry!)

- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - release spinlock
   DONE!
- else:
   - increment contention-counter
   - release spinlock
   - goto LOOP

LOOP:                 // contention counter has been incremented by this thread
- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - decrement contention-counter
   - release spinlock
   DONE!
- else
   - release spinlock
   - wait on semaphore
   - goto LOOP

Releasing the lock

- acquire spinlock
- release GIL
- read contention counter
- release spinlock
- if contention counter is non-zero:
     notify semaphore

This is just an idea at this stage. There was no working wifi on my train home today so haven't done any decent research on this, and I haven't done any empirical testing yet. Also I'm not a multithreading expert so if there's a glaring error in the idea or in the implementation then I'd be very pleased if somebody could point it out - thanks!

Adding atomic CAS instruction support to Factor's compiler

I said in the last post that I'd write a bit about adding a new machine instruction to the factor compiler once I'd got round to actually doing it, so here it is:

If you've been following my blog you'll know that I wanted to utilise multiple cpu cores for a personal database project. Unfortunately Factor doesn't have an os-threaded runtime yet and so in order to work around this I modified the factor vm code to allow multiple vms to run simultaneously on separate threads.

I'm now writing a concurrent queue so that messages can be passed internally between the running vms. To implement my queue I wanted some fast atomic primitives so I set about adding a Compare-and-swap (CAS) instruction to Factor's compiler.

To implement CAS on x86 you basically need the CMPXCHG and LOCK instructions, so my first job was to get these added to Factor's x86 assembler DSL. I located the machine-code opcodes in the intel manuals and added the CMPXCHG and LOCK instructions to the x86 assembler vocabulary thus:

basis/cpu/x86/assembler/assembler.factor:

: CMPXCHG ( dst src -- ) { HEX: 0f HEX: B1 } 2-operand ;

: LOCK ( -- ) HEX: f0 , ;

With the new x86 instructions in place I was all set to add a new low-level IR 'virtual' instruction to factor's compiler. There are basically 3 steps to this:

  1. Declare the new low-level IR instruction along with the number and types of registers it requires
  2. Tell the code generator where to dispatch the calls to generate the 'real' cpu machine code for the instruction
  3. Write methods that emit both X86 and PPC versions of the machine code for the instruction.
I'll explain each step below:

Step one: Declaring the new instruction

Thanks to the recent addition of a low-level instruction DSL, adding a new instruction to the compiler backend is just a case of declaring the instruction name and the argument registers it requires:

basis/compiler/cfg/instructions/instructions.factor:

INSN: ##atomic-compare-exchange
      def: dst/int-rep
      use: ptr/int-rep old/int-rep new/int-rep
      temp: temp/int-rep ;

##atomic-compare-exchange is my new virtual CAS instruction that receives 3 input arguments: a pointer to a word in memory, an expected 'old' value and a new value.

The implementation of ##atomic-compare-exchange will do the following: compare the value pointed to by the ptr register to the value in old and if it's equal replace it with the value in new, otherwise leaves it as it is. Finally, put the resulting value in the destination register dst. In case you haven't guessed, this is pretty much exactly what CMPXCHG does on x86.

At compile time Factor's register allocator allocates the real (e.g. x86) cpu registers and passes them to our code generator.

Unfortunately as an added complication in this particular case the x86 instruction CMPXCHG uses the EAX/RAX register as an implicit argument, and unfortunately Factor's code generation doesn't support tying particular cpu registers to parameters yet (though Slava assures me it will soon). To work around this I'm asking the compiler to pass me an extra 'temp' register so we can use this if any of the others happens to be EAX/RAX.

With the low-level instruction declared I now want to get some interactive testing going, so I write a high level word called 'compare-swap' and a compiler intrinsic which uses the ##atomic-compare-exchange instruction. (See the previous post for an overview of factor compiler intrinsics). The point of this is that we can dump out and check the low level IR instructions emitted by the compiler. Here's the compare-swap word and compiler intrinsic:

: compare-swap ( ptr old new -- ? )
    2drop "compare-swap needs to be compiled" throw ;

: emit-compare-swap ( node -- )
    drop
    3inputs
    ^^atomic-compare-exchange
    ds-push ;

\ compare-swap [ emit-compare-swap ] "intrinsic" set-word-prop

Now we can use the 'test-mr.' debugger word to see the new instruction in action. We'll just pass in a bunch of junk arguments for now so we can see what the low-level MR instructions look like in context:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-reference RAX ALIEN: 20 
##load-immediate RCX 8                            ! 1 << 3
##load-immediate RDX 16                           ! 2 << 3
##atomic-compare-exchange RAX RAX RCX RDX RBX     ! Oops! - RAX allocated as a register
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

In this example you can see that the register allocator has allocated RAX as both the destination register and one of the input registers, so our X86 implementation of ##atomic-compare-exchange will need to work around that.

Step two: Wiring up the code generator

Ok, now that we have low level IR working the next step is to tell the compiler how to generate the real cpu machine-code for the new instruction. There's a convention that all machine code emitting words start with a '%' so I'm going to create a generic word %atomic-compare-exchange with method implementations for each CPU. Here's the generic word declaration:

/basis/cpu/architecture/architecture.factor:

HOOK: %atomic-compare-exchange cpu ( dst cptr old new temp -- )

N.B. Factor has a generic dispatch mechanism called 'HOOK:' which dispatches polymorphically based on the value of a variable at compile time. In this case it's the cpu variable which is set to a singleton representing the target architecture (x86.32, x86.64, ppc), so essentially this generic word is polymophic based on CPU architecture.

Now I tell the compiler to used this generic word for code generation using the CODEGEN: DSL word. (again, see Slava's post on the new compiler DSL words):

/basis/compiler/codegen/codegen.factor:

CODEGEN: ##atomic-compare-exchange %atomic-compare-exchange

Step three: Doing the x86 code generation

All that's left now is to implement %atomic-compare-exchange for our cpu architectures. Below is my method implementation for x86. To make the example more straightforward I've omitted the code that works around the implicit EAX/RAX register, which I abstracted into a 'with-protected-accumulator' combinator.

/basis/cpu/x86/x86.factor:

:: (%atomic-compare-exchange) ( dst cptr old new -- )
    accumulator-reg old MOV
    LOCK
    cptr [] new CMPXCHG
    dst accumulator-reg MOV ;

! CMPXCHG implicitly uses EAX/RAX (accumulator) so need to remove
! EAX from arguments and protect it from being stomped
M: x86 %atomic-compare-exchange ( dst cptr old new temp -- )
    [ (%atomic-compare-exchange) ] with-protected-accumulator ;

The (%atomic-compare-exchange) word contains the actual machine code generation: You can see I simply output 4 lines of assembler using the factor x86 assembler DSL and the registers passed to me by the compiler. (N.B. 'accumulator-reg' is my helper word that returns EAX or RAX depending on whether the arch is 32 or 64 bit)

Now that the x86 implementation is written we can check the output machine code with the disassemble word (which uses either the Udis86 library or GDB under the hood to do the disassembling):

( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] disassemble
00007f85690413a0: 48b8fe3ac566857f0000  mov rax, 0x7f8566c53afe
00007f85690413aa: 48b90800000000000000  mov rcx, 0x8
00007f85690413b4: 48ba1000000000000000  mov rdx, 0x10
00007f85690413be: 4889c3                mov rbx, rax
00007f85690413c1: 4889c8                mov rax, rcx
00007f85690413c4: f0480fb113            lock cmpxchg [rbx], rdx
00007f85690413c9: 4889c0                mov rax, rax
00007f85690413cc: 4983c608              add r14, 0x8
00007f85690413d0: 498906                mov [r14], rax
00007f85690413d3: c3                    ret

The disassembler output verifies that the cmpxchg instruction is being compiled correctly. You can also see that I'm doing some juggling with the rax register to manage using it as an implicit argument to cmpxchg.

Hopefully that gives a good overview of how to get new low-level instructions added to Factor's compiler, and also illustrates how machine-code generation works in Factor.

Hand-coding multi-platform assembler using Factor compiler intrinsics

Disclaimer: I'm not a Factor compiler expert and am just getting to grips with compiler intrinsics so some of this might be a bit iffy.

'Compiler Intrinsics' is a mechanism by which you can insert your low-level implementation of a subroutine into the compiler output. This is useful in a couple of scenarios:

  • if the compiler doesn't support the desired functionality - e.g. it does something hardwarey that Factor can't do yet
  • if the subroutine is performance critical and the compiler isn't generating the most efficient code

The old way of doing compiler intrinsics in Factor was to hand-code some assembler using one of Factor's assembler DSLs (PPC or X86) and then attach it to an existing word as a word-property along with an argument type pattern. When the compiler compiled calls to the word it would compare the input parameters to the pattern and on match would insert the assembler directly into the generated code.

Since my last post about Factor's compiler over a year ago Slava has pretty much re-written the whole thing. It now has two intermediate stages:

The first frontend stage transforms the factor code into an intermediate representation called 'high level IR'. This is basically a decomposition of factor code into primitive word-calls and control nodes through various optimization passes. This is very similiar to the dataflow IR in the original Factor compiler that I described in the previous blog post

The second backend stage is the new bit. It converts the high-level IR into low-level IR, which is basically a platform independent assembler language. An optimization stage then runs and cpu registers are allocated resulting in 'machine IR' (abbreviated to 'MR' in the debug tools). The real machine code generation is then done from this MR.

The new way of doing compiler intrinsics allows you to insert low-level IR code at the beginning of the 'backend' stage. Differences to the old way include:

  • You now code using the platform independent instructions defined in compiler.cfg.instructions
  • Instructions operate on virtual registers. There are an infinite number of those
  • Subroutine arguments don't appear in registers. Instead you manually insert code to get them in and out of the data stack using ds-push, ds-pop
  • You still have to box and unbox values manually (just as before)
  • There's an optimization stage that runs after you've emitted the low level IR instructions from your compiler intrinsic

As a really simple example here's a word which is going to add 35 to the fixnum on the top of the stack and push the result. To make sure that we're executing the intrinsic assembler I'll give it a default implementation that throws an error.

: add-35 ( n -- n' ) 
    drop "shouldn't call this" throw  ;

Incidently, here are the MR instructions generated from this default implementation:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) \ add-35 test-mr.
=== word: add-35, label: add-35

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##load-reference RAX "shouldn't call this" 
##replace RAX D 0 
_label 2 
##call M\ object throw 
_label 3 
##no-tco 
_spill-area-size 0

A couple of things to notice:

  • The instructions are prefixed with ##. E.g. ##load-reference, ##replace

  • This MR output is displayed after cpu register allocation has been done: RAX is an x86.64 register. Also D is a pseudo-register that points to the data stack. If you look at the disassembled machine code (just below the callstack juggling) you can see that D actually becomes R14:

( scratchpad ) \ add-35 disassemble
00007f6d98780ce0: 49b8e00c78986d7f0000  mov r8, 0x7f6d98780ce0 (add-35)
00007f6d98780cea: 6820000000            push dword 0x20
00007f6d98780cef: 4150                  push r8
00007f6d98780cf1: 4883ec08              sub rsp, 0x8
00007f6d98780cf5: 48b8e6866ca76d7f0000  mov rax, 0x7f6da76c86e6
00007f6d98780cff: 498906                mov [r14], rax
00007f6d98780d02: e859a385ff            call 0x7f6d97fdb060

Ok, so instead of an implementation that throws an error I want to insert my own instructions into the output. I can do this by attaching some low-level-IR emitting code to the word using the "intrinsic" word property:

: emit-add-35 ( node -- )
    drop              ! don't need to inspect the compiler node
    ds-pop            ! insert instruction to pop value off the stack
    ^^untag-fixnum    ! insert code to untag the value in the register
    35 ^^add-imm      ! insert instruction to add 35 to it (add-imm = add immediate)
    ^^tag-fixnum      ! insert code to tag the result
    ds-push ;         ! insert code to push the result onto the data stack

\ add-35 [ emit-add-35 ] "intrinsic" set-word-prop

The emit-add-35 just pops a value off of the stack, un-tags (unboxes) it and then adds 35 to it and tags the result. A couple of points:

  • 'Hats' - The ^^ form of instructions are the same as the ## form, except that after emitting the instruction the ^^ form returns the (new) destination register so that it can be used by the next instruction.

  • 'tag/untag' - Factor aligns all its heap data to the nearest 8 byte boundary, which leaves the bottom 3 bits of each pointer free for runtime type identification (RTTI). These 3 RTTI bits are called the 'tag', and in the case of a fixnum the tag is '000' and the other bits store the actual value rather than a pointer to the value. So instead of unboxing fixnums we simply untag them, which equates to shifting them 3 bits to the right.

  • node parameter - You'll notice that the emit-add-35 word takes a node parameter. This parameter is a structure passed by the compiler and contains information about the inferred types and value-ranges of the arguments at compile time. This is handy if you're dispatching based on type or you want to decide whether to include overflow logic. In this example I'm doing neither so I discard it

Now that the add-35 word has a compiler intrinsic we can see the emitted code by compiling it within a quotation (code-block) and displaying the mr:

( scratchpad ) [ add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##peek RAX D 0                     ! - load value from stack
##sar-imm RAX RAX 3                ! - untag
##add-imm RAX RAX 35               ! - add 35
##shl-imm RAX RAX 3                ! - tag
##replace RAX D 0                  ! - replace top stack elem with result
_label 2 
##return 
_spill-area-size 0

I've annotated this output but you could probably guess what it was doing anyway.

I mentioned earlier that a backend optimizer stage runs after the intrinsic word is called. To illustrate this here's a compilation of the add-35 word with a supplied constant argument:

( scratchpad ) [ 4 add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-immediate RAX 312 
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

You can see that the Factor compiler dispensed with our hand-coded add instruction and instead just stuck the fixnum-tagged result in the RAX register. It did this because it could perform the evaluation and boxing at compile time. ( 312 = (35 + 4)<<3 ). Here's the resulting X86 assembler:

( scratchpad ) [ 4 add-35 ] disassemble
00007feac680e0c0: 48b83801000000000000  mov rax, 0x138
00007feac680e0ca: 4983c608              add r14, 0x8
00007feac680e0ce: 498906                mov [r14], rax
00007feac680e0d1: c3                    ret

So that leaves the question: How do I code actual X86 assembler into a subroutine?

To do that you need to create a new low-level instruction tuple and emit your X86 assembler from a generate-insn method on that instruction. This is a lot easier than it sounds thanks to the INSN: and CODEGEN: words.

I've got to add some CAS instructions soon so I'll probably write a bit about it then.

Making a C codebase reentrant by turning it into a big C++ object

Over the last couple of months I've been spending my spare time working at making the factor vm codebase reentrant. The Factor vm is a mostly C codebase with global variables holding the runtime state, and I wanted to be able to run multiple vms in a single process. I thought I'd document the approach I used to make the C portions reentrant because it's one of those things that's obvious and easy in hindsight but took me a few abortive attempts and some wasted time to find the best way.

The output of this process is one big vm object with all the global variables and functions in it. I originally spent some time trying to refactor the vm codebase into an OO model but this turned out to be a really subjective operation and I ended up thinking I'd do more harm than good attempting that. Ultimately I opted for the one-big-vm-object approach, with the previso that it can then be refactored into an object model later if that's deemed a good idea.

Anyway, here's the recipe to put all the variables and functions into the object. The purpose of the technique is to have a working running version at all times:

  1. create an empty vm class and singleton instance
  2. move the functions into the class one by one, leaving a forwarding function (The forwarding function calls the method through the singleton pointer, meaning all the existing refs to the function still work)
  3. once all the functions are converted to methods, remove the forwarding functions
  4. then move the global variables into the class
  5. finally remove the singleton pointer

The reason for moving the variables at the end is that once the functions are in the object it doesn't matter if variables are made local to the object or global: the code refering to them in the functions doesn't change. This means you can incrementally move variables in and out (for testing) and everything builds ok at each step.

I should mention that it really helps if you've got a good editor with macro support. I wielded emacs' macro and register features to pretty much automate the whole thing, which is a godsend if you've only got about an hour a night to spend on hacking. (I have kids).

Obviously there was a lot more to making the vm reentrant than simply putting all the C code in an object, but doing that really broke the back of the work and motivated me to press on with the assembler and compiler changes. Hopefully I'll get around to writing something about the vm internals soon.

Factor makes you write better code

I program in Python, Javascript and Factor on a roughly daily basis. My experience is that I can write functions/methods quicker in Python and Javascript than I can in Factor, but that my Factor code ends up being of considerably higher quality. By higher quality I mean that it's better factored and easier to pull apart and change. In this post I'm making the claim that factor forces me to write better code, and I'm going to illustrate this with an example.

(I also use perl, ruby, scheme and java, but not nearly as often)

I've recently been writing a trading simulator in my spare time so that I can test my trading ideas on historical data. As part of this project I've written some of the same functionality in both javascript and factor and this experience gave me a good basis from which to compare the languages.

The example I'm going to use to illustrate the comparison is: Coding a simple moving average (SMA) function.

A simple moving average involves stepping along an array of numbers, generating at each step the average (mean) of the last p elements of the sequence (where p is the period). The output of the function is the sequence/array of averages.

E.g. an sma with period 4 on a six element array:

sma([0,1,2,3,4,5],4) => [0,0,0,1.5,2.5,3.5]

(I padded the start of the array with zeros in the javascript version)

For the javascript implementation I built SMA as two nested 'for' loops, with the inner loop summing the last n elements at each turn. This isn't the most efficient way of computing a moving average, but it is what I thought of and implemented first:


function sma (arr,period) {
    var out = [];
    // fill initial space with zeros
    for(var i=0;i<period -1;i++) { out.push(0);}  
    // fill rest with averages
    for (var i=period-1; i<arr.length;i++) {
        var sum = 0;
        for (var j=i-(period-1); j<=i; j++){
           sum += arr[j];
        }
        out.push(sum / period) ;   
    }
    return out;
}

When I went to code the Factor version the idea of coding up nested loops made my head hurt. Factor's stack based approach effectively means serial access to state - you have to shuffle the right variables into the right order at the right time. This makes it is very hard to write functions that manage more than ~3-4 variables at a time.

Javascript by comparison has random access to local variables* and my javascript version uses: 'arr', 'out', 'i', 'j', 'period', 'sum', not to mention a bunch of unnamed temporaries like 'length' arr[j], 'period-1' etc...

Shuffling all these variables manually on a stack while mentally keeping tabs on the order and position of each variable is a pretty tough challenge. I suspect the resultant code would be the sort of thing only a compiler could love.

So faced with this problem I used my traditional factor problem-hammer, which is to step away from the screen, walk around a bit and ask myself the question: 'What abstraction could there be that would make this easier?'.

I came up with 'map-window' which implements a sliding window across the input sequence and applies a block of code to each subset in turn. The code to implement the moving average is then:


[ mean ] map-window

Which is clearly a much cleaner implementation of SMA.

Before I continue I should mention that I could also have written the map-window abstraction in javascript (javascript has good higher order function facilities), but the point of this post is that factor forced me to come up with the approach.

Once I'd had the 'map-window' idea I could easily see how to compute the moving average. I also had an idea of how I could build map-window using 'head' and 'tail', or at least I had enough of an idea to motivate my trying it.

Ok, so here's my full implementation for comparison with the javascript:


: window ( seq start window-width -- subseq )
    [ 1+ head ] dip short tail* ; 

: map-i ( seq quot: ( seq i -- elt ) -- seq' )
    [ dup length ] dip with map ; inline

: map-window ( seq window-width quot -- seq )
    '[ _ window @ ] map-i ; inline

: sma ( seq period -- seq' ) 
    [ mean ] map-window ;

To my eyes the factor implementation is quite a bit more complex than the javascript one, at least consumed in its entirety. This might be because the concept of a for-loop is deeply engrained in my brain whereas the Factor implementation invents both map-i and map-window to build sma.

However the individual parts of the factor implementation are both generic and composable, and once you know what each bit does the whole thing pretty elegantly describes itself.

A big advantage to all this abstraction is that when you discover an implementation pattern occuring more than once, the chances are that the pattern is already factored out to some extent and is ripe for reuse with very little modification. I find this makes refactoring quicker and easier than with python and keeps the codebase relatively lean. This in turn means that the codebase doesn't drag as much as it gets bigger. The tradeoff is that I spend more time upfront finding and creating abstractions in the first place.

Of course if the right abstractions already exist then coding performance is improved dramatically. e.g. if map-window had already existied then sma would have been a slam dunk. I'd assume that as the factor library improves the likelyhood of this happening will increase, maybe at the expense of more time required to learn the core vocabularies. Programming in factor is already more about the libraries than the native language and I'd imagine this trend will continue, especially when you consider that in a lot of cases the libraries implement the core language.

Aside: I was surprised to discover last year that genuinely new and important stack language abstractions like 'fry' and the cleave/spread combinators were only just being conceived, despite Factor being quite a few years old and stack languages in general being many decades old. When you consider that very few languages actually 'invent' new features this makes Factor quite an interesting language in itself. Also interesting is that apart from a small bootstrapping core, the factor language is actually implemented in libraries meaning that anybody can build and experiment with new language constructs.

Anyway I'm diverging from the subject so I ought to sum up. The takeaway is: Whereas other languages provide the ability to create good abstractions, Factor pretty much forces you to create good abstractions because it is so bloody difficult to write any code without them.

--

Update: During writing this post I realised that what I'm doing with map-window is actually very similar to an abstraction in the factor library called <clumps> which constructs a virtual array of overlapping subsequences. That's the nature of factor programming: you keep finding that somebody else has built a similar abstraction to yours and it would have saved you a ton of time if only you'd realised!

  • Factor actually has support for efficient lexical local variables via the 'locals' vocabulary (library), which is a pretty impressive feat. However I only tend to use this when the problem I'm solving doesn't factor well (or sometimes temporarily out of desperation when I can't come up with the right abstraction).

Really simple html templating in factor

I'm currently building an early release of the webapp database project I've been working on in my spare time. The app is predictably written in factor which has a neat deployment mechanism: you create an image file with just the compiled code that you need to run the app and then ship that for each platform.

This is a nice solution, except I found that using the built in 'fhtml' templating language in factor requires the whole parser-compiler runtime which ramps up the image footprint of my app from 3MB to about 20MB - D'oh!

So instead I've written a very simple web templating language which doesn't require the factor parser at all:


DEFER: parse-text

: display-variable ( elem -- )
  dup callable? [ call ] [ write ] if ;

: parse-variable ( -- )
  "}" read-until [ get display-variable parse-text ]
                 [ drop "missing closing }" throw ] if ;

: maybe-variable ( -- )
  read1 dup CHAR: { = [ drop parse-variable ] [ write1 parse-text ] if ;

: parse-text ( -- )
  "$" read-until [ write maybe-variable ] [ write ] if ;

: eval-template ( path -- )
  utf8 <file-reader> [ parse-text ] with-input-stream ;

To run it you just call 'eval-template' with the path to the template file.

The syntax for variables within the template is similar to that of korn shell scripts: e.g. '${MYVAR}'. On processing the template, variable strings are just looked up in the current factor namespace. If the result is a quotation it gets invoked, otherwise the value of the variable is output as a string.

The namespace thing is cool because in factor you can 'bind' a hashtable to the top of the namespace stack before invoking eval-template, so this removes the need to pass tables around and makes the lookup code very neat.

You'll also notice that the template input is bound to stdin using the with-input-stream word prior to processing, which means I also don't have to pass the template stream around. Similarly the code expects the output stream to be bound so there's no passing that around either. Dynamic variables are a really neat factor feature and it alludes me as to why they don't appear in more mainstream languages.

Finally, something I've learnt about html templating languages over the last couple of years:

Html templates are best when the html boilerplate is clean and free of control statements (loops, conditionals), but they begin to suck as the page gets sophisticated about its rendering; usually when you hit tables and nested lists.

However the opposite seems to be true of html DSLs in lisp and factor: they're great when there's plenty of logic but become overkill when you're just doing mostly static html, especially header and footer boilerplate.

The best compromise I've found is to drive the page from the html template, but then call into real code to render tables, lists and conditional content. This also has the nice side effect of making the template language trivial.

Searching arrays in X86 assembler with a bloom filter pt 3

Continuing on from yesterday evening, I had a bit of time tonight in front of the telly so I implemented the rest of the fast-search functionality in factor using the assembler code from the previous post.

The first step was to create the simple bloom filter for sending to the assember code. To keep the assembler fast and simple I'm just setting one bit for each element in the input sequence. Also I hardcoded the size of the filter for simplicity (more later).

Factor has a bit-array type that backs onto a block of memory so writing 'prepare-filter' was easy:


: set-bit ( n filter -- ) t -rot set-nth ; inline

: (prepare-filter) ( filter seq -- )
  [ [ 1048576 mod ] dip set-bit ] curry each ; 

: prepare-filter ( seq -- filter )
  1048576 <bit-array> [ (prepare-filter) ] keep ;

I also needed some code to filter out false positives found by the assembler filter. To do this I pre-create a hash from the input set and then pass each result to a 'collect-element-if-correct' word:


: seq>hash ( seq -- hash ) [ dup ] H{ } map>assoc ; inline

: check-elt ( ptr i -- v/f )
  alien-unsigned-cell itemshash get at ;

: collect-element-if-correct ( alien i -- )
  tuck check-elt [ 2array , ] [ drop ] if* ;

Incidently, somebody left a comment on my last post saying that the bt (bit-test) instruction was a performance hog when testing directly against memory, so I modified the code to load the relevant bits into a register and bt against that instead. He was right - on the simple test I got a speedup from 170ms to around 105ms! I replaced the single bt instruction from the last post with the following:


  tmpreg "val" operand MOV
  tmpreg 5 SHR                    ! tmp contains dword offset into filter
  tmpreg 2 SHL                    ! clear bottom 2 bits and make it a byte offset
  tmpreg "filter" operand tmpreg [+] MOV   ! tmp contains filter dword
  "val" operand tmpreg BT

which disassembled with gas looks like this:


mov    %edx,%edi
shr    $0x5,%edi
shl    $0x2,%edi
mov    (%eax,%edi,1),%edi
bt     %edx,%edi

I'd actually intended on using bit shifts to eliminate the BT instruction altogether but it seems bitshifts can only take immediate operands in X86 assembler so that means I can't dynamically shift according to the contents of a register. There's probably another way - are there any x86 assembler experts out there with hints?

The only other performance affecting variable was the size of the bitarray used as the filter. This was where I was expecting to come unstuck: I sometimes have 1000s of elements in the input set and so would need a pretty big bitarray to keep the number of false positives low enough to get good performance. On the other hand I expected a large bitarray would hammer the cpu caches (as access is pretty random) and kill performance.

As it happens the performance on a 1Kbit filter seems to be about the same as one 4Mbits wide (i.e. half a megabyte). I'm not sure I fully understand why is as I'm pretty sure the L1 data cache on a core2duo is 32kB. Probably all the action takes place in L2 cache.

Anyway, I'm getting ~107ms to check an entire 48MB array (12M elements) for matches against a set of a 2836 elements using a 1Mb filter on a CPU pegged to 800mhz. At 1800mhz it takes about 48ms. For my requirements that pretty much validates the idea and the approach.

Searching arrays in X86 assembler with a bloom filter pt 2

This is a continuation of the previous post

Thanks to the commenters who suggested using other datastructures rather than an unordered array. The reason I'm going for linear search is that it allows the searching of an index sorted for other purposes. The fewer indexes I use, the less data has to be in memory and the larger datasets I can work with. In short, fast-enough linear searching means more data in memory.

I'm pretty sure I'm not going to get near the 100ms at 800mhz I was aiming for yesterday. I wrote the assembler code this evening and did some preliminary tests.

The idea is that I load each element in the array, mod the value to the size of the bloom filter and then see if the corresponding bit is set in the filter and escape out if so. X86 assembler has a BT instruction that tests if an individual bit in memory is set. That's got to be faster than doing all the loading ANDing and bitshifting myself.

Unfortunately the factor X86 assembler doesn't support the BT instruction since it's just designed for generating machinecode for the compiler. A quick look at the intel manuals gave me the opcodes I needed to add it though:


: BT ( operand operand -- ) { HEX: 0F HEX: A3 } 2-operand ;

The assembler loop that does the work looks like this in factor's X86 DSL:


: %idx-matching-filter ( -- )
  "loop" define-label "end" define-label

  "filter" get dup %unbox-byte-array               ! unbox input params
  "ptr" get dup %unbox-alien
  "i" operand %untag-fixnum
  "val" operand %untag-fixnum        ! val contains length at this point

  ds-reg [] "val" operand MOV        ! free up val register for tmp use

"loop" resolve-label                      
  "val" operand "ptr" operand "i" operand [+] MOV  
  "val" operand 1023 AND   ! val mod 1024 
  "val" operand "filter" operand [] BT
  "end" get JB

  "i" operand 4 ADD
  "i" operand ds-reg [] CMP
  "loop" get JNE

"end" resolve-label
   "i" operand %tag-fixnum
;

Notice that it's full of '"foo" operand' clauses - factor maps these to registers (e.g. EAX, EBX etc..) according to function parameters. This is really handy and in this case saves doing the stack->register mapping manually. For completeness, here's the gas dissassembly which includes unpacking registers from the stack and unboxing integers:


( scratchpad ) \ idx-matching-filter disassemble
[Thread debugging using libthread_db enabled]
[New Thread 0xb79dc6c0 (LWP 26068)]
[New Thread 0xb12f1b90 (LWP 26069)]
0xb7f18410 in __kernel_vsyscall ()
Dump of assembler code from 0xb17028e0 to 0xb1702927:
0xb17028e0: mov    $0xb17028e0,%ebx
0xb17028e5: mov    -0xc(%esi),%eax
0xb17028e8: mov    -0x8(%esi),%ecx
0xb17028eb: mov    -0x4(%esi),%edx
0xb17028ee: mov    (%esi),%ebp
0xb17028f0: lea    0x5(%eax),%eax
0xb17028f3: mov    0x9(%ecx),%ecx
0xb17028f6: sar    $0x3,%edx
0xb17028f9: sar    $0x3,%ebp
0xb17028fc: mov    %ebp,(%esi)
0xb17028fe: mov    (%ecx,%edx,1),%ebp
0xb1702901: and    $0x3ff,%ebp
0xb1702907: bt     %ebp,(%eax)
0xb170290a: jb     0xb170291b
0xb1702910: add    $0x4,%edx
0xb1702913: cmp    (%esi),%edx
0xb1702915: jne    0xb17028fe
0xb170291b: shl    $0x3,%edx
0xb170291e: mov    %edx,-0xc(%esi)
0xb1702921: sub    $0xc,%esi
0xb1702924: ret    

I've got to go to bed soon so I just tried the best case of having an empty filter (and thus no matches): I was able to get 171ms for the 48MB array on a processor pegged at 800mhz, which isn't the sub-100ms I was hoping for but is still pretty good. I'll maybe experiment with real data and filter sizes tomorrow if I get any spare time in the evening.

Searching arrays in X86 assembler with a bloom filter

I'm currently writing what amounts to an olap database in factor in my spare time, which is a pretty interesting project. Actually spare time is limited now that I have a child but factor is affording me a pretty good productivity compared to python and scheme so it's probably net-par compared to what I was doing a couple of years ago. I should probably write a bit more about that some time.

Anyway I recently made the bizarro discovery that in the world of factor for ease of deployment it makes more sense to write low level code in assembler rather than C. This is because factor effectively has a built in macro assembler DSL for each of the chipsets it supports, whereas shipping C code requires having a C compiler setup on the target machine which is a pita for windows platforms. Also X86 is fast becoming the only server processor that matters. Crazy world.

Ideally I'd not have to write any low-level code at all and I'd do it all in factor. To be honest factor is pretty bloody fast, and could conceivably compete well with raw assembler in the not-too-distant future given enough tweaking of the code generator. For the time being though C and assembler rule the performance roost in this environment.

Ok, so the interesting problem I have is:

I have a set of 32bit integers, and I want to search a large unsorted 32bit array accumulating the array indices of values matching those from the set. To solve this in factor I've been putting the input set in a hashtable and then sequentially looking up each element from the array, which algorithmically is pretty optimal O(n+m) but the ever present constant is pretty high. This takes about 4 seconds on a processor pegged at 800mhz for a 48MB array.

Ideally I want this performance down to under 100ms. This is going to be very tight because a 48mb file contains ~12M elements, so if I'm aiming for 100ms that means about 6 cycles per element on an 800mhz chip. Probably a bit too ambitious.

I'm thinking I'll create a bloom filter with a simple hashing function (like e.g. 'mod' ) and iterate the large array in assembler checking the filter for each element and then escaping back to factor to confirm and accumulate matches.

Incidentally, I've noticed that my motivation for writing about my programming ideas drops after I've implemented them, so this time I thought I'd experiment with using my blog to think through the idea before I attempt it. Can anybody see a better way to solve this problem given the constraints? I'll post again with the results...

Nesting REPLs in factor

Factor ships with a gui ide, but I prefer to use the command line listener REPL within an emacs buffer for my coding. The advantage is that it's more tightly integrated with emacs - e.g. I can send blocks of code to the repl with keypresses. The downside is the lack of debugger.

Today I discovered that factor REPLs can be nested, rather like creating subshells in unix, using the 'listener' word. This is really handy because it means you can drop into a listener at any point in your code. E.g.

"mydir" [ listener ] with-directory

...opens a nested repl with "mydir" being the current working directory. Exit the repl with 'bye' (or ctrl-d) and execution returns to the parent listener in the parent directory.

This trick can be employed from any point so it's really handy for debugging state-machine style nested code. Here's a trivial example to illustrate the point:

3 [ listener ] map
  --- starts a new listener ---
drop "hello" bye
  --- starts a new listener ---
bye
  --- starts a new listener ---
drop "goodbye" bye
  --- back to the parent ---
.
==> { "hello" 1 "goodbye" }

Cool.