Adding atomic CAS instruction support to Factor's compiler

I said in the last post that I'd write a bit about adding a new machine instruction to the factor compiler once I'd got round to actually doing it, so here it is:

If you've been following my blog you'll know that I wanted to utilise multiple cpu cores for a personal database project. Unfortunately Factor doesn't have an os-threaded runtime yet and so in order to work around this I modified the factor vm code to allow multiple vms to run simultaneously on separate threads.

I'm now writing a concurrent queue so that messages can be passed internally between the running vms. To implement my queue I wanted some fast atomic primitives so I set about adding a Compare-and-swap (CAS) instruction to Factor's compiler.

To implement CAS on x86 you basically need the CMPXCHG and LOCK instructions, so my first job was to get these added to Factor's x86 assembler DSL. I located the machine-code opcodes in the intel manuals and added the CMPXCHG and LOCK instructions to the x86 assembler vocabulary thus:

basis/cpu/x86/assembler/assembler.factor:

: CMPXCHG ( dst src -- ) { HEX: 0f HEX: B1 } 2-operand ;

: LOCK ( -- ) HEX: f0 , ;

With the new x86 instructions in place I was all set to add a new low-level IR 'virtual' instruction to factor's compiler. There are basically 3 steps to this:

Declare the new low-level IR instruction along with the number and types of registers it requires
Tell the code generator where to dispatch the calls to generate the 'real' cpu machine code for the instruction
Write methods that emit both X86 and PPC versions of the machine code for the instruction.

I’ll explain each step below:

Step one: Declaring the new instruction

Thanks to the recent addition of a low-level instruction DSL, adding a new instruction to the compiler backend is just a case of declaring the instruction name and the argument registers it requires:

basis/compiler/cfg/instructions/instructions.factor:

INSN: ##atomic-compare-exchange
      def: dst/int-rep
      use: ptr/int-rep old/int-rep new/int-rep
      temp: temp/int-rep ;

##atomic-compare-exchange is my new virtual CAS instruction that receives 3 input arguments: a pointer to a word in memory, an expected 'old' value and a new value.

The implementation of ##atomic-compare-exchange will do the following: compare the value pointed to by the ptr register to the value in old and if it's equal replace it with the value in new, otherwise leaves it as it is. Finally, put the resulting value in the destination register dst. In case you haven't guessed, this is pretty much exactly what CMPXCHG does on x86.

At compile time Factor's register allocator allocates the real (e.g. x86) cpu registers and passes them to our code generator.

Unfortunately as an added complication in this particular case the x86 instruction CMPXCHG uses the EAX/RAX register as an implicit argument, and unfortunately Factor's code generation doesn't support tying particular cpu registers to parameters yet (though Slava assures me it will soon). To work around this I'm asking the compiler to pass me an extra 'temp' register so we can use this if any of the others happens to be EAX/RAX.

With the low-level instruction declared I now want to get some interactive testing going, so I write a high level word called 'compare-swap' and a compiler intrinsic which uses the ##atomic-compare-exchange instruction. (See the previous post for an overview of factor compiler intrinsics). The point of this is that we can dump out and check the low level IR instructions emitted by the compiler. Here's the compare-swap word and compiler intrinsic:

: compare-swap ( ptr old new -- ? )
    2drop "compare-swap needs to be compiled" throw ;

: emit-compare-swap ( node -- )
    drop
    3inputs
    ^^atomic-compare-exchange
    ds-push ;

\ compare-swap [ emit-compare-swap ] "intrinsic" set-word-prop

Now we can use the 'test-mr.' debugger word to see the new instruction in action. We'll just pass in a bunch of junk arguments for now so we can see what the low-level MR instructions look like in context:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-reference RAX ALIEN: 20 
##load-immediate RCX 8                            ! 1 << 3
##load-immediate RDX 16                           ! 2 << 3
##atomic-compare-exchange RAX RAX RCX RDX RBX     ! Oops! - RAX allocated as a register
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

In this example you can see that the register allocator has allocated RAX as both the destination register and one of the input registers, so our X86 implementation of ##atomic-compare-exchange will need to work around that.

Step two: Wiring up the code generator

Ok, now that we have low level IR working the next step is to tell the compiler how to generate the real cpu machine-code for the new instruction. There's a convention that all machine code emitting words start with a '%' so I'm going to create a generic word %atomic-compare-exchange with method implementations for each CPU. Here's the generic word declaration:

/basis/cpu/architecture/architecture.factor:

HOOK: %atomic-compare-exchange cpu ( dst cptr old new temp -- )

N.B. Factor has a generic dispatch mechanism called 'HOOK:' which dispatches polymorphically based on the value of a variable at compile time. In this case it's the cpu variable which is set to a singleton representing the target architecture (x86.32, x86.64, ppc), so essentially this generic word is polymophic based on CPU architecture.

Now I tell the compiler to used this generic word for code generation using the CODEGEN: DSL word. (again, see Slava's post on the new compiler DSL words):

/basis/compiler/codegen/codegen.factor:

CODEGEN: ##atomic-compare-exchange %atomic-compare-exchange

Step three: Doing the x86 code generation

All that's left now is to implement %atomic-compare-exchange for our cpu architectures. Below is my method implementation for x86. To make the example more straightforward I've omitted the code that works around the implicit EAX/RAX register, which I abstracted into a 'with-protected-accumulator' combinator.

/basis/cpu/x86/x86.factor:

:: (%atomic-compare-exchange) ( dst cptr old new -- )
    accumulator-reg old MOV
    LOCK
    cptr [] new CMPXCHG
    dst accumulator-reg MOV ;

! CMPXCHG implicitly uses EAX/RAX (accumulator) so need to remove
! EAX from arguments and protect it from being stomped
M: x86 %atomic-compare-exchange ( dst cptr old new temp -- )
    [ (%atomic-compare-exchange) ] with-protected-accumulator ;

The (%atomic-compare-exchange) word contains the actual machine code generation: You can see I simply output 4 lines of assembler using the factor x86 assembler DSL and the registers passed to me by the compiler. (N.B. 'accumulator-reg' is my helper word that returns EAX or RAX depending on whether the arch is 32 or 64 bit)

Now that the x86 implementation is written we can check the output machine code with the disassemble word (which uses either the Udis86 library or GDB under the hood to do the disassembling):

( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] disassemble
00007f85690413a0: 48b8fe3ac566857f0000  mov rax, 0x7f8566c53afe
00007f85690413aa: 48b90800000000000000  mov rcx, 0x8
00007f85690413b4: 48ba1000000000000000  mov rdx, 0x10
00007f85690413be: 4889c3                mov rbx, rax
00007f85690413c1: 4889c8                mov rax, rcx
00007f85690413c4: f0480fb113            lock cmpxchg [rbx], rdx
00007f85690413c9: 4889c0                mov rax, rax
00007f85690413cc: 4983c608              add r14, 0x8
00007f85690413d0: 498906                mov [r14], rax
00007f85690413d3: c3                    ret

The disassembler output verifies that the cmpxchg instruction is being compiled correctly. You can also see that I'm doing some juggling with the rax register to manage using it as an implicit argument to cmpxchg.

Hopefully that gives a good overview of how to get new low-level instructions added to Factor's compiler, and also illustrates how machine-code generation works in Factor.