Hand-coding multi-platform assembler using Factor compiler intrinsics

Disclaimer: I'm not a Factor compiler expert and am just getting to grips with compiler intrinsics so some of this might be a bit iffy.

'Compiler Intrinsics' is a mechanism by which you can insert your low-level implementation of a subroutine into the compiler output. This is useful in a couple of scenarios:

if the compiler doesn't support the desired functionality - e.g. it does something hardwarey that Factor can't do yet
if the subroutine is performance critical and the compiler isn't generating the most efficient code

The old way of doing compiler intrinsics in Factor was to hand-code some assembler using one of Factor's assembler DSLs (PPC or X86) and then attach it to an existing word as a word-property along with an argument type pattern. When the compiler compiled calls to the word it would compare the input parameters to the pattern and on match would insert the assembler directly into the generated code.

Since my last post about Factor's compiler over a year ago Slava has pretty much re-written the whole thing. It now has two intermediate stages:

The first frontend stage transforms the factor code into an intermediate representation called 'high level IR'. This is basically a decomposition of factor code into primitive word-calls and control nodes through various optimization passes. This is very similiar to the dataflow IR in the original Factor compiler that I described in the previous blog post

The second backend stage is the new bit. It converts the high-level IR into low-level IR, which is basically a platform independent assembler language. An optimization stage then runs and cpu registers are allocated resulting in 'machine IR' (abbreviated to 'MR' in the debug tools). The real machine code generation is then done from this MR.

The new way of doing compiler intrinsics allows you to insert low-level IR code at the beginning of the 'backend' stage. Differences to the old way include:

You now code using the platform independent instructions defined in compiler.cfg.instructions
Instructions operate on virtual registers. There are an infinite number of those
Subroutine arguments don't appear in registers. Instead you manually insert code to get them in and out of the data stack using ds-push, ds-pop
You still have to box and unbox values manually (just as before)
There's an optimization stage that runs after you've emitted the low level IR instructions from your compiler intrinsic

As a really simple example here's a word which is going to add 35 to the fixnum on the top of the stack and push the result. To make sure that we're executing the intrinsic assembler I'll give it a default implementation that throws an error.

: add-35 ( n -- n' ) 
    drop "shouldn't call this" throw  ;

Incidently, here are the MR instructions generated from this default implementation:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) \ add-35 test-mr.
=== word: add-35, label: add-35

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##load-reference RAX "shouldn't call this" 
##replace RAX D 0 
_label 2 
##call M\ object throw 
_label 3 
##no-tco 
_spill-area-size 0

A couple of things to notice:

The instructions are prefixed with ##. E.g. ##load-reference, ##replace
This MR output is displayed after cpu register allocation has been done: RAX is an x86.64 register. Also D is a pseudo-register that points to the data stack. If you look at the disassembled machine code (just below the callstack juggling) you can see that D actually becomes R14:

( scratchpad ) \ add-35 disassemble
00007f6d98780ce0: 49b8e00c78986d7f0000  mov r8, 0x7f6d98780ce0 (add-35)
00007f6d98780cea: 6820000000            push dword 0x20
00007f6d98780cef: 4150                  push r8
00007f6d98780cf1: 4883ec08              sub rsp, 0x8
00007f6d98780cf5: 48b8e6866ca76d7f0000  mov rax, 0x7f6da76c86e6
00007f6d98780cff: 498906                mov [r14], rax
00007f6d98780d02: e859a385ff            call 0x7f6d97fdb060

Ok, so instead of an implementation that throws an error I want to insert my own instructions into the output. I can do this by attaching some low-level-IR emitting code to the word using the "intrinsic" word property:

: emit-add-35 ( node -- )
    drop              ! don't need to inspect the compiler node
    ds-pop            ! insert instruction to pop value off the stack
    ^^untag-fixnum    ! insert code to untag the value in the register
    35 ^^add-imm      ! insert instruction to add 35 to it (add-imm = add immediate)
    ^^tag-fixnum      ! insert code to tag the result
    ds-push ;         ! insert code to push the result onto the data stack

\ add-35 [ emit-add-35 ] "intrinsic" set-word-prop

The emit-add-35 just pops a value off of the stack, un-tags (unboxes) it and then adds 35 to it and tags the result. A couple of points:

'Hats' - The ^^ form of instructions are the same as the ## form, except that after emitting the instruction the ^^ form returns the (new) destination register so that it can be used by the next instruction.
'tag/untag' - Factor aligns all its heap data to the nearest 8 byte boundary, which leaves the bottom 3 bits of each pointer free for runtime type identification (RTTI). These 3 RTTI bits are called the 'tag', and in the case of a fixnum the tag is '000' and the other bits store the actual value rather than a pointer to the value. So instead of unboxing fixnums we simply untag them, which equates to shifting them 3 bits to the right.
node parameter - You'll notice that the emit-add-35 word takes a node parameter. This parameter is a structure passed by the compiler and contains information about the inferred types and value-ranges of the arguments at compile time. This is handy if you're dispatching based on type or you want to decide whether to include overflow logic. In this example I'm doing neither so I discard it

Now that the add-35 word has a compiler intrinsic we can see the emitted code by compiling it within a quotation (code-block) and displaying the mr:

( scratchpad ) [ add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##peek RAX D 0                     ! - load value from stack
##sar-imm RAX RAX 3                ! - untag
##add-imm RAX RAX 35               ! - add 35
##shl-imm RAX RAX 3                ! - tag
##replace RAX D 0                  ! - replace top stack elem with result
_label 2 
##return 
_spill-area-size 0

I've annotated this output but you could probably guess what it was doing anyway.

I mentioned earlier that a backend optimizer stage runs after the intrinsic word is called. To illustrate this here's a compilation of the add-35 word with a supplied constant argument:

( scratchpad ) [ 4 add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-immediate RAX 312 
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

You can see that the Factor compiler dispensed with our hand-coded add instruction and instead just stuck the fixnum-tagged result in the RAX register. It did this because it could perform the evaluation and boxing at compile time. ( 312 = (35 + 4)<<3 ). Here's the resulting X86 assembler:

( scratchpad ) [ 4 add-35 ] disassemble
00007feac680e0c0: 48b83801000000000000  mov rax, 0x138
00007feac680e0ca: 4983c608              add r14, 0x8
00007feac680e0ce: 498906                mov [r14], rax
00007feac680e0d1: c3                    ret

So that leaves the question: How do I code actual X86 assembler into a subroutine?

To do that you need to create a new low-level instruction tuple and emit your X86 assembler from a generate-insn method on that instruction. This is a lot easier than it sounds thanks to the INSN: and CODEGEN: words.

I've got to add some CAS instructions soon so I'll probably write a bit about it then.