Digging into Factor’s compiler

I wrote this post partly as an advocacy piece and partly to put down a bunch of things I've learnt about the factor compiler over the last few weeks. I should point out that I'm no expert in this area and so there are probably inaccuracies and omissions - hopefully Slava or one of the factor gurus will point them out. With that in mind, check this out!:

Factor has an optimising compiler which generates machine code as you type code into the REPL. If you have gdb on your system you can see this in action by firing up factor and using the tools.disassembler vocab:


( scratchpad ) : hello "hello" print ;           ! defines the word hello
( scratchpad ) USING: tools.disassembler ;
( scratchpad ) \\ hello disassemble

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
[New Thread -1213614400 (LWP 25688)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb0f694b0 to 0xb0f694c7:
0xb0f694b0: mov    $0xb0f694b0,%ecx
0xb0f694b5: mov    0xb0f694e4,%ebx
0xb0f694bb: mov    %ebx,0x4(%esi)
0xb0f694be: add    $0x4,%esi
0xb0f694c1: jmp    0xb123e7d0
...
End of assembler dump.

This is very cool in itself, but for me the real beauty of the factor compiler is the very modular design, composed of small pieces that you can pull apart and tinker with in isolation. This makes the compiler accessible to people both as a learning tool and for those wanting to generate highly optimized code for tight loops.

The three stages of the compiler are

Parsing the code and generating a 'dataflow' abstract syntax tree. (Also called 'IR' - intermediate representation)
Optimizing the dataflow tree
Generating machine code from the dataflow tree

I'll dig into each of these steps in order:

Stage 1: Parsing factor code to dataflow IR

The first step parses factor code into a dataflow datastructure. You can run and inspect the results of this yourself using the dataflow word:


USE: inference
[ "hello" print ] dataflow pprint

=> T{
    #push
    T{
        node
        f
        f
        f
        V{ T{ value f "hello" 673850 f } }
        f
        f
        f
        f
        f
        f
        T{
            #call
            T{
                node
                f
                print
                V{ T{ value f "hello" 673850 f } }
                V{ }
                f
                f
                f
                f
                f
                f
                T{
                    #return
                    T{ node f f V{ } f f f f f f f f f }
                }
                f
            }
        }
        f
    }
}

Obviously inspecting this datastructure manually is pretty cumbersome, so fortunately there's some dataflow inspection functionality in the 'optimizer.debugger' vocab. The dataflow>quot word renders the dataflow structure back into quotations (code blocks) that you can print and inspect. I use it here to define some words for dataflow pretty-printing:


USE: optimizer.debugger
: print-dataflow f dataflow>quot pprint nl ;
: print-annotated-dataflow t dataflow>quot pprint nl ;

So now we can turn quotations into dataflow graphs and back again:


[ "hello" print ] dataflow print-dataflow
=> [ "hello" print ]

(N.B. there are already words in the optimizer.debugger vocab for displaying optimized dataflows, but for this post I wanted to be able to print dataflows prior to optimisation)

This also works for pre-defined words using 'word-dataflow' :


: print-hello "hello" print ;
USE: generator
\\ print-hello word-dataflow print-dataflow
=> [ "hello" print ]

In most cases the output quotation will be the same as the input quotation, however there are a couple of expansions that happen at this stage. The first is that words marked 'inline' are inlined directly into the dataflow:


: inlinedword "this" "is" "an" "inlined" "word" ; inline
[ inlinedword ] dataflow print-dataflow
=> [ "this" "is" "an" "inlined" "word" ]

Also any compiler-transforms (macros) are evaluated at this stage.


USE shuffle
[ 1 2 2 ndup ] dataflow print-dataflow      ! ndup is a macro
=> [ 1 2 2 drop 2 drop >r dup r> swap 2 drop >r dup r> swap ]

Stage 2: Dataflow Optimisation

Here's where the fun starts. You can get a feel for how this stage works by looking at 'optimizer.factor'. Here it is in its entirety:


! Copyright (C) 2006, 2008 Slava Pestov.
! See http://factorcode.org/license.txt for BSD license.
USING: kernel namespaces optimizer.backend optimizer.def-use
optimizer.known-words optimizer.math optimizer.control
optimizer.inlining inference.class ;
IN: optimizer

: optimize-1 ( node -- newnode ? )
    [
        H{ } clone class-substitutions set
        H{ } clone literal-substitutions set
        H{ } clone value-substitutions set
        dup compute-def-use
        kill-values
        dup detect-loops
        dup infer-classes
        optimizer-changed off
        optimize-nodes
        optimizer-changed get
    ] with-scope ;

: optimize ( node -- newnode )
    optimize-1 [ optimize ] when ;

'optimize' iteratively calls 'optimize-1' until nothing changes in the output graph - i.e. that it has reached a fixed point and no more optimizations can be performed. If you dig into the words used by optimize-1 (try executing them individually and inspecting the dataflow result) you'll find that optimize-1 performs a number of inferences and optimizations:

It tracks the types (classes) of stack elements created within the code block
It inlines specific generic word implementations (methods) when it can deduce the class instance on the stack
It prunes unused literals and flushable words. (this is actually more useful than it sounds, since other optimisations can generate unused code)
It performs branch analysis, marking tail calls in loops and pruning branches that can't be executed
It evaluates 'foldable' words at compile time if the values of arguments are known
It executes any custom inference code attached to words, allowing words to evaluate their results at compile time if inputs are known

Examples:

evaluating foldable words at compile time

'+' is a foldable word (see help for '+'), so the optimizer evaluates it at compile time if the values of both arguments are known. Here's the dataflow before and after optimization:


[ 2 3 + ] dataflow print-dataflow            ! before optimization
=> [ 2 3 + ]

[ 2 3 + ] dataflow optimize print-dataflow   ! after optimization
=> [ 5 ]

type inference and inlining generic word implementations

To illustrate this we first create two tuples (classes) with constructors, and a generic word


TUPLE: classa ;
C: <classa> classa 
TUPLE: classb ;
C: <classb> classb 

GENERIC: dosomething ( obj -- val )

Now we create an implementation of the generic word specialised for each class:


M: classa dosomething drop "something for class a" ;
M: classb dosomething drop "something for class b" ;

Finally, some code which calls 'dosomething' with an instance of 'classa' on the stack, before and after optimization:


[ <classa> dosomething ] dataflow print-dataflow            ! before optimization
=> [ \ classa drop \ classa 2 <tuple -boa> dosomething ]

[ <classa> dosomething ] dataflow optimize print-dataflow    ! after optimization
=> [ \ classa 2 <tuple -boa> drop "something for class a" ]

It’s a little messy because of the inlined tuple creation, but you can see that prior to optimization ‘dosomething’ is a word call in the dataflow, and afterwards the optimizer has inlined the implementation of ‘dosomething’ specialized on ‘classa’. (if you look you can also see an example of pruning literals here, as the first dataflow has resulted in a superflous ‘\ classa drop’).

This is easier to see if I cheat a bit and use factor’s ‘declare’ word, which declares that elements on the top of the stack are instances of specific classes. So this quotation assumes that top stack element before it is called is of type ‘classa’:


[ { classa } declare dosomething ] dataflow optimize print-dataflow
=> [ drop "something for class a" ]

N.B. you wouldn't normally use 'declare' in user code, but it could be really handy for optimizing performance sensitive tight loops where the results of an external word call are known to the programmer but not the compiler.

conditional folding

The compiler optimizes out conditional branches when it can deduce the outcome of the conditional at compile time:


[ 1 0 =  [ "do if true" ] [ "do if false" ] if ] dataflow optimize print-dataflow
=> [ "do if false" ]

1 isn't equal to 0 so it optimizes this whole block into the contents of the false quotation. This is a simple example, but it turns out to be really cool in performance sensitive code (e.g. tight loops) because you can use a generic library function whose behaviour depends on a conditional, specialize it with a hardcoded 'f' and the compiler will optimize the conditional branch right out of the resulting code. You get the elegance of the generic combinator with the speed of a hand coded loop.

Stage 3: Machine Code Generation

Code generation is implemented by the 'generate' word. This iterates through the nodes calling 'generate-node' on each.


: generate-nodes ( node -- )
    [ node@ generate-node ] iterate-nodes end-basic-block ;

: generate ( node word label -- )
    [
        init-generate-nodes
        [ generate-nodes ] with-node-iterator
    ] with-generator ;

'generate-node' is a generic word with specialized implementations for each type of dataflow node.

As described in this post from Slava's excellent Factor blog there are a number of dataflow node types, the important ones being:

* #push - push literals on the data stack * #shuffle - permute the elements of the data or call stack * #call - call a word * #label - an inlined recursive block (loop, etc) * #if - conditional with two child nodes * #dispatch - jump table with multiple nodes; jumps to the node indexed by a number on the data stack

The generate-node implementations invoke lower level words in the 'architecture' vocabulary, which in turn are generic words that write out small pieces of machine code specialized for each CPU architecture.

The machine code generation code is particularly cool and easy to follow because factor has an assembler DSL for each cpu architecture it supports. The assembler words match the commands and registers of the target cpu architecture and evaluate to their corresponding machine code.

You can even try this out in the REPL using 'make' to collect the results into an array. I'm on x86 so I load the cpu.x86.assembler vocabulary:


USE: cpu.x86.assembler

[ EAX 35 MOV ] { } make .   ! postfix assembler evaluates to machine code!

=> { 184 35 0 0 0 }

The assembler DSLs make code generation easy to follow because you can see the assembler in the generation code and then check it against the disassembled machine code using the 'disassemble' word we used at the start of the post. When following the code it helps to know which registers are used for what purpose. I found this information in assembler files in the factor VM source - I'm an x86 so for me the declares are in the 'cpu-x86.32.S' file:


#define ARG0 %eax
#define ARG1 %edx
#define XT_REG %ecx
#define STACK_REG %esp
#define DS_REG %esi
#define RETURN_REG %eax

#define CELL_SIZE 4

#define PUSH_NONVOLATILE \
    push %ebx ; \
    push %ebp

#define POP_NONVOLATILE \
    pop %ebp ; \
    pop %ebx

register CELL ds asm("esi");
register CELL rs asm("edi");

So lets check this against some generated code for a really basic word that just pops the number 42 on the stack:


: myfunc 42 ;
\ myfunc disassemble
=>
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
[New Thread -1213696320 (LWP 32499)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb136b230 to 0xb136b247:
0xb136b230: mov    $0xb136b230,%ecx
0xb136b235: mov    $0x150,%ebx
0xb136b23a: mov    %ebx,0x4(%esi)
0xb136b23d: add    $0x4,%esi
0xb136b240: ret

The first assembler line puts the address of the word into the XT_REG, which is %ecx. For some reason the start of each function puts it's address into this register - not quite sure why. The second line puts the number 42 into the ebx register. Note that factor uses the first 3 bits of a value ('cell') to store its type (called a tag - see layouts.h). In this case it's a fixnum which is 000. 42 shifted left 3 bits is 336, which in hex is 0x150. The third line puts the number onto the stack, and the forth updates the stack pointer to point to the new top of the stack.

code generation optimizations

Factor has another couple of tricks up its sleeve during the code generation stages:

optimizing shuffle words

The first is that stack shuffle words (e.g. dup, swap, tuck etc..) don't get translated into machine code. Instead the compiler has a compile time 'phantom stack' which records the positions of items in the stack. When it generates the machine code values are accessed from the stack out of order (the runtime stack is after all a random access piece of memory). This makes stack shuffling words and the retain stack effectively 'free' within a code block. A #merge node in the dataflow signifies a code boundary (usually before a subroutine call) which causes the compiler to output instructions which synchronise the physical runtime stack with its phantom stack.

word intrinsics

The second trick is word 'intrinsics'. Word intrinsics are essentially blocks of open-coded assembler that are output in place of a subroutine call. They are associated with the word via a 'word-property', which is a nifty feature of factor that allows meta information to be attached to each word. For example the 'fixnum+fast' word has intrinsics which you can see using 'word-prop':


\ fixnum+fast "intrinsics" word-prop pprint
=>
{
    {
        [ "x" operand "y" operand ADD ]
        H{
            { +output+ { "x" } }
            { +input+ { { f "x" } { [ small-tagged? ] "y" } } }
        }
    }
    {
        [ "x" operand "y" operand ADD ]
        H{
            { +output+ { "x" } }
            { +input+ { { f "x" } { f "y" } } }
        }
    }
}

This tells the compiler to inline the x86 ADD instruction instead of making a subroutine call to the implementation of fixnum+fast. You can add assembler intrinsics to existing words with 'define-intrinsics'; Here's a description from the help for the define-intrinsics word:

Defines a set of assembly intrinsics for the word. When a call to the word is being compiled, each intrinsic is tested in turn; the first applicable one will be called to generate machine code. If no suitable intrinsic is found, a simple call to the word is compiled instead.

What I particularly like about this feature is that it neatly provides the ability to specialize a highly optimized implementation for a particular hardware set, and then fall back gracefully on other architectures.

That concludes my ad-hoc tour of the factor compiler. I've skipped over a number of things and no doubt there are bits I haven't discovered yet and some inaccuracies, but I hope I've supplied enough information to spark interest in this excellent compiler.

As I mentioned in a previous post I got interested in factor as a direct result of tinkering with jonesforth, which takes you through the entire forth bootstrap process starting with raw assembly. I've been delighted to find that factor retains a lot of the 'right-down-to-the-metal' accessibility of its low level cousin.