Another Racer progress update

Racer has come on a bit since my last post. Here's a quick rundown of the main changes:

  • Method resolution got better. Racer now hunts around traits for default method implementations. This means that e.g. a completion on a Vec method gives you all the methods available, including 'is_exists()' and other methods implemented in the sub traits (see above). Racer can also resolve the type of 'self' so that it can match methods and fields when you're writing method implementations.
  • More completion information Racer now outputs some contextual info which gets included in matches - see the image above.
    (N.B. unfortunately there's a bug in my emacs bindings somewhere in the interaction with company mode which means that the extra stuff only crops up after you move the cursor. I'm not really sure why this happens, I'll try and fix it at some point.)
  • Module searching got better: Racer does a more accurate job of following 'extern crate' and nested modules. It turned out that modules didn't quite work the way I originally thought they did.
  • Racer now recognises explicit types in let expressions
    let v : Vec<uint> = ...
    This gives you a way to 'tell' Racer the type if it isn't smart enough to deduce it.
  • Internally Racer now uses iterators everywhere instead of callback functions. (I mentioned callbacks as something I didn't like about the design in a previous post).

    I originally thought iterators would solve the 'don't do the whole search if I only need the first match' problem. Unfortunately writing lazy iterators in Rust is non-trivial when the thing you're iterating is nested and recursive (like a code search). For now I've replaced the callbacks with a simpler eager solution: 'collect the results in a vector and return vec.move_iter()'.

    N.B. I did manage to cobble together a lazy iterator in one place by chaining together a bunch of procs in the central 'resolve_name' function; This has improved performance in racer - I might do the same trick for sub-trait searches.

Basically Racer is now pretty good at navigating and completing anything that does not involve lambdas, destructuring or generics. I'm planning to work on generics next.

I should say a big thanks to the people who committed patches to racer in the last couple of months, in particular rsw0x who posted a couple of fixes and has also been working on the vim plugin, Jorge Aparicio who got racer integrated with the travis build system and Micheal Gehring who dilligently kept Racer compiling in the face of breaking Rust changes.

While on the subject of contributors: I've been feeling the need for somewhere to discuss Racer recently, in particular for things that I don't have much experience with - e.g. windows support, vim + other ide plugins. I've created a google group racer-discuss; please join if you're interested in Racer.

Racer progress update (+ some vim support!)

It's been a few weeks since I posted about Racer, my rust code-autocompletion project, so here's another update.

First off, thanks to Michael Gehring who has been regularly submitting patches to bring Racer up to date with breaking Rust master changes. I have a very limited amount of time to work on Racer so I really appreciate this, thanks Michael.

The main news is that I've cobbled together a Racer vim plugin to go with the emacs one. I don't use vim much myself and I was kinda hoping somebody with expertise would show up and do this but I guess racer isn't sufficiently advanced for people to be interested yet. I'm a bit of a vim n00b so there's probably a much better way of structuring the plugin - any feedback or help would be gratefully received.

The other big improvement from last time is that racer can now perform some code completion of fields and methods (in addition to the static lookups I talked about last time). It turns out that type deduction is pretty involved. As an illustration here are the steps racer took in the screenshot above:

  1. In the screenshot the completion is being performed on fields of the 'v' variable. The plan is to find the type of 'v' and then locate all the fields and methods connected with that type beginning with 'pu'.
  2. Racer starts by searching backwards in the current scope, and parses the statement 'let v = Vec::new();'. It needs to evaluate the type of the right hand side of the expression so that it can know the type of 'v'. Vec::new() parses to a 'ExprCall' AST type with path: 'Vec::new'.
  3. Racer splits the path 'Vec::new' and starts searching for 'Vec'. 'Vec' is not defined in the screenshot and there are no 'use' or 'extern crate' statements in the file, so racer takes a look in libstd/prelude.rs. (The prelude module is implicitly included in all rust files).
  4. libstd/prelude.rs contains the line:
           ...
           #[doc(no_inline)] pub use vec::Vec;
           ...
    
    So now Racer needs to find the path 'vec::Vec'. It splits the path and starts searching for 'vec'
  5. 'vec' isn't defined anywhere in the prelude module. The next place to search is in the crate root, since paths in use statements are implicitly relative to the crate root. Racer hunts for the crate root using some heuristics and finds libstd/lib.rs.
  6. libstd/lib.rs (the std lib crate root) contains the line:
           ...
           pub use core_collections::vec;
           ...
    
    So Racer needs to find 'core_collections::vec'. It starts searching for 'core_collections' in the same file.
  7. The std lib crate root also contains the line:
           ...
           extern crate core_collections = "collections";
           ...
    
    Which means in the std lib crate 'core_collections' is an alias for the collections crate. Racer searches for the collections crate root and finds it in libcollections/lib.rs.
  8. The collections crate root contains:
           ...
           pub mod vec;
           ...
    
    Which means the source for the vec module should be in either vec.rs or vec/mod.rs under the libcollections directory. Racer finds vec.rs and searches for 'Vec::new' under it. It splits the path and searches for 'Vec'
  9. The vec module contains (amongst other things):
           impl<T> Vec<T> {
              ...
              ...
           }
    
    Which means racer has located an impl for Vec function. It now searches for 'new' in the impl scope and finds
        #[inline]
        pub fn new() -> Vec<T> {
            Vec { len: 0, cap: 0, ptr: 0 as *mut T }
        }
    
  10. Rather than parse the whole method Racer extracts the method type signature of new() and parses that. 'new()' returns a 'Vec<T>' type. Racer now has the type of the right hand side of the 'let v = Vec::new();' statement from step (2) and so knows that the type of the 'v' variable is 'Vec<T>'. All it needs to do now is look for fields and methods belonging to the Vec type. It starts by looking for the definition of the Vec type.
  11. Racer Locates the Vec struct in the the vec module
    pub struct Vec<T> {
        len: uint,
        cap: uint,
        ptr: *mut T
    }
    
  12. This is starting to get a bit boring so I'll cut the rest short: Racer looks for public fields in the struct (there are none), and then hunts for traits implemented for Vec and lists the methods (which it identifies as functions with 'self' as the first argument) starting with 'pu'. Phew!

Despite this elaborate field completing capability, Racer's type inference facilities are very far from er.. complete. At this point in time it is easy to confuse Racer with generic types, destructuring, lambdas and many other constructs that pop up all the time in rust code.

My plan now is to grind through each of these in some sort of order based on usage, probably starting with generics. My hope is that by incrementally adding smarts Racer will gradually become more and more useful and evolve into a fully fledged decent ide library by the time Rust hits 1.0.

Incidentally, this Thursday (26th June) is the first Rust London meetup, which looks like it is going to be awesome. (I'm not involved with organising it, just excited to be going!)

Racer progress update (Code autocompletion for Rust)

I've recently become a fan of the Rust Language. I think memory safety without garbage collection is an important innovation in the field and I'm excited by the prospect of using a safer alternative to C++ for fast low level code in my job.

Unfortunately the language is currently in heavy development, with major features changing every week or so. This is exciting but makes it unsuitable for production code at the moment. Instead I've started a project in my evenings and weekends to help me get up to speed with the language.

'Racer' is intended to bring code completion facilites to Rust editors and IDEs. It works as a standalone program that takes source code coordinates (filename, character position) and outputs completions. Eventually I hope it will also provide other search and editing capabilities. At present though it provides auto-complete of functions, structs + enums, and also does 'find-definition' for jumping around code.

It works like this:

  1. User presses <tab> or whatever to complete some code
  2. Editor writes out a temporary copy of the edited text and invokes Racer on it
  3. Racer searches the source code and prints the matching completion options
  4. The editor then reads the output and displays nice drop down list or whatever

The only editor support bundled with Racer at the moment is emacs. I also heard that Eric West has written support for the Atom editor. It appears that a lot of the core Rust developers use vi so I'm privately hoping that somebody will come along and add some vi support too.

Under the hood Racer works by parsing the local expression to be completed and then traversing the source code, doing text searches to narrow scope and parsing individual statements and blocks with libsyntax to perform type inference.

Internals

The code is very 'work in progress' (i.e. a bit crappy), mainly because I'm still getting to grips with how best to do things in Rust. There are some good bits and some less good bits:

Good:
  • Racer has a comment and string scrubbing iterator that works well. This is important for text-searching code without getting tricked by doc strings and comments.
  • It also has a statement iterator that steps through blocks of code quickly without requiring the parser (which is relatively expensive). This means statements can be text searched and then sent to libsyntax for full-blown parsing if they look like a match.
  • I'm now up to speed with Rust's parser package libsyntax and use it to parse various statements and expressions. The Rust language has changed a bunch over the last year and libsyntax still has a bunch of old idioms and GC based code so I feel like this is an achievement in itself.
  • I'm quite pleased with the decision to invoke racer as a command for each search rather than running a daemon that the editor talks to. It makes things easier to test from the command line and I hope will make integration with editors easier too. The downside is that I'll have to make code searching fast enough to work without building an up front in-memory model of the code. If it gets too slow I'll probably consider writing a code cache file rather than running Racer as a daemon.
Less good:
  • Internally the various searching functions should really return iterators so that the calling client can stop consuming whenever it wants. Instead I'm passing a callback arg to the methods which they then invoke when they find something, e.g.
    pub fn complete_from_file(src: &str, 
                              filepath: &Path, 
                              position: uint, 
                              callback: &mut |Match|) { 
       ... 
    }
    
    complete_from_file("foo", "myfile.rs", 3, |match| {
         // do something with each match
    });
    
    This means the whole search happens regardless of whether you just want the first match. The main reason for doing this is because iterators are somewhat less easy to write for my brain. I'm hoping that yield functionality shows up to help at some point. In the meantime converting Racer's search functions into iterators is on the todo list.
  • Lots of debug!() calls. I'm using logging as my primary debug tool. This is a habit I've got in practically every language I use, even with java+intellij which has a great debugging environment. Rust's debug story isn't bad actually and I've used GDB a bit to debug stuff in Rust.
  • I'm still not completely comfortable with Option types in Rust; for my brain they still generate a little friction every time I need to think about unpacking a result. In Racer the majority case is to just ignore the None type (because we are searching so 'None' usually means nothing showed up). My current go-to method of doing this in code is .map():
    do_something_that_might_work().map(|res| {
           do_something_with_result(res);
    });
    
    I feel I should be using .iter() instead:
    for res in do_something_that_might_work().iter() {
           do_something_with_result(res);
    }
    
    but I can't look at 'for' without thinking 'loop'. I find this particularly unclear for methods that are named as if they could return plural results (e.g. StrSlice::find_str(), which looks like it could return multiple results, but in fact stops at the first one and returns an Option<uint>).
  • Code positions are referenced by byte offset in the start of the file. Unfortunately some days I prefer calling this the 'point' (like emacs does), and some days it's 'pos'. I really should just pick one.
  • Actually the biggest problem with Racer at the moment is that I'm tracking Rust master and am often a step behind new language changes. This means the Racer build is frequently broken with respect to the latest Rust nightly. I'm not sure how best to address this; Maybe I'll release a version that tracks the next Rust release when it gets done.

That's enough brain dumping for now. Racer's completion is pretty immature at the moment but it is still somewhat useful. I personally use the find-definition feature a lot for jumping around code to look up function arguments and I'd miss coding in Rust without it. The next phase for Racer is beefing up the type inference to better track types and handle methods on traits.

Clojure + emacs/slime without the magic package install stuff

Sometimes I just want to install stuff into emacs manually so I know what's going on. Here's a minimal setup to use emacs slime and connect to a swank server with a repl

I stick everything in the /opt directory. I'm using clojure 1.3.0-alpha3

  1. git clone git://git.boinkor.net/slime.git
  2. git clone https://github.com/technomancy/swank-clojure.git
  3. setup emacs:

    (add-to-list 'load-path "/opt/clojure-mode")
    (require 'clojure-mode)
    
    (add-to-list 'load-path "/opt/slime")
    
    (eval-after-load "slime"
      '(progn
         (slime-setup '(slime-repl))))
    (require 'slime)

    (That last 'eval-after-load' bit is to enable the slime-repl. Recent versions of slime come without it enabled by default)

  4. Go to your project directory and start the swank server:

    java -cp "lib/*:.:/opt/swank-clojure/src:/opt/clojure-1.3.0-alpha3/clojure.jar" clojure.main -e "(use 'swank.swank)(apply -main *command-line-args*)" /dev/null "$@"

  5. In emacs, connect to swank server:
    M-x slime-connect

The future is tree shaped

Have been thinking and reading more about parallelism recently. This set of slides from Guy Steele distilled a lot for me.

In order to realise parallel hardware performance we need to optimize our programs for computational bandwidth rather than latency. In terms of programming this means deprecating accumulation (cons, fold, streams, sequences) and favouring divide-and-conquer. This suggests a move to trees as the fundamental abstract building block for data.

Making tests less brittle

I do most of my interesting programming at home currently. Limited time in the evenings means that development is very stop-starty: projects get periods of mad enthusiasm and then are dropped for a few months while I concentrate on something else.

In this context I've found that having a large suite of automated tests can be a double edged sword: Usually when I come back to a project after a few months I have slightly different goals and so want to change the codebase, often drastically.

If the tests are tightly coupled to the implementation this adds significant drag, and is sometimes enough to cause me delete a ton of code and start again. Worse than that, occasionally I just stop running tests and hack.

So over the years I've come to spend an increasing amount of effort isolating tests to make them less brittle in the face of change. This is not an exact science but there's often some way to re-gig a test to make it achieve most of the same purpose without being quite so tightly coupled to the implementation. I guess for my purposes I'm advocating mostly blackbox over whitebox testing at some level. Here's a couple of examples:

System test data in 'native' formats

I like to code 'top-down' as it keeps me focused. I usually start with some data that I want presenting or transforming or storing or mining or whatever. I write some small system tests and code down from there.

I learnt this tip the hard way: It's very advantagous to have test data in a native external format that's not likely to change.

On my first data-aggregation project I had most of the tests using an internal triples format of my own design, which meant that when I changed my mind a few months later I had a ton of testdata and tests to change. I ended up deleting a lot of the code and starting again.

The second time I picked up the project I converted all the inline testdata to CSV and JSON and made the tests run an implicit import transform before invoking the top-level functions. The tests became slightly more complex but also less brittle and I'm now much less likely to delete them.

Inputs/outputs as language primitives

Inevitably as a codebase gets bigger I find that adding top-level blackbox tests isn't enough to drive development and that I need whitebox tests at a unit-level to help with algorithmically intense parts of the project. These tests increase motivation and speed up my coding but unfortunately are a lot more brittle during change and tend to be the ones that get deleted first when I come back to a project.

To combat this I often find it's worth refactoring important algorithms into functions that take language-primitive arguments (e.g. ints, lists etc..), separate from the object graph of the application.

  • A totally contrived illustration:
  • Replace:

        Foo::do-something-clever-with-Bar-Objects( objects )
    
    with:
        do-something-clever-with-id->name-pairs( id/name-pairs )
    

    and have 'Foo' callers unpack the Bah objects into an list of id,name pairs before calling the function.

The tests checking 'do-something-clever' functionality are now less coupled to the internal object graph and are passing only the data required to fulfill the operation.

Now this is obviously a tradeoff: The additional unpacking may add overhead (sometimes not). It might make the function interface unnecessarily complicated. Sometimes the tradeoff works well, sometimes it doesn't, but I always at least consider trying to separate out domain-objects from an algorithmically intense function. Often the algorithm is central to the application but the layout and interaction of the object graph is contrived.

--

It might be that I'm missing some important piece of the testing puzzle - I've mostly coded test-first for as long as I can remember but I've always had a mixed relationship with the outcome. Hopefully there's a silver bullet somewhere that I just haven't been told about yet.

Idea for a global interpreter lock optimized for low contention

Slava is thinking about adding a global interpreter lock to Factor, much like the Python GIL, as a step on the path to a full multithreaded VM. This would allow factor code to run while blocking FFI calls (e.g. C library calls) execute. As part of this each FFI call would need to release the lock before calling into C code and obtain it again afterwards before returning to Factor.

One of the problems with adding a GIL is that it penalizes the common single-threaded case. This got me thinking about how a mutex lock could be implemented that optimized performance for the low contention case so as to minimize the performance impact to single threaded apps. Here's the best approach I've come up with so far:

You need:
  1. a spinlock (implemented inline via cpu primitives)
  2. a boolean to represent the GIL
  3. an int to represent contention on the GIL
  4. an OS semaphore (win32 semaphore or pthread condition)

The goal is to arrive at a situation where if there is no contention then FFI calls just use inline assembler primitives to acquire/release the spinlock and flip the GIL, otherwise they fall back on the additional overhead of library calls to an OS semaphore.

Acquiring the Lock

The idea is that if there's no contention then acquiring the lock is just a case of obtaining the spinlock and flipping the GIL boolean.

(code is half-baked pseudocode, sorry!)

- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - release spinlock
   DONE!
- else:
   - increment contention-counter
   - release spinlock
   - goto LOOP

LOOP:                 // contention counter has been incremented by this thread
- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - decrement contention-counter
   - release spinlock
   DONE!
- else
   - release spinlock
   - wait on semaphore
   - goto LOOP

Releasing the lock

- acquire spinlock
- release GIL
- read contention counter
- release spinlock
- if contention counter is non-zero:
     notify semaphore

This is just an idea at this stage. There was no working wifi on my train home today so haven't done any decent research on this, and I haven't done any empirical testing yet. Also I'm not a multithreading expert so if there's a glaring error in the idea or in the implementation then I'd be very pleased if somebody could point it out - thanks!

Adding atomic CAS instruction support to Factor's compiler

I said in the last post that I'd write a bit about adding a new machine instruction to the factor compiler once I'd got round to actually doing it, so here it is:

If you've been following my blog you'll know that I wanted to utilise multiple cpu cores for a personal database project. Unfortunately Factor doesn't have an os-threaded runtime yet and so in order to work around this I modified the factor vm code to allow multiple vms to run simultaneously on separate threads.

I'm now writing a concurrent queue so that messages can be passed internally between the running vms. To implement my queue I wanted some fast atomic primitives so I set about adding a Compare-and-swap (CAS) instruction to Factor's compiler.

To implement CAS on x86 you basically need the CMPXCHG and LOCK instructions, so my first job was to get these added to Factor's x86 assembler DSL. I located the machine-code opcodes in the intel manuals and added the CMPXCHG and LOCK instructions to the x86 assembler vocabulary thus:

basis/cpu/x86/assembler/assembler.factor:

: CMPXCHG ( dst src -- ) { HEX: 0f HEX: B1 } 2-operand ;

: LOCK ( -- ) HEX: f0 , ;

With the new x86 instructions in place I was all set to add a new low-level IR 'virtual' instruction to factor's compiler. There are basically 3 steps to this:

  1. Declare the new low-level IR instruction along with the number and types of registers it requires
  2. Tell the code generator where to dispatch the calls to generate the 'real' cpu machine code for the instruction
  3. Write methods that emit both X86 and PPC versions of the machine code for the instruction.
I'll explain each step below:

Step one: Declaring the new instruction

Thanks to the recent addition of a low-level instruction DSL, adding a new instruction to the compiler backend is just a case of declaring the instruction name and the argument registers it requires:

basis/compiler/cfg/instructions/instructions.factor:

INSN: ##atomic-compare-exchange
      def: dst/int-rep
      use: ptr/int-rep old/int-rep new/int-rep
      temp: temp/int-rep ;

##atomic-compare-exchange is my new virtual CAS instruction that receives 3 input arguments: a pointer to a word in memory, an expected 'old' value and a new value.

The implementation of ##atomic-compare-exchange will do the following: compare the value pointed to by the ptr register to the value in old and if it's equal replace it with the value in new, otherwise leaves it as it is. Finally, put the resulting value in the destination register dst. In case you haven't guessed, this is pretty much exactly what CMPXCHG does on x86.

At compile time Factor's register allocator allocates the real (e.g. x86) cpu registers and passes them to our code generator.

Unfortunately as an added complication in this particular case the x86 instruction CMPXCHG uses the EAX/RAX register as an implicit argument, and unfortunately Factor's code generation doesn't support tying particular cpu registers to parameters yet (though Slava assures me it will soon). To work around this I'm asking the compiler to pass me an extra 'temp' register so we can use this if any of the others happens to be EAX/RAX.

With the low-level instruction declared I now want to get some interactive testing going, so I write a high level word called 'compare-swap' and a compiler intrinsic which uses the ##atomic-compare-exchange instruction. (See the previous post for an overview of factor compiler intrinsics). The point of this is that we can dump out and check the low level IR instructions emitted by the compiler. Here's the compare-swap word and compiler intrinsic:

: compare-swap ( ptr old new -- ? )
    2drop "compare-swap needs to be compiled" throw ;

: emit-compare-swap ( node -- )
    drop
    3inputs
    ^^atomic-compare-exchange
    ds-push ;

\ compare-swap [ emit-compare-swap ] "intrinsic" set-word-prop

Now we can use the 'test-mr.' debugger word to see the new instruction in action. We'll just pass in a bunch of junk arguments for now so we can see what the low-level MR instructions look like in context:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-reference RAX ALIEN: 20 
##load-immediate RCX 8                            ! 1 << 3
##load-immediate RDX 16                           ! 2 << 3
##atomic-compare-exchange RAX RAX RCX RDX RBX     ! Oops! - RAX allocated as a register
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

In this example you can see that the register allocator has allocated RAX as both the destination register and one of the input registers, so our X86 implementation of ##atomic-compare-exchange will need to work around that.

Step two: Wiring up the code generator

Ok, now that we have low level IR working the next step is to tell the compiler how to generate the real cpu machine-code for the new instruction. There's a convention that all machine code emitting words start with a '%' so I'm going to create a generic word %atomic-compare-exchange with method implementations for each CPU. Here's the generic word declaration:

/basis/cpu/architecture/architecture.factor:

HOOK: %atomic-compare-exchange cpu ( dst cptr old new temp -- )

N.B. Factor has a generic dispatch mechanism called 'HOOK:' which dispatches polymorphically based on the value of a variable at compile time. In this case it's the cpu variable which is set to a singleton representing the target architecture (x86.32, x86.64, ppc), so essentially this generic word is polymophic based on CPU architecture.

Now I tell the compiler to used this generic word for code generation using the CODEGEN: DSL word. (again, see Slava's post on the new compiler DSL words):

/basis/compiler/codegen/codegen.factor:

CODEGEN: ##atomic-compare-exchange %atomic-compare-exchange

Step three: Doing the x86 code generation

All that's left now is to implement %atomic-compare-exchange for our cpu architectures. Below is my method implementation for x86. To make the example more straightforward I've omitted the code that works around the implicit EAX/RAX register, which I abstracted into a 'with-protected-accumulator' combinator.

/basis/cpu/x86/x86.factor:

:: (%atomic-compare-exchange) ( dst cptr old new -- )
    accumulator-reg old MOV
    LOCK
    cptr [] new CMPXCHG
    dst accumulator-reg MOV ;

! CMPXCHG implicitly uses EAX/RAX (accumulator) so need to remove
! EAX from arguments and protect it from being stomped
M: x86 %atomic-compare-exchange ( dst cptr old new temp -- )
    [ (%atomic-compare-exchange) ] with-protected-accumulator ;

The (%atomic-compare-exchange) word contains the actual machine code generation: You can see I simply output 4 lines of assembler using the factor x86 assembler DSL and the registers passed to me by the compiler. (N.B. 'accumulator-reg' is my helper word that returns EAX or RAX depending on whether the arch is 32 or 64 bit)

Now that the x86 implementation is written we can check the output machine code with the disassemble word (which uses either the Udis86 library or GDB under the hood to do the disassembling):

( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] disassemble
00007f85690413a0: 48b8fe3ac566857f0000  mov rax, 0x7f8566c53afe
00007f85690413aa: 48b90800000000000000  mov rcx, 0x8
00007f85690413b4: 48ba1000000000000000  mov rdx, 0x10
00007f85690413be: 4889c3                mov rbx, rax
00007f85690413c1: 4889c8                mov rax, rcx
00007f85690413c4: f0480fb113            lock cmpxchg [rbx], rdx
00007f85690413c9: 4889c0                mov rax, rax
00007f85690413cc: 4983c608              add r14, 0x8
00007f85690413d0: 498906                mov [r14], rax
00007f85690413d3: c3                    ret

The disassembler output verifies that the cmpxchg instruction is being compiled correctly. You can also see that I'm doing some juggling with the rax register to manage using it as an implicit argument to cmpxchg.

Hopefully that gives a good overview of how to get new low-level instructions added to Factor's compiler, and also illustrates how machine-code generation works in Factor.

Hand-coding multi-platform assembler using Factor compiler intrinsics

Disclaimer: I'm not a Factor compiler expert and am just getting to grips with compiler intrinsics so some of this might be a bit iffy.

'Compiler Intrinsics' is a mechanism by which you can insert your low-level implementation of a subroutine into the compiler output. This is useful in a couple of scenarios:

  • if the compiler doesn't support the desired functionality - e.g. it does something hardwarey that Factor can't do yet
  • if the subroutine is performance critical and the compiler isn't generating the most efficient code

The old way of doing compiler intrinsics in Factor was to hand-code some assembler using one of Factor's assembler DSLs (PPC or X86) and then attach it to an existing word as a word-property along with an argument type pattern. When the compiler compiled calls to the word it would compare the input parameters to the pattern and on match would insert the assembler directly into the generated code.

Since my last post about Factor's compiler over a year ago Slava has pretty much re-written the whole thing. It now has two intermediate stages:

The first frontend stage transforms the factor code into an intermediate representation called 'high level IR'. This is basically a decomposition of factor code into primitive word-calls and control nodes through various optimization passes. This is very similiar to the dataflow IR in the original Factor compiler that I described in the previous blog post

The second backend stage is the new bit. It converts the high-level IR into low-level IR, which is basically a platform independent assembler language. An optimization stage then runs and cpu registers are allocated resulting in 'machine IR' (abbreviated to 'MR' in the debug tools). The real machine code generation is then done from this MR.

The new way of doing compiler intrinsics allows you to insert low-level IR code at the beginning of the 'backend' stage. Differences to the old way include:

  • You now code using the platform independent instructions defined in compiler.cfg.instructions
  • Instructions operate on virtual registers. There are an infinite number of those
  • Subroutine arguments don't appear in registers. Instead you manually insert code to get them in and out of the data stack using ds-push, ds-pop
  • You still have to box and unbox values manually (just as before)
  • There's an optimization stage that runs after you've emitted the low level IR instructions from your compiler intrinsic

As a really simple example here's a word which is going to add 35 to the fixnum on the top of the stack and push the result. To make sure that we're executing the intrinsic assembler I'll give it a default implementation that throws an error.

: add-35 ( n -- n' ) 
    drop "shouldn't call this" throw  ;

Incidently, here are the MR instructions generated from this default implementation:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) \ add-35 test-mr.
=== word: add-35, label: add-35

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##load-reference RAX "shouldn't call this" 
##replace RAX D 0 
_label 2 
##call M\ object throw 
_label 3 
##no-tco 
_spill-area-size 0

A couple of things to notice:

  • The instructions are prefixed with ##. E.g. ##load-reference, ##replace

  • This MR output is displayed after cpu register allocation has been done: RAX is an x86.64 register. Also D is a pseudo-register that points to the data stack. If you look at the disassembled machine code (just below the callstack juggling) you can see that D actually becomes R14:

( scratchpad ) \ add-35 disassemble
00007f6d98780ce0: 49b8e00c78986d7f0000  mov r8, 0x7f6d98780ce0 (add-35)
00007f6d98780cea: 6820000000            push dword 0x20
00007f6d98780cef: 4150                  push r8
00007f6d98780cf1: 4883ec08              sub rsp, 0x8
00007f6d98780cf5: 48b8e6866ca76d7f0000  mov rax, 0x7f6da76c86e6
00007f6d98780cff: 498906                mov [r14], rax
00007f6d98780d02: e859a385ff            call 0x7f6d97fdb060

Ok, so instead of an implementation that throws an error I want to insert my own instructions into the output. I can do this by attaching some low-level-IR emitting code to the word using the "intrinsic" word property:

: emit-add-35 ( node -- )
    drop              ! don't need to inspect the compiler node
    ds-pop            ! insert instruction to pop value off the stack
    ^^untag-fixnum    ! insert code to untag the value in the register
    35 ^^add-imm      ! insert instruction to add 35 to it (add-imm = add immediate)
    ^^tag-fixnum      ! insert code to tag the result
    ds-push ;         ! insert code to push the result onto the data stack

\ add-35 [ emit-add-35 ] "intrinsic" set-word-prop

The emit-add-35 just pops a value off of the stack, un-tags (unboxes) it and then adds 35 to it and tags the result. A couple of points:

  • 'Hats' - The ^^ form of instructions are the same as the ## form, except that after emitting the instruction the ^^ form returns the (new) destination register so that it can be used by the next instruction.

  • 'tag/untag' - Factor aligns all its heap data to the nearest 8 byte boundary, which leaves the bottom 3 bits of each pointer free for runtime type identification (RTTI). These 3 RTTI bits are called the 'tag', and in the case of a fixnum the tag is '000' and the other bits store the actual value rather than a pointer to the value. So instead of unboxing fixnums we simply untag them, which equates to shifting them 3 bits to the right.

  • node parameter - You'll notice that the emit-add-35 word takes a node parameter. This parameter is a structure passed by the compiler and contains information about the inferred types and value-ranges of the arguments at compile time. This is handy if you're dispatching based on type or you want to decide whether to include overflow logic. In this example I'm doing neither so I discard it

Now that the add-35 word has a compiler intrinsic we can see the emitted code by compiling it within a quotation (code-block) and displaying the mr:

( scratchpad ) [ add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##peek RAX D 0                     ! - load value from stack
##sar-imm RAX RAX 3                ! - untag
##add-imm RAX RAX 35               ! - add 35
##shl-imm RAX RAX 3                ! - tag
##replace RAX D 0                  ! - replace top stack elem with result
_label 2 
##return 
_spill-area-size 0

I've annotated this output but you could probably guess what it was doing anyway.

I mentioned earlier that a backend optimizer stage runs after the intrinsic word is called. To illustrate this here's a compilation of the add-35 word with a supplied constant argument:

( scratchpad ) [ 4 add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-immediate RAX 312 
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

You can see that the Factor compiler dispensed with our hand-coded add instruction and instead just stuck the fixnum-tagged result in the RAX register. It did this because it could perform the evaluation and boxing at compile time. ( 312 = (35 + 4)<<3 ). Here's the resulting X86 assembler:

( scratchpad ) [ 4 add-35 ] disassemble
00007feac680e0c0: 48b83801000000000000  mov rax, 0x138
00007feac680e0ca: 4983c608              add r14, 0x8
00007feac680e0ce: 498906                mov [r14], rax
00007feac680e0d1: c3                    ret

So that leaves the question: How do I code actual X86 assembler into a subroutine?

To do that you need to create a new low-level instruction tuple and emit your X86 assembler from a generate-insn method on that instruction. This is a lot easier than it sounds thanks to the INSN: and CODEGEN: words.

I've got to add some CAS instructions soon so I'll probably write a bit about it then.

Making a C codebase reentrant by turning it into a big C++ object

Over the last couple of months I've been spending my spare time working at making the factor vm codebase reentrant. The Factor vm is a mostly C codebase with global variables holding the runtime state, and I wanted to be able to run multiple vms in a single process. I thought I'd document the approach I used to make the C portions reentrant because it's one of those things that's obvious and easy in hindsight but took me a few abortive attempts and some wasted time to find the best way.

The output of this process is one big vm object with all the global variables and functions in it. I originally spent some time trying to refactor the vm codebase into an OO model but this turned out to be a really subjective operation and I ended up thinking I'd do more harm than good attempting that. Ultimately I opted for the one-big-vm-object approach, with the previso that it can then be refactored into an object model later if that's deemed a good idea.

Anyway, here's the recipe to put all the variables and functions into the object. The purpose of the technique is to have a working running version at all times:

  1. create an empty vm class and singleton instance
  2. move the functions into the class one by one, leaving a forwarding function (The forwarding function calls the method through the singleton pointer, meaning all the existing refs to the function still work)
  3. once all the functions are converted to methods, remove the forwarding functions
  4. then move the global variables into the class
  5. finally remove the singleton pointer

The reason for moving the variables at the end is that once the functions are in the object it doesn't matter if variables are made local to the object or global: the code refering to them in the functions doesn't change. This means you can incrementally move variables in and out (for testing) and everything builds ok at each step.

I should mention that it really helps if you've got a good editor with macro support. I wielded emacs' macro and register features to pretty much automate the whole thing, which is a godsend if you've only got about an hour a night to spend on hacking. (I have kids).

Obviously there was a lot more to making the vm reentrant than simply putting all the C code in an object, but doing that really broke the back of the work and motivated me to press on with the assembler and compiler changes. Hopefully I'll get around to writing something about the vm internals soon.