Idea for a global interpreter lock optimized for low contention

Slava is thinking about adding a global interpreter lock to Factor, much like the Python GIL, as a step on the path to a full multithreaded VM. This would allow factor code to run while blocking FFI calls (e.g. C library calls) execute. As part of this each FFI call would need to release the lock before calling into C code and obtain it again afterwards before returning to Factor.

One of the problems with adding a GIL is that it penalizes the common single-threaded case. This got me thinking about how a mutex lock could be implemented that optimized performance for the low contention case so as to minimize the performance impact to single threaded apps. Here's the best approach I've come up with so far:

You need:

a spinlock (implemented inline via cpu primitives)
a boolean to represent the GIL
an int to represent contention on the GIL
an OS semaphore (win32 semaphore or pthread condition)

The goal is to arrive at a situation where if there is no contention then FFI calls just use inline assembler primitives to acquire/release the spinlock and flip the GIL, otherwise they fall back on the additional overhead of library calls to an OS semaphore.

Acquiring the Lock

The idea is that if there's no contention then acquiring the lock is just a case of obtaining the spinlock and flipping the GIL boolean.

(code is half-baked pseudocode, sorry!)

- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - release spinlock
   DONE!
- else:
   - increment contention-counter
   - release spinlock
   - goto LOOP

LOOP:                 // contention counter has been incremented by this thread
- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - decrement contention-counter
   - release spinlock
   DONE!
- else
   - release spinlock
   - wait on semaphore
   - goto LOOP

Releasing the lock

- acquire spinlock
- release GIL
- read contention counter
- release spinlock
- if contention counter is non-zero:
     notify semaphore

This is just an idea at this stage. There was no working wifi on my train home today so haven't done any decent research on this, and I haven't done any empirical testing yet. Also I'm not a multithreading expert so if there's a glaring error in the idea or in the implementation then I'd be very pleased if somebody could point it out - thanks!