Continuing on from yesterday evening, I had a bit of time tonight in front of the telly so I implemented the rest of the fast-search functionality in factor using the assembler code from the previous post.

The first step was to create the simple bloom filter for sending to the assember code. To keep the assembler fast and simple I'm just setting one bit for each element in the input sequence. Also I hardcoded the size of the filter for simplicity (more later).

Factor has a bit-array type that backs onto a block of memory so writing 'prepare-filter' was easy:


: set-bit ( n filter -- ) t -rot set-nth ; inline

: (prepare-filter) ( filter seq -- )
  [ [ 1048576 mod ] dip set-bit ] curry each ; 

: prepare-filter ( seq -- filter )
  1048576 <bit-array> [ (prepare-filter) ] keep ;

I also needed some code to filter out false positives found by the assembler filter. To do this I pre-create a hash from the input set and then pass each result to a 'collect-element-if-correct' word:


: seq>hash ( seq -- hash ) [ dup ] H{ } map>assoc ; inline

: check-elt ( ptr i -- v/f )
  alien-unsigned-cell itemshash get at ;

: collect-element-if-correct ( alien i -- )
  tuck check-elt [ 2array , ] [ drop ] if* ;

Incidently, somebody left a comment on my last post saying that the bt (bit-test) instruction was a performance hog when testing directly against memory, so I modified the code to load the relevant bits into a register and bt against that instead. He was right - on the simple test I got a speedup from 170ms to around 105ms! I replaced the single bt instruction from the last post with the following:


  tmpreg "val" operand MOV
  tmpreg 5 SHR                    ! tmp contains dword offset into filter
  tmpreg 2 SHL                    ! clear bottom 2 bits and make it a byte offset
  tmpreg "filter" operand tmpreg [+] MOV   ! tmp contains filter dword
  "val" operand tmpreg BT

which disassembled with gas looks like this:


mov    %edx,%edi
shr    $0x5,%edi
shl    $0x2,%edi
mov    (%eax,%edi,1),%edi
bt     %edx,%edi

I'd actually intended on using bit shifts to eliminate the BT instruction altogether but it seems bitshifts can only take immediate operands in X86 assembler so that means I can't dynamically shift according to the contents of a register. There's probably another way - are there any x86 assembler experts out there with hints?

The only other performance affecting variable was the size of the bitarray used as the filter. This was where I was expecting to come unstuck: I sometimes have 1000s of elements in the input set and so would need a pretty big bitarray to keep the number of false positives low enough to get good performance. On the other hand I expected a large bitarray would hammer the cpu caches (as access is pretty random) and kill performance.

As it happens the performance on a 1Kbit filter seems to be about the same as one 4Mbits wide (i.e. half a megabyte). I'm not sure I fully understand why is as I'm pretty sure the L1 data cache on a core2duo is 32kB. Probably all the action takes place in L2 cache.

Anyway, I'm getting ~107ms to check an entire 48MB array (12M elements) for matches against a set of a 2836 elements using a 1Mb filter on a CPU pegged to 800mhz. At 1800mhz it takes about 48ms. For my requirements that pretty much validates the idea and the approach.