Published Oct 11 2007 by Phil Dawes
I've been playing with Factor for a couple of weeks. I'm finding that it takes me quite a bit longer to write stuff with factor than with other languages, but the process is enjoyable and I get the feeling that I'm learning something useful each time. The question is: will programming speed improve with experience. And more to the point, eventually, will I be able to write code faster in factor than in gambit scheme?
Here's a task I've been trying to accomplish the past couple of evenings: converting tabular data into triples. I've written this functionality before in both python and scheme before so it's a good exercise for developing factor experience.
First, here's the code in python:
def row_to_triples(i,cols,row):
return [ (i,col,cell) for col,cell in zip(cols,row) ]
def tabular_to_triples(startid,cols,rows):
triples = []
for i,row in zip(range(startid,startid+len(rows)),rows):
triples += row_to_triples(i, cols, row)
return triples
Which works like this:
>> print tabular_to_triples(0,
["col1","col2","col3"],
[["a","b","c"],["e","f","g"]])
[(0, 'col1', 'a'), (0, 'col2', 'b'), (0, 'col3', 'c'), (1, 'col1', 'e'), (1, 'col2', 'f'), (1, 'col3', 'g')]
Here's the equivalent in gambit scheme using srfi-42 eager comprehensions:
(include "srfi-42.scm")
(define (tabular->triples startid rows cols)
(define (row->triples id cols row)
(list-ec (:parallel (:list i cols) (:list j row))
(list id i j)))
(append-ec (:list row (index i) rows)
(row->triples (+ i startid) cols row)))
Which yields:
>> (tabular->triples 0
'((a b c)(d e f))
'(col1 col2 col3))
((0 col1 a) (0 col2 b) (0 col3 c) (1 col1 d) (1 col2 e) (1 col3 f))
ASIDE: If you're evaluating gambit scheme (or scheme in general) be sure to check out srfi-42 and also Alex Shinn's ml-style pattern matching module from http://synthcode.com/scheme/ . Gambit scheme out-of-the-box is like abstraction assembly language - you need tools layered over the top for real world use.
Ok, so on to factor. My first attempt was:
: row>triples ( cols row rowid -- triples )
[ -rot 3array ] curry 2map ;
: process-row ( rowid cols row -- rowid cols triples )
rot 1 + swapd ! inc rownumber
[ swapd row>triples ] 2keep swap rot ;
: tabular>triples ( start-rowid cols rows -- triples )
[ process-row ] map concat 2nip ;
Which works like this:
>> { "c1" "c2" "c3" } { { "a" "b" "c" } { "d" "e" "f" } } 0 tabular>triples
>> .
{
{ 1 "c1" "a" }
{ 1 "c2" "b" }
{ 1 "c3" "c" }
{ 2 "c1" "d" }
{ 2 "c2" "e" }
{ 2 "c3" "f" }
}
This is reasonably compact, but looks horrible to me and is pretty difficult to follow due to all the stack shuffling. I'm starting to learn that the way to eliminate stack shuffling is to use combinators, and try to factor out the common code.
I did a bit of functional de-composition with the hope that creating words for the pieces would yield clearer code. The pieces of functionality needed were:
- creating a triple from 3 elements on the stack. Handled by 3array
- enumerating a variable whilst mapping through a sequence
- holding a variable whilst mapping through a sequence. Handled by 'curry map'
- mapping two sequences in parallel (columns and rows). Handled by 2map
- concatenating sequences (rows) of triples together: concat
Actually once I'd worked out the which words I wanted I found that a most of them already existed. In fact all of them except 'enumerating a variable whilst mapping through a sequence', so I wrote 'map-with-counter' to provide this. It prepends enumerating code to the map quotation before calling map, then cleans up the index variable at the end:
: map-with-counter ( start seq quot -- newseq )
[ [ dup 1+ swap ] dip ] swap compose map nip ;
And so few attempts later and I've got:
: row>triples ( rowid cols row -- triples )
[ 3array ] curry** 2map ;
: tabular>triples ( start-rowid cols rows -- triples )
[ row>triples ] curry* map-with-counter concat ;
Which I'm quite pleased with. However in addition to map-with-counter I've had to create a new partial application combinator: curry, which is general but doesn't exist in the standard library. A sure sign that I'm doing something wrong, or at least differently to other factor developers.
And while I was writing curry I ended up creating same new stack shufflers:
: rotd ( a b c d -- b c a d ) >r rot r> ;
: -rotd ( a b c d -- c a b d ) >r -rot r> ;
: dupdd ( a b c -- a a b c ) >r dupd r> ;
! partial application of quot based on 3rd item of stack
! see curry, curry*
: curry** ( param obj obj quot -- obj obj curry )
rotd [ -rotd call ] 2curry ; inline
The underlying theme to these extra words is dealing with the 3rd element in the stack and below. So does that mean that seasoned factor developers tend to switch to some other mechanism when there's more than 3 stack variables in play? Sounds like a question for the mailing list...
[ Tags:
General
factor
programming
]
Published Oct 02 2007 by Phil Dawes
I got round to writing my first factor module tonight: a csv parser.
Actually there's already a csv parser written in factor by Daniel Ehrenberg, but it's been removed from the latest factor releases. I found a copy of that code here but I don't know for how long it'll stay there.
Unfortunately I had two problems with this existing module: The first was that it ran pretty slowly (1M of csv took ~5 seconds to parse on my laptop) mainly because my copy of factor wouldn't compile the state-parser module that it depends on so it ran un-optimized. The second was that I needed a parser that could parse a row at a time for reading huge csv files in chunks. I took that as an opportunity to write my own.
The code performs ok (~500ms for 1M csv) and parses all the examples on the wikipedia csv page, but I can't help feeling that I've written it in a similar style to what I would have done if I were using scheme. If anybody has any hints on ways to make the code smaller or faster or more elegant then I'd be delighted.
USING: kernel sequences io namespaces combinators ;
IN: csvparser
DEFER: quoted-field
: not-quoted-field ( -- endchar )
",\"\n\s\t" read-until #! "
dup
{ { CHAR: \s [ drop % not-quoted-field ] } ! skip whitespace
{ CHAR: \t [ drop % not-quoted-field ] }
{ CHAR: , [ swap % ] }
{ CHAR: " [ drop drop quoted-field ] } ! "
{ CHAR: \n [ swap % ] }
{ f [ swap % ] } ! eof
} case ;
: maybe-escaped-quote ( -- endchar )
read1
dup
{ { CHAR: " [ , quoted-field ] } ! " is an escaped quote
{ CHAR: \s [ drop not-quoted-field ] }
{ CHAR: \t [ drop not-quoted-field ] }
[ drop ]
} case ;
: quoted-field ( -- endchar )
"\"" read-until ! "
drop % maybe-escaped-quote ;
: field ( -- string sep )
[ not-quoted-field ] "" make swap ;
: (row) ( -- sep )
field swap ,
dup CHAR: , = [ drop (row) ] when ;
: row ( -- array[string] eof? )
[ (row) ] { } make swap ;
: (csv) ( -- )
row swap , [ (csv) ] when ;
: csv-row ( stream -- row )
[ row drop ] with-stream ;
: csv ( stream -- rows )
[ [ (csv) ] { } make ] with-stream ;
If anybody's interested the module (inc tests and doc) is here.
[ Tags:
factor
programming
]
Published Sep 28 2007 by Phil Dawes
Last year I embarked on somwhat of a journey to find a better language for my home projects after getting a bit frustrated by python's lack of blocks and general cruftyness. After a couple of months of trying various different things I settled on Gambit Scheme for my spare-time data indexing project. A minimal core language with uniform syntax and macros. Lots of potentual for adapting and building language features that make sense to me. Sorted.
Then last week I got round to reading Richard Jones's minimal forth code. A language compiler and runtime in a couple of pages of well documented X86 assembler; small enough to read and understand the whole system in a bus journey.
Like Scheme, Forth has the ability to construct new language features, allowing user code to get between the parser and the evaluator to modify the language itself. Also, like Scheme, it has a uniform syntax - words separated by spaces, which makes re-writing code on the fly practical. Both these features mean that a fully fledged programming environment can be bootstrapped up from a minimal core. And the hook is: that minimal core is so much smaller for Forth than it is for Scheme.
One thing lead to another and I got interested in stack languages. Now I'm looking at Factor and wondering...
Factor has ticks in all the right boxes: minimal core, machine code compiler, macros, continuations, lightweight threads and message passing concurrency. It's also the tersest language I've seen - code to do something always seems a fraction of the size I expect it to be. On the other hand it's so radically different to anything else I've programmed in and it makes my head hurt. Could it be a contender to knock Scheme off the top spot?
[ Tags:
factor
forth
programming
scheme
workfriendly
]
Published Sep 17 2007 by Phil Dawes
JQuery is a javascript library providing a browser API in the functional style. In particular it supplies a CSS style selector language which makes the code for manipulating DOM objects simple and terse.
Simon Willison does a great job of introducing the library and explaining why it's so cool. For a taster, here's jquery line which loads html into a div with id 'intro'.
$('div#intro').load('/some/fragment.html');
In particular I found that using CSS selectors to locate and manipulate DOM objects coupled with jquery's terseness meant it was just as easy to attach events to browser from the javascript code than by adding 'onclick' handlers in the HTML. This nicely seperates the HTML from the javascript and lets you think about how the page will behave on devices without AJAX and javascript capability.
[ Tags:
ajax
browser
javascript
workfriendly
]
Published Jul 30 2007 by Phil Dawes
I'm trying to work out if it's possible to get the index searching performance I want using a disk-backed store, or whether I need to focus on optimising the indexes to fit totally in memory. The problem is that the optimisation strategies are somewhat different:
Storing indexes on disk:
- Increase redundancy, trading space for better locality of reference.
Storing indexes in memory:
- Reduce redundancy, trading locality of reference for fitting more data in memory.
Of course reference locality for caching does have an effect in memory as well as disk, but it's more a factor of ~10 (i.e. ~10 times slower to fetch a word from main memory than from l1 cache) whereas a disk seek compared to a sequential read is more like a factor of ~100000.
Typical server machines seem to have the order of ~100 times more disk than memory so whatever happens the redundancy increase needs to be less than a factor of 100 or there's little point.
Sequential disk reads are ~10 times slower than random access memory reads (cache misses) and ~100 times slower than sequential (L1) cache reads, so even if there are no seeks this is still going to be 10-100 times slower. But the payoff is potentually a lot more indexed data per server node.
N.B. these numbers are very approximate orders of magnitude. Assumptions are from the below links, along with my own measurements using hdparm, bonnie++, seeker.c and a some of my own code. Please let me know if I'm wildly out with any of these!
http://www.linuxinsight.com/howfastisyourdisk.html
http://homepage.virgin.net/roy.longbottom/randmem%20results.htm
[ Tags:
semantic-web
tagtriples
]
Published Jul 26 2007 by Phil Dawes
The new triplestore is coming along. It can do substring text searches (using a suffix array) and has a basic relational query engine. It doesn't optimise the query plans yet, but if you enter the queries in a good order (most selective clauses first) then you get good performance.
A few things have changed in my thinking since my last post about indexing. Although using hashes to represent tokens is really useful when joining datasets from different nodes in a cluster (no coordination overhead), I'm now thinking that they're not such a good idea for when laying out the actual triple indexes in memory (or on disk).
My reasoning is:
In order to get the performance I want (100 row results from relational queries in ~half second) I'm either going to have to keep the entire set of indexes in memory, or at the very least minimise the disk activity to a small number of sequential reads. Disk seeks are the order of ~10ms so 50 of them and I'm shot. If I end up aiming for the all-in-memory approach then I want to cram as many triples in to memory as possible. If I do use disk reads then locality of reference will be fundamental to the layout of the indexes.
Either way, I'm going to need to use compression on the indexes to achieve optimal storage or read efficiency. The problem with hashes is that they introduce a lot of randomness into the mix which reduces the ability to do delta compression (and then run length encoding of the deltas). I suspect that controlling allocation of identifiers could also be very useful in optimising locality of reference. All this is theoretical at the moment as I haven't actually implemented any index compression, but I hope to do this soon.
[ Tags:
data
indexing
semantic-web
tagtriples
]
Published Jul 25 2007 by Phil Dawes
Gambit scheme lacks a profiler that can profile scheme with embedded C code. (There's statprof, but unfortunately it doesn't profile embeded C). I needed to do this pretty desperately for my triple indexing stuff so I've written a simple macro which takes wall clock timings of functions and accumulates them.
You replace 'define' with 'define-timed' in the functions you want profiled, and then the time spent in each function is accumulated in a global hash table. It's not pretty, but it's simple.
The macro needs some supporting code (which I keep in a seperate module so that it can be compiled in order to minimise overhead). :
(define *timerhash* (make-table))
;;; call this before running the code
(define (reset-timer) (set! *timerhash* (make-table)))
;;; adds the time spent in the thunk to an entry in the hashtable
(define (accum-time name thunk)
(let* ((timebefore (real-time))
(res (thunk)))
(table-set! *timerhash* name
(let ((current (table-ref *timerhash* name '(0 0))))
(list (+ (car current) (- (real-time) timebefore))
(+ (cadr current) 1))))
res))
;;; call this afterwards to get the times
(define (get-times)
(map (lambda (e) (list (first e)
(* 1000 (second e))
(third e)))
(sort-list (table->list *timerhash*)
(lambda (a b) (> (second a)
(second b))))))
And here's the macro. It basically just wraps the function body in a call to 'accum-time':
(define-syntax define-timed
(syntax-rules ()
((define-timed (name . args) body ...)
(define (name . args)
(accum-time 'name (lambda ()
body ...))))))
Here's some example output. The first number is the accumulated milliseconds spent in the function, and the second is the number of times it was called.
((handle-query-clause 2182.3294162750244 8)
(do-substring-search-query 1678.9898872375488 1)
(run-query 1678.9379119873047 1)
(main 1678.929090499878 1)
(handle-substr-search 705.5349349975586 2)
(convert-ids-to-symbols 400.3608226776123 1)
(handle-s_o-clause 192.81506538391113 2)
(lookup-o->s 192.52514839172363 2)
(lookup-2-lvl-index 192.48294830322266 2)
(join 163.8491153717041 8)
(hash-join 163.6500358581543 5)
(fetch-int32-results 118.37220191955566 655)
(project 100.7072925567627 3)
(concat 77.42834091186523 11793)
(handle-p->so-clause 55.348873138427734 2)
(bounds-matching-1intersect 51.54275894165039 2)
(lookup-p->2-lvl-index 50.98700523376465 2)
(search-32-32-index 42.65856742858887 94)
(cells-at-positions 35.561561584472656 11697)
(fetch-32-32-results 1.171112060546875 94)
(extract-column .2582073211669922 4)
(get-join-columns .05888938903808594 5)
(text-search-to-structured-query .032901763916015625 1)
(columns .0324249267578125 10)
(column-position .031232833862304688 8)
(rows .011920928955078125 4))
Hmmm.. I really need to get something to html pretty print the scheme code.
[ Tags:
General
gambit
programming
scheme
]
Published Jul 25 2007 by Phil Dawes
[ Tags:
General
]
Published Jul 23 2007 by Phil Dawes
I've been struggling to motivate myself recently. I spent a lot of my spare time in my twenties programming free stuff after work with no trouble at all, but now in my thirties the energy and enthusiasm seems a little harder to come by. Plus Claire and I are expecting a baby any day now and the last few months have played havoc with sleeping patterns (nothing like what's to come I'm sure!).
Still, it's definitely possible to crank code on your own projects when you're knackered, you just have to be a bit more structured about how you generate motivation and enthusiasm from a body that just wants to watch telly. Note that we haven't had the baby yet (any day now!) so all of this may be completely moot once we hit the real deal, but I figured it might be useful to somebody and I haven't blogged for a while so...
The key to programming motivation for me seems to be all about engineering a low barrier to getting started. I really enjoy spare-time software development, but if I don't get up to speed quickly then it's easy to drift off to doing other things like reading reddit or going through my rss feeds. Here's some things I do to keep my eyes on the prize:
Keep a project diary of ideas and thoughts.
I usually have a text file called 'IDEAS' in the base directory of my projects. In it I keep a diary of thoughts and ideas dated in reverse chronological order (like a blog). Sometimes I noodle in it - write questions I want answered, jot while I think through problems. It is not intended to be read by anybody except me so things are often un-finished, written in shorthand and I rarely write explanations.
The diary serves a number of important purposes in addition to being a low motivation way to get going:
- It's a handy reminder of 'where I'm at' with the project if I leave it for a few days, important for getting up to speed quickly.
- Reading it later serves to motivate myself and galvanises me into action.
- I can 'park' ideas that might be useful later
- It keeps me thinking about the bigger picture. I often change direction with a project as a result of thinking something through in my diary textfile.
Do 'test driven' development.
Test driven development is a great technique generally, but I find it especially useful when I'm too tired to be motivated to program. That also applies when the problem is too big for an obviously good solution to fit in my head and I'm exhausted just thinking about it.
When say 'test driven', I mean really driven. The implementation code written really is just enough to pass the tests. In the early stages that means hard-coded results..:
- Write an automated test. Watch it fail. Hard-code the implementation so that the test passes (e.g. hardcode the result of the implementation function). Watch it pass.
- Write a second test. Watch it fail. Hardcode the implementation so that both tests pass. (maybe by using an 'if input-1 then return this, otherwise return that')
- Maybe do a bit of tweaking/refactoring so that it's not quite so hard-coded
- Write another test ... etc..
The point is that by the time you end up evolving the meat of the functionality, you've already crystalised a lot of what the implementation needs to do and can tackle it in very small baby steps. Small chunks are good for me when I'm tired because I don't need to hold too much in my head at once. The tests passing provides continual positive feedback which is a good motivator and often that gets me out of a programming rut.
Write poorly performing code and optimize later
It's very easy for me to fall into the trap of optimising code early, especially because I'm keeping a project diary and spending considerably more time thinking about the project than actually implementing it. Unfortunately although it's fun to come up with strategies to optimise for performance, coding that way tends to make the implementation code stagnant very quickly and that makes the whole experience very demotivating. To combat this I try to get into the habit of defering performance optimsations by noting them down in the diary instead, and then just concentrate on optimizing for simplicity and code size.
N.B. It's worth bearing in mind that profiling and optimizing your code later is really good fun - I have no problem being enthusiastic about making existing code go faster, especially if I have a good set of profiling data to get my teeth into. Also I find that it's much more time efficient to build something simply and then profile and optimize later. That's really important if you're doing this on scraps of spare time. I can't emphasise this enough: much more time efficient.
Any more?
So that's it. I'd be really interested to hear of any other spare-time-programming motivation tips that people use. With the baby coming I'm going to need all the help I can get!
[ Tags:
General
programming
]
Published May 23 2007 by Phil Dawes
One of the first things I wrote when I was in the 'nesting'* phase of learning gambit scheme was a unittest DSL. Part of this was that I wanted an excuse to use r5rs syntax-rules macros, but the real motivation was that I'd been seduced by the idea of using tests for documentation ala Nat Pryce's 'Protest'. Here's what I came up with: (example):
(define-tests bus-tests
(drive-to-next-stop ; name of fn/class/symbol being tested
("takes bus to next stop"
(drive-to-next-stop bus)
(assert-equal 'next-stop (bus 'position)))
("doesn't stop off at chipshop on the way"
; test code to detect chipshop hasn't been stopped at
))
(pick-up-passengers
("picks up passengers from the bus stop"
test code )
("doesn't leave passengers at the stop"
test code )
("waits for old lady running to the stop"
test code )))
The point is that it's easy to write some lisp to traverse this code and generate documentation from it.
I found when using protest in python that the documentation angle reinforced some healthy habits: When you write tests you naturally think 'how would this look to another person?' 'how can I document the behaviour of this?' which encourages more complete testing. Also when you look at generated documentation it's easy to see which bits you aren't testing because the documentation is missing (which then encourages you to write more tests).
The implementation is a bit clunky and makes use of gambit exceptions as a way of terminating tests early because of assert failures (which is a bit rubbish). What probably should be happening is that the outer macro should be re-writing each assert as an 'if' or something to conditionally execute the rest of the test. (which would portable to other scheme implementations)
To be honest I knocked this up as fast as I could so that I could move onto writing other things (I'm developing a data aggregation and indexing tool), but the point of this post is more to convey the idea than the implementation.
That said, the crufty (currently gambit specific) implementation is here - hope this is interesting/useful to somebody.
* nesting as in 'building a nest'
[ Tags:
gambit
scheme
]