JAM*VAT is now mature enough that it handles relational operations over large amounts of aggregated structured data quickly and scalably, and also provides very fast regex text search operations (due to its inbuilt suffix array implementation).

However one area where it doesn't perform very well is in handling dates and numbers. E.g if you aggregated 10000 RSS feeds into it and then asked for posts made between 9am and 11am this morning, performance would be poor regardless of your hardware. The reason for this is that jamvat doesn't currently index symbols other than for text searching.

The best way I can think to add this capability is to augment the symbol 'string' with a double precision floating point version (where the string can be 'cast' to a numeric). Of course this numeric value won't always be precise (because it's floating point), but AFAICS that won't matter because the value is going to be used for comparitive indexing only (e.g. for > or < comparisons, and for numeric ordering).

So for example the backend SQL for a (symbol > 1000.000000000000001) comparison would be:

 WHERE  sym.numeric_value > 1000.000000000001 
 AND    CAST(sym.text_value AS DECIMAL) > 1000.000000000001

..which should allow the mysql query processor to use the index to narrow down the options (in conjunction with other indexed parts of the query) before applying more accurate numerical comparisons to a CASTed text symbol (i.e. casted to a precise fixed point decimal).

For dates I'm assuming that I'll be able to do date-to-number transformations (e.g. seconds past the epoch) prior to insertion into the database or in a query.

All this is in theory - haven't written the code yet. Can anybody see a better way? (or a problem with this approach?)