The Wide Finder project is coming to an end, as it quickly becomes a
micro-optimization race.
Unsurprisingly, the JoCaml solution is much faster than those written in
interpreted (or even JIT-compiled, as Erlang) languages. As I write this, the
last JoCaml implementation Tim Bray tested is
two times faster
than the second entry, a 345-line, multi-process (not multi-thread) Erlang
program with a specialized pattern matcher. I've made a comparable
(i.e., without regexps) JoCaml version that is about three times faster than
its Erlang counterpart (and still shorter).
I have to say that I've been pleasantly surprised by Erlang's performance,
though; the HiPE system is doing a good job. Nevertheless, it still yields
but one third of the raw speed JoCaml achieves.
The Wide Finder implementations tested by Tim can be classified in four
groups, according to the languages used (read "languages" as "language
implementations"):
interpreted languages (Gawk, Groovy, Perl, PHP, Ruby...)
interpreted languages with built-in (i.e., fast, and usually implemented in C) functionality appropriate for the problem at hand; in this case, sublinear string search (Python)
JIT-compiled languages (Erlang; Java is glaringly missing)
compiled languages
The good thing about the two last categories is that they allow you to
implement the missing functionality while achieving good speed. With an
interpreted language, either you are lucky and have an efficient routine that
does what you need, or you have to write an extension --- the
two orders of magnitude paid up-front are hard to recover with algorithmic
improvements. In other words, if you're using a fast language, you're not stuck with the data structures and
algorithms integrated in the language; you can make your own if needed.
Bob's code weights 800 lines after removing unneeded PCRE code (vs. 150 + 150,
for the Bigstring module, for the fastest and bulkiest JoCaml one), uses a
specialized pattern matcher with "several assumptions about Tim Bray's example
regex hardcoded in", employs a flyweight substring representation that
precludes "chunked" mmaping (the whole file is mapped at once)... In few
words, it's a fair amount of code that has gone through considerable
micro-optimization.
When I timed it on a dual-core 2GHz Core 2 Duo box, it was about 11% faster
than the JoCaml program. A 5-line change to the latter to use a specialized
hash table, as done by the C++ program, reversed the relation, JoCaml being 1%
faster than C++ now.
This margin is smaller than the measurement error, and, a fortiori, the
variance introduced by the platform, the environment, or the compiler.
Further micro-optimizations are possible both in the JoCaml and the C++
program (there are a few more opportunities in the JoCaml one, though). At any
rate, it seems fair to say that the C++ and the JoCaml versions are about as
fast. (A factor of 3 is half an order of magnitude; 10% is negligible.)
The main advantage of the JoCaml version is that, unlike the C++ one (based on
threads), it only takes a one-line change to make it run in multiple machines.
Some conclusions
I've drawn some conclusions from Tim's results and my reading of several Wide
Finders written in Python, Ruby, Erlang, C++ and, of course, OCaml and JoCaml:
Languages and implementations
Mentioning several programming languages is risky; tiresome verbal strifes
follow often. These direct observations, however, shouldn't be controversial:
Erlang's HiPE system works fairly well
Python, Ruby and friends can perform decently if most of the time is spent in library functions written in C
the OCaml substrate of JoCaml can generate code about as fast as hand-tuned C++ in many (most?) cases*1
we cannot dismiss constant factors. JoCaml code written by an individual in an afternoon can perform better than Erlang code refined over the course of one month by Erlang experts*2 on a per-core basis, while scaling just as well.