Yesterday's goal was to beat the fastest
"Wide Finder"
log analyzer program. I have been successful; my best effort is nearly 3 times
faster than the current number 1 in Tim Bray's table
on the two-core machine I tested it on. Not bad considering that I'm
essentially fighting against highly optimized standard libraries written in C
(string searching and regexp matching).
I took
Ilmari Heikkinen's OCaml implementation
(referred to as wf-kig below) and improved it progressively, adding multicore
support using JoCaml's join-calculus in the last step (the concurrent core, a mere 12 lines, is explained below). Here follows a summary of the optimizations, but let's see the numbers first. These are my results on a
Core 2 Duo 2GHz, 2GB DDR2 667MHz machine (hot cache)*1:
program
real(s)
user(s)
sys(s)
wf-kig
1.39
1.104
0.284
wf
1.116
0.834
0.282
wf-block
0.762
0.472
0.286
wf-mmap
0.584
0.447
0.136
wf-mmap-multicore
0.367
?
?
wf-2.py
2.057
1.777
0.278
wf-6.py (2)
1.012
?
?
wf-6.py (1)
1.737
?
?
wf-6.py is the current leader in Tim
Bray's classification. It is multicore-enabled; the above table lists the times
I got with one and two processes (unsurprisingly, further processes don't make
it any faster because the test machine has got only two cores). wf-2.py is the
basic Python implementation wf-6.py was evolved from and is about the same size
as wf-kig.ml or wf.ml. My optimizations mostly mirror
those applied to wf-2.py in order
to create wf-6.py.
wf-kig.ml processes the input one line at a time, matching each one against
the "GET /ongoing/When/\d\d\dx..." regexp. The first optimization I did was
splitting that regexp matching into two parts:
find the "GET /ongoing/When/" substring in the line
try to match the remaining regexp at that position
This allows me to use a sublinear string search algorithm for the first
part. I took the Boyer-Moore implementation I had written for ICFP07,
achieving a 25% speed boost (this is "wf" in the above table). Note that
wf-2.py already uses sublinear searching, so it is equivalent to "wf" modulo
the implementation language.
Block processing
The profiler indicated nearly half of the time was spent in the function that
reads a single line from disk. I changed my code to read 50MB chunks and scan
them in one go, with some care to take the incomplete line at the end of each
chunk and copy it at the beginning of the next one. This brings a 46% speed
gain in wf-block relative to the previous version ("wf").
Avoiding IO
In order to avoid useless copying between kernel and user space, I switched to
a memory-mapped solution. I wrote a reusable Bigstring module that allows to
use mmap'ed memory as an OCaml string and adapted the Boyer-Moore string
search implementation. wf-mmap.ml itself takes less code than wf-block.ml, and
is about 30% faster.
There is still some potential for optimization in the single-core case because
I have to copy the parts of the line to be matched against the regexp to a
buffer, owing to a limitation in the interface of the regexp library.
Of course, the ultimate optimization would be to specialize the code for
the particular regexp we're using.
At this point, wf-mmap processes 200MB in 0.584s, and is nearly twice faster
on a single core than wf-6.py using two. However, Tim's machine has got 8
cores, so I went for a multicore-enable version.
Distributed programming with JoCaml
I adapted wf-mmap.ml to use several processes managed using JoCaml's join
calculus. wf-mmap-multicore processes 200MB in 0.367s when using 2 processes,
considerably faster than wf-6.py.
The join calculus is really neat, it allows you to think about concurrent
processes at a higher level with fewer chances to deadlock or corrupt your
data.
This is the code that controls the workers (if you have some time follow the
introduction to concurrent and distributed programming with JoCaml,
I can't recommend it enough); it differs a bit from what you'll find in the
actual sources, because I just thought of this more elegant way as I was
writing these lines: