You might not have noticed it, but every page on eigenclass.org lists the most
popular referrers. I often find interesting things in the Referer field, but
unfortunately they are hard to find (especially for an occasional visitor) in
the middle of unaccessible pages (bloglines, google reader, other online RSS
aggregators...) and (as of late) referrer spam.
I'm now filtering referrer URLs as I get them, but I also wanted to purge the
historical data contained in the "referrer database". Unsurprisingly, I
wrote a script for that.
Filtering referrers entails a fair bit of network traffic, to fetch the
referring URLs and verify that they can be accessed and seem legitimate.
Performing these checks serially would take forever (establishing the
connection, issuing the HTTP request, waiting for the data, timeouts, ...)
and I wouldn't be utilizing my bandwidth efficiently.
The obvious solution is performing several operations in parallel to maximize
bandwidth usage.
Pooling handlers
The idea is creating a PoolingExecutor object that assigns tasks to a
bounded number of handlers and runs them in separate threads. This way we
optimize the use of some limited resource (in this case, bandwidth, but
it could also be DB connections, etc...) --- since we're not CPU-bound,
while avoiding an overload.
The API is:
executor=PoolingExecutor.newdo|handlers|NUM_HANDLERS.timesdohandlers<<SomeHandler.new(stuff)endend# later# each task is run in a different thread, but the num of simultaneous# threads is boundedexecutor.rundo|handler|# perform task with the handler# e.g.foo(handler.process(stuff))endexecutor.rundo|handler|# ....endexecutor.wait_for_all# ensure all the tasks scheduled with executor are# finished# ....