After writing a "pooling executor" that assigns
tasks to a number of handlers in parallel, filtering referrer URLs somewhat
efficiently becomes easier. Checking multiple referrers concurrently helps
maximize bandwidth usage, which is quite important when you have to fetch
~8000 pages. Were this done serially, the process could easily take a couple
hours or more.
The script described below helped me remove nearly 95% of the referrer URLs,
going from 13595 (7909 unique) to 933 (653).
Task description
My HTTP referrers (and the corresponding hits) are stored as serialized hashes
in a number of files, marshalled with TMarshal,
AMarshal's elder (yet simpler) sibling: