This post originated from an RSS feed registered with Java Buzz
by Erik C. Thauvin.
Original Post: Crawlers Detection in Java
Feed Title: Erik's Weblog
Feed URL: http://erik.thauvin.net/blog/feed.jsp?cat=Java
Feed Description: The Truth is Out There!
As I was testing my link redirector Servlet for the linkblog, Rick asked what I was doing about search engine crawlers. I told him I was inspecting the user-agent on all requests and excluding anything with the words bot, crawler or spider, which I knew was not hardly enough.
I was ready to live with it, when I suddenly remembered that AWStats, my favorite logfile analyzer, does a pretty good job at keeping track of robots/spiders. It actually includes a Perl module with around 400 regexp user-agent matches for all sort of known robots, spiders and crawlers.
I converted the AWStats lookup data into a Java class, Robots, which I used in my Servlet.