This post originated from an RSS feed registered with Java Buzz
by Nick Lothian.
Original Post: Spam Blog Crisis
Feed Title: BadMagicNumber
Feed URL: http://feeds.feedburner.com/Badmagicnumber
Feed Description: Java, Development and Me
Tim Bray says there is a spam blog emergency occuring right now. I tend to agree. I'd like to see the search terms he is using to get that many splogs, though.
Removing spam blogs results from results sorted based on time is difficult because you can't rely on PageRank-like algorithms. Email spam filters are probably a better model, although the auto-generated splogs that I suspect Tim is suffering from are hard to detect using Bayesian-type algorithms. OTOH, my de-spammed version of Google's blog search just uses heuristics based on the URL of the item, and it does okay for many searches. Compare my version of a search for "cancer" with the raw version. At the time of writing my version removes 26 spammy results to get the first 10 non-spammy ones.