This post originated from an RSS feed registered with Ruby Buzz
by James Britt.
Original Post: Using Mechanize for HTML Scraping
Feed Title: James Britt: Ruby Development
Feed URL: http://feeds.feedburner.com/JamesBritt-Home
Feed Description: James Britt: Playing with better toys
There was some discussion on ruby-talk recently about HTML screen scraping, and some questions about using Michael Neumann's WWW::Mechanize library.
I mentioned that I was using that library to grab multiple CafePress pages and extract product data to assemble the rubystuff.com Web site. Someone asked if I could post my code as an example, which seemed a reasonable idea.
The code was never meant to be more than a way to save me from doing more work than absolutely necessary, but it turned out to be a pretty good, small-but-instructive example of what one can do with Mechanize. I cleaned up/refactored a few things, added in a narrative, and have put the results up here.
I'm not so sure that this will remain the final home for the article/example, though I'm thinking of using the Neurogami site as a repository for all my writing and code libraries. I have things on jamesbritt.com, rubyxml.com, the Linux Journal site, and maybe elsewhere, plus Ruby code hosted in almost as many different places. So some one-stop shopping might make it easier to keep track of things.