This post originated from an RSS feed registered with Ruby Buzz
by Red Handed.
Original Post: Ripping Up Wikipedia, Subjugating It
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
There’s an article that covers taking apart Wikipedia pages with Hpricot on the blog of Shane Vitarana. His commenters also point out the Wikipedia API, which is a fantastic way of getting at the raw data—in YAML even! Still, it gives you the raw Wiki text, untransformed. Hence Hpricot.
Anyway, I just want to make a couple suggestions for his script. These aren’t a big deal, just some shortcuts in Hpricot, which aren’t documented yet. Let’s start getting them known, you know?
First, a simple one: the Element#attributes hash can be used through Element#[].
#change /wiki/ links to point to full wikipedia path
(content/:a).each do |link|
unless link['href'].nil?
if link['href'] =~ %r!^/wiki/!
link['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
end
end
end
In the section where Shane loops through a bunch of CSS selector and removes everything that matches, I think it’s quicker to join the separate CSS selectors with commas, to form a large selector which Hpricot can find with a single pass.
#remove unnecessary content and edit links
(content/items_to_remove.join(',')).remove
And lastly, I’ve checked in a new swap method in which replaces an element with an HTML fragment.
#replace links to create new entries with plain text
(content/"a.new").each do |link|
link.swap(link['title'])
end
So, yeah. Little things. Hey, are there any other methods you’re itching for?