Ruby Buzz Forum - Ripping Up Wikipedia, Subjugating It

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
Ripping Up Wikipedia, Subjugating It

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Red Handed

Posts: 1158
Nickname: redhanded
Registered: Dec, 2004

Red Handed is a Ruby-focused group blog.

Ripping Up Wikipedia, Subjugating It

Posted: Oct 4, 2006 11:12 AM

This post originated from an RSS feed registered with Ruby Buzz by Red Handed.
Original Post: Ripping Up Wikipedia, Subjugating It Feed Title: RedHanded Feed URL: http://redhanded.hobix.com/index.xml Feed Description: sneaking Ruby through the system	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Red Handed Latest Posts From RedHanded

There’s an article that covers taking apart Wikipedia pages with Hpricot on the blog of Shane Vitarana. His commenters also point out the Wikipedia API, which is a fantastic way of getting at the raw data—in YAML even! Still, it gives you the raw Wiki text, untransformed. Hence Hpricot.

Anyway, I just want to make a couple suggestions for his script. These aren’t a big deal, just some shortcuts in Hpricot, which aren’t documented yet. Let’s start getting them known, you know?

First, a simple one: the Element#attributes hash can be used through Element#[].

 #change /wiki/ links to point to full wikipedia path
 (content/:a).each do |link|
 unless link['href'].nil?
 if link['href'] =~ %r!^/wiki/!
 link['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
 end
 end
 end

In the section where Shane loops through a bunch of CSS selector and removes everything that matches, I think it’s quicker to join the separate CSS selectors with commas, to form a large selector which Hpricot can find with a single pass.

 #remove unnecessary content and edit links
 (content/items_to_remove.join(',')).remove

And lastly, I’ve checked in a new swap method in which replaces an element with an HTML fragment.

 #replace links to create new entries with plain text
 (content/"a.new").each do |link|
 link.swap(link['title'])
 end

So, yeah. Little things. Hey, are there any other methods you’re itching for?

Read: Ripping Up Wikipedia, Subjugating It

Previous Topic

Next Topic


	Web Artima.com