The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Ripping Up Wikipedia, Subjugating It

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Red Handed

Posts: 1158
Nickname: redhanded
Registered: Dec, 2004

Red Handed is a Ruby-focused group blog.
Ripping Up Wikipedia, Subjugating It Posted: Oct 4, 2006 11:12 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Red Handed.
Original Post: Ripping Up Wikipedia, Subjugating It
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Red Handed
Latest Posts From RedHanded

Advertisement

There’s an article that covers taking apart Wikipedia pages with Hpricot on the blog of Shane Vitarana. His commenters also point out the Wikipedia API, which is a fantastic way of getting at the raw data—in YAML even! Still, it gives you the raw Wiki text, untransformed. Hence Hpricot.

Anyway, I just want to make a couple suggestions for his script. These aren’t a big deal, just some shortcuts in Hpricot, which aren’t documented yet. Let’s start getting them known, you know?

First, a simple one: the Element#attributes hash can be used through Element#[].

 #change /wiki/ links to point to full wikipedia path
 (content/:a).each do |link|
 unless link['href'].nil?
 if link['href'] =~ %r!^/wiki/!
 link['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
 end
 end
 end

In the section where Shane loops through a bunch of CSS selector and removes everything that matches, I think it’s quicker to join the separate CSS selectors with commas, to form a large selector which Hpricot can find with a single pass.

 #remove unnecessary content and edit links
 (content/items_to_remove.join(',')).remove

And lastly, I’ve checked in a new swap method in which replaces an element with an HTML fragment.

 #replace links to create new entries with plain text
 (content/"a.new").each do |link|
 link.swap(link['title'])
 end

So, yeah. Little things. Hey, are there any other methods you’re itching for?

Read: Ripping Up Wikipedia, Subjugating It

Topic: RubyInline 2 of 2 Previous Topic   Next Topic Topic: Screwed by Oracle

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use