This post originated from an RSS feed registered with Ruby Buzz
by Adam Green.
Original Post: Extracting Tech Memeorandum's blog list
Feed Title: ruby.darwinianweb.com
Feed URL: http://www.nemesis-one.com/rss.xml
Feed Description: Adam Green's Ruby development site
I know I'm supposed to be working on my RubyRiver tutorial, but I got distracted playing with ideas for mashups I want to demo at Mashup Camp. One type of mashup I want to work on is merging people's name's and blog URL's with various search engines. Tech Memeorandum aggregates posts from a great set of blogs, so I'm going to use that site as the starting source for my people mashup data. I'll explain the full project on my mashup blog. I'm going to try and maintain the pattern of posting the Ruby code for anything I work on here on the Ruby blog. That way I can publish complete source code listings without scaring away the non-programmers who read my other blogs. The idea of this code is that it reads the home page of Tech Memeorandum, extracts the links to blogs, and saves them as an XML file. The XML file will be permanently located at this location. Right now this XML file is not updating, but once I get the whole system running, it will automatically refresh. Hopefully, others will use it as the basis for their own mashups.
tmparse.rb
#! /usr/bin/ruby # tmparse.rb # Extract the blog citations from the home page of # http://tech.memeorandum.com. # # Copyright (C) 2006 Adam Green # http://ruby.darwinianweb.com, adam AT darwinianweb DOT com # This program is distributed under the same license as Ruby. # # Each blog is identified in the page with the following entry: # <CITE>First Last / <A HREF="http://url/">Blog Name</A>:</CITE> # If there is no author's name, the citation is: # <CITE> <A HREF="http://url/">Blog Name</A>:</CITE>
# Get the page's text. require "open-uri" page = open("http://tech.memeorandum.com/") pagetext = page.read page.close
# Convert ellipse entity used by TM, since it gives XML parsers fits. pagetext = pagetext.gsub("…", "...")
# Pull out all the citations. citelist = pagetext.scan(/<cite>.*?<\/cite>/i)
# Build a hash with them. sortlist = {} citelist.each do |citation| # Only use citations with URLs. if citation.match(/a href/i)
titlestart = htmlurlend+3 titleend = citation.index('</A>')-1 title = citation[titlestart..titleend]
# Does the citation include an author? if citation.index("/") < citation.index("<A HREF") authorstart = 6 authorend = citation.index("/")-2 author = citation[authorstart..authorend] author = author.strip author = author.squeeze(" ") sortkey = author else author = "" sortkey = title end
# Build the hash, so it can be sorted on author or title sortlist[sortkey.upcase] = { "author" => author, "htmlurl" => htmlurl, "title" => title } end end
# Write the sorted list out to an XML file. xmlfile = File.new("../../projects/tmblogs/tmblogs.xml", "w") xmlfile.puts('<?xml version="1.0" encoding="utf-8" ?>') xmlfile.puts('<tmblogs>')
# Hash#sort returns an array. sortarray = sortlist.sort sortarray.each do |item| info = item[1] xmlfile.puts(' <blog>') xmlfile.puts(' <author>' + info["author"] + '</author>') xmlfile.puts(' <title>' + info["title"] + '</title>') xmlfile.puts(' <htmlUrl>' + info["htmlurl"] + '</htmlUrl>') xmlfile.puts(' </blog>') end xmlfile.puts('</tmblogs>') xmlfile.close