The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Extracting Tech Memeorandum's blog list

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Adam Green

Posts: 102
Nickname: darwinian
Registered: Dec, 2005

Adam Green is the author of Ruby.Darwinianweb.com
Extracting Tech Memeorandum's blog list Posted: Feb 3, 2006 11:17 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Adam Green.
Original Post: Extracting Tech Memeorandum's blog list
Feed Title: ruby.darwinianweb.com
Feed URL: http://www.nemesis-one.com/rss.xml
Feed Description: Adam Green's Ruby development site
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Adam Green
Latest Posts From ruby.darwinianweb.com

Advertisement
I know I'm supposed to be working on my RubyRiver tutorial, but I got distracted playing with ideas for mashups I want to demo at Mashup Camp. One type of mashup I want to work on is merging people's name's and blog URL's with various search engines. Tech Memeorandum aggregates posts from a great set of blogs, so I'm going to use that site as the starting source for my people mashup data. I'll explain the full project on my mashup blog. I'm going to try and maintain the pattern of posting the Ruby code for anything I work on here on the Ruby blog. That way I can publish complete source code listings without scaring away the non-programmers who read my other blogs. The idea of this code is that it reads the home page of Tech Memeorandum, extracts the links to blogs, and saves them as an XML file. The XML file will be permanently located at this location. Right now this XML file is not updating, but once I get the whole system running, it will automatically refresh. Hopefully, others will use it as the basis for their own mashups.

tmparse.rb

 #! /usr/bin/ruby
# tmparse.rb
# Extract the blog citations from the home page of
# http://tech.memeorandum.com.
#
# Copyright (C) 2006 Adam Green
# http://ruby.darwinianweb.com, adam AT darwinianweb DOT com
# This program is distributed under the same license as Ruby.
#
# Each blog is identified in the page with the following entry:
# <CITE>First Last / <A HREF="http://url/">Blog Name</A>:</CITE>
# If there is no author's name, the citation is:
# <CITE> <A HREF="http://url/">Blog Name</A>:</CITE>



# Get the page's text.
require "open-uri"
page = open("http://tech.memeorandum.com/")
pagetext = page.read
page.close



# Convert ellipse entity used by TM, since it gives XML parsers fits.
pagetext = pagetext.gsub("&hellip;", "...")



# Pull out all the citations.
citelist = pagetext.scan(/<cite>.*?<\/cite>/i)



# Build a hash with them.
sortlist = {}
citelist.each do |citation|
# Only use citations with URLs.
if citation.match(/a href/i)



htmlurlstart = citation.index('="')+2
htmlurlend = citation.index('">')-1
htmlurl = citation[htmlurlstart..htmlurlend]



titlestart = htmlurlend+3
titleend = citation.index('</A>')-1
title = citation[titlestart..titleend]



# Does the citation include an author?
if citation.index("/") < citation.index("<A HREF")
authorstart = 6
authorend = citation.index("/")-2
author = citation[authorstart..authorend]
author = author.strip
author = author.squeeze(" ")
sortkey = author
else
author = ""
sortkey = title
end



# Build the hash, so it can be sorted on author or title
sortlist[sortkey.upcase] = { "author" => author,
"htmlurl" => htmlurl,
"title" => title }
end
end



# Write the sorted list out to an XML file.
xmlfile = File.new("../../projects/tmblogs/tmblogs.xml", "w")
xmlfile.puts('<?xml version="1.0" encoding="utf-8" ?>')
xmlfile.puts('<tmblogs>')



# Hash#sort returns an array.
sortarray = sortlist.sort
sortarray.each do |item|
info = item[1]
xmlfile.puts(' <blog>')
xmlfile.puts(' <author>' + info["author"] + '</author>')
xmlfile.puts(' <title>' + info["title"] + '</title>')
xmlfile.puts(' <htmlUrl>' + info["htmlurl"] + '</htmlUrl>')
xmlfile.puts(' </blog>')
end
xmlfile.puts('</tmblogs>')
xmlfile.close

Read: Extracting Tech Memeorandum's blog list

Topic: Simplicity via Complexity Previous Topic   Next Topic Topic: Switching from Blogger to Typo

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use