This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Parsing RSS-Data
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
As a companion to Les Orchard's RSS-Data versus namespace examples, here's some Python code that will parse the RSS-Data version:
import re, urllib, xmlrpclib, os.path
from pprint import pprint
# read les's example
html = urllib.urlopen('http://www.decafbad.com/blog/tech/rss_data_versus_namespace.html').read()
# turn the html-quoted example back into xml
for entity, char in (('lt', '<'), ('gt', '>'), ('amp', '&')):
html = html.replace('&%s;' % entity, char)
# rip out the rss-data bit and get rid of the namespace
xml = re.search(r'(\<sdl\:data\>.*\</sdl\:data\>)', html, re.S).group(1)
xml = xml.replace('sdl:', '')
# feed it through xmlrpclib
p, u = xmlrpclib.getparser()
p.feed(xml)
p.close()
# and we have the data!
pprint(u._stack[0])
Here's what you get when you run it:
phil@icefloe:~/projects/rss-data$ python test.py
{'Asin': '0439139597',
'Authors': ['J. K. Rowling', 'Mary GrandPr'],
'Availability': 'Usually ships within 24 hours',
'Catalog': 'Book',
'ImageUrlLarge': 'http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg',
'ImageUrlMedium': 'http://images.amazon.com/images/P/0439139597.01.MZZZZZZZ.jpg',
'ImageUrlSmall': 'http://images.amazon.com/images/P/0439139597.01.THUMBZZZ.jpg',
'ListPrice': '$25.95',
'Manufacturer': 'Scholastic',
'OurPrice': '$18.16',
'ProductName': '\n Harry Potter and the Goblet of Fire (Book 4)\n ',
'ReleaseDate': <DateTime 2000-07-08T00:00:00 at 818012c>,
'UsedPrice': '$3.97',
'url': 'http://www.amazon.com/exec/obidos/ASIN/0439139597/0xdecafbad-20'}