This post originated from an RSS feed registered with Ruby Buzz
by Jared Richardson.
Original Post: Converting from Rexml to libxml
Feed Title: 6th Sense Analytics
Feed URL: http://www.6thsenseanalytics.com/?feed=rss
Feed Description: The 6th Sense Analytics corporate blog
Once your Rails application starts to get serious, you start measuring it's performance. And you immediately realize that Rexml is slow. Real slow. It's still fast enough for most applications, but when you start thinking on the enterprise level, you don't use Rexml... you look for alternatives. And when you start searching and asking people, everyone seems to say Use Ruby libxml. So I did.
Ruby libxml is just a Ruby wrapper around the popular XML parser libxml. It's written in C and is really fast. Unfortunately, Rexml and Ruby libxml aren't interchangable. Each has their own quirks and interfaces, so I thought I'd document how to change over your codebase from Rexml to libxml. (Hint hint... wouldn't it be cool if someone wrote a wrapper so that Ruby libxml could be used with the same interface as Rexml? Switching over would mean changing a require statement... who wants to be famous? Go write it!)
The install has a number of dependancies, let me point you at the Install Page. Read it... install the required libraries before you attempt to install the gem. (libm, libz/zlib, libiconv, and libxml2)
Once you've got it all installed and running, move to your Rails code.
Switch out the imports first.
include 'REXML' becomes require 'libxml'.
Then start changing code.
doc = REXML::Document.new raw_content
becomes
xp = XML::Parser.new()
xp.string = raw_content
doc = xp.parse
Cycling through a document with Rexml looks like this:
doc.elements.each("/rss/channel/item") { |item|
uri = URI.parse(item.elements['link'].text)
}
becomes
doc.find('//rss/channel/item').each do |item|
ink = item.find('link')[0].child.to_s
uri = URI.parse( link )
}
Also
rating = item.attributes['my-rating']
becomes
rating = item.property('my-rating')
And finally,
title = item.elements['title'].text
becomes
title = item.find('title')[0].child.to_s
The libxml code is a bit uglier, but it parsed our XML ten times faster. An order of magnitude. I'll endure a little code ugliness for that.
As a final warning, the Darwin Ports package is very old (nearly a year), so you can't use a lot of the newer features if you use Ports to manage your Apple boxes.
The older version doesn't have first(), so code like
item.find('link').first.child
becomes
item.find('link').to_a[0].child
and
item.find('description')[0]
turns into
item.find('description').to_a[0]
I hope this helps get you up and running quickly. I also hope someone with some free time writes a libxml wrapper that copies the Rexml interfaces so that switching over is as easy as swapping out the requires. Anyone?