You think XPath is easier in Javascript than in Ruby when it comes to invalid HTML? I’ve heard this from a lot of correspondence over the past week. Because Javascript has the DOM, right?
Use HTree+REXML. HTree cleans and REXML peppers and gobbles. Here’s a hairy, little method that will save some pain:
require 'htree'
require 'rexml/document'
require 'open-uri'
def read_xhtml_from( uri )
open( uri ) { |f| HTree.parse f }.each_child do |child|
if child.respond_to? :qualified_name
doc = ""; child.display_xml( doc )
if child.qualified_name == 'html'
return REXML::Document.new( doc )
end
end
end
end
Okay, so. How to use it? That nice REXML way you’re already used to.
html = read_xhtml_from "http://redhanded.hobix.com/"
html.each_element( "//div[@class='entryFooter']" ) do |e|
puts e.text( "./a[starts-with(@href, 'http://redhanded.hobix.com/')]" )
end