Ruby Buzz Forum - No, XPath on Messy HTML is Just as Easy in Ruby

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
No, XPath on Messy HTML is Just as Easy in Ruby

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Red Handed

Posts: 1158
Nickname: redhanded
Registered: Dec, 2004

Red Handed is a Ruby-focused group blog.

No, XPath on Messy HTML is Just as Easy in Ruby

Posted: Aug 26, 2005 1:37 PM

This post originated from an RSS feed registered with Ruby Buzz by Red Handed.
Original Post: No, XPath on Messy HTML is Just as Easy in Ruby Feed Title: RedHanded Feed URL: http://redhanded.hobix.com/index.xml Feed Description: sneaking Ruby through the system	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Red Handed Latest Posts From RedHanded

You think XPath is easier in Javascript than in Ruby when it comes to invalid HTML? I’ve heard this from a lot of correspondence over the past week. Because Javascript has the DOM, right?

Use HTree+REXML. HTree cleans and REXML peppers and gobbles. Here’s a hairy, little method that will save some pain:

 require 'htree'
 require 'rexml/document'
 require 'open-uri'

 def read_xhtml_from( uri )
   open( uri ) { |f| HTree.parse f }.each_child do |child|
     if child.respond_to? :qualified_name
       doc = ""; child.display_xml( doc )
       if child.qualified_name == 'html'
         return REXML::Document.new( doc ) 
       end
     end
   end
 end

Okay, so. How to use it? That nice REXML way you’re already used to.

 html = read_xhtml_from "http://redhanded.hobix.com/" 
 html.each_element( "//div[@class='entryFooter']" ) do |e|
   puts e.text( "./a[starts-with(@href, 'http://redhanded.hobix.com/')]" )
 end

Read: No, XPath on Messy HTML is Just as Easy in Ruby

Previous Topic

Next Topic


	Web Artima.com