This post originated from an RSS feed registered with Ruby Buzz
by Red Handed.
Original Post: Okay, Give Hpricot 0.2 a Go
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
So the Hpricot parser is basically complete. There’s still lots of fiddling ahead: it doesn’t handle Javascript whatsoever and it’s not yet as flexible as HTree. However, it does fix alot of HTML that RubyfulSoup and the htmltools won’t.
Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.
user system total real
hpricot: 10.515625 0.000000 10.515625 ( 10.610571)
htree: 56.609375 0.023438 56.632812 ( 57.096530)
rubyfulsoup: 29.289062 0.046875 29.335938 ( 29.586510)
mechanize: 148.132812 1.101562 149.234375 (150.621922)
The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.
I didn’t include scrapi because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all p.posted elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.