This post originated from an RSS feed registered with Java Buzz
by Nick Lothian.
Original Post: Using XPath on real-world HTML documents
Feed Title: BadMagicNumber
Feed URL: http://feeds.feedburner.com/Badmagicnumber
Feed Description: Java, Development and Me
The article on the Server Side about Web-Harvest reminded me on one of my favourite things: using XPath to extract data from HTML documents.
Obviously, XPath won't work normall on most real-world webpages, because they aren't valid XML. However, the magic of TagSoup gives you a SAX-parser that will work on ugly HTML. You can then use XPath against that SAX stream.
Here's the magic innvocation to make TagSoup, XOM and XPath all work together:
XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder bob = new Builder(tagsoup);
Document doc = bob.build(url);