Java Buzz Forum - Using XPath on real-world HTML documents

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Buzz Forum
Using XPath on real-world HTML documents

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Nick Lothian

Posts: 397
Nickname: nicklothia
Registered: Jun, 2003

Nick Lothian is Java Developer & Team Leader

Using XPath on real-world HTML documents

Posted: Sep 11, 2006 1:47 AM

This post originated from an RSS feed registered with Java Buzz by Nick Lothian.
Original Post: Using XPath on real-world HTML documents Feed Title: BadMagicNumber Feed URL: http://feeds.feedburner.com/Badmagicnumber Feed Description: Java, Development and Me	Latest Java Buzz Posts Latest Java Buzz Posts by Nick Lothian Latest Posts From BadMagicNumber

The article on the Server Side about Web-Harvest reminded me on one of my favourite things: using XPath to extract data from HTML documents.

Obviously, XPath won't work normall on most real-world webpages, because they aren't valid XML. However, the magic of TagSoup gives you a SAX-parser that will work on ugly HTML. You can then use XPath against that SAX stream.

Here's the magic innvocation to make TagSoup, XOM and XPath all work together:

XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder bob = new Builder(tagsoup);
Document doc = bob.build(url);

which then allows you to do things like: XPathContext context = new XPathContext("html", "http://www.w3.org/1999/xhtml"); Nodes table = doc.query("//html:table[@class='forumline']", context);

Cool, hu?

Read: Using XPath on real-world HTML documents

Previous Topic

Next Topic


	Web Artima.com