The Artima Developer Community
Sponsored Link

Java Buzz Forum
Using XPath on real-world HTML documents

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Nick Lothian

Posts: 397
Nickname: nicklothia
Registered: Jun, 2003

Nick Lothian is Java Developer & Team Leader
Using XPath on real-world HTML documents Posted: Sep 11, 2006 1:47 AM
Reply to this message Reply

This post originated from an RSS feed registered with Java Buzz by Nick Lothian.
Original Post: Using XPath on real-world HTML documents
Feed Title: BadMagicNumber
Feed URL: http://feeds.feedburner.com/Badmagicnumber
Feed Description: Java, Development and Me
Latest Java Buzz Posts
Latest Java Buzz Posts by Nick Lothian
Latest Posts From BadMagicNumber

Advertisement

The article on the Server Side about Web-Harvest reminded me on one of my favourite things: using XPath to extract data from HTML documents.

Obviously, XPath won't work normall on most real-world webpages, because they aren't valid XML. However, the magic of TagSoup gives you a SAX-parser that will work on ugly HTML. You can then use XPath against that SAX stream.

Here's the magic innvocation to make TagSoup, XOM and XPath all work together:

XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder bob = new Builder(tagsoup);
Document doc = bob.build(url);
which then allows you to do things like:
XPathContext context = new XPathContext("html", "http://www.w3.org/1999/xhtml");
Nodes table = doc.query("//html:table[@class='forumline']", context);

Cool, hu?

Read: Using XPath on real-world HTML documents

Topic: What Makes a Web Scripting Language Successful Previous Topic   Next Topic Topic: Articles Coming

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use