Summary
Web-Harvest is an open-source screen-scraping tool that helps extract data from Web sites. It uses XSLT, XQuery, and regular expressions, and provides a configurable set of pipelines that process the raw HTML data of a Web site.
Advertisement
The open-source Web-Harvest data extraction tool project announced its initial public release. The Java-based tool allows developers to programmatically extract data from existing Web sites.
While screen-scraping evokes memories of DOS applications, the technique was also the primary means of programmatically extracting information from Web sites in the pre-RSS and Web-services days. Indeed, many large sites, such as eBay, developed their programmatic APIs in response to the increasing number of screen-scraping applications harvesting information from their Web pages.
Web-Harvest uses XSLT, XQuery, and regular expression to aid data extraction, or screen scraping, from HTML and XML-based Web sites. It provides a set of configurable pipelines to process each page:
Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one.
While RSS provides a standard format to expose a site's data in a structured manner, a site's RSS feed may still only publish a subset of available data. Screen scraping that takes into account knowledge of a Web page's layout to obtain from the page just the needed data, can still come in handy. What Web sites do you obtain data from in that manner?