Java Answers Forum - how to strip html tags using java

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Answers Forum
how to strip html tags using java

4 replies on 1 page. Most recent reply: Jul 23, 2002 7:59 PM by Somik Raha

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 4 replies on 1 page

kumar

Posts: 3
Nickname: kumar
Registered: Mar, 2002

how to strip html tags using java

Posted: Mar 21, 2002 5:15 AM

i am trying to create a web spider using java.i am able to connect to any url and retrieve the contents of the page along with html tags.

how can i strip the html tags from the page and also how can i capture the anchor tags to store the respective urls.

pls. help

kumar

Posts: 3
Nickname: kumar
Registered: Mar, 2002

Re: how to strip html tags using java

Posted: Mar 21, 2002 9:43 AM

I am awaiting a guidance how to strip html tags.

pls. let me know at the earliest.

Matt Gerrans

Posts: 1153
Nickname: matt
Registered: Feb, 2002

Re: how to strip html tags using java

Posted: Mar 21, 2002 9:57 AM

I don't know for sure of HTML processing tools for Java, but they probably exist. I would start by looking at http://java.sun.com; for instance, JAXP does XML processing and may be acceptible for doing the HTML processing you need.

Also, you can check out JavaCC (http://www.webgain.com/products/java_cc/), which can be used to parse any grammar. It even comes with an HTML-processing sample.

Finally, check out this article "The Swing HTML Parser" at http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html which mentions "An example provided shows how to use the standard HTML parser..."

Charles Bell

Posts: 519
Nickname: charles
Registered: Feb, 2002

Re: how to strip html tags using java

Posted: Mar 21, 2002 1:58 PM

in your code you could use lines such as with the classes below it:


HttpURLConnection connection = null;
HttpURLConnection.setFollowRedirects(true);
if (debug)System.out.println("Using normal connection");
connection = getHttpURLConnection(urlstring);
if (connection != null)System.out.println("Established connection");
HTMLParser  htmlparser = new HTMLParser();
HTMLEditorKit.Parser parser = htmlparser.getParser();		
HREFExtractor hrefextractor = new HREFExtractor();
if (connection == null)System.out.println("Connection is null");
if (hrefextractor == null)System.out.println("hrefextractor is null");
hrefextractor.parseHttpURLConnection(connection);
Vector links = hrefextractor.getHREFAttributes();

*************************


class HTMLParser extends HTMLEditorKit{
	//HTMLEditorKit.getParser() has protected access
	//thus must subclass HTMLEditorKit to get a parser object
	public HTMLEditorKit.Parser getParser(){
		return super.getParser();
	}
}

class HREFExtractor extends HTMLEditorKit.ParserCallback{

	HREFExtractor(){
		v = new Vector();
		htmlparser = new HTMLParser();
		parser = htmlparser.getParser();
	}
		
	HTMLParser htmlparser;
	Vector v;
	HTMLEditorKit.Parser parser;
	
	public void handleStartTag(HTML.Tag tag, MutableAttributeSet attributes, int position){

		SimpleAttributeSet simpleattributeset = new SimpleAttributeSet (attributes);
		if (tag == HTML.Tag.A){
			if (debug) System.out.println("Found link element: " + tag);	
			Enumeration e = simpleattributeset.getAttributeNames();
			while (e.hasMoreElements()){
				Object o = e.nextElement();
				if (o.toString().compareToIgnoreCase("href")==0){
					Object link = simpleattributeset.getAttribute(o);
					v.add(link.toString());
					if (debug)System.out.println("adding: " + link.toString());	
					}
			}
		}
	}

	/**	Parses a document on the web with a URL connection.
	*/
	public void parseURLConnection(URLConnection connection){
		try{
			InputStream is = connection.getInputStream();
			InputStreamReader isr = new InputStreamReader(is);
			parser.parse(isr,this,false);
		}catch (IOException ioexception){
			showErrorMessage("IOException: " + ioexception.getMessage());
		}
	}

	/**	Parses a document on the web with a URL connection.
	*/
	public void parseHttpURLConnection(HttpURLConnection connection){
		try{
			InputStream is = connection.getInputStream();
			InputStreamReader isr = new InputStreamReader(is);
			parser.parse(isr,this,true);
			//last argumnet set true to avoid ChangedCharSetException 
		}catch (ChangedCharSetException ccse){
			showErrorMessage(ccse.getMessage() + " CharSetSpec: " + ccse.getCharSetSpec()); 

		}catch (IOException ioexception){
			showErrorMessage("IOException: " + ioexception.getMessage());
		}
	}

	public Vector getHREFAttributes(){
		return v;
	}

}

Somik Raha

Posts: 2
Nickname: somik
Registered: Jul, 2002

Re: how to strip html tags using java

Posted: Jul 23, 2002 7:59 PM

Hi,
This is really easy with HTMLParser (http://htmlparser.sourceforge.net) - an open source html parsing library in Java.
Its much faster and better designed than the Swing Parser - and it handles dirty html.
Your application with HTMLParser will only be a few lines long.

Cheers,
Somik

Flat View: This topic has 4 replies on 1 page

Previous Topic

Next Topic


	Web Artima.com