The Artima Developer Community
Sponsored Link

Java Answers Forum
how to strip html tags using java

4 replies on 1 page. Most recent reply: Jul 23, 2002 7:59 PM by Somik Raha

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 4 replies on 1 page
kumar

Posts: 3
Nickname: kumar
Registered: Mar, 2002

how to strip html tags using java Posted: Mar 21, 2002 5:15 AM
Reply to this message Reply
Advertisement
i am trying to create a web spider using java.i am able to connect to any url and retrieve the contents of the page along with html tags.

how can i strip the html tags from the page and also how can i capture the anchor tags to store the respective urls.

pls. help


kumar

Posts: 3
Nickname: kumar
Registered: Mar, 2002

Re: how to strip html tags using java Posted: Mar 21, 2002 9:43 AM
Reply to this message Reply
I am awaiting a guidance how to strip html tags.

pls. let me know at the earliest.

Matt Gerrans

Posts: 1153
Nickname: matt
Registered: Feb, 2002

Re: how to strip html tags using java Posted: Mar 21, 2002 9:57 AM
Reply to this message Reply
I don't know for sure of HTML processing tools for Java, but they probably exist. I would start by looking at http://java.sun.com; for instance, JAXP does XML processing and may be acceptible for doing the HTML processing you need.

Also, you can check out JavaCC (http://www.webgain.com/products/java_cc/), which can be used to parse any grammar. It even comes with an HTML-processing sample.

Finally, check out this article "The Swing HTML Parser" at http://java.sun.com/products/jfc/tsc/articles/bookmarks/index.html which mentions "An example provided shows how to use the standard HTML parser..."

Charles Bell

Posts: 519
Nickname: charles
Registered: Feb, 2002

Re: how to strip html tags using java Posted: Mar 21, 2002 1:58 PM
Reply to this message Reply
in your code you could use lines such as with the classes below it:

HttpURLConnection connection = null;
HttpURLConnection.setFollowRedirects(true);
if (debug)System.out.println("Using normal connection");
connection = getHttpURLConnection(urlstring);
if (connection != null)System.out.println("Established connection");
HTMLParser htmlparser = new HTMLParser();
HTMLEditorKit.Parser parser = htmlparser.getParser();
HREFExtractor hrefextractor = new HREFExtractor();
if (connection == null)System.out.println("Connection is null");
if (hrefextractor == null)System.out.println("hrefextractor is null");
hrefextractor.parseHttpURLConnection(connection);
Vector links = hrefextractor.getHREFAttributes();


*************************

class HTMLParser extends HTMLEditorKit{
//HTMLEditorKit.getParser() has protected access
//thus must subclass HTMLEditorKit to get a parser object
public HTMLEditorKit.Parser getParser(){
return super.getParser();
}
}

class HREFExtractor extends HTMLEditorKit.ParserCallback{

HREFExtractor(){
v = new Vector();
htmlparser = new HTMLParser();
parser = htmlparser.getParser();
}

HTMLParser htmlparser;
Vector v;
HTMLEditorKit.Parser parser;

public void handleStartTag(HTML.Tag tag, MutableAttributeSet attributes, int position){

SimpleAttributeSet simpleattributeset = new SimpleAttributeSet (attributes);
if (tag == HTML.Tag.A){
if (debug) System.out.println("Found link element: " + tag);
Enumeration e = simpleattributeset.getAttributeNames();
while (e.hasMoreElements()){
Object o = e.nextElement();
if (o.toString().compareToIgnoreCase("href")==0){
Object link = simpleattributeset.getAttribute(o);
v.add(link.toString());
if (debug)System.out.println("adding: " + link.toString());
}
}
}
}

/** Parses a document on the web with a URL connection.
*/
public void parseURLConnection(URLConnection connection){
try{
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
parser.parse(isr,this,false);
}catch (IOException ioexception){
showErrorMessage("IOException: " + ioexception.getMessage());
}
}

/** Parses a document on the web with a URL connection.
*/
public void parseHttpURLConnection(HttpURLConnection connection){
try{
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
parser.parse(isr,this,true);
//last argumnet set true to avoid ChangedCharSetException
}catch (ChangedCharSetException ccse){
showErrorMessage(ccse.getMessage() + " CharSetSpec: " + ccse.getCharSetSpec());

}catch (IOException ioexception){
showErrorMessage("IOException: " + ioexception.getMessage());
}
}

public Vector getHREFAttributes(){
return v;
}

}

Somik Raha

Posts: 2
Nickname: somik
Registered: Jul, 2002

Re: how to strip html tags using java Posted: Jul 23, 2002 7:59 PM
Reply to this message Reply
Hi,
This is really easy with HTMLParser (http://htmlparser.sourceforge.net) - an open source html parsing library in Java.
Its much faster and better designed than the Swing Parser - and it handles dirty html.
Your application with HTMLParser will only be a few lines long.

Cheers,
Somik

Flat View: This topic has 4 replies on 1 page
Topic: j2me urgent help Previous Topic   Next Topic Topic: arrays and string......removing comma

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use