The Artima Developer Community
Sponsored Link

Java Performance
Integrating HTML validation to my site building process
by Jack Shirazi
July 14, 2004
Summary
I decided it was time to validate the HTML on my site, but wanted an integrated solution that would flag problems during the build process.

Advertisement

I generate my website using a local servlet container and JSP pages converting text source to html pages, then I upload all the pages to the server. Inspired by reading Cleaning Your Web Pages with HTML Tidy, I decided it was about time I had my HTML validated. But I wanted to do it as an integral part of the build process, not as an afterthought. That way, if HTML errors crept in to the pages for whatever reason, they would be flagged immediately. It turned out to be extremely easy to do so.

First off, I am already building my pages locally using a Java program which connects to my local servlet container and asks for each page then stores it locally. This allows me to have a dynamic page display process for building my pages, giving me all the power and flexibility of servlets and JSPs. The result is a set of static pages which I can upload to my internet site, providing extremely fast downloads of pages from my internet site JavaPerformanceTuning.com.

So all I had to do to add HTML validation was add one method to my build process. Once each page is complete and loaded into a local file, I simply added a call to a new validateHTML(File destinationfile) method.

My validateHTML method basically calls the "Tidy" executable on the newly created HTML file, (Tidy validates and corrects HTML, and is available here). Then I check Tidy's output for anything I'm interested in. If there is a problem, I throw an exception.

I use Process to execute Tidy as an external process. I could process Tidy's stdout and stderr directly from the program, but there is no need, it is much simpler to use Tidy to dump these to files and check those files. I don't actually use Tidy's HTML output for my web pages, I'm really using it only as a validator. It is worth noting that the W3 organization has a validator at http://validator.w3.org/ if you only need to check some pages, but in my case I wanted to have all my pages checked each time I re-built the site.

I am only interested in the line notifcation warnings and errors that Tidy emits, so I use a regular expression to detect and parse those lines. In addition, there are some warnings that I don't really care to fix at the moment, so I have added the ability to ignore those, either on a per file basis or globally (see the two entries in the TidyNoficationsToIgnore HashMap for examples).

Finally, if I do find a problem, I like to print the error and relevant line from the HTML file so that I can see where it is and what to fix

Here's the code in case anyone else needs to resolve this problem in a similar way. If you have problems getting Tidy to execute, it's probably a path issue so you might try using the path to the executable in the command, e.g. .\Tidy or ./Tidy

  //Note I am putting this code fragment in the public domain
  public static final Pattern TidyHTMLLineNotification = Pattern.compile("^line\\s+(\\d+)\\s+column\\s+(\\d+)\\s+\\-\\s+(.*)$");
  static HashMap TidyNoficationsToIgnore = new HashMap();
  static
  {
    TidyNoficationsToIgnore.put("newsletter013.shtml+Warning: discarding unexpected </p>", Boolean.TRUE); 
    TidyNoficationsToIgnore.put("Warning: trimming empty <p>", Boolean.TRUE); //always ignore
  }
  public static void validateHTML(File destinationfile)
    throws IOException, InterruptedException
  {
    //Stdout to tt.txt, stderr to t2.txt.
    //tt.txt contains fixed HTML if you want it.
    //t2.txt contains Tidy's warnings and errors
    String command = "Tidy -o tt.txt -f t2.txt " + destinationfile;
    Runtime.getRuntime().exec(command).waitFor();
    BufferedReader rdr = new BufferedReader(new FileReader("t2.txt"));
    String line;
    while( (line = rdr.readLine()) != null)
    {
      //Only interested in lines beginning with "line"
      if (line.startsWith("line "))
      {
        Matcher m = TidyHTMLLineNotification.matcher(line);
        if (m.matches())
        {
          String linenumstr = m.group(1);
          String colnum = m.group(2);
          String message = m.group(3);
          if ( (TidyNoficationsToIgnore.get(message) != Boolean.TRUE) &&
               (TidyNoficationsToIgnore.get(destinationfile.toString()+'+'+message) != Boolean.TRUE) )
          {
            //line number in destinationfile of problem. Read the file
            //and get that line and the line before
            int linenum = Integer.parseInt(linenumstr);
            BufferedReader rdr2 = new BufferedReader(new FileReader(destinationfile));
            String l2 = null, l1 = null;
            for (int i = 0; i < linenum; i++)
            {
              l1 = l2;
              l2 = rdr2.readLine();
            }
            rdr2.close();
            rdr.close();
            throw new IOException("HTML Validation Problem Identified by Tidy in file " + destinationfile + ": line " + 
		linenum + " / " + message + System.getProperty("line.separator") + l1 +System.getProperty("line.separator") + l2);
          }
        }
      }
    }
    rdr.close();
  }
}

Have you got your own solutions to this or other website build problems? Tell us.

Talk Back!

Have an opinion? Be the first to post a comment about this weblog entry.

RSS Feed

If you'd like to be notified whenever Jack Shirazi adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Jack Shirazi is the author of O'Reilly's "Java Performance Tuning" and director of the popular website JavaPerformanceTuning.com/, the world's premier site for Java performance information. Jack writes articles for many magazines, usually about Java performance related matters. He also oversees the output at JavaPerformanceTuning.com, publishing around 1 000 performance tips a year as well as many articles about performance tools, discussion groups, and much more. In his earlier life Jack also published work on protein structure prediction and black hole thermodynamics, and contributed to some Perl5 core modules "back when he had more time".

This weblog entry is Copyright © 2004 Jack Shirazi. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use