The Artima Developer Community
Sponsored Link

Web Buzz Forum
W3C LogValidator

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Douglas Clifton

Posts: 861
Nickname: dwclifton
Registered: May, 2005

Douglas Clifton is a freelance Web programmer and writer
W3C LogValidator Posted: Nov 4, 2008 1:13 AM
Reply to this message Reply

This post originated from an RSS feed registered with Web Buzz by Douglas Clifton.
Original Post: W3C LogValidator
Feed Title: blogZero
Feed URL: http://loadaveragezero.com/app/s9y/index.php?/feeds/index.rss1
Feed Description: Web Development News, Culture and Opinion
Latest Web Buzz Posts
Latest Web Buzz Posts by Douglas Clifton
Latest Posts From blogZero

Advertisement

w3c bookThis article documents my experience installing, configuring, and using the W3C LogValidator. Hopefully it will be useful to anyone new to this, and in particular those who are not comfortable installing Perl CPAN modules.

Goals

When I set out on this project I wanted to satisfy the following criteria. First, determine the N most popular documents where N is some configurable number. And second, amongst those pages determine:

  • Is the markup, be it HTML or XHTML, valid?
  • Are the CSS stylesheets valid?
  • Do any of them have broken links?

As it turns out, all three of these criteria can be met with LogValidator and they correspond with the following modules:

Both the HTML and CSS modules use the W3C validation services that most developers are familiar with, so there is no need to install HTML Tidy or any complicated Java packages. For the LinkChecker module to function, you also need to install checklink (which is a useful little utility on its own).

All this is fine and dandy, but how do you view the results? I'll get to this in more detail below, but the two most popular methods are to have LogValidator generate an HTML document, or send you an email—which is handy if you want to set-up the system to run automatically using cron.

Installation

This is command-line land folks, so I'm hoping you are comfortable using a shell. If you are, and you do have any experience installing Perl CPAN modules, you should be all set. Otherwise you may consult a complete session of my own installation experience. Good luck with all that, it's kind of gory.

First, you need the W3C::LogValidator module, which may or may not require some other prerequisite modules. Thankfully, the CPAN module itself is smart enough to determine if anything is missing and automatically install them as well. I'm not going to get into installing this software in non-standard locations, so you are going to have to be root (aka super-user) to do most of what follows. Note that the `#' character prompt indicates the command is run as root. To get started, install LogValidator like so:

# perl -MCPAN -e 'install W3C::LogValidator'

I found that installing LinkChecker did not work using CPAN, so I used the old-school method of downloading the package (aka "tarball"), unpacking it in some temporary directory, and building it and installing it from there. Again, only the actual installation step requires you to be root, all others you are better off as an ordinary user (as in you). Note the change in shell prompt to the `>' character.

/tmp/CPAN> tar xzvf W3C-LinkChecker-4.3.tar.gz
W3C-LinkChecker-4.3/
W3C-LinkChecker-4.3/...
/tmp/CPAN> cd W3C-LinkChecker-4.3
/tmp/CPAN/W3C-LinkChecker-4.3> perl Makefile.PL
/tmp/CPAN/W3C-LinkChecker-4.3> make
/tmp/CPAN/W3C-LinkChecker-4.3> make test

I'm not showing you the output here because it gets a little messy. Again, consult my session if you are new to this. The final step is to install LinkChecker as root:

/tmp/CPAN/W3C-LinkChecker-4.3> su
Password:
# make install
...
# ^d

Now you should have all the software installed and can proceed to the configuration step. Note that the ^d above means Ctrl+D (EOF), which exits you from the root login (typing "exit" also works).

Configuration

LogValidator comes with a sample configuration file that you need to copy to some directory of your choice. The sample will be in root's hidden .cron directory, but it's readable so exit from your root login and create this directory.

For example:

/tmp/CPAN/W3C-LinkChecker-4.3> cd /var/www/mysite
/var/www/mysite> mkdir config
/var/www/mysite> cd config
/var/www/mysite/config> cp ~root/.cpan/build/W3C-LogValidator-1.3.1/samples/logprocess.conf .

Pay attention to the LogValidator version number because this article can (and will) become obsolete at some point.

Now you're ready to edit the configuration file and run the script. It is very similar to the Apache httpd.conf file (and some of the directives are straight out of it), so if you're familiar with editing that file you're ahead of the game.

There are several options that you will want to alter, and the file is fully documented. Well almost fully documented. I'll get to that in a minute.

##  [apache] ServerAdmin : e-mail address to send the reports

ServerAdmin doug@example.com

##  MailFrom : From: address for e-mail output
##
## Unless the relevant option is specified when running the LogValidator,
## the mail output will use ServerAdmin (see above) as From: and To:
## This option allows you to override the From: parameter
## DEFAULT  = ServerAdmin

MailFrom logvalidator@example.com

## Title : a more useful Subject: for the Mail output and <title> for HTML Output
##
## Tell the mail/HTML output what this config is all about
## and make them use a better subject than the vanilla "LogValidator results"
## DEFAULT = Logvalidator results

Title LogValidator - Results for example.com

##  [apache] DocumentRoot : where the files are located
##
## For some log formats, it is necessary to know where the actual files
## reside on the server

DocumentRoot /var/www/mysite/docroot/

##  [apache] ServerName : full address for the web server
##
## should be of the form host.domain
## NOTE: no need to prepend http://

ServerName example.com

##  [apache] CustomLog : log file and format
##
## Add as many entries as you like. The Log Validator will process all log files listed below
## formats: see http://httpd.apache.org/docs/mod/mod_log_config.html
#  NOTE: only the following formats are currently supported:
#                    common, combined, w3, full, plain (list of addresses)
# CustomLog /var/log/apache/access.log.1 combined
# CustomLog /home/me/path/to/list plain

CustomLog /var/www/mysite/logs/access_log combined

## [apache] DirectoryIndex : document equivalent to "/"
##
## See http://httpd.apache.org/docs/mod/mod_dir.html#directoryindex
## Used by the validator to compute the "canonical" URLs for Documents
## DEFAULT = index.html index.htm index

DirectoryIndex index.php index.html

There are a large number of other options to play with, but many of the defaults are sufficient until you get everything running and you want to tweak things. The options that don't seem to be documented very well, and are omitted from the sample configuration file, relate to output. So I took the time to do so.

# UseOutputModule : method and location to send output

# --email from your shell, or
# UseOutputModule W3C::LogValidator::Output::Mail
# -s|--sendto <address> from your shell, or
# SendTo doug@example.com

# --HTML from your shell, or

UseOutputModule W3C::LogValidator::Output::HTML

# -o|--output <path> from your shell, or

OutputTo /var/www/mysite/www/admin/logvalidator.html

# output will go to console if not specified

Notice I have both options set, but the email method is commented out (a `#' character precedes comments). You can override either option by using command-line switches as described above. Also note that I am placing the HTML output in an admin directory, which I happen to have password protected because it contains other stuff that I don't want just anyone to have access to.

Once you have everything configured correctly it's time to run the script and view the output. Back at your command prompt, run the script and tell it where to find the configuration file:

~config> logprocess.pl -f logprocess.conf

This should work if you are in the same directory as the configuration file, otherwise you may have to specify the full (or relative) path to its location. Also, the CPAN installation typically puts the Perl script in /usr/local/bin/, so if that directory isn't in your PATH or depending on which shell you're using, it may not find the script at first. Either try using the full path to the script or run the rehash shell built-in command to update its database.

Also, depending on how much traffic you get, the server log file may be quite large and running all four modules on the results will take some time, so be patient. When the script exits and depending on the output option you selected, you will either get an email with the results or you can visit the generated HTML report with your browser.

Reports

To generate the report the script took about 20 minutes running in the background. This was with a full day's worth of requests stored in my Apache access log from yesterday. I quick check with wc -l on the log returned around 50,000 total requests. Keep in mind that many of these are for images, CSS, JavaScript, and other resources—not complete pages. Below is a thumbnail with my results, minus the Basic module, which was configured to list the top 100 most popular pages (the default).

screenshot

I kept an eye on the process while it ran and by far the longest time was spent during the LinkChecker phase, which is not surprising given all the external HTTP requests necessary to check each link. And I have a lot of outbound links. I ran checklink manually on my home page, and that in itself took a minute and change.

Conclusion

LogValidator is a great tool, although it takes some work to get it running. The main idea here is to target those pages on your Web site that get the most traffic and fix problems there first, before moving on. If you have thousands of pages like I do, it can be a pretty daunting task to find and fix every error on your site.

One final note. I use newsyslog(8) to automatically rotate and gzip compress my Apache access logs. I was hoping that LogValidator would have the ability to deal with gzipped log files, but as far as I could determine it does not. Although it really wouldn't be that difficult to modify the code to do so, I opted for using a simple wrapper shell script that uncompresses yesterday's access log, runs the logprocess script, and then recompresses the log file. Below is some sample code if you want to take this route.

# newsyslog.conf -- rotate apache logs
/var/www/mysite/logs/error_log root:staff 640 6 * @T00 BZN
/var/www/mysite/logs/access_log root:staff 640 6 * @T00 BZ /var/run/httpd.pid 30
#!/bin/sh

# logprocess -- W3C::LogValidator wrapper script

config=/var/www/mysite/config/logprocess.conf
access=/var/www/mysite/logs/access_log.0

gunzip ${access} 2>&1 /dev/null
logprocess.pl --quiet --config ${config} 2>&1 /dev/null
gzip ${access} 2>&1 /dev/null

In case you're wondering, the 2>&1 /dev/null at the end of each line above means "really quiet" as in all output, both stdout (console) and stderror (also console, but another file handle), is sent to the "bit bucket" (/dev/null). And the reason I'm doing this is because I can now execute this script via cron and not have a bunch of useless messages in my inbox.

That all he wrote folks! And you thought I was a softy after my last post. wink

Read: W3C LogValidator

Topic: Commenting system with lightweight JSON store Previous Topic   Next Topic Topic: Emacs Completions

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use