This post originated from an RSS feed registered with Web Buzz
by Douglas Clifton.
Original Post: Authenticating a Gooblebot in PHP and Perl
Feed Title: blogZero
Feed URL: http://loadaveragezero.com/app/s9y/index.php?/feeds/index.rss1
Feed Description: Web Development News, Culture and Opinion
Following a tip from Russ I was pleased to find an interesting post on the Official Google Webmaster Central Blog titled
How to verify Googlebot. In a nutshell, it explains how to use the Unix shell program host to authenticate that an IP address copied from your Web server log file really is a Googlebot and not some email harvester (or whatever).
I decided to take this a step further and demonstrate how you can automate this procedure using a scripting
language. For these examples I chose PHP
and Perl, although you could certainly use Python or Ruby or whatever your preferred language is, as long as it has an interface to the gethostbyname and gethostbyaddr system calls.
Using these calls under PHP is the simpler of the two approaches, as the interface to these routines are written at a more abstract level than using the Perl Socket module. Below is an example googlebot() function in PHP that returns true if the IP address parameter matches, although there is no 100% guarantee of a spoof (it will catch the vast majority of them). A bit of test code is included.
<?php
function googlebot($ip) {
// check to see if this IP really is a Googlebot
$bot = 'googlebot.com';
$name = gethostbyaddr($ip);
if ($name == $ip) return false;
return (strpos($name, $bot) !== false and gethostbyname($name) == $ip) ? true : false;
}
// test it
$ip = '66.249.66.1';
echo $ip . ' is ';
if (!googlebot($ip)) echo 'not ';
echo 'a Google bot' . "\n";
?>
The Perl version is at a lower level, much closer to the corresponding C library calls. In fact, the module is derived directly from the sys/sockets.h header file and the functions are just wrappers around these Standard C library calls. See Berkeley Sockets for more information.
#!/usr/bin/perl
use Socket;
sub googlebot($) {
# check to see if this IP really is a Googlebot
my $ip = shift;
my $bot = 'googlebot\.com';
my $name = gethostbyaddr(inet_aton($ip), AF_INET) or return 0;
my @addr = gethostbyname($name);
my $addr = inet_ntoa($addr[4]);
return ($name =~ m/$bot/ and $ip eq $addr) ? 1 : 0;
}
# test it
$ip = '66.249.66.1';
print $ip . ' is ';
unless (googlebot($ip)) { print 'not '; }
print 'a Google bot' . "\n";
Finally, in case anyone is interested why it's been so long since I posted anything, much of the summer I was sick as a dog and since recovering, busy as a bee. It's nice to be feeling better and back to work!