A coworker and I wanted to mirror a website of a friend of his. However wget -r insists on honouring the robots.txt file, which is probably a good thing. It seemed like a good opportunity to write a perl script.
#!/usr/bin/perl
use HTML::LinkExtor;
use LWP::Simple;
$url = shift || die "no url given";
print "Base URL: $url
";
$host = get_host($url);
$extracted{$url}++;
get_urls($url);
sub get_urls {
$base_url = shift;
mirror_file($base_url);
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (keys %seen) {
unless ($extracted{$_} || $host ne get_host($_)) {
$extracted{$_}++;
get_urls($_);
}
}
}
sub mirror_file {
my $url = shift;
my $file = get_local_filename($url);
print "mirroring $file
";
if ($file =~ m#(?:http://)?(.*)/(.*)#) {
$dir = $1;
$file = $2;
}
system("mkdir -p $dir");
getstore($url, "$dir/$file");
}
sub get_host {
my $url = shift;
(my $host = $url) =~ s{http://(?:www.)?([^/]*).*}
{$1};
return $host;
}
sub get_local_filename {
my $url = shift;
(my $filename = $url) =~ s{http://.*?/(.*)}
{$host/$1};
return $filename;
}
Tags: Perl










January 10th, 2004 at 10:50 pm
Use ‘wget -e robots=off -m http://site.com‘. Maybe it wasn’t possible to turn that off in an earlier version, but I can’t imagine an open source tool limiting itself like that. It even has a hard-to-detect mode. You can also do -X robots.txt, I think.