jamin on August 16th, 2002

A coworker and I wanted to mirror a website of a friend of his. However wget -r insists on honouring the robots.txt file, which is probably a good thing. It seemed like a good opportunity to write a perl script.


#!/usr/bin/perl
use HTML::LinkExtor;
use LWP::Simple;
$url  = shift || die "no url given";
print "Base URL: $url
";
$host = get_host($url);
$extracted{$url}++;
get_urls($url);
sub get_urls {
    $base_url = shift;
    mirror_file($base_url);
    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;
    foreach $linkarray (@links) {
        my @element  = @$linkarray;
        my $elt_type = shift @element;
        while (@element) {
            my ($attr_name , $attr_value) = splice(@element, 0, 2);
            $seen{$attr_value}++;
        }
    }
    for (keys %seen) {
        unless ($extracted{$_} || $host ne get_host($_)) {
                $extracted{$_}++;
                get_urls($_);
        }
    }
}
sub mirror_file {
    my $url = shift;
    my $file = get_local_filename($url);
    print "mirroring $file
";
    if ($file =~ m#(?:http://)?(.*)/(.*)#) {
        $dir  = $1;
        $file = $2;
    }
    system("mkdir -p $dir");
    getstore($url, "$dir/$file");
}
sub get_host {
    my $url = shift;
    (my $host = $url) =~ s{http://(?:www.)?([^/]*).*}
                          {$1};
    return $host;
}
sub get_local_filename {
    my $url = shift;
    (my $filename = $url) =~ s{http://.*?/(.*)}
                              {$host/$1};
    return $filename;
}

Tags:

One Response to “Web Mirroring”

  1. Use ‘wget -e robots=off -m http://site.com‘. Maybe it wasn’t possible to turn that off in an earlier version, but I can’t imagine an open source tool limiting itself like that. It even has a hard-to-detect mode. You can also do -X robots.txt, I think.