lwpcook - libwww-perl cookbook


DESCRIPTION

       This document contain some examples that show typical
       usage of the libwww-perl library.  You should consult the
       documentation for the individual modules for more detail.

       All examples should be runnable programs. You can, in most
       cases, test the code sections by piping the program text
       directly to perl.


GET

       It is very easy to use this library to just fetch
       documents from the net.  The LWP::Simple module provides
       the get() function that return the document specified by
       its URL argument:

         use LWP::Simple;
         $doc = get 'http://www.sn.no/libwww-perl/';

       or, as a perl one-liner using the getprint() function:

         perl -MLWP::Simple -e 'getprint "http://www.sn.no/libwww-perl/"'

       or, how about fetching the latest perl by running this:

         perl -MLWP::Simple -e '
           getstore "ftp://ftp.sunet.se/pub/lang/perl/CPAN/src/latest.tar.gz",
                    "perl.tar.gz"'

       You will probably first want to find a CPAN site closer to
       you by running something like the following command:

         perl -MLWP::Simple -e 'getprint "http://www.perl.com/perl/CPAN/CPAN.html"'

       Enough of this simple stuff!  The LWP object oriented
       interface gives you more control over the request sent to
       the server.  Using this interface you have full control
       over headers sent and how you want to handle the response
       returned.

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;
         $ua->agent("$0/0.1 " . $ua->agent);
         # $ua->agent("Mozilla/5.0") # pretend you are some very new Netscape browser

         $req = new HTTP::Request 'GET' => 'http://www.sn.no/libwww-perl';
         $req->header('Accept' => 'text/html');

         # send request
         $res = $ua->request($req);

         if ($res->is_success) {
            print $res->content;
         } else {
            print "Error: " . $res->code . " " . $res->message;
         }

       The lwp-request program (alias GET) that is distributed
       with the library can also be used to fetch documents from
       WWW servers.



HEAD

       If you just want to check if a document is present (i.e.
       the URL is valid) try to run code that looks like this:

         use LWP::Simple;

         if (head($url)) {
            # ok document exists
         }

       The head() function really returns a list of meta-
       information about the document.  The first three values of
       the list returned are the document type, the size of the
       document, and the age of the document.

       More control over the request or access to all header
       values returned require that you use the object oriented
       interface described for GET above.  Just s/GET/HEAD/g.


POST

       There is no simple interface for posting data to a WWW
       server.  You must use the object oriented interface for
       this. The most common POST operation is to access a WWW
       form application:

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;

         my $req = new HTTP::Request 'POST','http://www.perl.com/cgi-bin/BugGlimpse';
         $req->content_type('application/x-www-form-urlencoded');
         $req->content('match=www&errors=0');

         my $res = $ua->request($req);
         print $res->as_string;

       You can also use the HTTP::Request::Common module to set
       up a suitable POST request message (it handles all the
       escaping issues) and has a suitable default for the
       content_type:


         use LWP::UserAgent;
         $ua = new LWP::UserAgent;

         my $req = POST 'http://www.perl.com/cgi-bin/BugGlimpse',
                       [ search => 'www', errors => 0 ];

         print $ua->request($req)->as_string;

       The lwp-request program (alias POST) that is distributed
       with the library can also be used for posting data.


PROXIES

       Some sites use proxies to go through fire wall machines,
       or just as cache in order to improve performance.  Proxies
       can also be used for accessing resources through protocols
       not supported directly (or supported badly :-) by the
       libwww-perl library.

       You should initialize your proxy setting before you start
       sending requests:

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;
         $ua->env_proxy; # initialize from environment variables
         # or
         $ua->proxy(ftp  => 'http://proxy.myorg.com');
         $ua->proxy(wais => 'http://proxy.myorg.com');
         $ua->no_proxy(qw(no se fi));

         my $req = new HTTP::Request 'wais://xxx.com/';
         print $ua->request($req)->as_string;

       The LWP::Simple interface will call env_proxy() for you
       automatically.  Applications that use the $ua->env_proxy()
       method will normally not use the $ua->proxy() and
       $ua->no_proxy() methods.

       Some proxies also require that you send it a
       username/password in order to let requests through.  You
       should be able to add the required header, with something
       like this:

        use LWP::UserAgent;

        $ua = new LWP::UserAgent;
        $ua->proxy(['http', 'ftp'] => 'http://proxy.myorg.com');

        $req = new HTTP::Request 'GET',"http://www.perl.com";
        $req->proxy_authorization_basic("proxy_user", "proxy_password");

        $res = $ua->request($req);
        print $res->content if $res->is_success;

       with something suitable for your site.


ACCESS TO PROTECTED DOCUMENTS

       Documents protected by basic authorization can easily be
       accessed like this:

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;
         $req = new HTTP::Request GET => 'http://www.sn.no/secret/';
         $req->authorization_basic('aas', 'mypassword');
         print $ua->request($req)->as_string;

       The other alternative is to provide a subclass of
       LWP::UserAgent that overrides the get_basic_credentials()
       method. Study the lwp-request program for an example of
       this.


MIRRORING

       If you want to mirror documents from a WWW server, then
       try to run code similar to this at regular intervals:

         use LWP::Simple;

         %mirrors = (
            'http://www.sn.no/'             => 'sn.html',
            'http://www.perl.com/'          => 'perl.html',
            'http://www.sn.no/libwww-perl/' => 'lwp.html',
            'gopher://gopher.sn.no/'        => 'gopher.html',
         );

         while (($url, $localfile) = each(%mirrors)) {
            mirror($url, $localfile);
         }

       Or, as a perl one-liner:

         perl -MLWP::Simple -e 'mirror("http://www.perl.com/", "perl.html")';

       The document will not be transfered unless it has been
       updated.


LARGE DOCUMENTS

       If the document you want to fetch is too large to be kept
       in memory, then you have two alternatives.  You can
       instruct the library to write the document content to a
       file (second $ua->request() argument is a file name):

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;



                       'http://www.sn.no/~aas/perl/www/libwww-perl-5.00.tar.gz';
         $res = $ua->request($req, "libwww-perl.tar.gz");
         if ($res->is_success) {
            print "ok\n";
         }

       Or you can process the document as it arrives (second
       $ua->request() argument is a code reference):

         use LWP::UserAgent;
         $ua = new LWP::UserAgent;
         $URL = 'ftp://ftp.unit.no/pub/rfc/rfc-index.txt';

         my $expected_length;
         my $bytes_received = 0;
         $ua->request(HTTP::Request->new('GET', $URL),
                      sub {
                          my($chunk, $res) = @_;
                          $bytes_received += length($chunk);
                          unless (defined $expected_length) {
                             $expected_length = $res->content_length || 0;
                          }
                          if ($expected_length) {
                               printf STDERR "%d%% - ",
                                         100 * $bytes_received / $expected_length;
                          }
                          print STDERR "$bytes_received bytes received\n";

                          # XXX Should really do something with the chunk itself
                          # print $chunk;
                      });



HTML FORMATTING

       It is easy to convert HTML code to "readable" text.

         use LWP::Simple;
         use HTML::Parse;
         print parse_html(get 'http://www.sn.no/libwww-perl/')->format;



PARSE URLS

       To access individual elements of a URL, try this:

         use URI::URL;
         $host = url("http://www.sn.no/")->host;

       or

         use URI::URL;
         $u = url("ftp://ftp.sn.no/test/aas;type=i");
         print "Protocol scheme is ", $u->scheme, "\n";

         use URI::URL;
         my($host,$port) = (url("ftp://ftp.sn.no/test/aas;type=i")->crack)[3,4];



EXPAND RELATIVE URLS

       This code reads URLs and print expanded version.

         use URI::URL;
         $BASE = "http://www.sn.no/some/place?query";
         while (<>) {
            print url($_, $BASE)->abs->as_string, "\n";
         }

       We can expand URLs in an HTML document by using the parser
       to build a tree that we then traverse:

         %link_elements =
         (
          'a'    => 'href',
          'img'  => 'src',
          'form' => 'action',
          'link' => 'href',
         );

         use HTML::Parse;
         use URI::URL;

         $BASE = "http://somewhere/root/";
         $h = parse_htmlfile("xxx.html");
         $h->traverse(\&expand_urls, 1);

         print $h->as_HTML;

         sub expand_urls
         {
            my($e, $start) = @_;
            return 1 unless $start;
            my $attr = $link_elements{$e->tag};
            return 1 unless defined $attr;
            my $url = $e->attr($attr);
            return 1 unless defined $url;
            $e->attr($attr, url($url, $BASE)->abs->as_string);
         }



BASE URL

       If you want to resolve relative links in a page you will
       have to determine which base URL to use.  The
       HTTP::Response objects now has a base() method.

         $BASE = $res->base;

       Copyright 1996-1997, Gisle Aas

       This library is free software; you can redistribute it
       and/or modify it under the same terms as Perl itself.