Home > Articles > Web Development > Perl

📄 Contents

  1. Introduction
  2. Fetching the Site and Checking Links
  3. Conclusion
  4. Complete Listing
Like this article? We recommend

Like this article? We recommend

Fetching the Site and Checking Links

This script is simple, and does the most essential task: checking for bad links. More verbose and detailed reporting will be left as an exercise to the reader. This script gives the basic foundation. Let's begin!

NOTE

Click here to download a zip file containing the complete code for this article.

01: #!/usr/bin/perl -w

02: use strict;
03: use HTML::Parser;
04: use LWP::UserAgent;
05: use URI::URL;

Lines 1 through 5 begin the script by turning on warnings, and enable the strict pragma to help make sure that the code is somewhat sane. These two things are essential for any program. Line 3 uses the HTML::Parser module. This module parses the HTML we retrieve and gives us the links on that page. The LWP::UserAgent on line 4 is used to do the actual fetching of web pages. Finally, we bring in the URI::URL module to help build any relative URLs into full URIs.

We now have all the tools we need.

06: my %LINKS;
07: my %GOOD_LINKS;
08: my %BAD_LINKS;
09: my $BASE;
10: my @TO_CHECK;
11: my $URL = $ARGV[0] || "http://mydomain.com";

Lines 6 to 11 define our global variables. The %LINKS hash is used as a container for links that are found on the main URL. The %GOOD_LINKS and %BAD_LINKS hashes are used to keep track of which links are good and which are bad, respectively. $BASE holds the base URI that's being fetched. It will be used with the URI::URL module. The @TO_CHECK array holds all the links that need to be checked. This list grows as the program runs and parses more pages. The $URL variable is the URL from which we should start the crawl, and should be a base domain name. Since this is a command-line program, we can pass an argument for the URL, or use a default (http://mydomain.com).

We have the needed modules and our variables. Now on to the good stuff.

12: {
13:   package GetLinks;
14:   use base 'HTML::Parser';

Lines 12 to 14 begin a new block in which we create a new package, or basically a little module within our script. We name the package GetLinks, and make it a subsclass of the HTML::Parser module. By making it a subclass of HTML::Parser, we can inherit all its functionality, as well as override its start() method. This will be better explained in a moment.

15:   sub start {
16:       my $self = shift;
17:       my ($tag, $tag_attr) = @_;
18:       if ($tag eq 'a' and defined $tag_attr->{href}) {
19:           $LINKS{$tag_attr->{href}} = 0;
20:       }
21:       if ($tag eq 'img' and defined $tag_attr->{src}) {
22:           $LINKS{$tag_attr->{src}} = 0;
23:       }
24:   }

Lines 15 to 24 make up the GetLinks::start() method. This is a callback used in HTML::Parser. Whenever HTML::Parser comes across the start of an HTML tag, this callback is invoked, which allows us to do something based on the tag that's being parsed. In this case, we're working on <A> and <IMG> tags. Line 18 checks whether the tag (which is passed as an argument to the start() method) is an <A> tag. If it is, we want to make sure that one of the attributes to this tag is HREF. Not all <A> tags have an HREF attribute; some only have a NAME attribute, and we don't want to concern ourselves with those.

If these conditions are met, line 19 adds the URI that's referenced in the HREF attribute to our %LINKS hash. The $tag_attr variable is a hash reference, and contains the attribute data for the tag that's being worked on. The HREF hash key will contain a URI. Lines 21 to 23 do the same conditional but for <IMG> tags, and make sure that there's a SRC attribute. You may wonder why a hash is used to store this information, instead of another list. A hash is used so more information about a link can be added easily when this script is expanded. Just a small thing to do in order to ensure scalability and maintainability.

25: }

Line 25 simply finishes off the block in which we created the GetLinks package. Now we can head back into the main section of the script.

26: my $ua = new LWP::UserAgent;
27: $ua->agent("LinkCheck/0.1");

Line 26 creates a new LWP::UserAgent object. The LWP::UserAgent module basically creates a web client for us to use. Line 27 gives our user agent a name, LinkCheck/0.1. This information generally is logged into a web server's access log, so name it something useful (or fun).

28: print "Starting scan from $URL\n";

Line 28 just prints out a statement saying that the checking has begun.

29: my $req = new HTTP::Request('GET',$URL);
30: my $res = $ua->request($req);

Lines 29 and 30 create a new HTTP::Request object, which does all the necessary things to send an HTTP request to a server. In this case, we're making a GET request to the URL we provided when the script was executed. Line 30 uses our user agent's request() method to run the request. The response of the request is stored in an HTTP::Response object, which is what the $res variable will be.

31: if (!$res->is_success) {
32:   die "Can't fetch $URL";
33: }

Line 31 checks whether the request for the page failed. The is_success() method returns a true value if the requested page was successfully found. If we don't get a true value, our program dies with a simple message. Of course, if we can't get to the original URL, we may as well stop there.

34: $BASE = $res->base;

Line 34 sets the $BASE variable to the base of our response. No real magic here, but it will be used later.

35: my $parser = GetLinks->new;
36: $parser->parse($res->content);

Line 35 creates a new instance of the GetLinks package we created in lines 12 to 25. Remember, the GetLinks package is a subclass of HTML::Parser, so it inherits the methods exported by HTML::Parser. We use the parse() method, inherited from HTML::Parser, on line 36. The argument given to the parse() method is the HTML content returned from our request to the web site. We access this content via the content() method of our response object ($res in this case). As you may guess, this content is the HTML of the URL we requested, and HTML::Parser does its magic to parse the HTML tags in this content. As the parsing is happening, the start() callback method is used, and our %LINKS hash is populated.

37: for my $link (keys %LINKS) {
38:   my $true_url = url($link, $BASE)->abs;
39:   push(@TO_CHECK, $true_url);
40: }

Lines 37 to 40 loop through the keys of the %LINKS hash. Each key is a URL for a web page or image, since that's what our start() callback looks for. Line 38 passes the link (the URL we're working on) and the base URL we have to the URI::URL::abs() method. The url() method is a constructor for URI::URL, so we can do all this in one shot. The returned value is stored in $true_url, and will be what we eventually check. If the abs() method sees that $link is already an absolute URI, then it does nothing to it. If it sees that it's a relative path, it appends it to $BASE. For example, if $link is '/pages/foo.html' and $BASE is 'http://mydomain.com/', $true_url will be 'http://mydomain.com/pages/foo.html'.

Line 39 pushes this URL onto our @TO_CHECK array, which continuously contains all the links that need to be checked.

41: while (my $url = shift @TO_CHECK) {
42:   next if exists $GOOD_LINKS{$url} or exists $BAD_LINKS{$url};

Line 41 starts looping through and picking URLs off the @TO_CHECK array. We're shifting the elements off because we want to alter the array and remove URLs we've checked. Later in this loop we'll add new URLs to the list, so we may as well remove them as we check them. Line 42 checks whether the URL is in either the %GOOD_LINKS or %BAD_LINKS hash. These hashes get populated later on in this loop.

43:   $req = new HTTP::Request('GET', $url);
44:   $res = $ua->request($req);

Lines 43 and 44 make an HTTP request to the current URL and put the response into the $res variable. This is the same thing we did with our beginning URL, in lines 29 and 30.

45:   if ($res->is_success) {

Line 45 checks whether we connected successfully to the URL. This returns a true value if the web server returned a 200 response to us. A 500 (or other response) results in a false value.

46:       if ($res->content_type =~ /text\/html/i && $url =~ /$URL/i) {

Line 46 gets the content type of the page we've fetched (from the Content-Type header of the response). If the content type is 'text/html', we can expect to get back a page of HTML. We want to know if it's HTML because if it isn't we don't want to parse it for links. We wouldn't want to parse image data or a text file for hyperlinks. As well as checking for content type, we make sure that the URL we began with (http://mydomain.com/) is part of the current $url. If not, we don't want to scan it. If we did, we would end up crawling external sites, which we don't want to do!

47:           my $parser = GetLinks->new;
48:           $parser->parse($res->content);

At this point, we've connected successfully to the URL, and we know that the page we connected to is an HTML document. Lines 47 and 48 create a new GetLinks instance and pass the contents of the document to GetLinks to be parsed.

49:           for my $link (keys %LINKS) {

Since the contents of the page have gone through our start() callback, the links found in the document have been added to the %LINKS hash. We want to cycle through these links and add them to the @TO_CHECK array. Line 49 begins this loop.

50:               my $abs = url($link, $BASE)->abs;

Line 50 gets the absolute path of the hyperlink and puts the value in the $abs variable.

51:               unless(exists $GOOD_LINKS{$abs} or exists $BAD_LINKS{$abs}) {
52:                   push(@TO_CHECK, $abs);
53:               }

Since we don't want to check the same link twice, we check whether the URL is in the %GOOD_LINKS or the %BAD_LINKS hash. If not, we put the URL onto the end of the @TO_CHECK array in order to process it later.

54:           }

Line 54 ends the loop through the keys of the %LINKS hash.

55:       }

Line 55 closes the conditional on line 46.

56:       $GOOD_LINKS{$url}++;
57:   } else {
58:       $BAD_LINKS{$url}++;
59:   }
60: }

Line 56 adds the URL to the %GOOD_LINKS hash. This line is accessed if we got a true response from is_success() on line 45. If is_success() returned a false value, we would be on line 58, where the bad link would be recorded in the %BAD_LINKS hash. When all the links are checked and nothing is left in @TO_CHECK, we have a hash with all the good URLs, and one with all the bad ones. The only thing left to do is use this information.

61: print qq{Bad links\n};
62: print qq{$_\n} for keys %BAD_LINKS;

Lines 61 and 62 do some very basic display of the results. Since we're mainly concerned with the bad links, we loop through the keys in %BAD_LINKS and display all the bad links. That's it!

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020