Programming Blackberry: Simple Web Crawler in Python - Parse Domain Links Using urllib2 and HTMLParser

I know there are quite a few "Simple Python Crawlers" out on the web for easy download and use. Nonetheless, I felt like I'd add yet another to the mix - Hey, innovation doesn't work without choice, right? Writing a basic web-crawler is pretty simple if you leverage Python's built-in modules that handle the most difficult aspects: opening and managing socket connections to remote servers and parsing the returned HTML. The Python modules urllib2 and HTMLParser provide you with the high-level interface to these lower level processes. The crawler I've written for the tutorial leverages these modules, runs from the command-line, and takes the following two arguments:

"seed url" - where the crawler will begin its parsing
"all" or "local"

The "local" flag tells the crawler to parse only the http links that are contained within the "seed url" domain (local). This means that eventually the parser will stop because there are a limited number of links within a domain. Note that if you were to crawl a large domain, like www.microsoft.com, it could take a very long time to complete the crawl. Caveat #1: (IMPORTANT!)This crawler only looks at the base url on http links to stay within the domain. If absolute links are within the page (e.g., a href="/") this crawler won't pick those up You'll have to add that functionality if that's what you're looking for (but that should be fairly easy)
The "all" flag tells the crawler to parse every http link it finds within the html, even if they are outside the domain. Note that this means the spider could take a very, very, very long time to complete its crawl (years?) I'd suggest running this only if you'd like to see how quickly the number of pending links virtually explodes as the spider crawls. You'll not want to run it for long though as your machine will likely deplete its memory.

Before we begin, you can get the entire source code here but I'd recommend taking a look at the step-by-step below so you can understand how to customize it to your needs.

Caveat #2: Although I've run the program against a handfull of sites and haven't had problems, I've not tested this very thoroughly. This means there could be errors, problems, situations where it crashes, or it could even be giving incorrect link counts. In the coming days I intend on testing it more but if you run into problems let me know in the comments.

Run the Program from the Command-Line

Nothing too complex here: If you'd like to run the crawler to parse only the local domain links on this website you'd give the following command from the command-line:

python spider.py http://berrytutorials.blogspot.com local

Otherwise, if you want to crawl the web starting with my site as the seed url then you'd run the following command:

python spider.py http://berrytutorials.blogspot.com all

The program will give you updates on status, printing the number of pending URLs in the queue along with the number of links(URLs) that have been processed, and when it completes, the total number of links it found. Along the way, as HTMLParser processes the HTML, you'll likely encounter errors in parsing due to malformed tags, etc. that are things that HTMLParser cannot gracefully overlook. The following is what the tail-end of the output looks like:

.....
.....
.....

Crawl Exception: Malformed tag found when parsing HTML
bad end tag: "", at line 1266, column 16

15 Pending URLs are in the queue.
369 URLs have been fully processed.

10 Pending URLs are in the queue.
374 URLs have been fully processed.

5 Pending URLs are in the queue.
379 URLs have been fully processed.

Total number of links: 382
Main-Mini:Desktop john$

I understand that there are better HTML parsers out there in Python such as BeautifulSoup that might be able to handle poorly-formed HTML, however I'm a bit more familiar with HTMLParser.

Overall Architecture of the Simple Crawler

The base design of the crawler consists of the following:

Spider class: Main class that defines two dictionaries to hold the pending URLs to be processed and the visited URLs that are complete. The visited URLs dictionary maps the URL to the HTML that was parsed by HTMLParser so you can further process the link content as suits your application. Also, Spider defines a function called "startcrawling()" which is called to begin the crawl.
LinksHTMLParser: HTML parsing class, declared as a local variable within the startcrawling function in Spider. This class extends the base HTMLParser by overriding the handle_starttag function to only parse out anchor tags. It also defines a local variable named "links" that holds the processed links as strings so the Spider can access them and perform further processing.

Spider Class Details

The main algorithm is in the Spider class' startcrawling() function and operates as follows (in semi-pseudo-code):

While there are URLs in the pendingURLs dictionary:    
     pop another URL from the pending URLs dictionary to process.    
     make HEAD request to URL and check content-type    
     if content-type is 'text/html' process, 
     otherwise continue (break from this 
     iteration of loop)       
          open URL and read in HTML       
          add URL to list of visited URLs       
          for each of the HTTP links found when processing the HTML:          
               parse the link to make sure it is syntactically correct.          
               check to make sure it's HTTP and it hasn't already been visited          
               if command-line option is 'local' check the domain of the link.             
               if the domain is not the same then disregard, 
                 otherwise add to pendingURls          
               otherwise, if adding all links, just add to pendingURLs

Refer to the following code detailing the Spider class:

import sys
import re
import urllib2
from urllib2 import URLError

# Snow Leopard Fix for threading issues Trace/BPT trap problem
urllib2.install_opener(urllib2.build_opener())
from urlparse import urlparse
import threading
import time
from HTMLParser import HTMLParser


"""
Spider takes a starting URL and  visits all links found within each page
until it doesn't find anymore 
"""
class Spider():
 
 def __init__(self,sUrl, crawl):
 
  #Urlparse has the following attributes: scheme, netloc, path, params,query,fragment
  self.startUrl = urlparse(sUrl)
  self.visitedUrls = {} # Map of link -> page HTML
  self.pendingUrls = {sUrl:sUrl} # Map of link->link. Redundant, but used for speed of lookups in hash
  self.startUrlString = sUrl
  self.crawlType = crawl
  self.numBrokenLinks = 0
  self.numTotalLinks = 0
  
 """ Main crawling function that parses the URLs, stores the HTML from each in visitedUrls
   and analyzes the HTML to acquire and process the links within the HTML"""
 def startcrawling(self):
   
  while len(self.pendingUrls) > 0:
   try:
    
    self.printProcessed()
   
    currUrl = self.pendingUrls.popitem()[0]  
    
    
    # Make HEAD request first to see if the type is text/html
    url = urllib2.urlopen(HeadRequest(currUrl))
    conType = url.info()['content-type']
    conTypeVal = conType.split(';')
    
    # Only look at pages that have a content-type of 'text/html'
    if conTypeVal[0] == 'text/html':
 
     url = urllib2.urlopen(currUrl)
     html = url.read()
     
     # Map HTML of the current URL in process in the dictionary to the link
     # for further processing if required
     self.visitedUrls[currUrl] = html
     
     # LinksHTMLParser is extended to take out the a tags only and store 
     htmlparser = LinksHTMLParser()
     htmlparser.feed(html)
     
     # Check each of the a tags found by Parser and store if scheme is http
     # and if it already doesn't exist in the visitedUrls dictionary
     for link in htmlparser.links.keys(): 
      url = urlparse(link)
      
      if url.scheme == 'http' and not self.visitedUrls.has_key(link): 
       if self.crawlType == 'local': 
        if url.netloc == self.startUrl.netloc:
         if not self.pendingUrls.has_key(link):
          self.pendingUrls[link] = link
            
       else: 
        if not self.pendingUrls.has_key(link):    
         self.pendingUrls[link] = link
           

   
   # Don't die on exceptions.  Print and move on
   except URLError:
    print "Crawl Exception: URL parsing error" 
    
   except Exception,details:
    print "Crawl Exception: Malformed tag found when parsing HTML"
    print details
    # Even if there was a problem parsing HTML add the link to the list
    self.visitedUrls[currUrl] = 'None'
    
  if self.crawlType == 'local':
   self.numTotalLinks = len(self.visitedUrls)
 
  print "Total number of links: %d" % self.numTotalLinks

You can see the main loop processes links while there are still pendingUrls in the queue (while 1 and len(self.pendingUrls) > 0). It opens the current Url to process from the pendingURLs dictionary by removing it from the queue using the popitem() function.

Note that because I'm using dictionaries there is no order to the processing of the links; a random one is popped from the dictionary. An improvement/enhancement/customization might be to use an actual queue(list) and process the links in order they were added to the queue. In my case, I decided to randomly process because I didn't think the order mattered in the long run. In the case of visitedURLs I used the dictionary mainly because I wanted quick lookup (O(1)) of the hash for processing the HTML down the road.

Next, a HEAD request is made to the current URL to process to check its 'content-type' value in the header. If it's a 'text/html' content type, we will process it further. I went this route because a) I didn't want to process document (.pdf, .doc, .txt, etc.) files, image (.jpg, .png, etc), audio/video, etc. I only want to look at html files. Also, the reason I make the HEAD request before downloading the entire page is mainly so the crawler is more "polite"; i.e., so it doesn't eat up server processing time downloading entire pages unless it's totally necessary.

After validating the HEAD request, the program downloads the entire page and feeds it to LinksHTMLParser. The following is the code for LinksHTMLParser:

class LinksHTMLParser(HTMLParser):

 def __init__(self):
  self.links = {}
  self.regex = re.compile('^href$')
  HTMLParser.__init__(self)
  
 
    
 # Pull the a href link values out and add to links list
 def handle_starttag(self,tag,attrs):
  if tag == 'a':
   try:
    # Run through the attributes and values appending 
    # tags to the dictionary (only non duplicate links
    # will be appended)
    for (attribute,value) in attrs:
     match = self.regex.match(attribute)
     if match is not None and not self.links.has_key(value):
      self.links[value] = value
      
     
   except Exception,details:
    print "LinksHTMLParser: " 
    print Exception,details

You can see that I've inherited from HTMLParser and overridden the handle_starttag function so we only look at anchor tags that have an href value (in order to eliminate some tag processing). Then LinksHTMLParser adds each anchor link to an internal dictionary called links that holds the links on that processed page for Spider to further process.

Finally, Spider loops on the links found by LinksHTMLParser and checks if it's local (domain-only) crawl will check the domain of each link to make sure it's the same as the "seed URL". Otherwise it just adds it if the link doesn't already exist in the pendingURLs dictionary.

Areas for Crawler Customization and Enhancement

As it is written, the crawler doesn't do much more than get the links, parse them, count them, store each link's HTML in a dictionary, and return a total. Obviously you'd be advised to make it actually do something real, even as simple as just printing out the links it finds so you can review them. In fact, before posting this I had it doing just that (to standard out) after the main while loop returned (lines 90-91) in the whole version:

for link in self.visitedUrls.keys():
 print link

You might customize that to write to a file instead of STDOUT so it could be further processed by external scripts.

Here's some other enhancements I'd suggest:

Use the crawler to parse the HTML content you've stored for each link in the pendingUrls dictionary. Say you're looking for some particular content on a site, you'd add a function that processes that HTML after the startcrawling function is complete, using another extended version of the LinksHTMLParser to do some other scraping.
limit the "depth" that the crawler runs when it's an "all" search - e.g., have a variable from the command line limit the number of times the crawler runs through the found links so you can get the all version to stop.
Although this crawler is semi-polite because it requests the HEAD before the whole page, you'd really want to download the robots.txt file from the seed domain (and from each outside domain that the crawler accesses if you're hitting all domains) to ensure crawlers are allowed. You don't want to accidentally access some NSA website, scrape all the content, then have agents knocking at your door that afternoon.
This crawler makes requests on the webserver without any delay between requests and worst-case could bring a server down or severely slow it down. You'd likely put some kind of delay between requests so as to not overwhelm the target servers (use time.sleep(SECS))
Instead of making the HEAD request, checking the content-type to see if the URL pending to visit is html, you could use a regular epression to test the URL to see if it ends in either a '/', '.asp','.php','htm', or 'html' then just request the page. This would avoid the immediate GET after the HEAD and limit stress on the server

Preventing Duplicate HTML Content

One issue I thought of is it would be a good idea to enhance the crawler so it doesn't look at duplicate HTML content if your end goal is to test the actual page details for each link. The crawler is written so it definitely doesn't store duplicate links but that doesn't guarantee that the HTML content is unique. For example, on my site it's finding 382 total unique links even though I only have 15 posts. Where are all these extra links coming from?

It's the widgets that i'm using in my template, for example here are some 'widget' links the crawler found:

http://berrytutorials.blogspot.com/search/label/blackberry?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1230796800000&toggleopen=MONTHLY-1262332800000

http://berrytutorials.blogspot.com/2009/11/create-custom-listfield-change.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1262332800000&toggleopen=MONTHLY-1257058800000

Although these are unique links, they point to content that is already catalogued by the crawler when it found the main page (ex., look at the second link, it points to the article 'create-custom-listfield-change.html', and the crawler also holds the link to the actual page - the HTML content is duplicated).

To prevent this, I'd think after the crawler is complete you'd have a 'normalization' process where the found links are checked for duplicate content. Since I've stored the HTML for each link you wouldn't have to have the spider reconnect to the crawled website, just check the HTML. I haven't thought this through completely to suggest an algorithm that would be fast though so I'll leave that up to you.

Wrap up and Roll

Although there are tons of open-source crawlers on the web I think that writing one yourself will definitely help you understand the complexities of link and content parsing and will help you actually visualize the explosion of links that are out there. For example, I set this crawler on www.yahoo.com and within a couple minutes it was up to over 2000 links that were in the queue. I was honestly surprised to find so many links just in my simple blog. It was a great learning experience and I hope this article helped you along the path to writing a more advanced crawler.

In case you're interested, here is an article about distributed crawlers (state of the art from 2003 :) here

As usual, let me know if you have questions/concerns in the comments!

5 comments:

Anonymous said...: June 21, 2010 at 7:06 PM; I typed the following in the command line as you said:

python spider.py http://berrytutorials.blogspot.com local

But then I got the error message saying that "SyndaxError: invalid syntax"

Please help me.
Anonymous said...: September 22, 2010 at 12:04 AM; I typed the following in the command line as you said:

python spider.py http://berrytutorials.blogspot.com local

But then I got the error message saying that
"Crawl Exception: URL parsing error
Total number of links: 0"

please help me...
karl said...: October 27, 2010 at 12:43 PM; You can add something along this to be able to catch sites which serves with application/xhtml+xml

if conTypeVal[0] in ('text/html','application/xhtml+xml'):
John Banks said...: October 27, 2010 at 9:48 PM; @Karl: thanks for the comments. Although I haven't validated your code suggestions they appear sound. That being said I'd also say that this crawler isn't meant to be a complete solution as evidenced by the "simple" in the title of the article. Enhancements are left up to the reader to investigate and improve upon.
Unknown said...: September 25, 2013 at 1:40 AM; Hi,Website functionality must not be ignored. No matter what programming language you use in Web Design Cochin, you should always adhere to clean coding standards and always have it validated. -Thanks....

Programming Blackberry

Search

About Me

Tags

Archive

Simple Web Crawler in Python - Parse Domain Links Using urllib2 and HTMLParser