- "seed url" - where the crawler will begin its parsing
- "all" or "local"
- The "local" flag tells the crawler to parse only the http links that are contained within the "seed url" domain (local). This means that eventually the parser will stop because there are a limited number of links within a domain. Note that if you were to crawl a large domain, like www.microsoft.com, it could take a very long time to complete the crawl. Caveat #1: (IMPORTANT!)This crawler only looks at the base url on http links to stay within the domain. If absolute links are within the page (e.g., a href="/") this crawler won't pick those up You'll have to add that functionality if that's what you're looking for (but that should be fairly easy)
- The "all" flag tells the crawler to parse every http link it finds within the html, even if they are outside the domain. Note that this means the spider could take a very, very, very long time to complete its crawl (years?) I'd suggest running this only if you'd like to see how quickly the number of pending links virtually explodes as the spider crawls. You'll not want to run it for long though as your machine will likely deplete its memory.
Caveat #2: Although I've run the program against a handfull of sites and haven't had problems, I've not tested this very thoroughly. This means there could be errors, problems, situations where it crashes, or it could even be giving incorrect link counts. In the coming days I intend on testing it more but if you run into problems let me know in the comments.
Run the Program from the Command-Line
Nothing too complex here: If you'd like to run the crawler to parse only the local domain links on this website you'd give the following command from the command-line:
python spider.py http://berrytutorials.blogspot.com localOtherwise, if you want to crawl the web starting with my site as the seed url then you'd run the following command:
python spider.py http://berrytutorials.blogspot.com allThe program will give you updates on status, printing the number of pending URLs in the queue along with the number of links(URLs) that have been processed, and when it completes, the total number of links it found. Along the way, as HTMLParser processes the HTML, you'll likely encounter errors in parsing due to malformed tags, etc. that are things that HTMLParser cannot gracefully overlook. The following is what the tail-end of the output looks like:
..... ..... ..... Crawl Exception: Malformed tag found when parsing HTML bad end tag: "", at line 1266, column 16 15 Pending URLs are in the queue. 369 URLs have been fully processed. 10 Pending URLs are in the queue. 374 URLs have been fully processed. 5 Pending URLs are in the queue. 379 URLs have been fully processed. Total number of links: 382 Main-Mini:Desktop john$I understand that there are better HTML parsers out there in Python such as BeautifulSoup that might be able to handle poorly-formed HTML, however I'm a bit more familiar with HTMLParser.
Overall Architecture of the Simple Crawler
The base design of the crawler consists of the following:
- Spider class: Main class that defines two dictionaries to hold the pending URLs to be processed and the visited URLs that are complete. The visited URLs dictionary maps the URL to the HTML that was parsed by HTMLParser so you can further process the link content as suits your application. Also, Spider defines a function called "startcrawling()" which is called to begin the crawl.
- LinksHTMLParser: HTML parsing class, declared as a local variable within the startcrawling function in Spider. This class extends the base HTMLParser by overriding the handle_starttag function to only parse out anchor tags. It also defines a local variable named "links" that holds the processed links as strings so the Spider can access them and perform further processing.
Spider Class Details
The main algorithm is in the Spider class' startcrawling() function and operates as follows (in semi-pseudo-code):
While there are URLs in the pendingURLs dictionary: pop another URL from the pending URLs dictionary to process. make HEAD request to URL and check content-type if content-type is 'text/html' process, otherwise continue (break from this iteration of loop) open URL and read in HTML add URL to list of visited URLs for each of the HTTP links found when processing the HTML: parse the link to make sure it is syntactically correct. check to make sure it's HTTP and it hasn't already been visited if command-line option is 'local' check the domain of the link. if the domain is not the same then disregard, otherwise add to pendingURls otherwise, if adding all links, just add to pendingURLsRefer to the following code detailing the Spider class:
import sys import re import urllib2 from urllib2 import URLError # Snow Leopard Fix for threading issues Trace/BPT trap problem urllib2.install_opener(urllib2.build_opener()) from urlparse import urlparse import threading import time from HTMLParser import HTMLParser """ Spider takes a starting URL and visits all links found within each page until it doesn't find anymore """ class Spider(): def __init__(self,sUrl, crawl): #Urlparse has the following attributes: scheme, netloc, path, params,query,fragment self.startUrl = urlparse(sUrl) self.visitedUrls = {} # Map of link -> page HTML self.pendingUrls = {sUrl:sUrl} # Map of link->link. Redundant, but used for speed of lookups in hash self.startUrlString = sUrl self.crawlType = crawl self.numBrokenLinks = 0 self.numTotalLinks = 0 """ Main crawling function that parses the URLs, stores the HTML from each in visitedUrls and analyzes the HTML to acquire and process the links within the HTML""" def startcrawling(self): while len(self.pendingUrls) > 0: try: self.printProcessed() currUrl = self.pendingUrls.popitem()[0] # Make HEAD request first to see if the type is text/html url = urllib2.urlopen(HeadRequest(currUrl)) conType = url.info()['content-type'] conTypeVal = conType.split(';') # Only look at pages that have a content-type of 'text/html' if conTypeVal[0] == 'text/html': url = urllib2.urlopen(currUrl) html = url.read() # Map HTML of the current URL in process in the dictionary to the link # for further processing if required self.visitedUrls[currUrl] = html # LinksHTMLParser is extended to take out the a tags only and store htmlparser = LinksHTMLParser() htmlparser.feed(html) # Check each of the a tags found by Parser and store if scheme is http # and if it already doesn't exist in the visitedUrls dictionary for link in htmlparser.links.keys(): url = urlparse(link) if url.scheme == 'http' and not self.visitedUrls.has_key(link): if self.crawlType == 'local': if url.netloc == self.startUrl.netloc: if not self.pendingUrls.has_key(link): self.pendingUrls[link] = link else: if not self.pendingUrls.has_key(link): self.pendingUrls[link] = link # Don't die on exceptions. Print and move on except URLError: print "Crawl Exception: URL parsing error" except Exception,details: print "Crawl Exception: Malformed tag found when parsing HTML" print details # Even if there was a problem parsing HTML add the link to the list self.visitedUrls[currUrl] = 'None' if self.crawlType == 'local': self.numTotalLinks = len(self.visitedUrls) print "Total number of links: %d" % self.numTotalLinksYou can see the main loop processes links while there are still pendingUrls in the queue (while 1 and len(self.pendingUrls) > 0). It opens the current Url to process from the pendingURLs dictionary by removing it from the queue using the popitem() function.
Note that because I'm using dictionaries there is no order to the processing of the links; a random one is popped from the dictionary. An improvement/enhancement/customization might be to use an actual queue(list) and process the links in order they were added to the queue. In my case, I decided to randomly process because I didn't think the order mattered in the long run. In the case of visitedURLs I used the dictionary mainly because I wanted quick lookup (O(1)) of the hash for processing the HTML down the road.
Next, a HEAD request is made to the current URL to process to check its 'content-type' value in the header. If it's a 'text/html' content type, we will process it further. I went this route because a) I didn't want to process document (.pdf, .doc, .txt, etc.) files, image (.jpg, .png, etc), audio/video, etc. I only want to look at html files. Also, the reason I make the HEAD request before downloading the entire page is mainly so the crawler is more "polite"; i.e., so it doesn't eat up server processing time downloading entire pages unless it's totally necessary.
After validating the HEAD request, the program downloads the entire page and feeds it to LinksHTMLParser. The following is the code for LinksHTMLParser:
class LinksHTMLParser(HTMLParser): def __init__(self): self.links = {} self.regex = re.compile('^href$') HTMLParser.__init__(self) # Pull the a href link values out and add to links list def handle_starttag(self,tag,attrs): if tag == 'a': try: # Run through the attributes and values appending # tags to the dictionary (only non duplicate links # will be appended) for (attribute,value) in attrs: match = self.regex.match(attribute) if match is not None and not self.links.has_key(value): self.links[value] = value except Exception,details: print "LinksHTMLParser: " print Exception,details
You can see that I've inherited from HTMLParser and overridden the handle_starttag function so we only look at anchor tags that have an href value (in order to eliminate some tag processing). Then LinksHTMLParser adds each anchor link to an internal dictionary called links that holds the links on that processed page for Spider to further process.
Finally, Spider loops on the links found by LinksHTMLParser and checks if it's local (domain-only) crawl will check the domain of each link to make sure it's the same as the "seed URL". Otherwise it just adds it if the link doesn't already exist in the pendingURLs dictionary.
Areas for Crawler Customization and Enhancement
As it is written, the crawler doesn't do much more than get the links, parse them, count them, store each link's HTML in a dictionary, and return a total. Obviously you'd be advised to make it actually do something real, even as simple as just printing out the links it finds so you can review them. In fact, before posting this I had it doing just that (to standard out) after the main while loop returned (lines 90-91) in the whole version:
for link in self.visitedUrls.keys(): print link
You might customize that to write to a file instead of STDOUT so it could be further processed by external scripts.
Here's some other enhancements I'd suggest:
- Use the crawler to parse the HTML content you've stored for each link in the pendingUrls dictionary. Say you're looking for some particular content on a site, you'd add a function that processes that HTML after the startcrawling function is complete, using another extended version of the LinksHTMLParser to do some other scraping.
- limit the "depth" that the crawler runs when it's an "all" search - e.g., have a variable from the command line limit the number of times the crawler runs through the found links so you can get the all version to stop.
- Although this crawler is semi-polite because it requests the HEAD before the whole page, you'd really want to download the robots.txt file from the seed domain (and from each outside domain that the crawler accesses if you're hitting all domains) to ensure crawlers are allowed. You don't want to accidentally access some NSA website, scrape all the content, then have agents knocking at your door that afternoon.
- This crawler makes requests on the webserver without any delay between requests and worst-case could bring a server down or severely slow it down. You'd likely put some kind of delay between requests so as to not overwhelm the target servers (use time.sleep(SECS))
- Instead of making the HEAD request, checking the content-type to see if the URL pending to visit is html, you could use a regular epression to test the URL to see if it ends in either a '/', '.asp','.php','htm', or 'html' then just request the page. This would avoid the immediate GET after the HEAD and limit stress on the server
Preventing Duplicate HTML Content
One issue I thought of is it would be a good idea to enhance the crawler so it doesn't look at duplicate HTML content if your end goal is to test the actual page details for each link. The crawler is written so it definitely doesn't store duplicate links but that doesn't guarantee that the HTML content is unique. For example, on my site it's finding 382 total unique links even though I only have 15 posts. Where are all these extra links coming from?
It's the widgets that i'm using in my template, for example here are some 'widget' links the crawler found:
http://berrytutorials.blogspot.com/search/label/blackberry?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1230796800000&toggleopen=MONTHLY-1262332800000 http://berrytutorials.blogspot.com/2009/11/create-custom-listfield-change.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1262332800000&toggleopen=MONTHLY-1257058800000
Although these are unique links, they point to content that is already catalogued by the crawler when it found the main page (ex., look at the second link, it points to the article 'create-custom-listfield-change.html', and the crawler also holds the link to the actual page - the HTML content is duplicated).
To prevent this, I'd think after the crawler is complete you'd have a 'normalization' process where the found links are checked for duplicate content. Since I've stored the HTML for each link you wouldn't have to have the spider reconnect to the crawled website, just check the HTML. I haven't thought this through completely to suggest an algorithm that would be fast though so I'll leave that up to you.
Wrap up and Roll
Although there are tons of open-source crawlers on the web I think that writing one yourself will definitely help you understand the complexities of link and content parsing and will help you actually visualize the explosion of links that are out there. For example, I set this crawler on www.yahoo.com and within a couple minutes it was up to over 2000 links that were in the queue. I was honestly surprised to find so many links just in my simple blog. It was a great learning experience and I hope this article helped you along the path to writing a more advanced crawler.
In case you're interested, here is an article about distributed crawlers (state of the art from 2003 :) here
As usual, let me know if you have questions/concerns in the comments!