Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts
I know there are quite a few "Simple Python Crawlers" out on the web for easy download and use. Nonetheless, I felt like I'd add yet another to the mix - Hey, innovation doesn't work without choice, right? Writing a basic web-crawler is pretty simple if you leverage Python's built-in modules that handle the most difficult aspects: opening and managing socket connections to remote servers and parsing the returned HTML.  The Python modules urllib2 and HTMLParser provide you with the high-level interface to these lower level processes.  The crawler I've written for the tutorial leverages these modules, runs from the command-line, and takes the following two arguments:

  • "seed url" - where the crawler will begin its parsing
  • "all" or "local"
    • The "local" flag tells the crawler to parse only the http links that are contained within the "seed url" domain (local).  This means that eventually the parser will stop because there are a limited number of links within a domain.  Note that if you were to crawl a large domain, like www.microsoft.com, it could take a very long time to complete the crawl. Caveat #1: (IMPORTANT!)This crawler only looks at the base url on http links to stay within the domain. If absolute links are within the page (e.g., a href="/") this crawler won't pick those up  You'll have to add that functionality if that's what you're looking for (but that should be fairly easy)
    • The "all" flag tells the crawler to parse every http link it finds within the html, even if they are outside the domain.  Note that this means the spider could take a very, very, very long time to complete its crawl (years?)  I'd suggest running this only if you'd like to see how quickly the number of pending links virtually explodes as the spider crawls.  You'll not want to run it for long though as your machine will likely deplete its memory.
Before we begin, you can get the entire source code here but I'd recommend taking a look at the step-by-step below so you can understand how to customize it to your needs.

Caveat #2: Although I've run the program against a handfull of sites and haven't had problems, I've not tested this very thoroughly. This means there could be errors, problems, situations where it crashes, or it could even be giving incorrect link counts. In the coming days I intend on testing it more but if you run into problems let me know in the comments.

Run the Program from the Command-Line


Nothing too complex here: If you'd like to run the crawler to parse only the local domain links on this website you'd give the following command from the command-line:
python spider.py http://berrytutorials.blogspot.com local

Otherwise, if you want to crawl the web starting with my site as the seed url then you'd run the following command:
python spider.py http://berrytutorials.blogspot.com all

The program will give you updates on status, printing the number of pending URLs in the queue along with the number of links(URLs)  that have been processed, and when it completes, the total number of links it found.   Along the way, as HTMLParser processes the HTML, you'll likely encounter errors in parsing due to malformed tags, etc. that are things that HTMLParser cannot gracefully overlook. The following is what the tail-end of the output looks like:
.....
.....
.....

Crawl Exception: Malformed tag found when parsing HTML
bad end tag: "", at line 1266, column 16

15 Pending URLs are in the queue.
369 URLs have been fully processed.

10 Pending URLs are in the queue.
374 URLs have been fully processed.

5 Pending URLs are in the queue.
379 URLs have been fully processed.

Total number of links: 382
Main-Mini:Desktop john$ 

I understand that there are better HTML parsers out there in Python such as BeautifulSoup that might be able to handle poorly-formed HTML, however I'm a bit more familiar with HTMLParser.

Overall Architecture of the Simple Crawler


The base design of the crawler consists of the following:
  • Spider class: Main class that defines two dictionaries to hold the pending URLs to be processed and the visited URLs that are complete.  The visited URLs dictionary maps the URL to the HTML that was parsed by HTMLParser so you can further process the link content as suits your application.  Also, Spider defines a function called "startcrawling()" which is called to begin the crawl.
  • LinksHTMLParser: HTML parsing class, declared as a local variable within the startcrawling function in Spider.  This class extends the base HTMLParser by overriding the handle_starttag function to only parse out anchor tags.  It also defines a local variable named "links" that holds the processed links as strings so the Spider can access them and perform further processing.

Spider Class Details


The main algorithm is in the Spider class' startcrawling() function and operates as follows (in semi-pseudo-code):
While there are URLs in the pendingURLs dictionary:    
     pop another URL from the pending URLs dictionary to process.    
     make HEAD request to URL and check content-type    
     if content-type is 'text/html' process, 
     otherwise continue (break from this 
     iteration of loop)       
          open URL and read in HTML       
          add URL to list of visited URLs       
          for each of the HTTP links found when processing the HTML:          
               parse the link to make sure it is syntactically correct.          
               check to make sure it's HTTP and it hasn't already been visited          
               if command-line option is 'local' check the domain of the link.             
               if the domain is not the same then disregard, 
                 otherwise add to pendingURls          
               otherwise, if adding all links, just add to pendingURLs     

Refer to the following code detailing the Spider class:
import sys
import re
import urllib2
from urllib2 import URLError

# Snow Leopard Fix for threading issues Trace/BPT trap problem
urllib2.install_opener(urllib2.build_opener())
from urlparse import urlparse
import threading
import time
from HTMLParser import HTMLParser


"""
Spider takes a starting URL and  visits all links found within each page
until it doesn't find anymore 
"""
class Spider():
 
 def __init__(self,sUrl, crawl):
 
  #Urlparse has the following attributes: scheme, netloc, path, params,query,fragment
  self.startUrl = urlparse(sUrl)
  self.visitedUrls = {} # Map of link -> page HTML
  self.pendingUrls = {sUrl:sUrl} # Map of link->link. Redundant, but used for speed of lookups in hash
  self.startUrlString = sUrl
  self.crawlType = crawl
  self.numBrokenLinks = 0
  self.numTotalLinks = 0
  
 """ Main crawling function that parses the URLs, stores the HTML from each in visitedUrls
   and analyzes the HTML to acquire and process the links within the HTML"""
 def startcrawling(self):
   
  while len(self.pendingUrls) > 0:
   try:
    
    self.printProcessed()
   
    currUrl = self.pendingUrls.popitem()[0]  
    
    
    # Make HEAD request first to see if the type is text/html
    url = urllib2.urlopen(HeadRequest(currUrl))
    conType = url.info()['content-type']
    conTypeVal = conType.split(';')
    
    # Only look at pages that have a content-type of 'text/html'
    if conTypeVal[0] == 'text/html':
 
     url = urllib2.urlopen(currUrl)
     html = url.read()
     
     # Map HTML of the current URL in process in the dictionary to the link
     # for further processing if required
     self.visitedUrls[currUrl] = html
     
     # LinksHTMLParser is extended to take out the a tags only and store 
     htmlparser = LinksHTMLParser()
     htmlparser.feed(html)
     
     # Check each of the a tags found by Parser and store if scheme is http
     # and if it already doesn't exist in the visitedUrls dictionary
     for link in htmlparser.links.keys(): 
      url = urlparse(link)
      
      if url.scheme == 'http' and not self.visitedUrls.has_key(link): 
       if self.crawlType == 'local': 
        if url.netloc == self.startUrl.netloc:
         if not self.pendingUrls.has_key(link):
          self.pendingUrls[link] = link
            
       else: 
        if not self.pendingUrls.has_key(link):    
         self.pendingUrls[link] = link
           

   
   # Don't die on exceptions.  Print and move on
   except URLError:
    print "Crawl Exception: URL parsing error" 
    
   except Exception,details:
    print "Crawl Exception: Malformed tag found when parsing HTML"
    print details
    # Even if there was a problem parsing HTML add the link to the list
    self.visitedUrls[currUrl] = 'None'
    
  if self.crawlType == 'local':
   self.numTotalLinks = len(self.visitedUrls)
 
  print "Total number of links: %d" % self.numTotalLinks
  

You can see the main loop processes links while there are still pendingUrls in the queue (while 1 and len(self.pendingUrls) > 0). It opens the current Url to process from the pendingURLs dictionary by removing it from the queue using the popitem() function.

Note that because I'm using dictionaries there is no order to the processing of the links; a random one is popped from the dictionary. An improvement/enhancement/customization might be to use an actual queue(list) and process the links in order they were added to the queue. In my case, I decided to randomly process because I didn't think the order mattered in the long run. In the case of visitedURLs I used the dictionary mainly because I wanted quick lookup (O(1)) of the hash for processing the HTML down the road.

Next, a HEAD request is made to the current URL to process to check its 'content-type' value in the header. If it's a 'text/html' content type, we will process it further. I went this route because a) I didn't want to process document (.pdf, .doc, .txt, etc.) files, image (.jpg, .png, etc), audio/video, etc. I only want to look at html files. Also, the reason I make the HEAD request before downloading the entire page is mainly so the crawler is more "polite"; i.e., so it doesn't eat up server processing time downloading entire pages unless it's totally necessary.

After validating the HEAD request, the program downloads the entire page and feeds it to LinksHTMLParser. The following is the code for LinksHTMLParser:

class LinksHTMLParser(HTMLParser):

 def __init__(self):
  self.links = {}
  self.regex = re.compile('^href$')
  HTMLParser.__init__(self)
  
 
    
 # Pull the a href link values out and add to links list
 def handle_starttag(self,tag,attrs):
  if tag == 'a':
   try:
    # Run through the attributes and values appending 
    # tags to the dictionary (only non duplicate links
    # will be appended)
    for (attribute,value) in attrs:
     match = self.regex.match(attribute)
     if match is not None and not self.links.has_key(value):
      self.links[value] = value
      
     
   except Exception,details:
    print "LinksHTMLParser: " 
    print Exception,details



You can see that I've inherited from HTMLParser and overridden the handle_starttag function so we only look at anchor tags that have an href value (in order to eliminate some tag processing). Then LinksHTMLParser adds each anchor link to an internal dictionary called links that holds the links on that processed page for Spider to further process.

Finally, Spider loops on the links found by LinksHTMLParser and checks if it's local (domain-only) crawl will check the domain of each link to make sure it's the same as the "seed URL". Otherwise it just adds it if the link doesn't already exist in the pendingURLs dictionary.


Areas for Crawler Customization and Enhancement


As it is written, the crawler doesn't do much more than get the links, parse them, count them, store each link's HTML in a dictionary, and return a total. Obviously you'd be advised to make it actually do something real, even as simple as just printing out the links it finds so you can review them. In fact, before posting this I had it doing just that (to standard out) after the main while loop returned (lines 90-91) in the whole version:

for link in self.visitedUrls.keys():
 print link

You might customize that to write to a file instead of STDOUT so it could be further processed by external scripts.

Here's some other enhancements I'd suggest:
  • Use the crawler to parse the HTML content you've stored for each link in the pendingUrls dictionary. Say you're looking for some particular content on a site, you'd add a function that processes that HTML after the startcrawling function is complete, using another extended version of the LinksHTMLParser to do some other scraping.
  • limit the "depth" that the crawler runs when it's an "all" search - e.g., have a variable from the command line limit the number of times the crawler runs through the found links so you can get the all version to stop.
  • Although this crawler is semi-polite because it requests the HEAD before the whole page, you'd really want to download the robots.txt file from the seed domain (and from each outside domain that the crawler accesses if you're hitting all domains) to ensure crawlers are allowed. You don't want to accidentally access some NSA website, scrape all the content, then have agents knocking at your door that afternoon.
  • This crawler makes requests on the webserver without any delay between requests and worst-case could bring a server down or severely slow it down. You'd likely put some kind of delay between requests so as to not overwhelm the target servers (use time.sleep(SECS))
  • Instead of making the HEAD request, checking the content-type to see if the URL pending to visit is html, you could use a regular epression to test the URL to see if it ends in either a '/', '.asp','.php','htm', or 'html' then just request the page. This would avoid the immediate GET after the HEAD and limit stress on the server


Preventing Duplicate HTML Content


One issue I thought of is it would be a good idea to enhance the crawler so it doesn't look at duplicate HTML content if your end goal is to test the actual page details for each link. The crawler is written so it definitely doesn't store duplicate links but that doesn't guarantee that the HTML content is unique. For example, on my site it's finding 382 total unique links even though I only have 15 posts. Where are all these extra links coming from?

It's the widgets that i'm using in my template, for example here are some 'widget' links the crawler found:

http://berrytutorials.blogspot.com/search/label/blackberry?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1230796800000&toggleopen=MONTHLY-1262332800000

http://berrytutorials.blogspot.com/2009/11/create-custom-listfield-change.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1262332800000&toggleopen=MONTHLY-1257058800000

Although these are unique links, they point to content that is already catalogued by the crawler when it found the main page (ex., look at the second link, it points to the article 'create-custom-listfield-change.html', and the crawler also holds the link to the actual page - the HTML content is duplicated).

To prevent this, I'd think after the crawler is complete you'd have a 'normalization' process where the found links are checked for duplicate content. Since I've stored the HTML for each link you wouldn't have to have the spider reconnect to the crawled website, just check the HTML. I haven't thought this through completely to suggest an algorithm that would be fast though so I'll leave that up to you.


Wrap up and Roll


Although there are tons of open-source crawlers on the web I think that writing one yourself will definitely help you understand the complexities of link and content parsing and will help you actually visualize the explosion of links that are out there. For example, I set this crawler on www.yahoo.com and within a couple minutes it was up to over 2000 links that were in the queue.  I was honestly surprised to find so many links just in my simple blog.  It was a great learning experience and I hope this article helped you along the path to writing a more advanced crawler.

In case you're interested, here is an article about distributed crawlers (state of the art from 2003 :)  here


As usual, let me know if you have questions/concerns in the comments!


While working on a recent Django project I had the need to create thumbnails from images residing on a remote server to store within one of my Django app models.  The wrinkle was that I wanted to programmatically make the thumbnail "on the fly" before storage since I didn't want to waste space storing the original, much larger, image.  So if you are in a similar situation, where do you start?

Considering the vast array of libraries for Python, I hunted down the most referenced one, Python Imaging Library(PIL), and installed it. For purposes of this tutorial, I'll presuppose that you've already installed PIL on your platform and have some experience manipulating images using it.  I'm working on OS X (Snow Leopard) and had no issues getting PIL working but if you do, follow the directions on this blog post. (I can't help if you're on Windows) If you can run the command 'from PIL import Image' from the python prompt then you've installed it properly, Django shouldn't complain, and the code below should work.


The Thumbnail Model

Once you've installed PIL your hardest struggles are over. For our simplified example we'll create a custom image model with 'url' and 'thumb' attributes. The 'url' attribute will store the url of the image in the event you need to reference the original picture and 'thumb' will be an ImageField that stores the location of the thumbnail we create. We'll define a function called 'create_thumb' that will perform the image manipulation.   Here's what the Model looks like, including the required imports:

import Image
import os
import urllib
from django.core.files import File
.....
....
..

class Thumbnail(models.Model):
 url  =models.CharField(max_length=255, unique=True)
         
 # Set the upload_to parameter to the directory where you'll store the 
 # thumbs
 thumb = models.ImageField(upload_to='thumbs', null=true)
 
 """ Pulls image, converts it to thumbnail, then 
   saves in thumbs directory of Django install """
 def create_thumb(self):
  
  if self.url and not self.thumb:
   
   image = urllib.urlretrieve(self.url)
   
   # Create the thumbnail of dimension size
   size=128,128
   t_img = Image.open(image[0])
   t_img.thumbnail(size) 
 
   # Get the directory name where the temp image was stored
   # by urlretrieve
   dir_name = os.path.dirname(image[0])

   # Get the image name from the url
   img_name = os.path.basename(self.url)

   # Save the thumbnail in the same temp directory 
   # where urlretrieve got the full-sized image, 
   # using the same file extention in os.path.basename()
   t_img.save(os.path.join(dir_name, "thumb" + img_name))
 
   # Save the thumbnail in the media directory, prepend thumb  
   self.thumb.save(os.path.basename("thumb" + self.url),File(open(os.path.join(dir_name, "thumb" + img_name)))
 


What is that Code Doing? Can it be Improved?

Although the code is commented pretty well, I'll give a bit more explanation. In Django, the ImageField doesn't actually store the image in your database. Instead, it is stored in a directory located on the path your 'MEDIA_ROOT' is configured to in the settings.py file. So make sure that this appropriately configured, then create the 'thumbs' subdirectory within that directory. That's what the 'upload_to' parameter is used for in the ImageField type.

The 'create_thumb' method should be called when you create an instance of the Thumbnail model. Here's an example of one way you could use it:

a = Thumbnail(url='http://url.to.the.image.you.want.a.thumbnail.of')
a.create_thumb()
a.save()


The 'create_thumb' method takes the url, creates the thumbnail, and saves it in the 'upload_to' directory. Since this is sample code I didn't put any provision for catching exceptions such as improper url provided or image processing errors - that would be one area I'd suggest you improve upon. Also, the thumbnails are saved under the same name of the image provided by the url with "thumb" prepended. You might wonder what happens if two urls have the same image name? Well, I'll tell you: The new thumb will overwrite the old so you will want to add code that creates a unique name for the image.

Besides the few caveats mentioned above, the code works as advertised and will make your life that much easier...at least when it comes to creating thumbnails. As usual, I only ask that if this post helped you, please leave a comment.

Django Configuration - Serve Static Media w/Templates

I must say that Django documentation is all-around really fantastic. In fact, I've never run into a situation where I was totally stumped and their docs haven't saved the day. Regardless, there are always niches where you'll wish there was a slightly better real-world example - in this case regarding serving static media (stylesheets, javascript, image files, etc) on the development server. Their doc on this subject, found here, covers the base configuration pretty well however I still found that I couldn't get Django to recognize my static media for some reason. After messing with the config for a bit I managed to get it operational so If you're having the same problem, follow these steps and you'll have it running in no time.

The tutorial assumes basic knowledge of Django and that your project is located at the path (on OS X): '/Users/[USER_NAME]>/Code/Django/[PROJECT_NAME]' where USER_NAME is your OS X account and PROJECT_NAME is the top level directory of your project, likely where you ran the 'django-admin.py startproject [PROJECT_NAME]' command. For example, the path to my project is '/Users/john/Code/Django/testproj'. Obviously your code doesn't have to be on this exact path but you'll need to make sure you adjust the paths in the code below accordingly.

Note : In section two below I give two different ways to configure your settings.py file; The first way is with the paths hardcoded into the variables and the second is using python's built-in os module to create absolute paths to your static files. I kept both in here as a demonstration of how it works but I highly recommend going the absolute path route since your code will be portable across systems. Also, it should work on Windows without mods as well.

Configure URLconf - Make Static Media View Available for DEBUG Only


First, open the top-level urls.py file and add the following settings.DEBUG code after the urlpatterns that already reside there. The following is a basic example of my urls.py file:

from django.conf.urls.defaults import *
from testproj.base.views import index
from django.conf import settings

urlpatterns = patterns('',
     (r'^$', index),
)
if settings.DEBUG:
     urlpatterns += patterns('',(r'^static/(?P.*)$', 'django.views.static.serve',{'document_root': settings.MEDIA_ROOT}),
)


As explained in Django's documentation, it is recommended that in a production environment the server should provide the static files, not your Django code. Using the settings.DEBUG test will ensure that when you move it to production you'll catch the static files being served by Django since the DEBUG setting will be False in prod.

In the code you're importing the settings module from django.conf. The settings.DEBUG config is ensuring that any requests matching 'static/PATH' are served by the django.views.static.serve view with a context containing 'document_root' set to whatever path is found in settings.MEDIA_ROOT. Next we're going to set the value of that path in the settings.py file.

Configure Django Settings to the Location of Your Media


The first step is to create a directory named 'static' in the Django project folder and two directories within it named 'css' and 'scripts'. In the future you'll place your static files in folders within the main 'static' directory - the way you organize them doesn't make a difference just make sure it's a logical setup and that you adjust your template tags to point to the right folder (refer to the next section).

Option #1: Hardcoded Path


Open your settings.py file and modify the MEDIA_ROOT, MEDIA_UR, and ADMIN_MEDIA_PREFIX settings as follows (remembering to change the path to the location of your own static folder):
MEDIA_ROOT = '/Users/john/Code/Django/testproj/static/'

MEDIA_URL = '/static/'

ADMIN_MEDIA_PREFIX = '/media/'



Now, double-check that you didn't forget to add the beginning and ending '/' on each of the paths you modified as this will confuse Django. Note that while the ADMIN_MEDIA_PREFIX and MEDIA_URL don't necessarily have to be different, it is recommended by Django. If somehow you've wandered in here looking for instructions on how to do this on Windows I believe that the only difference in the entire tutorial is to change MEDIA_ROOT setting to 'C:/path/to/your/static/folder'. Following the rest of this should work on Windows but I didn't have time to validate that.

Option #2: Absolute Path option - Recommended


Instead of hardcoding as mentioned above, I'd recommend going the absolute path route since you can port your code from system to system without rewriting the settings.py file. Remember, either option should work but only use one way or the other.

Instead of hardcoding the paths into settings.py you'll import the os module and use the os.path.abspath('') method to acquire the working directory for your code dynamically and set it to ROOT_PATH. Then you'll adjust the same settings as before using the os.path.join method to attach your static directory. Refer to the code in settings.py below:

import os

ROOT_PATH = os.path.abspath('')
....
....
MEDIA_ROOT = os.path.join(ROOT_PATH,'static')
MEDIA_URL = os.path.join(ROOT_PATH,'static')
ADMIN_MEDIA_PREFIX = os.path.join(ROOT_PATH,'media')



Notice that when you use os.path.join(ROOT_PATH,'static') you won't put the '/' before and after the static directory as we did in the previous option.

Modify your Template to Point at the Static Directory


The final step is to make sure that in you've modified the HTML tags (script, link, etc) in your template to point at your 'static' directory. For example, here's how my script and link tags are configured for the example project:







The script src link is set to "/static/scripts/scripts.js" since this is where I've placed my javascript files. Again, don't forget to put the initial '/' at the beginning of your path.

Wrap it Up and Call it a Day


Test it out by running the 'python manage.py runserver' command from your project directory and everything should be working beautifully. If you run into problems it's likely that you've either forgotten or added a '/' in the wrong place on the paths you've configured.

Configure Eclipse with Python and Django on OS X

I've recently been doing some development using the Python web application framework Django. To make my life easier, I configured Eclipse to work with Python and Django however I soon discovered the many disparate resources on the web are either dated or incomplete. Thus, the main goal of this post is to consolidate those resources into one comprehensive and easy to follow post that I'll try to keep as succinct as possible while maintaining clear instruction. Also, I'll assume that you already have Eclipse installed and if not, go get it - what are you waiting for? I'm using Eclipse 3.5.1 but I believe these instructions will work for 3.4.x as well with some slight modifications.

Also, these instructions assume the following:

  • For simplicity's sake we're going to use SQLite for Django database - Python 2.5 and above come with SQLite support so no install necessary.

  • No previous versions of Pydev modules or Django on your system.

  • We'll be using the built-in versions of Python inherent in Leopard (python version - 2.5.1) or Snow Leopard(python version - 2.6)



Installing Python Support - Pydev plugin


The Pydev plugin from aptana gives you syntax highlighting, code complete, etc for Python in the Eclipse environment and in my experience work exceptionally well.

Open Eclipse and perform the following:

1) Help->Install New Software. In the "Work With" form section, input the following site to download the Pydev modules: http://pydev.org/updates.

2) Select the Pydev plugin modules in the checkbox.


3) Click through, accepting the licensing agreements and wait for the module to install. Restart Eclipse. Open Eclipse->About and click on the Pydev Icon - we want to get the latest version of Pydev which is version 1.5.0.125.1989166 at the time of this post. If you have an earlier version use the automatic update feature of Eclipse to update just the Pydev module (Help->Update Software)

Install Django


4) Download Django 1.1.1, open a Terminal, navigate to the directory where you downloaded Django, and run the following commands:


tar -xvf Django-1.1.1.tar.gz
cd Django-1.1.1
sudo python setup.py install


Yep, it's as easy as that.

Configure PyDev to use Django


4) After installing Django, we need to ensure that the newly installed directories are visible to the Eclipse python interpreter. In the toolbar under Eclipse->Preferences->Pydev, click the Interpreter-Python option and the following window will appear:




5) Click 'New' and populate Interpreter Name with any value you want (I called it 'OS X Python'), and populate the Interpreter Executable with the path to your python executable - the default path on OS X is /usr/bin/python.



6) Eclipse will attempt to locate the libraries for Python and will display a pre-populated list of directories to add to the Eclipse PythonPath - i.e., the modules and binaries necessary for Python to run. Critical here is to select the directory where Django files were extracted when we installed it earlier. So make sure to select '/Library/Python/2.x/site-packages', as in the screenshot below. After you click 'OK' you'll be taken back to the preferences screen. Click 'OK' again and Eclipse will reparse the site packages.



Sanity Check


7) As a quick sanity check, create a new PyDev project and then a new Python file within that project space. Test a django import statement to see if code completion is working. It should look something like the screenshot below:




Create a New Django Project


8) Open Terminal navigate to a directory where you want to store your first Django project. Once there create a new directory for your project, navigate into the folder and enter the following command:

django-admin.py startproject PROJECTNAME
It's important to create the top level directory and place the new project files into a separate directory, as we'll see in a few minutes. If you navigate into the directory you created you'll see that Django has created the prepopulated files __init__.py, manage.py, settings.py, and urls.py within that directory. Now we have to import these into Eclipse.

9) In Eclipse, click 'File->New->Pydev Project' Name the project then uncheck 'Use Default' since we're using code that's already created for us in the previous step. Instead, click 'Browse' and navigate to the location of the top level directory you created previously. Uncheck 'Create default src folder' and click 'Finish'.



In my case, I created a top level directory named Djangoprojects and created the Django files with the django-admin.py command inside a directory named Django.

10) Within Eclipse, right-click the project you've created and click 'Properties->Pydev-PYTHONPATH->Add Source Folder' and select the project folder you created.



Now in the package explorer for your newly created project you'll see that your __init__.py, manage.py, settings.py, and urls.py are showing as a package.

Set up the Django Testserver



11) Now we have to configure Eclipse to start the the Django test server when we run our program. Right-click your new project in the Package explorer-> Properties->Debug Configurations(or Run/Debug Settings in newer versions of Eclipse). Click 'New' and select 'Python Run'. Press the new configuration button at the top, then give the new configuration a name (I call mine DjangoTest), under Project click 'Browse' and point to our new project, and under Main Module point to the manage.py of our new project. It should look as follows:



Under the Arguments tab, for Program Arguments, type in 'runserver'. Refer to the screen below:



Now right-click your project, then 'Debug As->Python Run' then browse to the manage.py file within your project to launch, and click 'Ok', if all went well you'll get info in the console indicating that the pydev debugger is running:



Point your browser to http://localhost:8000/ and you should see the following page:



Unfortunately, you'll likely find that you're unable to shut down the server from Eclipse using the square red button near the console. Also, it won't shut down when you close Eclipse so you'll have to get in the habit of shutting down the Python process using Activity Monitor.

Debugging with Django and Python


12) Nearing the end of the process now. We have to point Eclipse to the working directory within our workspace for it to be able to access the database (SQLite in our case). So we go back to the Python perspective and right click our project->Debug As-> Debug Configurations(or Run/Debug Settings in newer versions of Eclipse)-> Arguments Tab. Under 'Working Directory' we click 'Other' then 'Workspace' and browse to our inner folder (In my case the Django folder within Djangotest). Bear in mind that if your code resides in a directory outside the Eclipse workspace you'd have to use the 'File System' button and point to the appropriate directory in your filesystem. Refer below:



The final debug configuration should appear as the one below:



13) Lastly we need to configure Eclipse to have the python debug folder 'pysrc' in the PYTHONPATH variable. Select Eclipse in the menu toolbar -> Preferences ->Pydev->Interpreter-Python->New Folder. Browse to the plugins folder of your Eclipse install, in my case located on the path '/Applications/eclipse/plugins/org.python.pydev.debug_1.5.0.1251989166/pysrc. Click apply and Eclipse will again reconfigure the PYTHONPATH to include the debug extensions. Be careful here if you updated your pydev modules as described there will be a debug folder for the previous version and you want to make sure you're using the latest and greatest. The config should look like the screenshot below:



In order to simplify starting and stopping the debug server, right-click the toolbar Eclipse toolbar (within the Eclipse window) where there aren't buttons and select 'Customize Perspective'. Click the 'Command Groups Availability' and check the Pydev Debug switch. This will place the start and stop debug server options directly within the Eclipse toolbar. It should like like this:



Now, when adding breakpoints to your code you'll be able to debug in Eclipse as you're used to with Java. One note: in order to see the debugging perspective within Pydev you'll have to close the Pydev perspective and reopen it so it can reload the modules we've added - otherwise you'll never see the debug option. In order to test this place a breakpoint in your python code and run the debugger. The python debug perspective will subsequently open and you'll be able to step through your code, see running values of variables, etc.

14) Celebrate. You've completed installation of Python and Django support in Eclipse and this was no easy task. I'd suggest periodically looking for updates to the Pydev extensions as they are regularly improved by the good folks at aptana. The following is a list of several resources I used to compile this post in the event that you need additional help:

Python and XML - Simple Example to Parse an XML document

A little background before I show you how to do a simple example to pull node data out of XML using Python. I've used Perl for years and absolutely love it for accomplishing a wide variety of tasks ranging from complex object-oriented solutions to simple text parsing. The power and simplicity of Perl is astounding but secretly I enjoy programming in it because the MIS folks I work with can't wrap their heads around the arcane syntax and who would want MIS guys monkeying around with real code? I'm kidding of course but Perl users must admit that their code is not easily maintainable.

Recently I've had reason to use Python to set up the backend of a Blackberry mobile project and I must say that I really like it. Initially I had trouble simply typing in code as muscle memory forces me to use bracketing for code blocks. You can imagine the frustration having to constantly delete brackets, so I quickly solved this problem by downloading and installing the Pydev Eclipse plugin found here. Armed with code completion, syntax highlighting, and code analysis my development time has been reduced significantly.

Setup


An initial process I needed to automate using Python was simple parsing of XML content and I couldn't find a quick and simple example of how it's done (hence, I thought I'd help other out with this small tutorial). After some experimentation I determined that importing the minidom module would suitably accomplish what I wanted to do and is extremely lightweight so we'll use it for the example code. For the purposes of this article I'll use the following sample XML file courtesy of Microsoft and edited to save some space:




Gambardella, Matthew
XML Developer's Guide
Computer
44.95
2000-10-01
An in-depth look at creating applications 
with XML.


Ralls, Kim
Midnight Rain
Fantasy
5.95
2000-12-16
A former architect battles corporate zombies, 
an evil sorceress, and her own childhood to become queen 
of the world.





Parsing the XML


What we want to accomplish is parse out the child nodes of each book to get the data element of each for processing. With Python it's simple - import the DOM (includes the parse, and Node modules), read in the xml file using the parse() method, and then iterate through the childnodes. For simplicity's sake in the example I'll just print the data for each node to output. Here's the Python code to do just that:


from xml.dom import *
xmlDoc = parse("library.xml")
    for node1 in xmlDoc.getElementsByTagName("book"):
        for node2 in node1.childNodes:
            if node2.nodeType == Node.ELEMENT_NODE:
                print node2.childNodes[0].data 


As you can see, with just six lines of code we've read the entire XML file into memory using DOM, parsed it, pulled out the book elements and printed all the childnodes for each book to output. The check if the node is an ELEMENT_NODE is critical since we do not want to pull the TEXT_NODEs and iterate through them for this example. If you pull that test out the code will attempt to get childNodes from the TEXT_NODE and will fail with "IndexError: tuple index out of range" since the text node doesn't have childNodes.

If you need to pull out the attributes from the Element you can iterate through them by using the attributes dictionary class member of the node. In this case, if node2 had attributes(and it doesn't in my example) you could assign a variable the attributes dictionary 'attributes = node2.attributes' then iterate through the attributes using the 'keys()' method on attributes.

Perl Counterexample


To parse the same XML in Perl you'd have to write the following:
use XML::DOM;

my $file = 'library.xml';
my $parser = XML::DOM::Parser->new();
my $xmldoc = $parser->parsefile($file);
foreach my $book ($xmldoc->getElementsByTagName('book')){
   foreach my $tag ( $book->getChildNodes() ) {
      if($tag->getNodeType == ELEMENT_NODE){
         print $tag->getFirstChild->getNodeValue;		
      }
   }
}



Looks similar and in reality it is mostly identical code but in my opinion it is definitely not as clean as the Python code. Don't get me wrong - I love Perl and am not saying that other languages beat it out - if that were the case I'd have posted a better example to stoke the flames beneath the Python vs Perl zealots. Nonetheless, I must say I'm beginning to enjoy Python quite a bit.


Wrap up


If you came here looking for something other than a simple way to parse XML using Python, such as adding to the ridiculous flame wars that go on between Perl & Python evangelists, look here, here, and here for more on the debate. If you want a pretty good reference for using XML with Python look here

Side Note: Monty Python Rules


By the way, I always thought Python was named after the snake. However, according to the source itself it's really named after Monty Python's flying circus which really strikes a chord with me since I love several of the Python movies, particularly the Holy Grail. Maybe I should change the picture at the top of this post to this:

top