Recently I've had reason to use Python to set up the backend of a Blackberry mobile project and I must say that I really like it. Initially I had trouble simply typing in code as muscle memory forces me to use bracketing for code blocks. You can imagine the frustration having to constantly delete brackets, so I quickly solved this problem by downloading and installing the Pydev Eclipse plugin found here. Armed with code completion, syntax highlighting, and code analysis my development time has been reduced significantly.
Setup
An initial process I needed to automate using Python was simple parsing of XML content and I couldn't find a quick and simple example of how it's done (hence, I thought I'd help other out with this small tutorial). After some experimentation I determined that importing the minidom module would suitably accomplish what I wanted to do and is extremely lightweight so we'll use it for the example code. For the purposes of this article I'll use the following sample XML file courtesy of Microsoft and edited to save some space:
Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
Parsing the XML
What we want to accomplish is parse out the child nodes of each book to get the data element of each for processing. With Python it's simple - import the DOM (includes the parse, and Node modules), read in the xml file using the parse() method, and then iterate through the childnodes. For simplicity's sake in the example I'll just print the data for each node to output. Here's the Python code to do just that:
from xml.dom import * xmlDoc = parse("library.xml") for node1 in xmlDoc.getElementsByTagName("book"): for node2 in node1.childNodes: if node2.nodeType == Node.ELEMENT_NODE: print node2.childNodes[0].data
As you can see, with just six lines of code we've read the entire XML file into memory using DOM, parsed it, pulled out the book elements and printed all the childnodes for each book to output. The check if the node is an ELEMENT_NODE is critical since we do not want to pull the TEXT_NODEs and iterate through them for this example. If you pull that test out the code will attempt to get childNodes from the TEXT_NODE and will fail with "IndexError: tuple index out of range" since the text node doesn't have childNodes.
If you need to pull out the attributes from the Element you can iterate through them by using the attributes dictionary class member of the node. In this case, if node2 had attributes(and it doesn't in my example) you could assign a variable the attributes dictionary 'attributes = node2.attributes' then iterate through the attributes using the 'keys()' method on attributes.
Perl Counterexample
To parse the same XML in Perl you'd have to write the following:
use XML::DOM; my $file = 'library.xml'; my $parser = XML::DOM::Parser->new(); my $xmldoc = $parser->parsefile($file); foreach my $book ($xmldoc->getElementsByTagName('book')){ foreach my $tag ( $book->getChildNodes() ) { if($tag->getNodeType == ELEMENT_NODE){ print $tag->getFirstChild->getNodeValue; } } }
Looks similar and in reality it is mostly identical code but in my opinion it is definitely not as clean as the Python code. Don't get me wrong - I love Perl and am not saying that other languages beat it out - if that were the case I'd have posted a better example to stoke the flames beneath the Python vs Perl zealots. Nonetheless, I must say I'm beginning to enjoy Python quite a bit.
Wrap up
If you came here looking for something other than a simple way to parse XML using Python, such as adding to the ridiculous flame wars that go on between Perl & Python evangelists, look here, here, and here for more on the debate. If you want a pretty good reference for using XML with Python look here
Side Note: Monty Python Rules
By the way, I always thought Python was named after the snake. However, according to the source itself it's really named after Monty Python's flying circus which really strikes a chord with me since I love several of the Python movies, particularly the Holy Grail. Maybe I should change the picture at the top of this post to this:
0 comments:
Post a Comment