I know there are quite a few "Simple Python Crawlers" out on the web for easy download and use. Nonetheless, I felt like I'd add yet another to the mix - Hey, innovation doesn't work without choice, right? Writing a basic web-crawler is pretty simple if you leverage Python's built-in modules that handle the most difficult aspects: opening and managing socket connections to remote servers and parsing the returned HTML.  The Python modules urllib2 and HTMLParser provide you with the high-level interface to these lower level processes.  The crawler I've written for the tutorial leverages these modules, runs from the command-line, and takes the following two arguments:

  • "seed url" - where the crawler will begin its parsing
  • "all" or "local"
    • The "local" flag tells the crawler to parse only the http links that are contained within the "seed url" domain (local).  This means that eventually the parser will stop because there are a limited number of links within a domain.  Note that if you were to crawl a large domain, like www.microsoft.com, it could take a very long time to complete the crawl. Caveat #1: (IMPORTANT!)This crawler only looks at the base url on http links to stay within the domain. If absolute links are within the page (e.g., a href="/") this crawler won't pick those up  You'll have to add that functionality if that's what you're looking for (but that should be fairly easy)
    • The "all" flag tells the crawler to parse every http link it finds within the html, even if they are outside the domain.  Note that this means the spider could take a very, very, very long time to complete its crawl (years?)  I'd suggest running this only if you'd like to see how quickly the number of pending links virtually explodes as the spider crawls.  You'll not want to run it for long though as your machine will likely deplete its memory.
Before we begin, you can get the entire source code here but I'd recommend taking a look at the step-by-step below so you can understand how to customize it to your needs.

Caveat #2: Although I've run the program against a handfull of sites and haven't had problems, I've not tested this very thoroughly. This means there could be errors, problems, situations where it crashes, or it could even be giving incorrect link counts. In the coming days I intend on testing it more but if you run into problems let me know in the comments.

Run the Program from the Command-Line


Nothing too complex here: If you'd like to run the crawler to parse only the local domain links on this website you'd give the following command from the command-line:
python spider.py http://berrytutorials.blogspot.com local

Otherwise, if you want to crawl the web starting with my site as the seed url then you'd run the following command:
python spider.py http://berrytutorials.blogspot.com all

The program will give you updates on status, printing the number of pending URLs in the queue along with the number of links(URLs)  that have been processed, and when it completes, the total number of links it found.   Along the way, as HTMLParser processes the HTML, you'll likely encounter errors in parsing due to malformed tags, etc. that are things that HTMLParser cannot gracefully overlook. The following is what the tail-end of the output looks like:
.....
.....
.....

Crawl Exception: Malformed tag found when parsing HTML
bad end tag: "", at line 1266, column 16

15 Pending URLs are in the queue.
369 URLs have been fully processed.

10 Pending URLs are in the queue.
374 URLs have been fully processed.

5 Pending URLs are in the queue.
379 URLs have been fully processed.

Total number of links: 382
Main-Mini:Desktop john$ 

I understand that there are better HTML parsers out there in Python such as BeautifulSoup that might be able to handle poorly-formed HTML, however I'm a bit more familiar with HTMLParser.

Overall Architecture of the Simple Crawler


The base design of the crawler consists of the following:
  • Spider class: Main class that defines two dictionaries to hold the pending URLs to be processed and the visited URLs that are complete.  The visited URLs dictionary maps the URL to the HTML that was parsed by HTMLParser so you can further process the link content as suits your application.  Also, Spider defines a function called "startcrawling()" which is called to begin the crawl.
  • LinksHTMLParser: HTML parsing class, declared as a local variable within the startcrawling function in Spider.  This class extends the base HTMLParser by overriding the handle_starttag function to only parse out anchor tags.  It also defines a local variable named "links" that holds the processed links as strings so the Spider can access them and perform further processing.

Spider Class Details


The main algorithm is in the Spider class' startcrawling() function and operates as follows (in semi-pseudo-code):
While there are URLs in the pendingURLs dictionary:    
     pop another URL from the pending URLs dictionary to process.    
     make HEAD request to URL and check content-type    
     if content-type is 'text/html' process, 
     otherwise continue (break from this 
     iteration of loop)       
          open URL and read in HTML       
          add URL to list of visited URLs       
          for each of the HTTP links found when processing the HTML:          
               parse the link to make sure it is syntactically correct.          
               check to make sure it's HTTP and it hasn't already been visited          
               if command-line option is 'local' check the domain of the link.             
               if the domain is not the same then disregard, 
                 otherwise add to pendingURls          
               otherwise, if adding all links, just add to pendingURLs     

Refer to the following code detailing the Spider class:
import sys
import re
import urllib2
from urllib2 import URLError

# Snow Leopard Fix for threading issues Trace/BPT trap problem
urllib2.install_opener(urllib2.build_opener())
from urlparse import urlparse
import threading
import time
from HTMLParser import HTMLParser


"""
Spider takes a starting URL and  visits all links found within each page
until it doesn't find anymore 
"""
class Spider():
 
 def __init__(self,sUrl, crawl):
 
  #Urlparse has the following attributes: scheme, netloc, path, params,query,fragment
  self.startUrl = urlparse(sUrl)
  self.visitedUrls = {} # Map of link -> page HTML
  self.pendingUrls = {sUrl:sUrl} # Map of link->link. Redundant, but used for speed of lookups in hash
  self.startUrlString = sUrl
  self.crawlType = crawl
  self.numBrokenLinks = 0
  self.numTotalLinks = 0
  
 """ Main crawling function that parses the URLs, stores the HTML from each in visitedUrls
   and analyzes the HTML to acquire and process the links within the HTML"""
 def startcrawling(self):
   
  while len(self.pendingUrls) > 0:
   try:
    
    self.printProcessed()
   
    currUrl = self.pendingUrls.popitem()[0]  
    
    
    # Make HEAD request first to see if the type is text/html
    url = urllib2.urlopen(HeadRequest(currUrl))
    conType = url.info()['content-type']
    conTypeVal = conType.split(';')
    
    # Only look at pages that have a content-type of 'text/html'
    if conTypeVal[0] == 'text/html':
 
     url = urllib2.urlopen(currUrl)
     html = url.read()
     
     # Map HTML of the current URL in process in the dictionary to the link
     # for further processing if required
     self.visitedUrls[currUrl] = html
     
     # LinksHTMLParser is extended to take out the a tags only and store 
     htmlparser = LinksHTMLParser()
     htmlparser.feed(html)
     
     # Check each of the a tags found by Parser and store if scheme is http
     # and if it already doesn't exist in the visitedUrls dictionary
     for link in htmlparser.links.keys(): 
      url = urlparse(link)
      
      if url.scheme == 'http' and not self.visitedUrls.has_key(link): 
       if self.crawlType == 'local': 
        if url.netloc == self.startUrl.netloc:
         if not self.pendingUrls.has_key(link):
          self.pendingUrls[link] = link
            
       else: 
        if not self.pendingUrls.has_key(link):    
         self.pendingUrls[link] = link
           

   
   # Don't die on exceptions.  Print and move on
   except URLError:
    print "Crawl Exception: URL parsing error" 
    
   except Exception,details:
    print "Crawl Exception: Malformed tag found when parsing HTML"
    print details
    # Even if there was a problem parsing HTML add the link to the list
    self.visitedUrls[currUrl] = 'None'
    
  if self.crawlType == 'local':
   self.numTotalLinks = len(self.visitedUrls)
 
  print "Total number of links: %d" % self.numTotalLinks
  

You can see the main loop processes links while there are still pendingUrls in the queue (while 1 and len(self.pendingUrls) > 0). It opens the current Url to process from the pendingURLs dictionary by removing it from the queue using the popitem() function.

Note that because I'm using dictionaries there is no order to the processing of the links; a random one is popped from the dictionary. An improvement/enhancement/customization might be to use an actual queue(list) and process the links in order they were added to the queue. In my case, I decided to randomly process because I didn't think the order mattered in the long run. In the case of visitedURLs I used the dictionary mainly because I wanted quick lookup (O(1)) of the hash for processing the HTML down the road.

Next, a HEAD request is made to the current URL to process to check its 'content-type' value in the header. If it's a 'text/html' content type, we will process it further. I went this route because a) I didn't want to process document (.pdf, .doc, .txt, etc.) files, image (.jpg, .png, etc), audio/video, etc. I only want to look at html files. Also, the reason I make the HEAD request before downloading the entire page is mainly so the crawler is more "polite"; i.e., so it doesn't eat up server processing time downloading entire pages unless it's totally necessary.

After validating the HEAD request, the program downloads the entire page and feeds it to LinksHTMLParser. The following is the code for LinksHTMLParser:

class LinksHTMLParser(HTMLParser):

 def __init__(self):
  self.links = {}
  self.regex = re.compile('^href$')
  HTMLParser.__init__(self)
  
 
    
 # Pull the a href link values out and add to links list
 def handle_starttag(self,tag,attrs):
  if tag == 'a':
   try:
    # Run through the attributes and values appending 
    # tags to the dictionary (only non duplicate links
    # will be appended)
    for (attribute,value) in attrs:
     match = self.regex.match(attribute)
     if match is not None and not self.links.has_key(value):
      self.links[value] = value
      
     
   except Exception,details:
    print "LinksHTMLParser: " 
    print Exception,details



You can see that I've inherited from HTMLParser and overridden the handle_starttag function so we only look at anchor tags that have an href value (in order to eliminate some tag processing). Then LinksHTMLParser adds each anchor link to an internal dictionary called links that holds the links on that processed page for Spider to further process.

Finally, Spider loops on the links found by LinksHTMLParser and checks if it's local (domain-only) crawl will check the domain of each link to make sure it's the same as the "seed URL". Otherwise it just adds it if the link doesn't already exist in the pendingURLs dictionary.


Areas for Crawler Customization and Enhancement


As it is written, the crawler doesn't do much more than get the links, parse them, count them, store each link's HTML in a dictionary, and return a total. Obviously you'd be advised to make it actually do something real, even as simple as just printing out the links it finds so you can review them. In fact, before posting this I had it doing just that (to standard out) after the main while loop returned (lines 90-91) in the whole version:

for link in self.visitedUrls.keys():
 print link

You might customize that to write to a file instead of STDOUT so it could be further processed by external scripts.

Here's some other enhancements I'd suggest:
  • Use the crawler to parse the HTML content you've stored for each link in the pendingUrls dictionary. Say you're looking for some particular content on a site, you'd add a function that processes that HTML after the startcrawling function is complete, using another extended version of the LinksHTMLParser to do some other scraping.
  • limit the "depth" that the crawler runs when it's an "all" search - e.g., have a variable from the command line limit the number of times the crawler runs through the found links so you can get the all version to stop.
  • Although this crawler is semi-polite because it requests the HEAD before the whole page, you'd really want to download the robots.txt file from the seed domain (and from each outside domain that the crawler accesses if you're hitting all domains) to ensure crawlers are allowed. You don't want to accidentally access some NSA website, scrape all the content, then have agents knocking at your door that afternoon.
  • This crawler makes requests on the webserver without any delay between requests and worst-case could bring a server down or severely slow it down. You'd likely put some kind of delay between requests so as to not overwhelm the target servers (use time.sleep(SECS))
  • Instead of making the HEAD request, checking the content-type to see if the URL pending to visit is html, you could use a regular epression to test the URL to see if it ends in either a '/', '.asp','.php','htm', or 'html' then just request the page. This would avoid the immediate GET after the HEAD and limit stress on the server


Preventing Duplicate HTML Content


One issue I thought of is it would be a good idea to enhance the crawler so it doesn't look at duplicate HTML content if your end goal is to test the actual page details for each link. The crawler is written so it definitely doesn't store duplicate links but that doesn't guarantee that the HTML content is unique. For example, on my site it's finding 382 total unique links even though I only have 15 posts. Where are all these extra links coming from?

It's the widgets that i'm using in my template, for example here are some 'widget' links the crawler found:

http://berrytutorials.blogspot.com/search/label/blackberry?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1230796800000&toggleopen=MONTHLY-1262332800000

http://berrytutorials.blogspot.com/2009/11/create-custom-listfield-change.html?widgetType=BlogArchive&widgetId=BlogArchive1&action=toggle&dir=open&toggle=YEARLY-1262332800000&toggleopen=MONTHLY-1257058800000

Although these are unique links, they point to content that is already catalogued by the crawler when it found the main page (ex., look at the second link, it points to the article 'create-custom-listfield-change.html', and the crawler also holds the link to the actual page - the HTML content is duplicated).

To prevent this, I'd think after the crawler is complete you'd have a 'normalization' process where the found links are checked for duplicate content. Since I've stored the HTML for each link you wouldn't have to have the spider reconnect to the crawled website, just check the HTML. I haven't thought this through completely to suggest an algorithm that would be fast though so I'll leave that up to you.


Wrap up and Roll


Although there are tons of open-source crawlers on the web I think that writing one yourself will definitely help you understand the complexities of link and content parsing and will help you actually visualize the explosion of links that are out there. For example, I set this crawler on www.yahoo.com and within a couple minutes it was up to over 2000 links that were in the queue.  I was honestly surprised to find so many links just in my simple blog.  It was a great learning experience and I hope this article helped you along the path to writing a more advanced crawler.

In case you're interested, here is an article about distributed crawlers (state of the art from 2003 :)  here


As usual, let me know if you have questions/concerns in the comments!


While working on a recent Django project I had the need to create thumbnails from images residing on a remote server to store within one of my Django app models.  The wrinkle was that I wanted to programmatically make the thumbnail "on the fly" before storage since I didn't want to waste space storing the original, much larger, image.  So if you are in a similar situation, where do you start?

Considering the vast array of libraries for Python, I hunted down the most referenced one, Python Imaging Library(PIL), and installed it. For purposes of this tutorial, I'll presuppose that you've already installed PIL on your platform and have some experience manipulating images using it.  I'm working on OS X (Snow Leopard) and had no issues getting PIL working but if you do, follow the directions on this blog post. (I can't help if you're on Windows) If you can run the command 'from PIL import Image' from the python prompt then you've installed it properly, Django shouldn't complain, and the code below should work.


The Thumbnail Model

Once you've installed PIL your hardest struggles are over. For our simplified example we'll create a custom image model with 'url' and 'thumb' attributes. The 'url' attribute will store the url of the image in the event you need to reference the original picture and 'thumb' will be an ImageField that stores the location of the thumbnail we create. We'll define a function called 'create_thumb' that will perform the image manipulation.   Here's what the Model looks like, including the required imports:

import Image
import os
import urllib
from django.core.files import File
.....
....
..

class Thumbnail(models.Model):
 url  =models.CharField(max_length=255, unique=True)
         
 # Set the upload_to parameter to the directory where you'll store the 
 # thumbs
 thumb = models.ImageField(upload_to='thumbs', null=true)
 
 """ Pulls image, converts it to thumbnail, then 
   saves in thumbs directory of Django install """
 def create_thumb(self):
  
  if self.url and not self.thumb:
   
   image = urllib.urlretrieve(self.url)
   
   # Create the thumbnail of dimension size
   size=128,128
   t_img = Image.open(image[0])
   t_img.thumbnail(size) 
 
   # Get the directory name where the temp image was stored
   # by urlretrieve
   dir_name = os.path.dirname(image[0])

   # Get the image name from the url
   img_name = os.path.basename(self.url)

   # Save the thumbnail in the same temp directory 
   # where urlretrieve got the full-sized image, 
   # using the same file extention in os.path.basename()
   t_img.save(os.path.join(dir_name, "thumb" + img_name))
 
   # Save the thumbnail in the media directory, prepend thumb  
   self.thumb.save(os.path.basename("thumb" + self.url),File(open(os.path.join(dir_name, "thumb" + img_name)))
 


What is that Code Doing? Can it be Improved?

Although the code is commented pretty well, I'll give a bit more explanation. In Django, the ImageField doesn't actually store the image in your database. Instead, it is stored in a directory located on the path your 'MEDIA_ROOT' is configured to in the settings.py file. So make sure that this appropriately configured, then create the 'thumbs' subdirectory within that directory. That's what the 'upload_to' parameter is used for in the ImageField type.

The 'create_thumb' method should be called when you create an instance of the Thumbnail model. Here's an example of one way you could use it:

a = Thumbnail(url='http://url.to.the.image.you.want.a.thumbnail.of')
a.create_thumb()
a.save()


The 'create_thumb' method takes the url, creates the thumbnail, and saves it in the 'upload_to' directory. Since this is sample code I didn't put any provision for catching exceptions such as improper url provided or image processing errors - that would be one area I'd suggest you improve upon. Also, the thumbnails are saved under the same name of the image provided by the url with "thumb" prepended. You might wonder what happens if two urls have the same image name? Well, I'll tell you: The new thumb will overwrite the old so you will want to add code that creates a unique name for the image.

Besides the few caveats mentioned above, the code works as advertised and will make your life that much easier...at least when it comes to creating thumbnails. As usual, I only ask that if this post helped you, please leave a comment.

Django Configuration - Serve Static Media w/Templates

I must say that Django documentation is all-around really fantastic. In fact, I've never run into a situation where I was totally stumped and their docs haven't saved the day. Regardless, there are always niches where you'll wish there was a slightly better real-world example - in this case regarding serving static media (stylesheets, javascript, image files, etc) on the development server. Their doc on this subject, found here, covers the base configuration pretty well however I still found that I couldn't get Django to recognize my static media for some reason. After messing with the config for a bit I managed to get it operational so If you're having the same problem, follow these steps and you'll have it running in no time.

The tutorial assumes basic knowledge of Django and that your project is located at the path (on OS X): '/Users/[USER_NAME]>/Code/Django/[PROJECT_NAME]' where USER_NAME is your OS X account and PROJECT_NAME is the top level directory of your project, likely where you ran the 'django-admin.py startproject [PROJECT_NAME]' command. For example, the path to my project is '/Users/john/Code/Django/testproj'. Obviously your code doesn't have to be on this exact path but you'll need to make sure you adjust the paths in the code below accordingly.

Note : In section two below I give two different ways to configure your settings.py file; The first way is with the paths hardcoded into the variables and the second is using python's built-in os module to create absolute paths to your static files. I kept both in here as a demonstration of how it works but I highly recommend going the absolute path route since your code will be portable across systems. Also, it should work on Windows without mods as well.

Configure URLconf - Make Static Media View Available for DEBUG Only


First, open the top-level urls.py file and add the following settings.DEBUG code after the urlpatterns that already reside there. The following is a basic example of my urls.py file:

from django.conf.urls.defaults import *
from testproj.base.views import index
from django.conf import settings

urlpatterns = patterns('',
     (r'^$', index),
)
if settings.DEBUG:
     urlpatterns += patterns('',(r'^static/(?P.*)$', 'django.views.static.serve',{'document_root': settings.MEDIA_ROOT}),
)


As explained in Django's documentation, it is recommended that in a production environment the server should provide the static files, not your Django code. Using the settings.DEBUG test will ensure that when you move it to production you'll catch the static files being served by Django since the DEBUG setting will be False in prod.

In the code you're importing the settings module from django.conf. The settings.DEBUG config is ensuring that any requests matching 'static/PATH' are served by the django.views.static.serve view with a context containing 'document_root' set to whatever path is found in settings.MEDIA_ROOT. Next we're going to set the value of that path in the settings.py file.

Configure Django Settings to the Location of Your Media


The first step is to create a directory named 'static' in the Django project folder and two directories within it named 'css' and 'scripts'. In the future you'll place your static files in folders within the main 'static' directory - the way you organize them doesn't make a difference just make sure it's a logical setup and that you adjust your template tags to point to the right folder (refer to the next section).

Option #1: Hardcoded Path


Open your settings.py file and modify the MEDIA_ROOT, MEDIA_UR, and ADMIN_MEDIA_PREFIX settings as follows (remembering to change the path to the location of your own static folder):
MEDIA_ROOT = '/Users/john/Code/Django/testproj/static/'

MEDIA_URL = '/static/'

ADMIN_MEDIA_PREFIX = '/media/'



Now, double-check that you didn't forget to add the beginning and ending '/' on each of the paths you modified as this will confuse Django. Note that while the ADMIN_MEDIA_PREFIX and MEDIA_URL don't necessarily have to be different, it is recommended by Django. If somehow you've wandered in here looking for instructions on how to do this on Windows I believe that the only difference in the entire tutorial is to change MEDIA_ROOT setting to 'C:/path/to/your/static/folder'. Following the rest of this should work on Windows but I didn't have time to validate that.

Option #2: Absolute Path option - Recommended


Instead of hardcoding as mentioned above, I'd recommend going the absolute path route since you can port your code from system to system without rewriting the settings.py file. Remember, either option should work but only use one way or the other.

Instead of hardcoding the paths into settings.py you'll import the os module and use the os.path.abspath('') method to acquire the working directory for your code dynamically and set it to ROOT_PATH. Then you'll adjust the same settings as before using the os.path.join method to attach your static directory. Refer to the code in settings.py below:

import os

ROOT_PATH = os.path.abspath('')
....
....
MEDIA_ROOT = os.path.join(ROOT_PATH,'static')
MEDIA_URL = os.path.join(ROOT_PATH,'static')
ADMIN_MEDIA_PREFIX = os.path.join(ROOT_PATH,'media')



Notice that when you use os.path.join(ROOT_PATH,'static') you won't put the '/' before and after the static directory as we did in the previous option.

Modify your Template to Point at the Static Directory


The final step is to make sure that in you've modified the HTML tags (script, link, etc) in your template to point at your 'static' directory. For example, here's how my script and link tags are configured for the example project:







The script src link is set to "/static/scripts/scripts.js" since this is where I've placed my javascript files. Again, don't forget to put the initial '/' at the beginning of your path.

Wrap it Up and Call it a Day


Test it out by running the 'python manage.py runserver' command from your project directory and everything should be working beautifully. If you run into problems it's likely that you've either forgotten or added a '/' in the wrong place on the paths you've configured.

Blackberry Design Guidelines - Tips on Designing the User Interface


I commonly hear from designers that engineers don't have the right-brain skills for interface design. They say that the essence of good design is functional simplicity and since engineers think end users are only interested in cool features (because that's what the engineers themselves are interested in) the result of an engineer designing an interface will be a bloated, feature-laden mess. One of the most common examples they will bring up is the dreaded Sony remote control, take a look at the sample picture of a remote from one of Sony's A/V receivers.

When you look at the remote on the left can you blame the designers for the stereotype? Yes, I'm pretty sure this remote was designed by engineers, here's why:

  1. There are 65 buttons total on a remote that measures no more than 3"x10"

  2. 20 tiny buttons in a grid at the top of the remote, each with multiple functions spelled out in tiny type. How can the user distinguish the functions easily?


  3. Most of the buttons are the same dimension making it difficult to find the correct one in a dimly lit room


  4. The four buttons at the bottom appear to perform four functions each for a total of 16 functions. Feature overload!

If you're wondering what the smaller control on the right is for, get this: it's a remote that also comes with the Sony A/V receiver that's meant to simplify the primary remote. It's almost like Sony realized too late that the primary remote was terrible and the best solution was to throw in another remote. Don't get me wrong: I love Sony products but I've always felt that the software and hardware interface of their devices is their biggest flaw.

Software Engineers can do Better


I know the idea is somewhat cliched by now but pick any one of Apple's products, iPhone, iPod, OS X, etc., examine the user interface and you'll see what I mean: complex functionality disguised behind a simple and easy to use interface. I mean I know it's not software but even their Magic Mouse is a wonder of human interface principles.

By all accounts the iPod saved Apple but how did that happen considering Creative and other companies dominated the marketplace for MP3 players when the iPod was released? Originally the iPod only worked on Macs but when Apple released iTunes software for Windows, sales skyrocketed and the rest is history. Creative's players were rated higher, had greater capacity, and longer battery life but iPods eventually outsold them by a large factor.

If we ignore Steve Job's "reality distortion field", I'd wager that one major reason was the user interface of the iPod (including the mechanics of the wheel) combined with simplified software for the desktop (iTunes). Creative and other companies designed their interfaces with so many features they forgot to focus on the most important one: how does the end user navigate 1000's of songs quickly and efficiently. Apple's engineers understood that the screen is tiny and overloading it with too much extraneous "stuff" will overwhelm the user. Users are on the go when they break out their player therefore menus shouldn't be too complex or deep for quick access.

If you can't stand the Apple example (I know Apple can be polarizing) then take a look at the Flip series of handheld video cameras. This startup company had the idea of making an inexpensive and tiny camcorder that was plain easy to use for people wanting to upload to YouTube and they succeeded - recently Cisco bought the company for $590 million dollars and Sony, Canon, and other manufacturers are rushing to catch up. Why is the Flip so popular? Again, I think it's mainly due to the simple interface on the camera and the easy desktop software that came with it. Take a look at the Flip hardware interface versus a typical Canon from a couple years ago:



I think you're getting the idea that I feel simple user interfaces can be really important to successful software and it applies to Blackberry app design as well. As final affirmation, take a look at how focusing on design has served Apple's stock price over the last 10 years:




Develop like an Engineer, Think Like a User


Short of spending time taking a class on Human-Computer Interaction Principles or reading long (boring?) textbooks (some of my favorites are linked at the bottom of this article), there are a few simple suggestions I have for you to keep in the back of your mind while you're programming the interface to your Blackberry app. I'll use images from Blackberry programs found on the App store to illustrate my points because I think it helps to have examples. I've gone to lengths to disguise the name of the apps in the hopes of not offending anyone who developed them - apologies in advance if you recognize your app on here.

As preface to the discussion, in my opinion, it's easier to design an interface with lots of features - after all, you don't have to think as much, just throw everything in. Simple interfaces require a great deal of contemplation before coding since you're making tough decisions about what to get rid of and that isn't always easy. Think of simple interface design as forging a sword - it requires a lot of labor to grind out all the imperfections, to fold the steel repeatedly, and most importantly, making it razor-sharp. In the end, when the swordsman has to pull out his weapon, he needs to have the confidence that it will serve it's sole purpose to perfection (yeah, kill people) and that it's not going to get in his way (no unnecessary features).

Without further ado, my list of user interface design recommendations:

#1 - Don't put too many options in menus


Blackberry menus can be customized on a per-screen basis thus it doesn't make sense to have the main menu options as part of each screen. Make contextual menus do a limited number of things otherwise the menu will be overwhelming. Here's example of menus done wrong on the Blackberry:



The designer chose to put every option under the sun in the menu including choices that extend "below the fold" e.g., where the user has to scroll to see them. Additionally, they chose to include five different search types and the menu takes up the entire screen because the verbiage is too long. One suggestion for improvement of this menu is to have one option for "Search" that opens a search screen with the different options laid out as checkboxes or some other method.

#2 - Pay Careful Attention to Fonts and Readability


There's an impulse to differentiate your app from others by using uncommon fonts in your app (think Comic Sans). Don't do it! (unless you're working on a game and it fits the theme) Remember, you want your user to be able to easily read the content of your app and most of the time a common font is the best choice. Another consideration is to ensure that fonts within graphics are readable. For example:



The main font for the content of the contact is fine but it's bold and is definitely larger than the menu options at the top. It's distracting and the app would be served well to reduce the font size. Additionally, take a look at the Contacts button that's pressed at the top left. The gradient used there is cutting off the top of the words making it hard to read. I'd suggest changing the font color to a darker shade when the button is active.

#3 - Space is Your Friend


Sometimes you have to just let the content of your app "breathe". In other words, since the screen real estate is so tiny on most Blackberry devices you might have the tendency to try and fill up the space with a ton of data - don't. Sometimes a graphic, although more difficult to add/program, can do the trick. Remember your user is going to be quickly glancing at the screen while on the go and, as they say, a picture is worth 1000 words. For example, compare these two stock quote apps:



Yes, they both give similar information yet in the first example the user can glance at the image of the current stock price and immediately know the day's trend and if they should be worried. Also, going back to my point on fonts, in the first app the font size chosen for the latest quote is larger than the rest of the fonts giving it higher priority. Your eyes are drawn to the image and that quote while you'd have to search the list on the second one to distinguish where the quote resides. Which one would you choose if you were just comparing screenshots?

Also, notice how there's space around each of the elements in the first app. It gives a certain "lightness" to the app - what people these days call a "clean look". The second app has a lot of whitespace (I'd consider wasted real estate) but it just doesn't have a finished look.

#4 - Keep it Clear Part 1- Make Navigation Obvious


Users don't have the patience to spend more than few minutes figuring out how to work your app. They want to quickly navigate your app to find the features they want and you'll want to make it easy for them. Large icons, simple menus with few options, minimal background noise are all important to the success of your app. If you're finding that you have too many navigation options consider paring back the scope of your app - e.g., removing unnecessary features. Take a look at these:




While I certainly appreciate the fact that this app has to have a bunch of features (it's the front end for a complex web application) I'm just not sure if the choice to place 15 tiny icons at the top of the screen is a wise one. Even after using the app for a long time, the end user will likely always have problems figuring out what icon stands for which function. Yes, it is nifty that when you scroll across it the name of the function appears next to the icon but think of all the time the user will waste scrolling back and forth until they find the function they were looking for. I'd wager that 5-10 of those options could be placed in a menu or some other method used to place only the most commonly used features on the main nav.


#5 - Keep it Clear Part 2 - Eliminate Extraneous Info


Yes, keeping it clear is important. You never want to have extraneous information on screen for the user because they might be looking at your app while driving, in a meeting where they can only quickly check it, etc. For this reason you want to keep it as simple and clear as possible and to do so you'll want to ensure all the features of your app are entirely necessary. Ensure you prevent something that Apple calls "Feature Cascade" - "If you are developing a simple application, it can be very tempting to add features that aren’t wholly relevant to the original intent of the program. This feature cascade can lead to a bloated interface that is slow and difficult to use because of its complexity. Try to stick to the original intent of your program and include only features that are relevant to the main workflow." (Human Interface Guidelines)

While the images below aren't examples of feature cascade, they demonstrate that eliminating extraneous information can improve an app:



I modified the image on the right to cut out the unnecessary, self-congratulatory, images at the bottom. Before, the focus on the screen was on the colorful icons but after the modification you're not distracted. Don't you think it looks and feels much better?

#6 - Don't Forget Usability Testing


When you're spending all your time focused on developing your app you quickly lose your objectivity around its functionality. You know your app inside and out and you'll begin to assume the end user will as well. This is when usability testing comes into play. It's quite simple - just load your app on a phone and pass it around to your friends, your family, anyone who hasn't seen it before and ask them to perform certain tasks. Watch them carefully to see what they are doing when they try and tackle the tasks you throw at them - watch the phone but also observe the user's face to see if there is any frustration or quizzical looks. Ask them why they're having trouble if they look frustrated and take notes.

When you've gotten 4-5 people to test the app, take careful consideration of the feedback they've given. You can safely ignore most comments about color since it's a subjective topic. Instead, focus on areas where navigation was a problem or where finding a certain function took too long. Use the principle of Occam's Razor to simplify the app if these kind of problems arise. In other words, don't add features thinking it will simplify things - try removing them instead. Trust me, it will be a whole lot easier!


To be Continued...


Design is an enormous topic so I'll continue discussing this in future posts. The main takeaway here is that end users have a limited amount of patience and goodwill towards your application. If they can't figure it out in the first five minutes or it's not pleasing to the eye then they won't be recommending it to their friends and you'll probably get bad reviews in the app store. As a secondary takeaway, here are the key bullet points from the Blackberry UI Guidelines:


  • Use or extend existing UI components where possible so that your application can inherit the default behavior of the component.

  • Follow the standard navigation model as closely as possible so that a particular user action produces a consistent result across applications. For example, allow users to open the context menu in all applications by clicking the trackball or trackpad.

  • Support and extend user tasks in useful ways. For example, when users download an application, the application should open automatically. The application should also be saved in the Applications folder.



When you design your application, also consider the following guidelines:


  • Stay focused on users' immediate task. Display only the information that users need at any one moment.

  • Verify that the actions that are available in the menu are relevant to users' current context.

  • Minimize the number of times that users need to click the trackwheel, trackball, trackpad, or touch screen to complete a task.

  • Design your UI to allow users to change their mind and undo commands. Users sometimes click the wrong menu item or button accidentally. For example, use an alert dialog box to notify users of a critical action such as deleting data from their BlackBerry devices.

  • Display information in a way that makes effective use of the small screen.



Suggested Reading and Links



top