Colourblind

Welcome to Colourblind.

This is the personal web space of Tom Milsom. As much as possible everything is free (as in speech and as in beer).


Make text: Smaller Bigger

A Simple Python Web Crawler

Posted by Tom on 31/03/2011 21:51:25

More code doodlin' in Python. A web crawler this time.

   1:   import sys
   2:   import httplib
   3:   import urlparse
   4:   from BeautifulSoup import BeautifulSoup
   5:   
   6:   class Crawler:
   7:       def __init__(self, host, root, depth, handler):
   8:           self._host = host
   9:           self._root = root
  10:           self._depth = depth
  11:           self._handler = handler
  12:           self._visited = []
  13:           self._connection = httplib.HTTPConnection(host)
  14:   
  15:       def run(self):
  16:           self._run(self._root, '', 0)
  17:   
  18:       def _run(self, url, parentUrl, currentDepth):
  19:           # is some clown is using absolute URLs for internal links?
  20:           url = url.replace('http://' + self._host, '')
  21:           # bail if we're running too deep
  22:           if self._depth > 0 and currentDepth > self._depth:
  23:               return
  24:           # bail if it's a manky URL
  25:           if ':' in url or url[0].startswith('#'):
  26:               return
  27:   
  28:           # normalise relative urls
  29:           if url[0] != '/':
  30:               index = parentUrl.rfind('/')
  31:               if index > -1:
  32:                   url = parentUrl[:index] + '/' + url
  33:               else:
  34:                   url = '/' + url
  35:   
  36:           # bail if we've already visited this page
  37:           if url in self._visited:
  38:               return
  39:   
  40:           page = Page(self._connection, url)
  41:           self._handler(page)
  42:           self._visited.append(url)
  43:   
  44:           map(self._run, page.urls, [url] * len(page.urls), [currentDepth + 1] * len(page.urls))
  45:   
  46:   class Page:
  47:       def __init__(self, connection, url):
  48:           self.url = url
  49:           self.urls = []
  50:           self.inputs = []
  51:   
  52:           # get a list of querystring key
  53:           querystring = urlparse.urlparse(url).query
  54:           self.querystring_params = [part.split('=')[0] for part in querystring.split('&')]
  55:   
  56:           connection.connect()
  57:           connection.request('GET', url, headers = {'User-Agent': 'Colourblind Crawler 0.1'})
  58:           response = connection.getresponse()
  59:   
  60:           self.statusCode = response.status
  61:           if self.statusCode != 200:
  62:               # handle redirects (location probably isn't relevant to all of them)
  63:               if self.statusCode >= 300 and self.statusCode < 400:
  64:                   self.urls.append(response.getheader('Location'))
  65:   
  66:           # if it's HTML, parse the sucker
  67:           if 'text/html' in response.getheader('Content-Type'):
  68:               soup = BeautifulSoup(response.read(), fromEncoding='utf-8')
  69:               links = soup('a')
  70:               # grab all the hrefs and remove any blanks
  71:               self.urls.extend(filter(lambda x: x != None, map(lambda x: x.get('href'), links)))
  72:               self.inputs.extend(soup('input'))
  73:               self.inputs.extend(soup('select'))
  74:               self.inputs.extend(soup('textarea'))
  75:   
  76:           connection.close()
  77:   
  78:   def print_page(page):
  79:       print('{0} {1}'.format(page.url.ljust(75, '.'), page.statusCode))
  80:       for input in page.inputs:
  81:           name = input.get('name')
  82:           print('\t{0}'.format(name))
  83:   
  84:   if __name__ == '__main__':
  85:       startPage = '/'
  86:       depth = 3
  87:       if len(sys.argv) > 2:
  88:           startPage = sys.argv[2]
  89:       if len(sys.argv) > 3:
  90:           depth = int(sys.argv[3])
  91:   
  92:       crawler = Crawler(sys.argv[1], startPage, depth, print_page)
  93:       crawler.run()

It's far from perfect (I still don't know how best to handle case sensitivity in the URLs), but I wrote this as part of a larger project which will, realistically, never get more than 10% completed. It'd be the shame for the code to never see the light of day, so here it is.

Tags: Python

Comments (0)

Pylighter - Python Syntax Highlighting

Posted by Tom on 31/05/2010 17:11:29

Back when I was first picking up Python I went looking for some code to syntax highlight it for blog posts. For C# I use Jean-Claude Manoli's C# Formatter, and wanted to reuse the same stylesheets. (I've since been distracted by other things, but I thought this was worth finishing off.)

After some brief Googling I found a likely candidate for plagiarism - the syntax highlighter that comes with MoinMoin. But it uses <font> tags and hard-coded colours, both of which are proven to be carcinogenic to cute little kittens. Clearly this will not do. Someone has to think of the kittens.

Here is the result of some fairly heavy tweaking - Pylighter.

And as formatted by itself.

   1:   # Pylighter - monochromacy.net
   2:   # HTML syntax highlighting for Python
   3:   # based on the MoinMoin Python Source Parser - moinmo.in
   4:   # compatible with the Manoli highlighting styles - www.manoli.net/csharpformat/
   5:   
   6:   import cgi, string, sys, StringIO
   7:   import keyword, token, tokenize
   8:   
   9:   _KEYWORD = token.NT_OFFSET + 1
  10:   
  11:   _classes = {
  12:       token.NUMBER:       'str',
  13:       token.OP:           'op',
  14:       token.STRING:       'str',
  15:       tokenize.COMMENT:   'rem',
  16:       token.ERRORTOKEN:   'kwrd',
  17:       _KEYWORD:           'kwrd',
  18:   }
  19:   
  20:   class Parser:
  21:       """ Send colored python source.
  22:       """
  23:   
  24:       def __init__(self, raw, includePreamble, out = sys.stdout):
  25:           """ Store the source text.
  26:           """
  27:           self.raw = string.strip(string.expandtabs(raw))
  28:           self.includePreamble = includePreamble
  29:           self.out = out
  30:   
  31:       def format(self, formatter, form):
  32:           """ Parse and send the colored source.
  33:           """
  34:   
  35:           if self.includePreamble:
  36:               self.out.write('<html>\n')
  37:               self.out.write('<head>\n')
  38:               self.out.write('<link rel="stylesheet" type="text/css" href="http://monochromacy.net/Skins/Cbv2/Lib/Css/Style.css" />\n')
  39:               self.out.write('<link rel="stylesheet" type="text/css" href="http://monochromacy.net/Skins/Cbv2/Lib/Css/Code.css" />\n')
  40:               self.out.write('</head>\n')
  41:               self.out.write('<body>\n')
  42:   
  43:           self.lineNum = 1
  44:           self.newlineRequired = True
  45:           self.colPos = 0
  46:   
  47:           self.out.write('<div class="code">\n')
  48:           tokenize.tokenize(StringIO.StringIO(self.raw).readline, self)
  49:           self.out.write('</pre>\n')
  50:           self.out.write('</div>\n')
  51:   
  52:           if self.includePreamble:
  53:               self.out.write('</body>\n')
  54:               self.out.write('</html>\n')
  55:   
  56:       def __call__(self, toktype, toktext, (srow,scol), (erow,ecol), line):
  57:           """ Token handler.
  58:           """
  59:           if 0:
  60:               print "type", toktype, token.tok_name[toktype], "text", toktext,
  61:               print "start", srow,scol, "end", erow,ecol, "<br>"
  62:   
  63:           # Handle multi-line strings with sneaky recursion
  64:           if toktype == token.STRING and toktext.count('\n') > 0:
  65:               lines = toktext.split('\n')
  66:               for i in range(len(lines)):
  67:                   self.__call__(token.STRING, lines[i], (0, 0), (0, len(lines[i])), lines[i])
  68:                   if i < len(lines) - 1:
  69:                       self.__call__(token.NEWLINE, '', (0, 0), (0, 0), lines[i])
  70:   
  71:               self.newlineRequired = False
  72:               self.colPos = 0
  73:               return
  74:   
  75:           # Write the line number if required
  76:           if self.newlineRequired:
  77:               spaces = ' ' * (4 - len(str(self.lineNum)))
  78:               self.out.write('<pre><span class="lnum">{0}{1}:   </span>'.format(spaces, self.lineNum))
  79:               self.newlineRequired = False
  80:   
  81:           # Handle newlines
  82:           if toktype in [token.NEWLINE, tokenize.NL]:
  83:               self.out.write('</pre>\n')
  84:               self.lineNum = self.lineNum + 1
  85:               self.colPos = 0
  86:               self.newlineRequired = True
  87:               return
  88:   
  89:           # Rewrite stripped out whitespace
  90:           if scol > self.colPos:
  91:               self.out.write(line[self.colPos:scol])
  92:   
  93:           # Do some token type wrangling
  94:           if token.LPAR <= toktype and toktype <= token.OP:
  95:               toktype = token.OP
  96:           elif toktype == token.NAME and keyword.iskeyword(toktext):
  97:               toktype = _KEYWORD
  98:   
  99:           # Write the token with the relevant style
 100:           cssClass = _classes.get(toktype, None)
 101:           if cssClass != None:
 102:               self.out.write('<span class="%s">' % (cssClass))
 103:               self.out.write(cgi.escape(line[scol:ecol]))
 104:               self.out.write('</span>')
 105:           else:
 106:               self.out.write(cgi.escape(line[scol:ecol]))
 107:   
 108:           # Update the last character position so we can tell when whitespace
 109:           # is dropped
 110:           self.colPos = ecol
 111:   
 112:   if __name__ == "__main__":
 113:       import os
 114:   
 115:       source = open(sys.argv[1]).read()
 116:       outfile = sys.argv[1] + '.html'
 117:   
 118:       Parser(source, True, open(outfile, 'wt')).format(None, None)
 119:   
 120:       if os.name == "nt":
 121:           os.system("explorer " + outfile)
 122:       else:
 123:           os.system("netscape " + outfile + " &")

So there you go. If you've already got a stylesheet set up for the Manoli C# formatter and want to reuse it for Python: enjoy.

Tags: Python

Comments (0)

Define Irony - Shiny and Impractical Subversion Visualisation

Posted by Tom on 26/03/2010 00:59:14

I've always had an interest in new and curious methods of visualising data. As a fully paid-up member of the geek fraternity I am required to read xkcd or they'd take my membership card away, and the episodes that really tickle my fancy are the ones that show information in fascinating ways. Since I'm still dabbling with Python in my spare time and I know my way around OpenGL already, I thought I'd try my hand at some visualisation apps.

To begin with I need some data. Subversion logs provide a rich seam of information:

svn log --xml -v [directory] > log.xml

That gives us quite a few properties to present to the user. I decided to graph revision against each object in the repository (either a file or a directory), but instead of slapping it on some boring old orthogonal axes I decided to go radial and then make it garish. That's just the way I roll. So, each 'spoke' on the wheel represents a file or directory in the repository. The files are sorted alphabetically, starting at the top and continuing clockwise. The centre of the wheel is revision 0 and the outside is the last revision found in the log file. Each of the change to that file would be represented by a white blob. But enough description.

You can run the code using:

python svnwheel.py log.xml

And yea, this will result in eye-candy.


Compares Favourably revision 45


This very blogging system


Track and zoom

In terms of practicality it ranks somewhere between a chocolate fireguard and a solar powered torch, but I like it. While I'm sure it will only keep me distracted long enough for the next technological muse to get her claws in it makes for some fun new stuff to learn, I imagine I'll come back and add different visualisations at some point in the future. Something with animation, perhaps.

Anyway, if you want to have a look at the source yourself the project is on github:

http://github.com/colourblind/define_irony

Left-click and drag moves the viewport and right-click and drag will zoom in and out. To reset the view at any time just hit 'r'. Windowing and OpenGL is provided by pyglet and apart from that it's pretty straight-forward. This also represents my first foray into git. The over-powering smugness of the git community had me heading towards Mercurial, but I figured that there's so much good stuff on github these days that I'd need it installed anyway. Curse you, network effect!

Comments (0)

AllRGB - All the Colours of the 24-bit Rainbow

Posted by Tom on 22/02/2010 23:16:48

While surfing the intertubes recently I came across a link to allrgb (it was probably through proggit). The concept is simple and delicious, like good pasta. A 4096 by 4096 image with one pixel of each colour it is possible to represent in 24 bits - all 16,777,216 of them.

I like me a bit of eye-candy, and I'm enjoying the Python at the moment (after the brief false start), so I saw this as a good excuse to get my script on.

You can click on any of the below for a larger image. And when I say large I mean large. Only click on any of the below if you really, really mean it. The average image size is around 30MB.

If you've got an algorithmic eye you can probably infer what's being done in each of the examples above. To be honest all of this is pretty underwhelming compared to some of the submitted images. There's some frankly amazing stuff over there, and they've been getting progressively more awesome. Back before it became popular it was all nested loops, and I can follow it pretty easily. The best-fit image matching stuff I could have a stab at some algorithms. I even know what a Hilbert Curve is. But now? Now it just makes me feel uneducated. I have no idea what simulated annealing is, but it sure makes pretty pictures.

Anyhoo, here's the source used to generate my paltry efforts above. PNG functionality is provided by the sultry PyPNG, and apart from that there's nothing particularly special going on. I am still very much starting out, so this was a good way of cutting my teeth.

ZIP containing all the relevant bits

Caveat: The memory usage can get a little crazy. With 4GB in Windows 7 I have no problems. With 2GB in Windows XP there was some slight wackiness (where 'slight wackiness' can mean anything from heavy slowdown to so-boned-you-need-to-hard-reset). I got better results using arrays (which I understand are wrappers around C arrays, so pack a lot smaller) but left the code on another machine. Whoops. Also, I don't know what I'm doing in Python - so don't use this as an example of how to do it. As usual, The Disclaimer applies.

Increasingly traditional random link: Egads. If this isn't a harbinger of the eschaton then I don't know what is.

Comments (0)

Python 3.1 - The Release That Time Forgot

Posted by Tom on 18/01/2010 17:47:43

So I started learning Python.

When lots of people on the Intertubes said that I should not use 3.1 because there wasn't as much module support I made a 'pssh' noise, rolled my eyes and promptly installed 3.1. My reasoning ran thus:

  • Lots of people are stupid [1]
  • 3.1 is a larger number than 2.6
  • It was likely to take me so long to learn the language to any useful degree that 3.1 usage would have stepped up by then

I was unaware, however, of the full magnitude of the problem. Allow me to illustrate using the medium of Google Image Search:

Python31 - <sound type="crickets" />

Python 3.1

Python26 - Holy crap, inflatables!

Python 2.6

I started off with Hello World, then decided that the next best step was steganography. It's something I've always been interested in, and it's got file IO in there, some bit-twisting, probably hookups to a couple of other libraries and plenty of room to grow - a good starter project.

First order of the day: start looking for PNG libraries. Plenty of options, none of which work with 3.1. Shame, but OK. I know nothing about PNG, so it's time to look up the spec. Before too long I'm loading chunks, learning to love struct, having to implement LZ77. Wait, what? At this point I made a face like this - >:  | - expunged 3.1, installed 2.6 and downloaded pypng.

You could argue that I should be lighting a candle rather than cursing the darkness - that I should stop whining and start porting code. In most cases I'm willing to get down and dirty. I've written loaders in my time, but I don't start to learn a high-level language so I can trawl through file format specs, especially when there's a perfectly servicably library 0.5 of a release over thataway.

In case you need any more convincing that no-one gives a crap about 3.1 then let's have a look at their online documentation:

2.6: http://docs.python.org/library/io.html
3.1: http://docs.python.org/3.1/library/io.html

I see someone's in a subdirectory. No illusions about who's the ginger stepchild in that family.

So there's some more anecdotal evidence to add to the pile. I beseech you, dear reader: if you're about to install Python, install 2.6.

[1] The fact that Mamma Mia is the (by at least one metric) most successful film of all time in the UK makes me gnash my teeth and dispair for the species.

I hate Abba. That's put that right out there at the start so that we may understand one another. Or at least that you may understand me. If I take the time to clamber down from my throne constructed entirely of rage and attempt to be objective for 5 minutes I can come to the conclusion that Abba is not bad music. It's musically well constructed. The lyrics aren't half bad. But all of that goes right out the window when I actually hear any Abba, as I am filled with a sudden urge to set fire to orphans or punch a baby panda.

Which proves to be a problem with Mamma Mia, since there is nothing else in there.

Here is a film with absolutely zero content. It's a hollowed-out corpse of a film. A flimsy celuloid husk. It's cinematic celery. You could get the same experience by purchasing a Best of Abba CD. In fact, that would be much preferable, as then you wouldn't have to the listen to Pierce Fucking Brosnan.

To be honest I'd given up on him after Die Another Day (you know the only thing more retarded than an invisible car? Having someone hide behind it), but Stellan: you should know better. I remember when you were playing a mathematical pimp and jousting with Robin Williams in Good Will Hunting. I watched you play the reserved and enigmatic Gregor in Ronin. What are you doing now? Capering. CAPERING. Well that's it. You're off my favourite Swede list. Peter Stormare: you've been promoted.

It's . . . it's just a bad film. A really bad film. The thought that if aliens landed in Kent tomorrow (let's face it, it would either be that or Norfolk), checked up what was the best example of our most popular entertainment medium and found themselves watching Mamma Mia makes me feel like crying into my lasagne.

I just wanted to get that off my chest. But yeah, Python 2.6. Thumbs up.

Tags: Python, Waffle

Comments (0)