Sal
Peter Hoffmann Director Data Engineering at Blue Yonder. Python Developer, Conference Speaker, Mountaineer

Microsoft Sho Word Histogram and the Python Standard Library

Through a blog post from John D. Cook on planet python I beacame aware of the Microsoft Sho Project for data analysis and scientific computing. I haven't installed it yet, but it looks promising and I always love to see progress and usage of IronPython on Windows.

Here's the Computing a Word Histogramm Example:

>>> fp = System.IO.File.ReadAllText("./declarationofindependence.txt")
>>> table = System.Collections.Hashtable()
>>> for word in fp.split():
    if table.ContainsKey(word):
        table[word] +=1
    else:
        table[word] = 1
>>> pairs = zip(list(table.Keys), list(table.Values))
>>> pairs.sort(lambda a,b: a[1]<b[1])
>>> bar([elt[0] for elt in pairs[0:10]], [elt[1] for elt in pairs[0:10]])

http://research.microsoft.com/en-us/projects/sho/wordhistogram.jpg

I've used a hashtable for counting in the past too, but the Counter Datastructure from the python standard library (added in 2.7) is much better suited for this kind of task:

>>> from collections import Counter
>>> table = Counter()
>>> table(fp.split())
>>> pairs = table.most_common(10)

It's shorter and more readable.

For sorting a list of lists based on a specific element I prefer using operator.itemgetter instead of a lambda expression.

>>> from operator import itemgetter
>>> lst = [('orange', 5), ('banana', 7), ('apple', 2)]
>>> lst.sort(key=itemgetter(1))
>>> lst
[('apple', 2), ('orange', 5), ('banana', 7)]

The bottom line is that python has a great standard library and it is worth knowing it well.