github.com/hoffmann Peter Hoffmann on Stackoverflow @peterhoffmann on twitter Peter Hoffmann on Facebook Contact me per email Subscribe to Atom Feed

Peter Hoffmann

Software Engineer
prev page next page

Microsoft Sho Word Histogram and the Python Standard Library

Posted on January 27, 2011
#python

Through a blog post from John D. Cook on planet python I beacame aware of the Microsoft Sho Project for data analysis and scientific computing. I haven't installed it yet, but it looks promising and I always love to see progress and usage of IronPython on Windows.

Here's the Computing a Word Histogramm Example:

>>> fp = System.IO.File.ReadAllText("./declarationofindependence.txt")
>>> table = System.Collections.Hashtable()
>>> for word in fp.split():
    if table.ContainsKey(word):
        table[word] +=1
    else:
        table[word] = 1
>>> pairs = zip(list(table.Keys), list(table.Values))
>>> pairs.sort(lambda a,b: a[1]<b[1])
>>> bar([elt[0] for elt in pairs[0:10]], [elt[1] for elt in pairs[0:10]])

http://research.microsoft.com/en-us/projects/sho/wordhistogram.jpg

I've used a hashtable for counting in the past too, but the Counter Datastructure from the python standard library (added in 2.7) is much better suited for this kind of task:

>>> from collections import Counter
>>> table = Counter()
>>> table(fp.split())
>>> pairs = table.most_common(10)

It's shorter and more readable.

For sorting a list of lists based on a specific element I prefer using operator.itemgetter instead of a lambda expression.

>>> from operator import itemgetter
>>> lst = [('orange', 5), ('banana', 7), ('apple', 2)]
>>> lst.sort(key=itemgetter(1))
>>> lst
[('apple', 2), ('orange', 5), ('banana', 7)]

The bottom line is that python has a great standard library and it is worth knowing it well.