Using functional python idioms efficiently

This article presents a script to calculate the sha1sum for a file, while giving tips for writing code portable to both python 2 and python 3. Also some performance considerations with using python's functional features are mentioned.

Here is the standalone python script that checksums a file. It has a central hash_file function that is essentially the same as the one recently introduced to the OpenStack nova project.

from __future__ import print_function
import sys
import hashlib

def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        for chunk in iter(lambda: f.read(32768), b''):
            sha1sum.update(chunk)
    return sha1sum.hexdigest()

if __name__ == "__main__":
    import sys
    for f in sys.argv[1:]:
        try:
            print(hash_file(f))
        except IOError:
            e = sys.exc_info()[1]
            print("%s: %s" % (e.filename, e.strerror))

Some notes to consider with the above script.

Supports all python versions >= 2.6 (including the python 3 series)
Uses a bounded amount of memory, chosen to trade off between mem usage and read call overhead
Uses a concise but not purely functional syntax
The lamda is used to convert a function with args, to one without args like iter() expects
The b passed to open() is required to stop python 3 converting the input to a text buffer
The b'' is the sentinel, again b being used to be compatible with python 3

It's interesting to note that the above script is significantly faster than the "standard" sha1sum utility on GNU/Linux due to the hashlib module calling out to openssl's efficient assembly implementation.

$ truncate -S200M t.in

$ time openssl sha1 t.in
SHA1(t.in)= fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258
real 0m0.590s
user 0m0.538s
sys 0m0.052s

$ time sha1sum t.in
fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258 t.in
real 0m1.002s
user 0m0.957s
sys 0m0.044s

$ time ./hash_file.py t.in
fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258
real 0m0.639s
user 0m0.595s
sys 0m0.040s

Now lets consider a more "functional" implementation, which relies on update() returning None, so that any() iterates through all chunks. One might expect this to be faster since no explicit python iteration is done.

def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        any(sha1sum.update(c) for c in iter(lambda: f.read(32768), b''))
    return sha1sum.hexdigest()

For comparison, this is the imperative form, which one might expect to be slower with its explicit iteration.

def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        While True:
            c = file.read(32768)
            if c==b'':
              break
            sha1sum.update(c)
    return sha1sum.hexdigest()

Now performance testing these forms directly with real files is pointless, since I/O and checksumming overhead will dominate all measurements. So instead we can construct an internal list to simulate the file chunks. Setting this up in a shell function wrapper around the standard python timeit module, we have:

ptest() {
  [ "$1" = 3 ] && py=python3 || py=python; shift
  $py -m timeit -s "x=[None]*100000+[b'']" -s "def disregard(e): pass" "$@"
}

Allowing us to easily test various forms...

$ ptest 2 -s "import itertools" "i=iter(x)" \
"any(itertools.imap(disregard, iter(lambda: i.next(), b'')))"
10 loops, best of 3: 53.5 msec per loop

$ ptest 2  "i=iter(x)" \
"for e in iter(lambda: i.next(), b''): disregard(e)"
10 loops, best of 3: 58.3 msec per loop

$ ptest 2 "i=iter(x)" \
"while 1:" "    e = i.next()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 53.2 msec per loop

$ ptest 2 "i=iter(x)" \
"while True:" "    e = i.next()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 62.5 msec per loop

$ ptest 2 "i=iter(x)" \
"any(disregard(c) for c in iter(lambda: i.next(), b''))"
10 loops, best of 3: 63.3 msec per loop

Notes:

There isn't that much variation between the various forms, so it mainly comes down to a matter of style
The full functional form (the last one) is a bit slower than the rest, which is surprising
The fastest is the "while 1" loop, which is noticably faster, than the equivalent "while True" loop

Repeating with python 3 (3.2.3)...

$ ptest 3 "i=iter(x)" \
"any(map(disregard, iter(lambda: i.__next__(), b'')))"
10 loops, best of 3: 67.2 msec per loop

$ ptest 3 "i=iter(x)" \
"for e in iter(lambda: i.__next__(), b''): disregard(e)"
10 loops, best of 3: 67.9 msec per loop

$ ptest 3 "i=iter(x)" \
"while 1:" "    e = i.__next__()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 44.4 msec per loop

$ ptest 3 "i=iter(x)" \
"while True:" "    e = i.__next__()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 44.4 msec per loop

$ ptest 3 "i=iter(x)" \
"any(disregard(c) for c in iter(lambda: i.__next__(), b''))"
10 loops, best of 3: 82.2 msec per loop

Notes:

All functional forms are a bit slower in python 3 than python 2
The imperative form is faster in python 3 than python 2
python3 removes the anomaly in python 2 between "while 1" and "while True" performance
__next__() is required with python3 rather than next()
map is now lazy and so itertools.imap is no longer used or supported