This article presents a script to calculate the sha1sum for a file, while giving tips for writing code portable to both python 2 and python 3. Also some performance considerations with using python's functional features are mentioned.

Here is the standalone python script that checksums a file. It has a central hash_file function that is essentially the same as the one recently introduced to the OpenStack nova project.

from __future__ import print_function
import sys
import hashlib

def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        for chunk in iter(lambda: f.read(32768), b''):
            sha1sum.update(chunk)
    return sha1sum.hexdigest()

if __name__ == "__main__":
    import sys
    for f in sys.argv[1:]:
        try:
            print(hash_file(f))
        except IOError:
            e = sys.exc_info()[1]
            print("%s: %s" % (e.filename, e.strerror))
Some notes to consider with the above script. It's interesting to note that the above script is significantly faster than the "standard" sha1sum utility on GNU/Linux due to the hashlib module calling out to openssl's efficient assembly implementation.
$ truncate -S200M t.in

$ time openssl sha1 t.in
SHA1(t.in)= fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258
real 0m0.590s
user 0m0.538s
sys 0m0.052s

$ time sha1sum t.in
fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258 t.in
real 0m1.002s
user 0m0.957s
sys 0m0.044s

$ time ./hash_file.py t.in
fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258
real 0m0.639s
user 0m0.595s
sys 0m0.040s
Now lets consider a more "functional" implementation, which relies on update() returning None, so that any() iterates through all chunks. One might expect this to be faster since no explicit python iteration is done.
def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        any(sha1sum.update(c) for c in iter(lambda: f.read(32768), b''))
    return sha1sum.hexdigest()
For comparison, this is the imperative form, which one might expect to be slower with its explicit iteration.
def hash_file(target):
    sha1sum = hashlib.sha1()
    with open(target, 'rb') as f:
        While True:
            c = file.read(32768)
            if c==b'':
              break
            sha1sum.update(c)
    return sha1sum.hexdigest()
Now performance testing these forms directly with real files is pointless, since I/O and checksumming overhead will dominate all measurements. So instead we can construct an internal list to simulate the file chunks. Setting this up in a shell function wrapper around the standard python timeit module, we have:
ptest() {
  [ "$1" = 3 ] && py=python3 || py=python; shift
  $py -m timeit -s "x=[None]*100000+[b'']" -s "def disregard(e): pass" "$@"
}
Allowing us to easily test various forms...
$ ptest 2 -s "import itertools" "i=iter(x)" \
"any(itertools.imap(disregard, iter(lambda: i.next(), b'')))"
10 loops, best of 3: 53.5 msec per loop
$ ptest 2  "i=iter(x)" \
"for e in iter(lambda: i.next(), b''): disregard(e)"
10 loops, best of 3: 58.3 msec per loop
$ ptest 2 "i=iter(x)" \
"while 1:" "    e = i.next()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 53.2 msec per loop
$ ptest 2 "i=iter(x)" \
"while True:" "    e = i.next()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 62.5 msec per loop
$ ptest 2 "i=iter(x)" \
"any(disregard(c) for c in iter(lambda: i.next(), b''))"
10 loops, best of 3: 63.3 msec per loop
Notes: Repeating with python 3 (3.2.3)...
$ ptest 3 "i=iter(x)" \
"any(map(disregard, iter(lambda: i.__next__(), b'')))"
10 loops, best of 3: 67.2 msec per loop
$ ptest 3 "i=iter(x)" \
"for e in iter(lambda: i.__next__(), b''): disregard(e)"
10 loops, best of 3: 67.9 msec per loop
$ ptest 3 "i=iter(x)" \
"while 1:" "    e = i.__next__()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 44.4 msec per loop
$ ptest 3 "i=iter(x)" \
"while True:" "    e = i.__next__()" "    if e==b'': break" "    disregard(e)"
10 loops, best of 3: 44.4 msec per loop
$ ptest 3 "i=iter(x)" \
"any(disregard(c) for c in iter(lambda: i.__next__(), b''))"
10 loops, best of 3: 82.2 msec per loop
Notes:
© Oct 31 2012