This article presents a script to calculate the sha1sum for a file,
while giving tips for writing code portable to both python 2 and python 3.
Also some performance considerations with using python's functional features are mentioned.
Here is the standalone python script that checksums a file. It has a central hash_file function that is essentially the same as the one recently introduced to the OpenStack nova project.
from __future__ import print_function import sys import hashlib def hash_file(target): sha1sum = hashlib.sha1() with open(target, 'rb') as f: for chunk in iter(lambda: f.read(32768), b''): sha1sum.update(chunk) return sha1sum.hexdigest() if __name__ == "__main__": import sys for f in sys.argv[1:]: try: print(hash_file(f)) except IOError: e = sys.exc_info()[1] print("%s: %s" % (e.filename, e.strerror))Some notes to consider with the above script.
- Supports all python versions >= 2.6 (including the python 3 series)
- Uses a bounded amount of memory, chosen to trade off between mem usage and read call overhead
- Uses a concise but not purely functional syntax
- The lamda is used to convert a function with args, to one without args like iter() expects
- The b passed to open() is required to stop python 3 converting the input to a text buffer
- The b'' is the sentinel, again b being used to be compatible with python 3
$ truncate -S200M t.in $ time openssl sha1 t.in SHA1(t.in)= fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258 real 0m0.590s user 0m0.538s sys 0m0.052s $ time sha1sum t.in fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258 t.in real 0m1.002s user 0m0.957s sys 0m0.044s $ time ./hash_file.py t.in fd7c5327c68fcf94b62dc9f58fc1cdb3c8c01258 real 0m0.639s user 0m0.595s sys 0m0.040sNow lets consider a more "functional" implementation, which relies on update() returning None, so that any() iterates through all chunks. One might expect this to be faster since no explicit python iteration is done.
def hash_file(target): sha1sum = hashlib.sha1() with open(target, 'rb') as f: any(sha1sum.update(c) for c in iter(lambda: f.read(32768), b'')) return sha1sum.hexdigest()For comparison, this is the imperative form, which one might expect to be slower with its explicit iteration.
def hash_file(target): sha1sum = hashlib.sha1() with open(target, 'rb') as f: While True: c = file.read(32768) if c==b'': break sha1sum.update(c) return sha1sum.hexdigest()Now performance testing these forms directly with real files is pointless, since I/O and checksumming overhead will dominate all measurements. So instead we can construct an internal list to simulate the file chunks. Setting this up in a shell function wrapper around the standard python timeit module, we have:
ptest() { [ "$1" = 3 ] && py=python3 || py=python; shift $py -m timeit -s "x=[None]*100000+[b'']" -s "def disregard(e): pass" "$@" }Allowing us to easily test various forms...
$ ptest 2 -s "import itertools" "i=iter(x)" \ "any(itertools.imap(disregard, iter(lambda: i.next(), b'')))" 10 loops, best of 3: 53.5 msec per loop
$ ptest 2 "i=iter(x)" \ "for e in iter(lambda: i.next(), b''): disregard(e)" 10 loops, best of 3: 58.3 msec per loop
$ ptest 2 "i=iter(x)" \ "while 1:" " e = i.next()" " if e==b'': break" " disregard(e)" 10 loops, best of 3: 53.2 msec per loop
$ ptest 2 "i=iter(x)" \ "while True:" " e = i.next()" " if e==b'': break" " disregard(e)" 10 loops, best of 3: 62.5 msec per loop
$ ptest 2 "i=iter(x)" \ "any(disregard(c) for c in iter(lambda: i.next(), b''))" 10 loops, best of 3: 63.3 msec per loopNotes:
- There isn't that much variation between the various forms, so it mainly comes down to a matter of style
- The full functional form (the last one) is a bit slower than the rest, which is surprising
- The fastest is the "while 1" loop, which is noticably faster, than the equivalent "while True" loop
$ ptest 3 "i=iter(x)" \ "any(map(disregard, iter(lambda: i.__next__(), b'')))" 10 loops, best of 3: 67.2 msec per loop
$ ptest 3 "i=iter(x)" \ "for e in iter(lambda: i.__next__(), b''): disregard(e)" 10 loops, best of 3: 67.9 msec per loop
$ ptest 3 "i=iter(x)" \ "while 1:" " e = i.__next__()" " if e==b'': break" " disregard(e)" 10 loops, best of 3: 44.4 msec per loop
$ ptest 3 "i=iter(x)" \ "while True:" " e = i.__next__()" " if e==b'': break" " disregard(e)" 10 loops, best of 3: 44.4 msec per loop
$ ptest 3 "i=iter(x)" \ "any(disregard(c) for c in iter(lambda: i.__next__(), b''))" 10 loops, best of 3: 82.2 msec per loopNotes:
- All functional forms are a bit slower in python 3 than python 2
- The imperative form is faster in python 3 than python 2
- python3 removes the anomaly in python 2 between "while 1" and "while True" performance
- __next__() is required with python3 rather than next()
- map is now lazy and so itertools.imap is no longer used or supported
© Oct 31 2012