Benchmarking tools & techniques

The basic principle of benchmarking is simple. Measure the change in a system, relative to a reference. However this is difficult on our modern complex systems, because it's hard to minimize external influences.

Now benchmarking can be a useful check to determine performance changes of a component over time. But one must be extremely careful, and really already know the operation of the component under test, so as to be able to interpret the results effectively.

Here are a few useful general techniques to employ when benchmarking, with GNU/Linux examples provided. See also profiling techniques which can obviate the need for some of the following considerations.

Minimize run time

If you run for a short time, then there is less chance the system will start doing something else, thus minimizing variance in your results. There is a happy medium with this, as if the run is too short, you'll not be minimizing start-up overhead and variance.

For example, I was testing a change in the `join` command with the output of yes aaaaaaaaaa | head -n10000 | tee a1 > a2. That resulted in 10000*10000 output records, which took 10s and was very variable. Changing the input data (it was input parsing code I was testing) to seq -f%15.0f 1000000 | tee a1 > a2 increased the input 100 times, but still reduced the output considerably resulting in a much more stable 0.8s run.

Disable power management

Another reason for running for a short time is that some systems can only run full tilt for a couple of minutes before they overheat and start throttling (a good test to do before buying a laptop). In general power management is problematic for benchmarking and performance. For example ASPM on network cards adds latency. To disable power saving on a wireless NIC for example, use iwconfig wlan0 power off.

CPU throttling is also a common impediment to benchmarking, but is easy to disable. For example on fedora do /etc/init.d/cpuspeed stop or on older ubuntu releases at least one could /etc/init.d/powernowd stop.

Simulate other conditions

Given the ubiquity of networking in the contemporary computing environment, benchmarking network performance is important, especially given the difference between localhost, LAN and WAN performance characteristics. A useful technique is to simulate a WAN interface locally like.

Add 20ms latency to loopback device:
tc qdisc add dev lo root handle 1:0 netem delay 20msec
And to restore:
tc qdisc del dev lo root

Minimize external code

Your system may implicitly invoke large amounts of code, and there are often modes one can select to help restrict to just the logic under test.

For example, the LANG=C environment variable can usually greatly minimize the processing that needs to be done by most text utilities. Also one should ensure that the MALLOC_PERTURB_= environment variable is not set, as it causes mallocs to write to malloced mem which is slow for large buffers.

Also one can minimize the affect of other processes by giving higher priority to the process being measured. For example, taskset -c 0 chrt -f 99 command, will run on a single CPU with highest priority. Note this will need root access, and also it runs with such high priority that it excludes cpu speed daemons etc. so they will need to be disabled first, as mentioned above.

Be aware of caching

Caching is a ubiquitous technique used to improve performance, so accounting for it is vital where taking measurements. One often wants to prime the caches, so as to minimize external delays and variances like accessing a disk for example. In that case one would run the program under test a couple of times, and only then start measuring.

If you're measuring worst case performance, or disk access patterns perhaps, then one will need to clear the caches used on the system. Here are some tips for clearing the file cache on GNU/Linux

With coreutils >= 8.11 you can drop the cache for a file using dd of=file oflag=nocache conv=notrunc,fdatasync count=0
With Linux >= 2.6.16 one can drop the entire page cache using sync; echo 1 > /proc/sys/vm/drop_caches

Be aware of alignment

Measurement bias is a very common issue, where a small change in the system, often changing memory alignment in the system under test, can cause large differences in results.

For example, I can't benchmark on my laptop over a suspend/resume cycle, and I invariably get a significant performance difference after doing that. So that implies one should be very wary of results performed at different times.

Another example, is where I got random 400% difference in performance in a strstr() implementation when I compiled with gcc default settings. I had thought initially my code change had made a huge improvement, but further testing revealed that just moving code around caused as much of a performance change. Only after specifying -march=pentium-m to gcc to generate code specific to my platform, did the performance stabilize.

Discount large values

If something external to the system under test does happen to interfere, then it can often result in a large drop in performance. So as to bias results to runs that were not, preempted by other stuff or using on uncached data, one should discard slow outliers at least. python -m timeit -h is a facility included with python that does this automatically, by discarding outliers and averaging a number of runs. There are also more general tools like dumbbench that do something similar for arbitrary commands.

Now there is the argument to best minimize external influences, one should not average as done above, but instead pick the fastest run. I tend to agree with this, and often use this method. One can adjust the python timeit module to do this by using the form min(Timer("stmt","setup").repeat(7,100)).

For a detailed analysis comparing "average", "median" and "best" methods etc., please read this excellent paper on performance analysis. It goes on to describe how to report results in a statistically rigorous manner, by using confidence intervals plots (using R).

[Update Jul 2012: Richard WM Jones pointed out the avgtime tool, which presents "average", "median" and "best" stats for a command.]