How the GNU coreutils are tested

Detailed here are some of the tools and techniques we use to test the GNU coreutils project, which should present some useful ways to automate the use of tools like gdb, strace, valgrind, sed, grep, or the coreutils themselves etc., either for testing or for other applications. We also describe general techniques like using timeouts in a robust and performant way.

Test framework

automake's test framework is used, including "color-tests" and "parallel-tests", which supports running generic test scripts. Our test scripts are generally shell scripts, which makes a lot of sense since the coreutils themselves are designed to be used from shell, and therefore we get a lot of secondary testing from the ancillary operations to the primary commands being tested in each script.

Generally one invokes the test suite with make check, either when developing or after the build step when building coreutils for your system. You can also run individual tests when debugging/developing like
make TESTS=path/to/test/script SUBDIRS=. VERBOSE=yes check etc.

Note the GNU coreutils test suite is useful independently from the coreutils project (with some caveats), since the utilities under test are identified using the $PATH. That allows one to swap in other implementations of these utilities, to test conformity to the GNU coreutils implementation.

Performance

It's important to have a test suite that runs in a reasonable amount of time to increase the chances of tests being run and that they don't impact developer flow too much. To that end we support running tests in parallel, and also categorize tests to "EXPENSIVE" and "VERY_EXPENSIVE" which are not normally run.

Parallel testing neatly leverages make's parallel support and is enabled with make -j $(nproc) check. One just has to be wary to split large tests to more granular ones. Note $(nproc) is about the right level of parallelism for these tests, with diminishing returns beyond that. Current performance and test counts on a 40 core system are:

$ nproc
40
$ time make -j $(nproc) check SUBDIRS=.
13s
 # TOTAL: 590
 # PASS:  483
 # SKIP:  107
# time make -j $(nproc) check
21s including these additional gnulib tests
 # TOTAL: 311
 # PASS:  292
 # SKIP:  19
# time make -j $(nproc) check RUN_EXPENSIVE_TESTS=yes
1m22.244s for 9 extra expensive tests
# time make -j $(nproc) check \
  RUN_EXPENSIVE_TESTS=yes RUN_VERY_EXPENSIVE_TESTS=yes
6m2.051s for 10 extra very expensive tests
 # SKIP:  55
To maintain accuracy and reliability in these tests at this level of performance can be tricky, and we detail various techniques below to achieve that.

Truncated exponential backoff

It can be quite tricky to test asynchronous code without introducing large delays that slow down the test suite. A very useful technique we've used to avoid that is through the use of the retry_delay_ function, which tries an operation with an initial small timeout which usually suffices, but if not, will retry the operation with an increasing timeout.

retry_delay_ is used for operations that require a delay to pass. For operations that fail after a timeout, i.e. tests that are protected against hanging, we use the timeout command with a value of 10s, before a test is failed.

Responsive idempotence

This is the idea that you can run without stateful side effects on the system, and that you can kill the run within a reasonable amount of time also without side effects.

Each test runs in its own directory, which is removed when finished. SIGINT, and SIGTERM are handled appropriately so the clean up happens for Ctrl-C or if the test suite is otherwise terminated. Sometimes we need to take explicit action to be responsive to Ctrl-C. For example one very expensive sort(1) test uses timeout --foreground to be responsive. Also in some cases we need to be careful to not be too responsive to signals, like where we disable SIGTTOU for some tests, or disable suspension in isolated cases to avoid false positive failures due to timeouts.

For asynchronous tests we need to explicitly cleanup background processes so that

Handling asynchronous cases is generally tricky; for example when fixing a subtle test race we also noticed a race in bash which was subsequently fixed.

Generally any of these tests have the potential to cause side effects on the system, but some are downright scary and dangerous. For example the rm --no-preserve-root -r '/' test, is worth reviewing to see the many protections and techniques in place there, including chickening out entirely if running as root, using gdb to limit calls to unlink(), and also using LD_PRELOAD to verify and limit calls.

System tools

loopback mounts are used to test various file systems, given the focus of coreutils on files (having assimilated the fileutils project), and the general file abstraction of UNIX itself. We use generated file systems to ensure extents are supported, or to test global SELinux mount options for example.

/dev/full is a very useful device to simulate a full file system, and returns ENOSPC for any system call that might return that. In my experience with general system flakiness with a real full file system, lots of software could do with testing with this device.

LD_PRELOAD wrappers are a handy technique to replace functionality in shared library routines. We take the simpler approach of limiting building our shared library wrappers to gcc with particular options, as there is a large disparity between shared library details on various systems, as indicated by the complexity of libtool for example. We use wrappers for example, to provide protection for dangerous syscalls, or to simulate partial failure. Note sometimes special care is needed to support both 32bit and 64bit systems.

gdb can be quite awkward to automate robustly, but we use it in a couple of places to verify a fix for a race by running shell script at a breakpoint, or to provide protection by limiting dangerous system calls by python scripting at a breakpoint. Note because of inlining, breakpoints must be set on a line rather than a routine, noting that some systems provide both inline and non-inline functions in the same binary, so setting a breakpoint doesn't necessarily mean that it will be hit. Note also we avoid sending signals directly to gdb due do its SIGTERM and SIGCONT handling.

strace is used to check a syscall is not called and synchronize operation on a syscall.

glibc has a feature to return "random" data for heap allocations which helps detect heap issues that would otherwise be undetected due to the allocated memory often being zero by default. That's enabled though setting the MALLOC_PERTURB_ environment variable.

valgrind is an amazing tool and before release we run all utilities with valgrind. More often we use a more performant and integrated subset of these tests by enabling ASAN in the build. Some tests use valgrind explicitly, when verifying specific memory corruption fixes, or to explicitly ensure no memory leaks. Note the leak checking needs to set an appropriate --leak-check level depending on whether we're doing a development build which deallocates memory in more cases, or a standard build which avoids redundant deallocations right before the process terminates.

chroot is part of coreutils so we both test this tool and use it to test others under particular user credentials. There are various commands to set an effective user id, but we use chroot --userspec to provide and implicitly test that functionality. Previously we had used a non installed wrapper to provide this, but implemented it within chroot to avoid that maintenance overhead, and provide this useful functionality more generally.

ulimit is used to run tests under constrained memory conditions, and we've support for determining the base memory limit for a command, which is then used to set the appropriate memory limits with ulimit for the commands under test. That allows us to set both tight and robust limits.

Portability

The GNU coreutils support a large variety of systems, with the differences between them often being handled transparently by the gnulib project.

Still in certain cases we need to skip (parts of) tests, like: Other portability considerations are:

pretest VMs

An excellent resource for portability testing for any project are the pretest VMs which provide easy to use virtual machine images for testing various free software GNU/Linux and BSD distros etc. There is even a GUI helper there to generate the appropriate qemu or libvirt commands to run the VMs.

Coverage

Coverage is slowly increasing over time which is good, though could be improved. make coverage currently reports:
lines......: 81.3% (42913 of 52795 lines)
functions..: 89.2% (2293 of 2571 functions)
Open Coverage results show branch coverage rather than line coverage, and indicate a significantly lower coverage of 44.7%. Note while we're looking to increase these numbers, we're concentrating on testing the minimal amount of code being changed/added, noting that other techniques like fuzz testing are used by ourselves and others to augment test coverage.

Syntax checks

These are higher level checks run with make syntax-check that generally operate on the code, rather than on the binaries as `make check` does. Note this also means they're targeted at developers and so can use newer tools and have less portability constraints than other make targets and tests. Currently there are 153 tests which run in 8 seconds on an older 4 core system.

Some tests are general enough to be shared in gnulib to all projects, like: Types of coreutils specific syntax checks include:

Third-party testing

Given the ubiquity, compartmentalized nature, and test robustness of the GNU coreutils, others have used coretuils to exercise their own, often very sophisticated, test frameworks. We're very appreciative of these efforts which have found bugs both new and old.

We received bug reports for tail(1) from symbolic execution analysis (an extension to KLEE), from Aachen University. This was interesting as it extended KLEE to automatically find unintended non-termination of programs, for which it found two cases in tail(1). This work is as yet unpublished.

An encouraging trend brought to fore recently with the release of AFL, is the increased use of fuzz testing. The related AFLFast project was used to catch bugs in split, pr, and tac and tail, with the split bug being caught before it landed in an official release! Note ASAN and UBSAN are symbiotic with the use of fuzz testing (used essentially to improve test coverage), so we keep the coreutils project ASAN/UBSAN clean.

Previously the AFLFast author Marcel Böhme used Regression Tests to Expose Change Interaction Errors to find a bug in cut before it landed in an official release. This method produced a "change sequence graph" for the two versions under consideration, and identified problematic differences between the two.

Miscellaneous notes

TODO:

The coreutils tests are by no means perfect, and would benefit from improvements in a few areas:
© Jan 23 2017