How the GNU coreutils are tested

Detailed here are some of the tools and techniques we use to test the GNU coreutils project, which should present some useful ways to automate the use of tools like gdb, strace, valgrind, sed, grep, or the coreutils themselves etc., either for testing or for other applications. We also describe general techniques like using timeouts in a robust and performant way.

Test framework

automake's test framework is used, including "color-tests" and "parallel-tests", which supports running generic test scripts. Our test scripts are generally shell scripts, which makes a lot of sense since the coreutils themselves are designed to be used from shell, and therefore we get a lot of secondary testing from the ancillary operations to the primary commands being tested in each script.

Generally one invokes the test suite with make check, either when developing or after the build step when building coreutils for your system. You can also run individual tests when debugging/developing like
make TESTS=path/to/test/script SUBDIRS=. VERBOSE=yes check etc.

Note the GNU coreutils test suite is useful independently from the coreutils project (with some caveats), since the utilities under test are identified using the $PATH. That allows one to swap in other implementations of these utilities, to test conformity to the GNU coreutils implementation.

Performance

It's important to have a test suite that runs in a reasonable amount of time to increase the chances of tests being run and that they don't impact developer flow too much. To that end we support running tests in parallel, and also categorize tests to "EXPENSIVE" and "VERY_EXPENSIVE" which are not normally run.

Parallel testing neatly leverages make's parallel support and is enabled with make -j $(nproc) check. One just has to be wary to split large tests to more granular ones. Note $(nproc) is about the right level of parallelism for these tests, with diminishing returns beyond that. Current performance and test counts on a 40 core system are:

$ nproc
40
$ time make -j $(nproc) check SUBDIRS=.
13s
 # TOTAL: 590
 # PASS:  483
 # SKIP:  107

# time make -j $(nproc) check
21s including these additional gnulib tests
 # TOTAL: 311
 # PASS:  292
 # SKIP:  19

# time make -j $(nproc) check RUN_EXPENSIVE_TESTS=yes
1m22.244s for 9 extra expensive tests
# time make -j $(nproc) check \
  RUN_EXPENSIVE_TESTS=yes RUN_VERY_EXPENSIVE_TESTS=yes
6m2.051s for 10 extra very expensive tests
 # SKIP:  55

[Update Feb 2024: comparing an IBM,9043-MRX (Power10) system.
Note there are more tests ran now (on coreutils 9.4)

$ nproc
192
$ time make -j $(nproc) check SUBDIRS=.
5.8s
 # TOTAL: 645
 # PASS:  527
 # SKIP:  118

]

To maintain accuracy and reliability in these tests at this level of performance can be tricky, and we detail various techniques below to achieve that.

Truncated exponential backoff

It can be quite tricky to test asynchronous code without introducing large delays that slow down the test suite. A very useful technique we've used to avoid that is through the use of the retry_delay_ function, which tries an operation with an initial small timeout which usually suffices, but if not, will retry the operation with an increasing timeout.

retry_delay_ is used for operations that require a delay to pass. For operations that fail after a timeout, i.e. tests that are protected against hanging, we use the timeout command with a value of 10s, before a test is failed.

Responsive idempotence

This is the idea that you can run without stateful side effects on the system, and that you can kill the run within a reasonable amount of time also without side effects.

Each test runs in its own directory, which is removed when finished. SIGINT, and SIGTERM are handled appropriately so the clean up happens for Ctrl-C or if the test suite is otherwise terminated. Sometimes we need to take explicit action to be responsive to Ctrl-C. For example one very expensive sort(1) test uses timeout --foreground to be responsive. Also in some cases we need to be careful to not be too responsive to signals, like where we disable SIGTTOU for some tests, or disable suspension in isolated cases to avoid false positive failures due to timeouts.

For asynchronous tests we need to explicitly cleanup background processes so that

Stray processes aren't left on the system
Files aren't held open causing "silly rename" deletion issues on NFS
Partitions used to run the tests from can be unmounted

Handling asynchronous cases is generally tricky; for example when fixing a subtle test race we also noticed a race in bash which was subsequently fixed.

Generally any of these tests have the potential to cause side effects on the system, but some are downright scary and dangerous. For example the rm --no-preserve-root -r '/' test, is worth reviewing to see the many protections and techniques in place there, including chickening out entirely if running as root, using gdb to limit calls to unlink(), and also using LD_PRELOAD to verify and limit calls.

System tools

loopback mounts are used to test various file systems, given the focus of coreutils on files (having assimilated the fileutils project), and the general file abstraction of UNIX itself. We use generated file systems to ensure extents are supported, or to test global SELinux mount options for example.

/dev/full is a very useful device to simulate a full file system, and returns ENOSPC for any system call that might return that. In my experience with general system flakiness with a real full file system, lots of software could do with testing with this device.

LD_PRELOAD wrappers are a handy technique to replace functionality in shared library routines. We take the simpler approach of limiting building our shared library wrappers to gcc with particular options, as there is a large disparity between shared library details on various systems, as indicated by the complexity of libtool for example. We use wrappers for example, to provide protection for dangerous syscalls, or to simulate partial failure. Note sometimes special care is needed to support both 32bit and 64bit systems.

gdb can be quite awkward to automate robustly, but we use it in a couple of places to verify a fix for a race by running shell script at a breakpoint, or to provide protection by limiting dangerous system calls by python scripting at a breakpoint. Note because of inlining, breakpoints must be set on a line rather than a routine, noting that some systems provide both inline and non-inline functions in the same binary, so setting a breakpoint doesn't necessarily mean that it will be hit. Note also we avoid sending signals directly to gdb due do its SIGTERM and SIGCONT handling.

strace is used to check a syscall is not called and synchronize operation on a syscall. Also strace is used to inject faults, which we use to ensure we handle failing syscalls appropriately.

glibc has a feature to return "random" data for heap allocations which helps detect heap issues that would otherwise be undetected due to the allocated memory often being zero by default. That's enabled though setting the MALLOC_PERTURB_ environment variable.

valgrind is an amazing tool and before release we run all utilities with valgrind. More often we use a more performant and integrated subset of these tests by enabling ASAN in the build. Some tests use valgrind explicitly, when verifying specific memory corruption fixes, or to explicitly ensure no memory leaks. Note the leak checking needs to set an appropriate --leak-check level depending on whether we're doing a development build which deallocates memory in more cases, or a standard build which avoids redundant deallocations right before the process terminates.

chroot is part of coreutils so we both test this tool and use it to test others under particular user credentials. There are various commands to set an effective user id, but we use chroot --userspec to provide and implicitly test that functionality. Previously we had used a non installed wrapper to provide this, but implemented it within chroot to avoid that maintenance overhead, and provide this useful functionality more generally.

ulimit is used to run tests under constrained memory conditions, and we've support for determining the base memory limit for a command, which is then used to set the appropriate memory limits with ulimit for the commands under test. That allows us to set both tight and robust limits.

Portability

The GNU coreutils support a large variety of systems, with the differences between them often being handled transparently by the gnulib project.

Still in certain cases we need to skip (parts of) tests, like:

skip remote file systems due to performance/functionality
skip when the binary is configured not to be built
skip when the feature is indicated not available in config.h

Other portability considerations are:

We automatically pick a shell with sufficient support for the test suite, through the use of the gnulib posix-shell module, and some other shell compatibility checks. Also if explicitly calling a subshell we do so through the $SHELL variable.
Ensure we avoid shell builtins through the use of the env utility
Various file system variations, like GPFS having a large st_blksize,
or avoiding speculative preallocation on XFS etc.
Various shell portability issues, like avoiding :>file to touch files due to various issues.
Also avoiding $(< file) to read files which breaks with dash.

pretest VMs

An excellent resource for portability testing for any project are the pretest VMs which provide easy to use virtual machine images for testing various free software GNU/Linux and BSD distros etc. There is even a GUI helper there to generate the appropriate qemu or libvirt commands to run the VMs.

Coverage

Coverage is slowly increasing over time which is good, though could be improved. make coverage currently reports:

lines......: 81.3% (42913 of 52795 lines)
functions..: 89.2% (2293 of 2571 functions)

Open Coverage results show branch coverage rather than line coverage, and indicate a significantly lower coverage of 44.7%. Note while we're looking to increase these numbers, we're concentrating on testing the minimal amount of code being changed/added, noting that other techniques like fuzz testing are used by ourselves and others to augment test coverage.

Syntax checks

These are higher level checks run with make syntax-check that generally operate on the code, rather than on the binaries as `make check` does. Note this also means they're targeted at developers and so can use newer tools and have less portability constraints than other make targets and tests. Currently there are 153 tests which run in 8 seconds on an older 4 core system.

Some tests are general enough to be shared in gnulib to all projects, like:

tight_scope ensures items aren't exported from C modules unless needed
require_test_exit_idiom ensures test shell scripts exit robustly
space_tab maintains spacing by disallowing a TAB to follow a SPACE
...

Types of coreutils specific syntax checks include:

prohibit-form-feed to enforce layout style
prohibit-c99-printf-format for runtime portability not handled by gnulib
error_quotes enforcing runtime error format
prohibit_test_background_without_cleanup_ ensuring test robustness
...

Third-party testing

Given the ubiquity, compartmentalized nature, and test robustness of the GNU coreutils, others have used coreutils to exercise their own, often very sophisticated, test frameworks. We're very appreciative of these efforts which have found bugs both new and old.

We received bug reports for tail(1) from symbolic liveness analysis (a extension to KLEE (symbolic execution analysis)), from Aachen University. This was interesting as it extended KLEE to automatically find unintended non-termination of programs, for which it found two cases in tail(1). [Update Sep 2018: The is work is now published].

An encouraging trend brought to fore recently with the release of AFL, is the increased use of fuzz testing. AFL was used to find a TZ parsing bug in date for example. The related AFLFast project was used to catch bugs in split, pr, and tac and tail, with the split bug being caught before it landed in an official release! Note ASAN and UBSAN are symbiotic with the use of fuzz testing (used essentially to improve test coverage), so we keep the coreutils project ASAN/UBSAN clean.

Previously the AFLFast author Marcel Böhme used Regression Tests to Expose Change Interaction Errors to find a bug in cut before it landed in an official release. This method produced a "change sequence graph" for the two versions under consideration, and identified problematic differences between the two.

Miscellaneous notes

Helper utilities are used in a few cases though best avoided like we did in the chroot --userspec case above. Also we prefer to use scripting where available to ease maintenance, like with the python d_type checks for example, which use python to call directly into C level routines. A useful compiled tool we do use is getlimits, which outputs a list of compile time constants which are then consumed by shell or perl scripts for the common task of testing tools with these limits. Because it's only accurate at compile time, we don't release this tool generally.
Helper shell functions are used extensively. For example we use the returns_ function to make the construct command && fail=1 more concise, but more importantly more robust against crashes, since we're checking for a specific return code, and this not silently hiding crashes, which may occur at any time, but especially with process abort() due to ASAN failure etc.
Useful logs are supported by enabling set -x to trace the test script executing with make check VERBOSE=yes. But note we must be careful to avoid such tracing in wrappers that might adversely effect error comparisons, like in the case of the returns_ function above. We use braces to avoid the catch 22 of tracing the command to disable tracing. Other log conscious techniques are using the compare function to ensure a file is empty, so that the file contents are logged on failure.
Runtime program analysis really benefits from a robust existing test suite to run a significant amount of your code through the various runtime verification tools like valgrind (including helgrind/drd thread checks), and GCC/clang's ASAN and UBSAN modes. Note coreutils is currently ASAN and UBSAN clean when running the test suite.
gnulib tests are bundled together with coreutils', and are generally written in C given their lower level nature. The SUBDIR parameter in make SUBDIRS=. check will skip gnulib's tests and run only the coreutils ones, while omitting that parameter will run all tests. Gnulib's tests are maintained independently of coreutils. To learn more, see this discussion on running tests as a gnulib developer.

TODO:

The coreutils tests are by no means perfect, and would benefit from improvements in a few areas:

improved coverage
improved mocking of file system details with a fuse file system emulator along the lines of petardfs and CharybdeFS
a more general C library mocking system along the lines of cmocka
The addition of more of our test matrix to our continuous testing system, like "root user", more platforms, different shells, with and without tty, VPATH build, ...