Parallel processing with unix tools

There are various ways to use parallel processing in UNIX:

piping
An often under appreciated idea in the unix pipe model is that the components of the pipe run in parallel. This is a key advantage leveraged when combining simple commands that do "one thing well"
split -n, xargs -P, parallel
Note programs that are invoked in parallel by these, need to output atomically for each item processed, which the GNU coreutils are careful to do for factor and sha*sum, etc. Generally commands that use stdio for output can be wrapped with the `stdbuf -oL` command to avoid intermixing lines from parallel invocations
make -j
Most implementations of make(1) now support the -j option to process targets in parallel. make(1) is generally a higher level tool designed to process disparate tasks and avoid reprocessing already generated targets. For example it is used very effictively when testing coreutils where about 700 tests can be processed in 13 seconds on a 40 core machine.
implicit threading
This goes against the unix model somewhat and definitely adds internal complexity to those tools. The advantages can be less data copying overhead, and simpler usage, though its use needs to be carefully considered. A disadvantage is that one loses the ability to easily distribute commands to separate systems. Examples are GNU sort(1) and turbo-linecount

Counting lines in parallel

The examples below will compare the above methods for implementing multi-processing, for the function of counting lines in a file.

First of all let's generate some test data. We use both long and short lines to compare the overhead of the various methods compared to the core cost of the function being performed:

$ seq 100000000 > lines.txt  # 100M lines
$ yes $(yes longline | head -n9) | head -n10000000 > long-lines.txt  # 10M lines

We'll also define the add() { paste -d+ -s | bc; } helper function to add a list of numbers.

Note the following runs were done against cached files, and thus not I/O bound. Therefore we limit the number of processes in parallel to $(nproc), though you would generally benefit to raising that if your jobs are waiting on network or disk etc.

wc -l

We'll use this command to count lines for most methods, so here is the base non multi-processing performance for comparison:

$ time wc -l lines.txt
real	0m0.559s
user	0m0.399s
sys	0m0.157s

$ time wc -l long-lines.txt
real	0m0.263s
user	0m0.102s
sys	0m0.158s

Note the distro version (v8.25) not being compiled with --march makes a significant difference, but only for the short line case. We'll not use the distro version in the following tests.

$ time fedora-25-wc -l lines.txt
real	0m1.039s
user	0m0.900s
sys	0m0.134s

turbo-linecount

turbo-linecount is an example of multi-threaded processing of a file.

time tlc lines.txt
real	0m0.536s  # third fastest
user	0m1.906s  # but a lot less efficient
sys	0m0.100s

time tlc long-lines.txt
real	0m0.146s  # second fastest
user	0m0.336s  # though less efficient
sys	0m0.110s

split -n

Note using -n alone is not enough to parallelize. For example this will run serially with each chunk, because since --filter may write files, the -n pertains to the number of files to split into rather than the number to process in parallel.

$ time split -n$(nproc) --filter='wc -l' lines.txt | add
real	0m0.743s
user	0m0.495s
sys	0m0.702s

$ time split -n$(nproc) --filter='wc -l' long-lines.txt | add
real	0m0.540s
user	0m0.155s
sys	0m0.693s

You can either run multiple invocations of split in parallel on separate portions of the file like:

$ time for i in $(seq $(nproc)); do
    split -n$i/$(nproc) lines.txt | wc -l&
  done | add
real	0m0.432s  # second fastest

$ time for i in $(seq $(nproc)); do
    split -n$i/$(nproc) long-lines.txt | wc -l&
  done | add
real	0m0.266s  # third fastest

Or split can do parallel mode using round robin on each line, but that's huge overhead in this case. (Note also the -u option significant with -nr):

$ time split -nr/$(nproc) --filter='wc -l' lines.txt | add
real	0m4.773s
user	0m5.678s
sys	0m1.464s

$ time split -nr/$(nproc) --filter='wc -l' long-lines.txt | add
real	0m1.121s  # significantly less overhead for longer lines
user	0m0.927s
sys	0m1.339s

Round robin would only be useful when the processing per item is significant.

parallel

Parallel isn't well suited to processing a large single file, rather focusing on distributing multiple files to commands. It can't efficiently split to lightweight processing if reading sequentially from pipe:

$ time parallel --will-cite --block=200M --pipe 'wc -l' < lines.txt | add
real	0m1.863s
user	0m1.192s
sys	0m2.542s

Though has support for processing parts of a seekable file in parallel with the --pipepart option (added in version 20161222):

$ time parallel --will-cite --block=200M --pipepart -a lines.txt 'wc -l' | add
real	0m0.693s
user	0m0.941s
sys	0m1.142s

We can use parallel(1) to drive split similarly to the for loop construct above but it's a little awkward and slower, but does demonstrate the flexibility of the parallel(1) tool.

$ time parallel --will-cite --plus 'split -n{%}/{##} {1} | wc -l' \
       ::: $(yes lines.txt | head -n$(nproc)) | add
real	0m0.656s
user	0m0.949s
sys	0m0.944s

xargs -P

Like parallel, xargs is designed to distribute separate files to commands, and with the -P option can do so in parallel. If you have a large file then it may be beneficial to presplit it, which could also help with I/O bottlenecks if the pieces were placed on separate devices:

split -d -n l/$(nproc) lines.txt l.

Those pieces can then be processed in parallel like:

$ time find -maxdepth 1 -name 'l.*' |
  xargs -P$(nproc) -n1 wc -l | cut -f1 -d' ' | add
real	0m0.267s  # joint fastest
user	0m0.760s
sys	0m0.262s

$ time find -maxdepth 1 -name 'll.*' |
  xargs -P$(nproc) -n1 wc -l | cut -f1 -d' ' | add
real	0m0.131s  # joint fastest
user	0m0.251s
sys	0m0.233s

If your file sizes are unrelated to the number of processors then you will probably want to adjust -n1 to batch together more files to reduce the number of processes run in total. Note you should always specify -n with -P to avoid xargs accumulating too many input items, thus impacting the parallelism of the processes it runs.

make -j

make(1) is generally used to process disparate tasks, though can be leveraged to provide low level parallel processing on a bunch of files. Note also the make -O option which avoids the need for commands to output their data atomically, letting make do the synchronization. We'll process the presplit files as generated for the xargs example above, and to support that we'll use the following Makefile:

%: FORCE     # Always run the command
	@wc -l < $@
FORCE: ;
Makefile: ;  # Don't include Makefile itself

One could generate this and pass to make(1) with the -f option, though we'll keep it as a separate Makefile here for simplicity. This performs very well and matches the performance of xargs.

$ time find -name 'l.*' -exec make -j$(nproc) {} + | add
real	0m0.269s  # joint fastest
user	0m0.737s
sys	0m0.292s

$ time find -name 'll.*' -exec make -j$(nproc) {} + | add
real	0m0.132s  # joint fastest
user	0m0.233s
sys	0m0.256s

Note we use the POSIX specified "find ... -exec ... {} +" construct, rather than conflating the example with xargs. This construct like xargs will pass as many files to make as possible, which make(1) will then process in parallel.