coreutils inbox - May 2009

Here are some of the enhancements that we're currently working on for the coreutils project. You can see the latest changes as they're added in the NEWS file, and you can subscribe to see those changes.

fallocate()

Newer file systems like ext4 and xfs have support for extents which allows one to efficiently layout blocks of a file contiguously on disk. To support this we're going to add fallocate() to gnulib which will return ENOTSUPP if any of libc, kernel or file system don't support the fallocate() call. This will allow utilities like cp and mv etc. to unconditionally call fallocate() as a performance enhancement to layout new files efficiently on disk and also get immediate feedback if there is not enough space available to perform the copy.

posix_fallocate()

This is a higher level function than fallocate() as it will allocate space by writing 0's to file if fallocate() is not supported as described above. We're adding this functionality to gnulib (as it's currently implemented in glibc), so that we can add new allocation support to truncate. This will be made available as a new option like: truncate --allocate

cp --attributes-only

This option will create a new empty file with all the meta data of another file. It's awkward otherwise to copy meta data from a source file, requiring separate tools like touch and chcon etc. This option should then make it easy to write a replace or inplace script that can use cp --attr --preserve=all to create a temporary file to be populated by an arbitrary filter, and moved back on top of the original. This common operation is currently surprisingly difficult to achieve using existing tools.

sort --human

This is an often requested feature that we've finally capitulated on and will implement. This will allow one for example to du -hs * | sort -h. It's not a general solution, in that we've gone the simple route and assumed that quantities are appropriately minimised. I.E. 5000K and 4M will not be present together in the input. Also we flag as an error a mixture of MB and MiB for example in the input. Even with these simplifications, the vast majority of input sources are covered.

multicore sort

I'm a mentor this year for the coreutils project in google's SOC program. The project is to enhance the performance of sort on multicore systems. The first part of this project will be to enhance the split utility to easily partition data by number of chunks rather than by size. Then we'll figure out how best to run sort in parallel over this data (perhaps a wrapper script?). To whet your appetite, preliminary testing on an 8 core system shows a 4.85 times speedup using existing merge sort facilities of the sort command, and process substitution provided by bash:

$ time sort -m <(sort data1) ... <(sort data8) >/dev/null
16.75 s
$ time sort data1 ... data8 >/dev/null
81.32 s

stdbuf

This is a new utility that will allow one to control the buffering of standard streams of any process. Some programs have specific options to control this (like grep --line-buffered for example), but this will give general control over this feature as in this example to turn on line buffering for the cut utility: tail -f access.log | stdbuf -oL cut -d ' ' -f1 | uniq

libunistring

Thanks to Bruno Haible's fantastic work on libunistring, we'll finally be able to add correct i18n support to coreutils. There has been an i18n patch in use by certain distros for years, but was rightly not accepted upstream because of numerous bugs and performance issues. For example there was a segfault identified just a couple of days ago, when invalid input was given to the join command.