We make very careful considerations about the interface and operation of the GNU coreutils, but unfortunately due to backwards compatibility reasons, some behaviours or defaults of these utilities can be confusing.

This information will continue to be updated and overlaps somewhat with the coreutils FAQ, with this list focusing on less frequent potential issues.

chmod

chmod -R is redundant and tricky. If for example you copy a dir from VFAT and want to turn off executable bits on files using chmod -R 644, that will fail to recurse as it removes the executable bits from dirs. This is achievable in various ways:

cut

cut doesn't work with fields separated by arbitrary whitespace. It's often better to use awk or even join -a1 -o1.$field $file /dev/null

cut -s only suppresses lines without delimiters. Therefore if you have a line with a missing field but it does contain some delimiters, a blank line is output

Similarly, if you want to output a blank line when there are no delimiters one needs to append a delimiter like:

printf '%s\n' a:b c d:e | sed '/:/!s/$/:/' | cut -d: -f2-

dd

dd iflag=fullblock is usually what you want because when reading from a fifo/pipe you often get a short read, which means you get too little data if you specify "count", or too much data if you specify "sync". For example:
$ dd status=none count=1 if=/dev/random bs=512 | wc -c
78
Note we do warn in certain cases since version 8.10, but not with count=1 above as short reads are often used with count=1 as an idiom to "consume available data", though perhaps dd iflag=nonblock would be a more direct and general way to do that?

dd conv=noerror really also needs conv=sync so that if reading from failing disk, one gets correctly aligned data, with unreadable bits replaced with NULs. Note if there is a read error anywhere in a block, the whole block will be discarded. So one needs to balance between speed (bigger) and minimized data loss (smaller). This is simpler and more dynamic in a more dedicated tool like ddrescue.

df

For full portability the -P option is needed when parsing the output from df. Line wrapping is avoided, though df will no longer wrap lines since version 8.11 (Apr 2011) to help avoid this gotcha. Also if one needs to parse the header, the -P option will use more standardised (but ambiguous) wording. See also the Block size issue.

du

If two or more hard links point to the same file, only one of the hard links is counted. The FILE argument order affects which links are counted, and changing the argument order may change the numbers that du outputs. Note this also impacts specified directories which is confusing:
$ cd git/coreutils
$ du -s ./ ./tests
593120  ./
$ du -s ./tests ./  # depth first gets items listed (though counted independently)
10036   ./tests
583084  ./
# Note order is significant even with implicit independence
$ du -s --separate-dirs ./tests ./
128     ./tests
16268   ./
$ du -s --separate-dirs ./ ./tests
16268   ./
Note du doesn't handle reflinked files specially, and thus will count all instances of a reflinked file.

echo

echo is non portable and its behaviour diverges between systems and shell builtins etc. One should really consider using printf instead. This shell session illustrates some inconsistencies. Where you see env being used, that is selecting the coreutils standalone version:
$ echo -e -n # outputs nothing
$ echo -n -e
$ echo -- -n # option terminator outputted
-- -n
$ POSIXLY_CORRECT=1 env echo -e -n
-e -n
$ POSIXLY_CORRECT=1 env echo -n -e # no output either ‽

expr

The exit status of expr is a confusing gotcha. POSIX states that exit status of 1 is used if "the expression evaluates to null or zero", which you can see in these examples:
$ expr 2 - 1; echo $?
1
0

$ expr 2 - 2; echo $?
0
1

$ expr substr 01 1 1; echo $?
0
1

$ expr ' ' : '^ *$'; echo $?
1
0

$ expr '' : '^ *$'; echo $?
0
1
The string matching above is especially confusing, though does conform to POSIX, and is consistent across solaris, FreeBSD and GNU utils.

As for changing the behaviour, it's probably not possible due to backwards compatibility issues. For example the '^..*$' case would need to change the handling of the '*' in the expression, which would break a script like:

printf '%s\n' 1 2 '' 3 |
while read line; do
  expr "$line" : '^[0-9]*$' >/dev/null || break # at first blank line
  echo process "$line"
done
Note, using a leading ^ in the expression is redundant and non portable.

ls

ls -lrt will also reverse sort names for files with matching timestamps (common in /dev/ and /proc/ etc.) This is as per POSIX but probably not what the user wanted. There is no way to reverse by time and normal by name.

ln

ln -nsf is needed to update symlinks, though note that this will overwrite existing files, and cause links to be created within existing directories.

*sum

The checksum utilities like md5sum, sha1sum etc. add backslashes to the output names if those names contain '\n' or '\' characters. Also '*' is added to the output where O_BINARY is significant (CYGWIN). Therefore automatic processing of these utilities require one to unescape first.

rm

rm -rf does not mean "delete as much as posible". It only avoids prompts. For example with a non writeable dir, you will not be able to remove any contents. Therefore this is sometimes necessary to:
find "$dir" -depth -type d -exec chmod +wx {} + && rm -rf "$dir"

sort

A very common issue encountered is with the default ordering of the sort utility. Usually what is required is a simple byte comparison, though by default the collation order of the current locale is used. To use the simple comparison logic you can LC_ALL=C sort ... as detailed in the FAQ.

equal comparisons

As well as being slower, the locale based ordering can often be surprising. For example some character representations, like the full width forms of latin numbers, compare equal to each other.
$ printf '%s\n' 2 1 | ltrace -e strcoll sort
sort->strcoll("\357\274\222", "\357\274\221") = 0
2
1

$ printf '%s\n' 2 1 | sort -u
2
The equal comparison issue with --unique can even impact in the "C" locale, for example with --numeric-sort dropping items unexpectedly. Note this example also demonstrates that --unique implies --stable, to select the first encountered item in the matching set.
$ printf "%s\n" 1 zero 0 .0 | sort -nu
zero
1

i18n patch issues

Related to locale ordering, there is the i18n patch on Fedora/RHEL/SUSE which has its own issues. Note disabling the locale specific handling as described above effectively avoids these issues.

Example 1: leading space are mishandled with --human-numeric-sort:

$ printf ' %s\n' 4.0K 1.7K | sort -s -h
 4.0K
 1.7K
Example 2: case folding results in incorrect ordering:
$ printf '%s\n' Dániel Dylan | sort
Dániel
Dylan

$ printf '%s\n' Dániel Dylan | sort -f
Dylan
Dániel

field handling

Fields specified with -k are separated by default by runs of blank characters (space and tab), and by default the blank characters preceding a field are included in the comparison, which depending on your locale could be significant to the sorting order. This is confusing enough on its own, but is compounded with the --field-separator and --ignore-leading-blanks options. Ignoring leading blanks (-b) is particularly confusing, because... Also precisely specifying a particular field, requires both the start and end fields specified. I.E. to sort on field 2 you use -k2,2.

These field delineation issues along with others are so confusing, that the sort --debug option was added in version 8.6 to highlight the matching extent and other consequences of the various options.

--random-sort

sort -R does randomize the input similarly to the shuf command, but also ensures that matching keys are grouped together. shuf also provides optimizations when outputting a subset of the input.

tac

tac like wc has issues dealing with files without a last '\n' character.
$ printf "1\n2" | tac
21

tail

tail -F is probably what you want rather than -f as the latter doesn't follow log rotations etc.

tee

tee by default will exit immediately upon receiving SIGPIPE to be POSIX compliant and to support applications like yes | tee log | timeout process. Now this is problematic in the presence of "early close" pipes, often seen when combining tee with bash >(process substitutions). Starting with coreutils 8.24 (Jul 2015), tee has the new -p, --output-error option to control the operation in such cases.
$ seq 100000 | tee >(head -n1) > >(tail -n1)
1
14139

$ seq 100000 | tee -p >(head -n1) > >(tail -n1)
1
100000

wc

wc -l on a file in which the last line doesn't end with '\n' character will return a value of one-less than might be expected as wc is standardised to just count '\n' characters. POSIX in fact doesn't consider a file without a '\n' as the last character to be a text file at all.
$ printf "hello\nworld" | wc -l
1
wc -L counts the maximum display width for a line, considering only valid, printable characters, but not terminal control codes.
# invalid UTF-8 sequence not counted:
$ printf "\xe2\xf2\xa5" | wc -l
0

# unprintable characters even in the C locale are not counted:
$ printf "\xe2\x99\xa5" | LC_ALL=C wc -L
0

# Bytes can be counted using sed:
$ printf "\xe2\x99\xa5" | LC_ALL=C sed 's/././g' | wc -L
3

# Terminal control chars are not handled specially:
$ printf '\x1b[33mf\bred\x1b[m\n' | tee /dev/tty | wc -L
red
10

Unit representations

The df, du, ls --block-size option is unusual in that appending a B to the unit, changes it from binary to decimal. I.E. KB means 1000, while K means 1024.

In general the units representations in coreutils are unfortunate, but an accident of history. POSIX species 'k' and 'b' to mean 1024 and 512 respectively. Standards wise 'k' should really mean 1000 and 'K' 1024. Then extending from that we now have (which we can't change for compatibility reasons):

Note there is new flexibility to consider when controlling the output of numeric units, by leveraging the numfmt utility. For example to control the output of du you could define a function like:
 du() { env du -B1 "$@" | numfmt --to=iec-i --suffix=B; } 

Timezones

Discussed separately at Time zone ambiguities
© Nov 29 2015