A plan for Multi-Byte Unicode Character Support in GNU coreutils

Introduction

coreutils is the project that implements about 100 of the most well known and used utilities on any GNU/Linux system. These utilities are used interactively, or extensively in other programs and scripts, and are integral to the standard Linux server distros used today. Originally these utilities were implemented only considering ASCII or sometimes implicitly other unibyte character sets, but many of the assumptions break down in the presence of multi-byte encodings. As time has gone on this has become more of an issue as this graph representing the rise of UTF-8 use on the web indicates.
utf8-growth-google

The obvious trend continues as can be seen in W3tech's more up to date stats

There is some support for multi-byte encodings which is detailed below, but it’s incomplete, has bad performance characteristics in most cases, has no tests, and has had many bugs and security issues. Having general support for all or even the most popular encodings in use will improve the performance, security and more importantly the reusablilty of these core components, which will help avoid reimplementations in other parts of the system with associated maintenance and security overhead.

History

There is already some support for multi-byte encodings in some distros such as Fedora, RHEL and SUSE, but interestingly not on debian or ubuntu. Also this patch has not been accepted upstream as it is largely a mechanical replacement and duplication of code without care and attention to the needs of the tool itself and its relation to multi-byte encodings. BTW an improvement that could be made independently to this multi-byte patch would be to provide history in git somewhere, as currently it’s just maintained as a large iterated patch.

To test if the current patch is applied, you can try this basic operation:

fedora> printf '%s\n' útf8 | cut -c1
ú
debian> printf '%s\n' útf8 | cut -c1
�
The existing patch originated in the LI18NUX2000 effort and has been carried in Red Hat distros at least since 2001, and in SUSE distros. In 2001 there were discussions upstream between Paul Eggert, Jim Meyering and Bruno Haible about a more appropriate and considered approach, though nothing was completed due to the size of the work involved.

Issues with the LI18NUX patch

Upstream direction

The more considered approach mentioned above was to use a shared library to provide various unicode and character encoding functionality for use primarily by GNU coreutils, but usable by many others. A shared library was decided on due to the large amount of data required to represent various unicode tables etc. Since this is a large effort it languished a bit due to lack of resources, but Bruno Haible released libunistring in 2009. Some projects have since used it, like libidn, and there was an initial prototype done with the join utility in coreutils in 2010, but that again floundered due to lack of time. join(1) was a good first choice in regard to testing out various functionality that any of the coreutils might need, but turned out to be a bad first choice in retrospect due to the interdependence with the sort(1) and uniq(1) utilities which need to process and compare data cohesively.

An implementation plan

Time estimates

© Mar 16 2015