unicode utilities

[Update: Jan 2018 For an online tool to do similar processing to the tools presented here, see ftfy]

I had an interesting question from a friend to help convert the following UTF8 text to iso-8859-15.

<br />â€œ curly quotes â€ aeiouÌˆ line â€¨ separator

This gives me a good opportunity to present unicode tools I often use — recode, uconv and python.

MySQL issues

The text above was obtained/garbled by doing a dump from a MySQL database. MySQL character encoding is a common problem, since the default seems to be latin1, and even if you set "utf8", that doesn't cover all characters. "utf8mb4" is in fact required for that. For a description of some of the problems with trying to fix MySQL encoding issues in place see this OpenStack discussion. Presented below are external tools to untangle this mess which can get quite tricky.

Ungarbling

With a little experimentation one can see that the above is validly encoded UTF8, but the encoder assumed it was encoding windows cp1252 data, while the data was already in UTF8!. One can see this using the venerable recode utility to reverse this process:

$ recode utf8..cp1252 < garbled.utf8
<br />“ curly quotes ��
recode: Invalid input in step `UTF-8..CP1252'

The above is trying to convert from UTF8 to CP1252 to get back to the original UTF8. However as you can see, this conversion didn't complete. One can see the reason by listing the CP1252 charset:

$ recode -lh cp1252 | grep -C1 -F "     "

80 Eu           a0 NS   b0 DG   c0 A!   d0 D-   e0 a!   f0 d-
        91 '6   a1 !I   b1 +-   c1 A'   d1 N?   e1 a'   f1 n?
82 .9   92 '9   a2 Ct   b2 2S   c2 A>   d2 O!   e2 a>   f2 o!
--
8c OE   9c oe   ac NO   bc 14   cc I!   dc U:   ec i!   fc u:
                ad --   bd 12   cd I'   dd Y'   ed i'   fd y'
8e Z<           ae Rg   be 34   ce I>   de TH   ee i>   fe th
8f z<   9f Y:   af 'm   bf ?I   cf I:   df ss   ef i:   ff y:

Notice that there are missing bytes in the table above. I.E. there are certain bytes that are not valid cp1252 characters, namely 81 8d 90 9d 9e. So the original conversion is not a fully reversible operation. I.E. if any of the original UTF8 text has those values then there will be problems converting (note iso-8859-15 for example defines chars for all bytes and so one would have not had this issue).

Specifically consider the right curly quote (”) in the original UTF8 file. This has the byte sequence: e2 80 9d containing one of the invalid cp1252 code points. What ever did the conversion converted these 3 bytes to c3 a2e2 82 acc2 9d.

So rather than just ignoring the invalid characters, what we can do is convert to cp1252 but fall back to iso-8859-15 conversion, which will essentially just remove the c2 byte as required. I don't know of existing tools that allow you to do that, but a quick python proggy fits the bill.

#!/usr/bin/python

import sys

for c in unicode(sys.stdin.read(),"utf-8"):
    try:
        sys.stdout.write(c.encode("cp1252"))
    except:
        sys.stdout.write(c.encode("iso-8859-15"))

Note python has traditionally had very good support for unicode processing (it uses ICU internally), and I often use it for unicode lookup and non standard conversion tasks like this. Here is an example to lookup its embedded unicode database:

$ python -c "import unicodedata as ud; print ud.name(unichr(0x2028))"
LINE SEPARATOR

Normalization

After running the data through our ungarbling script above, we get this valid UTF8

<br />“ curly quotes ” aeioü line \xe2\x80\xa8 separator

Note the highlighted portion above. I replaced that U+2028 line separator character in the original with its hex representation because it causes firefox 2.0.0.10 on linux to hang immediately, and firefox 2.0.0.13 to truncate the line. Also recode which we'll use for conversion later on, doesn't know how to handle it, so we'll convert it to HTML with sed like sed 's/\xe2\x80\xa8/<br\/>/g.

Note also the ü character above. This umlaut is actually represented above as 2 UTF8 characters. u plus the combining diacritic. This is a common issue in unicode processing, where there are multiple ways to represent a particular character. We can use the uconv utility which comes from the ICU project to normalise representations to the combined form for example. This is needed in our particular case as recode doesn't handle combining characters at present. So to convert to single characters do uconv -f utf8 -t utf8 -x nfc. Note uconv is available in the libicu-dev package on debian/ubuntu, and in the icu package on fedora/redhat.

Conversion

So after the above normalization we have this UTF8 string:

<br />“ curly quotes ” aeioü line <br/> separator

The last step to transliterate the UTF8 characters to the closest iso-8859-15 equivalent is achieved quite easily with the recode utf8..iso-8859-15 command. recode really is a nifty tool and I've previously documented other recode examples. So to recap on the whole conversion command line:

./ungarble.py < garbled.utf8 |
uconv -f utf8 -t utf8 -x nfc |
sed 's/\xe2\x80\xa8/<br\/>/g' |
recode utf8..iso-8859-15 > converted.latin9