<br />â€œ curly quotes â€ aeiouÌˆ line â€¨ separatorThis gives me a good opportunity to present unicode tools I often use — recode, uconv and python.
MySQL issuesThe text above was obtained/garbled by doing a dump from a MySQL database. MySQL character encoding is a common problem, since the default seems to be latin1, and even if you set "utf8", that doesn't cover all characters. "utf8mb4" is in fact required for that. For a description of some of the problems with trying to fix MySQL encoding issues in place see this OpenStack discussion. Presented below are external tools to untangle this mess which can get quite tricky.
UngarblingWith a little experimentation one can see that the above is validly encoded UTF8, but the encoder assumed it was encoding windows cp1252 data, while the data was already in UTF8!. One can see this using the venerable recode utility to reverse this process:
$ recode utf8..cp1252 < garbled.utf8 <br />“ curly quotes �� recode: Invalid input in step `UTF-8..CP1252'The above is trying to convert from UTF8 to CP1252 to get back to the original UTF8. However as you can see, this conversion didn't complete. One can see the reason by listing the CP1252 charset:
$ recode -lh cp1252 | grep -C1 -F " " 80 Eu a0 NS b0 DG c0 A! d0 D- e0 a! f0 d- 91 '6 a1 !I b1 +- c1 A' d1 N? e1 a' f1 n? 82 .9 92 '9 a2 Ct b2 2S c2 A> d2 O! e2 a> f2 o! -- 8c OE 9c oe ac NO bc 14 cc I! dc U: ec i! fc u: ad -- bd 12 cd I' dd Y' ed i' fd y' 8e Z< ae Rg be 34 ce I> de TH ee i> fe th 8f z< 9f Y: af 'm bf ?I cf I: df ss ef i: ff y:Notice that there are missing bytes in the table above. I.E. there are certain bytes that are not valid cp1252 characters, namely 81 8d 90 9d 9e. So the original conversion is not a fully reversible operation. I.E. if any of the original UTF8 text has those values then there will be problems converting (note iso-8859-15 for example defines chars for all bytes and so one would have not had this issue).
Specifically consider the right curly quote (”) in the original UTF8 file. This has the byte sequence: e2 80 9d containing one of the invalid cp1252 code points. What ever did the conversion converted these 3 bytes to c3 a2e2 82 acc2 9d.
So rather than just ignoring the invalid characters, what we can do is convert to cp1252 but fall back to iso-8859-15 conversion, which will essentially just remove the c2 byte as required. I don't know of existing tools that allow you to do that, but a quick python proggy fits the bill.
#!/usr/bin/python import sys for c in unicode(sys.stdin.read(),"utf-8"): try: sys.stdout.write(c.encode("cp1252")) except: sys.stdout.write(c.encode("iso-8859-15"))Note python has traditionally had very good support for unicode processing (it uses ICU internally), and I often use it for unicode lookup and non standard conversion tasks like this. Here is an example to lookup its embedded unicode database:
$ python -c "import unicodedata as ud; print ud.name(unichr(0x2028))" LINE SEPARATOR
NormalizationAfter running the data through our ungarbling script above, we get this valid UTF8
<br />“ curly quotes ” aeioü line \xe2\x80\xa8 separatorNote the highlighted portion above. I replaced that U+2028 line separator character in the original with its hex representation because it causes firefox 184.108.40.206 on linux to hang immediately, and firefox 220.127.116.11 to truncate the line. Also recode which we'll use for conversion later on, doesn't know how to handle it, so we'll convert it to HTML with sed like sed 's/\xe2\x80\xa8/<br\/>/g.
Note also the ü character above. This umlaut is actually represented above as 2 UTF8 characters. u plus the combining diacritic.
This is a common issue in unicode processing, where there are multiple ways to represent a particular character.
We can use the
uconv utility which comes from the ICU project
to normalise representations to the combined form for example.
This is needed in our particular case as recode doesn't handle combining characters at present. So to convert to single characters
do uconv -f utf8 -t utf8 -x nfc. Note
uconv is available in the
libicu-dev package on debian/ubuntu,
and in the
icu package on fedora/redhat.
ConversionSo after the above normalization we have this UTF8 string:
<br />“ curly quotes ” aeioü line <br/> separatorThe last step to transliterate the UTF8 characters to the closest iso-8859-15 equivalent is achieved quite easily with the recode utf8..iso-8859-15 command. recode really is a nifty tool and I've previously documented other recode examples. So to recap on the whole conversion command line:
./ungarble.py < garbled.utf8 | uconv -f utf8 -t utf8 -x nfc | sed 's/\xe2\x80\xa8/<br\/>/g' | recode utf8..iso-8859-15 > converted.latin9