Zen of Unicode
Date: Thu, 16 Mar 2006 10:26:08 -0500
From: "David Goodger" <goodger at python.org>
To: grig at gheorghiu.net
Subject: Re: Zen of Unicode question
> [Grig Gheorghiu]
> I realize you're in the middle of the docutils sprint and you might
> not have time to respond.
That's over and I'm finally getting through my email backlog. Sorry
for the delay.
> I liked your Unicode talk a lot and I thought I finally got the Zen
> of Unicode:
That's a good title. Now why didn't I think of it? I'll add it to
the list of alternates.
> - unicode is text as far as Python is concerned
> - we don't care how unicode is stored by Python internally
> - there is no such thing as a Unicode file
> - encoding a unicode string is like saving an image in a particular
> format
> - an application should decode when accepting input, should work
> internally with unicode objects, and should encode when producing
> output
Yes.
> So I downloaded some Japanese text from Yahoo Japan, saved it into a
> file, and tried to decode on input into utf-8. I got an error.
That means it wasn't encoded in UTF-8.
> I then tried utf-16 and it worked (but not for printing, only for
> evaluating it).
Be careful, UTF-16 can decode *anything*. Unless the source text was
encoded in UTF-16, you got garbage. You'll have to compare the source
text on the web to the decoded result.
> When I encoded so that I can save into another file, still in
> utf-16, I got 2 extra characters at the beginning.
2 extra characters or *bytes*? Probably bytes, the byte order mark.
Remember, characters and bytes are *not* interchangeable!
> What's a good strategy for figuring out what codec I should use for
> decoding on input? I thought utf-8 is pretty safe, but this is the
> second case in which I got errors when trying to use it...
UTF-8 is safe, for that very reason: if the input isn't encoded with
UTF-8, you'll get an error. That's a good thing!
My strategy can be found in the docutils.io.Input.decode method. I
ran out of time to show it during the talk. Here it is, with my
comments flush-left:
def decode(self, data):
"""
Decode a string, `data`, heuristically.
Raise UnicodeError if unsuccessful.
The client application should call ``locale.setlocale`` at the
beginning of processing::
locale.setlocale(locale.LC_ALL, '')
"""
if self.encoding and self.encoding.lower() == 'unicode':
self.encoding contains the input encoding provided by the user or
application. We use a trick, an encoding of "unicode" (which is a
contradiction in terms), to indicate "the data is already Unicode
text". Then we test that:
assert isinstance(data, UnicodeType), (
'input encoding is "unicode" '
'but input is not a unicode object')
And we're forgiving on input:
if isinstance(data, UnicodeType):
# Accept unicode even if self.encoding != 'unicode'.
return data
encodings = [self.encoding]
First rule: know your encoding. If the encoding is specified, we
assume it's true. "encodings" here is a list of encodings to try.
If self.encoding *is* provided, we don't try anything else:
if not self.encoding:
# Apply heuristics only if no encoding is explicitly given.
encodings.append('utf-8')
UTF-8 is the first encoding tried, because it will only work if the
input data *is* UTF-8-encoded. Then we append some locale-specific
encodings to try:
try:
encodings.append(locale.nl_langinfo(locale.CODESET))
except:
pass
try:
encodings.append(locale.getlocale()[1])
except:
pass
try:
encodings.append(locale.getdefaultlocale()[1])
except:
pass
encodings.append('latin-1')
Latin-1 is the last encoding tried, because like UTF-16, it will
decode *anything*. Therefore it is *not* safe to try early on, but
it's commonly used.
error = None
error_details = ''
Now we loop through the candidate encodings, taking the first one that
works:
for enc in encodings:
if not enc:
continue
try:
decoded = unicode(data, enc, self.error_handler)
self.successful_encoding = enc
If the input data was successfully decoded, we simply return it,
removing any Byte Order Marks (probably the extra 2 bytes you noticed
at the beginning of your file):
# Return decoded, removing BOMs.
return decoded.replace(u'\ufeff', u'')
except (UnicodeError, LookupError), error:
pass
At this point, there were no encodings that worked. We report this
information by raising an exception:
if error is not None:
error_details = '\n(%s: %s)' % (error.__class__.__name__,
error)
raise UnicodeError(
'Unable to decode input data. Tried the following
encodings: '
'%s.%s'
% (', '.join([repr(enc) for enc in encodings if enc]),
error_details))
So the order we go through is this:
1. Check for Unicode; if it already is, there's nothing to decode.
2. Try the specified encoding. If it works, we're done. If not, it's
an error.
3. If no encoding was specified, use some heuristics. Try UTF-8
first, because its safe, then other common encodings. Watch out
for encodings that decode anything though; garbage in, garbage out.
> The other kind of weird thing is that if I ignore unicode altogether
> and just deal with the bytestring, I can read and write files with
> no errors...is is just luck in this case?
If you're not processing the **text**, but just copying bytes, that's
fine. You're just copying a file, including the encoding.
It's only when you have to process the text *as* text that you have to
decode it to Unicode first.
> Here's my interpreter session:
>
> [ggheo@concord test_unicode]$ python
> Python 2.4 (#1, Nov 30 2004, 16:42:53)
> [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
> Type "help", "copyright", "credits" or "license" for more
information.
>>>> f = open('in.txt')
>>>> t = f.read().rstrip()
>>>> t
> 'if(cp)d.getElementById(\'tba\').innerHTML=\'<a href="/r/xpr"><font
> color="#000099"
>
size="1">\xa4\xaa\xc6\xc0\xa4\xc7\xca\xd8\xcd\xf8\xa4\xca\xa5\xb5\xa1\xbc\xa5\xd3\xa5\xb9\xa4\xac\xa4\xa4\xa4\xc3\xa4\xd1\xa4\xa4</font></a>\';'
>>>> print t
> if(cp)d.getElementById('tba').innerHTML='<a href="/r/xpr"><font
> color="#000099"
size="1">¢¥ªÆÀ¢¥ÇÊØÃø¢¥Ê¥µ¡1â„4¥Ó¥©ˆ¢¥¬¢¥¢¥¢¥Ã¢¥Ñ¢¥¢¥</font></a>';
You're seeing the encoded bytes, binary garbage.
>>>> u = t.decode('utf-8')
It's definitely not UTF-8:
- The first byte of a non-ASCII character encoded in UTF-8 is
always in the range 0xC0 to 0xFD, and all subsequent bytes are in
the range 0x80 to 0xBF. The bytes 0xFE and 0xFF are never used.
Notice the first byte of text (after the <font> start-tag) is "\xa4",
which is not in the range \xC0-\xFD. If this is Japanese text, I'd
try the Shift-JIS, ISO-2022-JP, and EUC-JP encodings.
If you got this off the web, use your browser to see the encoding. At
the top of every HTML file there is (or should be) a tag like this:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1"
/>
The "charset" is what you're interested in. Also, in your browser,
look for the "Character Encoding" submenu (under View in Firefox). It
should show you what the encoding of the current page actually is.
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/usr/local/lib/python2.4/encodings/utf_8.py", line 16, in
> decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position
89:
> unexpected code byte
>>>> u = t.decode('utf-16')
The input data you started with isn't encoded in UTF-16. I'm
surprised you didn't get an error; when I tried UTF-16, I got
"UnicodeDecodeError: 'utf16' codec can't decode bytes in position
96-97: illegal UTF-16 surrogate".
I tried the three Japanese encodings listed above, and determined that
the text was encoded with EUC-JP. The only way to know that (without
prior knowledge) is to decode with the guessed encoding, re-encode
with a known encoding (like UTF-8), and compare the input to the
output. Only with EUC-JP do I get comprehensible Japanese text.
>>>> u
>
u'\u6669\u6328\u2970\u2e64\u6567\u4574\u656c\u656d\u746e\u7942\u6449\u2728\u6274\u2761\u2e29\u6e69\u656e\u4872\u4d54\u3d4c\u3c27\u2061\u7268\u6665\u223d\u722f\u782f\u7270\u3e22\u663c\u6e6f\u2074\u6f63\u6f6c\u3d72\u2322\u3030\u3030\u3939\u2022\u6973\u657a\u223d\u2231\ua43e\uc6aa\ua4c0\ucac7\ucdd8\ua4f8\ua5ca\ua1b5\ua5bc\ua5d3\ua4b9\ua4ac\ua4a4\ua4c3\ua4d1\u3ca4\u662f\u6e6f\u3e74\u2f3c\u3e61\u3b27'
>>>> print u
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'latin-1' codec can't encode characters in
position
> 0-65: ordinal not in range(256)
"print" cannot directly handle Unicode text. It has to encode it,
using the default encoding. Your default encoding is Latin-1, which
can't handle the Unicode code points in "u".
>>>> w = open('out.txt', 'w')
>>>> w.write('%s\n' % u.encode('utf-16'))
>>>> w.close()
You're re-encoding the Unicode text with the same codec/encoding used
to decode it, UTF-16. Although UTF-16 will decode just about
anything, this doesn't work because the decoded "text" is corrupted.
Bytes may have been swapped. Try comparing in.txt with out.txt.
>>>> w = open('out.txt')
>>>> print w.read().rstrip()
> ÿ©≠if(cp)d.getElementById('tba').innerHTML='<a
href="/r/xpr"><font
> color="#000099"
size="1">¢¥ªÆÀ¢¥ÇÊØÃø¢¥Ê¥µ¡1â„4¥Ó¥©ˆ¢¥¬¢¥¢¥¢¥Ã¢¥Ñ¢¥¢¥</font></a>';
Here's what I get:
>>> d = 'if(cp)d.getElementById(\'tba\').innerHTML=\'<a
href="/r/xpr"><font
color="#000099"
size="1">\xa4\xaa\xc6\xc0\xa4\xc7\xca\xd8\xcd\xf8\xa4\xca\xa5\xb5\xa1\xbc\xa5\xd3\xa5\xb9\xa4\xac\xa4\xa4\xa4\xc3\xa4\xd1\xa4\xa4</font></a>\';'
>>> t = d.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File
"/sw/src/root-python24-2.4-5/sw/lib/python2.4/encodings/utf_16.py", line
16, in decode
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 96-97:
illegal
UTF-16 surrogate
>>> t = d.decode('shift-jis')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position
98-99:
illegal multibyte sequence
>>> t = d.decode('iso-2022-jp')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'iso2022_jp' codec can't decode byte 0xa4 in
position 89:
illegal multibyte sequence
>>> t = d.decode('euc-jp')
>>> print t.encode('utf-8')
if(cp)d.getElementById('tba').innerHTML='<a href="/r/xpr"><font
color="#000099"
size="1">ãŠå¾—ã§ä¾¿åˆ©ãªã‚µãƒ¼ãƒ“スãŒã„ã£ã±ã„</font></a>';
>>>
The Japanese text at the end says "Full of special & convenient
service".