I was glad to hear I am not alone
in feeling that (to quote) "Unicode stinks". UnicodeDecodeError is a constant
pain in the ass for me.
I appreciate this advise on Unicode,
but I'm not entirely sure what to do with it:
- strings are fine for text data that is encoded using
the default encoding
- Unicode should be used for all text data that is not
or cannot be encoded in the default encoding
I still have str() calls and __str__ methods all over the
place, and Unicode sneaks into the most unexpected places.
Sometimes I think my life would be much, much easier if my default
encoding was UTF-8 instead of ASCII. Isn't that a nice, happy encoding? Sure, a UTF-8
string isn't equivalent to a Unicode string. The lengths don't match
up, some Unicode-aware operations (e.g., operations that deal with
letters different from punctuation) won't work. Most of my strings
are sufficiently opaque that I don't care, though. And, doing server
programming, UTF-8 is a good encoding; there's no such thing as locale
for me. But even setting the default encoding has been made
deliberately very difficult.
I just really don't know what I should do. Should I replace all my
__str__ methods with __unicode__? Should I set up a boundary
where I carefully decode all strings, making sure I'm using Unicode
everywhere in my app? These are rather hard things to do, because
"inside" is a rather leaky place. There's all these libraries other
people wrote, external inputs I am hidden from, etc.
For instance, imagine some library that writes data to a file
occasionally. Maybe it's a cache; the data is opaque. It expects
strings. What does it do when it gets a Unicode object? Very
possibly it writes it, if it is encodable with the default encoding
(typically ASCII). In fact, this works great for me because my name and
everything I write is ASCII; I'm not even sure how to input anything
but ASCII. How do I, the ignorant English-speaking-and-typing
American, even make a test case? Well, sometimes I write
u"\u0100" or something; I don't even know what that character is,
but at least I know it's Real Unicode. Sucks that it takes 9
characters to give me that one Unicode character I want. And in
practice I usually leave this out of my tests. Then some European
comes along with an umlaut in their name, and BOOM!
UnicodeDecodeError -- and I didn't even know strings were
involved. It's not even my library. Nothing is safe from these
blasted characters. And the problem isn't localized -- Unicode works
implicitly often enough that the Unicode can leak in long before it
causes a problem, and a subtle difference like between "%s" % obj
and str(obj) can cascade throughout the system.
(And just try commenting on this post with anything but ASCII, I dare you!)