Besides concrete problems
with the current status quo in Python Unicode, I think there's a
general philosophical problem to the way Unicode is expected to be
used in Python.
The convention in other languages is that you define boundaries, and you put
thought into the encoding at those boundaries (maybe using some
particular metadata like an HTTP header, maybe using convention or
configuration -- there's no single rule). Then inside the boundaries there's the safe
All-Unicode inner world.
This is a good solution for Java. Unfortunately it just doesn't work
in Python, because you can't build good boundaries in Python. There's
a couple reasons:
- Python is not statically typed. If it was, we could use the typing
to make it very clear where those boundaries were, and what parts of
the code required decoded strings. Adaptation-based type declarations would probably
be just as effective here.
- We have two kinds of strings, Java has one.
- Those two kinds of strings act almost exactly like each other.
This means duck typing does not work. If the two string objects
had a very different set of methods and were not interchangeable,
then the boundaries would become very clear at runtime. (This is a
similar string-related wart in Python.) As it is str objects
can get deep into the system before a Unicode expectation causes an
exception.
- Byte (non-Unicode) strings are prevalent in Python code, both in the
core and in libraries. If you only use mindfully-written code that
deals with the Unicode properly you are okay. This is fine for,
say, Zope users. Or people who do all their work as XML
transformations, since XML libraries are another place where Unicode
is mindfully supported. But for people that don't live in a walled city of
vetted code, this doesn't work.
If we got rid of str entirely and added a bytestring type (with a
very different API than strings!) then the rest of Python's system
would work. Duck typing would work. You wouldn't have to learn best
practices through hard-won experience, and you wouldn't have to audit every
piece of outside code you use for problems. You could handle Unicode safely and confirm
it through unit
testing and during the development process. But that's now where we
are now; and as a result Python is very prickly and unfriendly when it
comes to this issue.
Most criticisms of dynamic typing apply to this very case; and those
criticisms are right. This is a case where dynamic typing leads very directly to difficult-to-predict and
difficult-to-detect runtime bugs.
Dynamic typing only works when you adhere to certain
important principles -- one of those is that if objects are not interchangeable
they should use differently-named methods.
As a stop-gap I think setdefaultencoding will
paper over a lot of these issues. It's not perfect by any means.
It's akin to being able to add numbers to strings, and having the
numbers automatically coerced into strings in the process -- it's not
the sort of feature you really want to introduce; it's clearly sloppy. But until Python
3.0 it's the best option I see.