Python Buzz Forum - Why Python Unicode Sucks

Besides concrete problems with the current status quo in Python Unicode, I think there's a general philosophical problem to the way Unicode is expected to be used in Python.

The convention in other languages is that you define boundaries, and you put thought into the encoding at those boundaries (maybe using some particular metadata like an HTTP header, maybe using convention or configuration -- there's no single rule). Then inside the boundaries there's the safe All-Unicode inner world.

This is a good solution for Java. Unfortunately it just doesn't work in Python, because you can't build good boundaries in Python. There's a couple reasons:

Python is not statically typed. If it was, we could use the typing to make it very clear where those boundaries were, and what parts of the code required decoded strings. Adaptation-based type declarations would probably be just as effective here.
We have two kinds of strings, Java has one.
Those two kinds of strings act almost exactly like each other. This means duck typing does not work. If the two string objects had a very different set of methods and were not interchangeable, then the boundaries would become very clear at runtime. (This is a similar string-related wart in Python.) As it is str objects can get deep into the system before a Unicode expectation causes an exception.
Byte (non-Unicode) strings are prevalent in Python code, both in the core and in libraries. If you only use mindfully-written code that deals with the Unicode properly you are okay. This is fine for, say, Zope users. Or people who do all their work as XML transformations, since XML libraries are another place where Unicode is mindfully supported. But for people that don't live in a walled city of vetted code, this doesn't work.

If we got rid of str entirely and added a bytestring type (with a very different API than strings!) then the rest of Python's system would work. Duck typing would work. You wouldn't have to learn best practices through hard-won experience, and you wouldn't have to audit every piece of outside code you use for problems. You could handle Unicode safely and confirm it through unit testing and during the development process. But that's now where we are now; and as a result Python is very prickly and unfriendly when it comes to this issue.

Most criticisms of dynamic typing apply to this very case; and those criticisms are right. This is a case where dynamic typing leads very directly to difficult-to-predict and difficult-to-detect runtime bugs. Dynamic typing only works when you adhere to certain important principles -- one of those is that if objects are not interchangeable they should use differently-named methods.

As a stop-gap I think setdefaultencoding will paper over a lot of these issues. It's not perfect by any means. It's akin to being able to add numbers to strings, and having the numbers automatically coerced into strings in the process -- it's not the sort of feature you really want to introduce; it's clearly sloppy. But until Python 3.0 it's the best option I see.


	Web Artima.com