So... thinking some more about my Unicode woes,
I think UTF-8 is the Right Default Encoding For Me. I think it will
solve a large number of my problems.
If you set the default encoding to UTF-8, things like
str(u'\u0100') actually works (and gives you the encoded
version). If you concatenate the result ('\xc4\x80') to a Unicode
string, the string is automatically decoded and it works perfectly.
This is what I want! UTF-8, being a superset of ASCII, happens to be
the encoding I'm already using in my sourcecode. I'm perfectly
happy moving as many of my external data sources to UTF-8 as
possible. I'll set DefaultEncoding in Apache, I'll fiddle with my
database, whatever. In those cases where I can't, I'll just have to
carefully decode the data, but I have to do that anyway. To the
degree I can make my systems and communications consistently UTF-8,
things will just get better. I really don't see a downside.
But why does Python make it SO DAMN HARD to change my encoding? I
don't understand this at all. There is a function
sys.setdefaultencoding, but site.py (which is loaded on Python startup)
deletes the function.
I feel like someone decided they were smarter than me, but I'm not
sure I believe them.
From what I can tell, there's three ways to fix the default encoding:
- Edit site.py (in the standard library) directly. Seems like a
bad idea. Though maybe I'll just delete the del
sys.setdefaultencoding line... anyway, site.py might appear in
other places on your computer as well (e.g.,
/etc/pythonX.Y/site.py).
- Create sitecustomize.py in the standard path
(lib/pythonX.Y). This will apply to all processes. But I'm not
sure I feel safe with effecting all Python processes. You could
also save sys.defaultencoding here (under a different name) for
later access.
- Put sitecustomize.py in the current directory you run Python
from. But . is not on sys.path by default (I think
site.py adds it after it tries to import sitecustomize), so you
have to put it in PYTHONPATH.
There's some discussion in the comments here.
This post
suggests running reload(sys) to restore setdefaultencoding,
which is very clean to enable (none of this site crap) but reloading
sys scares me a bit.
And searching about I didn't see one justification for why doing any of this
is bad, just references to it being a hack, which is not very convincing.
Are people claiming that there should be no default encoding? As long as
we have non-Unicode strings, I find the argument less than convincing, and I
think it reflects the perspective of people who take Unicode very seriously,
as compared to programmers who aren't quite so concerned but just want their
applications to not be broken; and the current status quo is very deeply
broken.