The Artima Developer Community
Sponsored Link

Python Buzz Forum
Why Python Unicode Sucks

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Ian Bicking is a freelance programmer
Why Python Unicode Sucks Posted: Aug 9, 2005 9:22 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ian Bicking.
Original Post: Why Python Unicode Sucks
Feed Title: Ian Bicking
Feed URL: http://www.ianbicking.org/feeds/atom.xml
Feed Description: Thoughts on Python and Programming.
Latest Python Buzz Posts
Latest Python Buzz Posts by Ian Bicking
Latest Posts From Ian Bicking

Advertisement

Besides concrete problems with the current status quo in Python Unicode, I think there's a general philosophical problem to the way Unicode is expected to be used in Python.

The convention in other languages is that you define boundaries, and you put thought into the encoding at those boundaries (maybe using some particular metadata like an HTTP header, maybe using convention or configuration -- there's no single rule). Then inside the boundaries there's the safe All-Unicode inner world.

This is a good solution for Java. Unfortunately it just doesn't work in Python, because you can't build good boundaries in Python. There's a couple reasons:

  • Python is not statically typed. If it was, we could use the typing to make it very clear where those boundaries were, and what parts of the code required decoded strings. Adaptation-based type declarations would probably be just as effective here.
  • We have two kinds of strings, Java has one.
  • Those two kinds of strings act almost exactly like each other. This means duck typing does not work. If the two string objects had a very different set of methods and were not interchangeable, then the boundaries would become very clear at runtime. (This is a similar string-related wart in Python.) As it is str objects can get deep into the system before a Unicode expectation causes an exception.
  • Byte (non-Unicode) strings are prevalent in Python code, both in the core and in libraries. If you only use mindfully-written code that deals with the Unicode properly you are okay. This is fine for, say, Zope users. Or people who do all their work as XML transformations, since XML libraries are another place where Unicode is mindfully supported. But for people that don't live in a walled city of vetted code, this doesn't work.

If we got rid of str entirely and added a bytestring type (with a very different API than strings!) then the rest of Python's system would work. Duck typing would work. You wouldn't have to learn best practices through hard-won experience, and you wouldn't have to audit every piece of outside code you use for problems. You could handle Unicode safely and confirm it through unit testing and during the development process. But that's now where we are now; and as a result Python is very prickly and unfriendly when it comes to this issue.

Most criticisms of dynamic typing apply to this very case; and those criticisms are right. This is a case where dynamic typing leads very directly to difficult-to-predict and difficult-to-detect runtime bugs. Dynamic typing only works when you adhere to certain important principles -- one of those is that if objects are not interchangeable they should use differently-named methods.

As a stop-gap I think setdefaultencoding will paper over a lot of these issues. It's not perfect by any means. It's akin to being able to add numbers to strings, and having the numbers automatically coerced into strings in the process -- it's not the sort of feature you really want to introduce; it's clearly sloppy. But until Python 3.0 it's the best option I see.

Read: Why Python Unicode Sucks

Topic: Using a Cisco PIX Without NAT Previous Topic   Next Topic Topic: Flash 8’s Backwards Security Model

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use