The Artima Developer Community
Sponsored Link

Python Buzz Forum
Do I hate Unicode, or Do I Hate ASCII?

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Ian Bicking is a freelance programmer
Do I hate Unicode, or Do I Hate ASCII? Posted: Aug 4, 2005 9:23 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ian Bicking.
Original Post: Do I hate Unicode, or Do I Hate ASCII?
Feed Title: Ian Bicking
Feed URL: http://www.ianbicking.org/feeds/atom.xml
Feed Description: Thoughts on Python and Programming.
Latest Python Buzz Posts
Latest Python Buzz Posts by Ian Bicking
Latest Posts From Ian Bicking

Advertisement

I was glad to hear I am not alone in feeling that (to quote) "Unicode stinks". UnicodeDecodeError is a constant pain in the ass for me.

I appreciate this advise on Unicode, but I'm not entirely sure what to do with it:

  • strings are fine for text data that is encoded using the default encoding
  • Unicode should be used for all text data that is not or cannot be encoded in the default encoding

I still have str() calls and __str__ methods all over the place, and Unicode sneaks into the most unexpected places.

Sometimes I think my life would be much, much easier if my default encoding was UTF-8 instead of ASCII. Isn't that a nice, happy encoding? Sure, a UTF-8 string isn't equivalent to a Unicode string. The lengths don't match up, some Unicode-aware operations (e.g., operations that deal with letters different from punctuation) won't work. Most of my strings are sufficiently opaque that I don't care, though. And, doing server programming, UTF-8 is a good encoding; there's no such thing as locale for me. But even setting the default encoding has been made deliberately very difficult.

I just really don't know what I should do. Should I replace all my __str__ methods with __unicode__? Should I set up a boundary where I carefully decode all strings, making sure I'm using Unicode everywhere in my app? These are rather hard things to do, because "inside" is a rather leaky place. There's all these libraries other people wrote, external inputs I am hidden from, etc.

For instance, imagine some library that writes data to a file occasionally. Maybe it's a cache; the data is opaque. It expects strings. What does it do when it gets a Unicode object? Very possibly it writes it, if it is encodable with the default encoding (typically ASCII). In fact, this works great for me because my name and everything I write is ASCII; I'm not even sure how to input anything but ASCII. How do I, the ignorant English-speaking-and-typing American, even make a test case? Well, sometimes I write u"\u0100" or something; I don't even know what that character is, but at least I know it's Real Unicode. Sucks that it takes 9 characters to give me that one Unicode character I want. And in practice I usually leave this out of my tests. Then some European comes along with an umlaut in their name, and BOOM! UnicodeDecodeError -- and I didn't even know strings were involved. It's not even my library. Nothing is safe from these blasted characters. And the problem isn't localized -- Unicode works implicitly often enough that the Unicode can leak in long before it causes a problem, and a subtle difference like between "%s" % obj and str(obj) can cascade throughout the system.

(And just try commenting on this post with anything but ASCII, I dare you!)

Read: Do I hate Unicode, or Do I Hate ASCII?

Topic: MochiKit 0.70 Released Previous Topic   Next Topic Topic: Ideal Web Application Layout

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use