The Artima Developer Community
Sponsored Link

Python Buzz Forum
String hash vs. Unicode hash

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Ian Bicking is a freelance programmer
String hash vs. Unicode hash Posted: Aug 14, 2005 9:36 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ian Bicking.
Original Post: String hash vs. Unicode hash
Feed Title: Ian Bicking
Feed URL: http://www.ianbicking.org/feeds/atom.xml
Feed Description: Thoughts on Python and Programming.
Latest Python Buzz Posts
Latest Python Buzz Posts by Ian Bicking
Latest Posts From Ian Bicking

Advertisement

This is odd...

>>> d = {'test': None}
>>> d[u'test'] = 1
>>> d
{'test': 1}
>>> d = {u'test': None}
>>> d['test'] = 1
>>> d
{u'test': 1}

I guess it makes sense, but it's tricky. 'test' == u'test'; but if you feel Unicode strings are different from byte strings (str), then this is no help. But here's a problem with setdefaultencoding:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> s = u'\u0100'
>>> str(s)
'\xc4\x80'
>>> str(s) == s
True
>>> hash(str(s)), hash(s)
(1207774670, -1591639807)
>>> d = {s: None}
>>> d[str(s)] = 1
>>> d
{u'\u0100': None, '\xc4\x80': 1}

The strings are equal, but they don't hash equally, so the dictionary (being a hash table) puts both in and doesn't notice their equality. Not surprising; equality is default encoding aware (the byte string is decoded before comparing it with the unicode string). In fact you get UnicodeDecodeError if you compare a byte string that can't be decoded in the default encoding to a unicode string. (I know exactly why there's an exception there, and I understand why, and maybe I even see how it's a good idea, but how can you not find it disturbing that these two objects can't be safely compared when almost all other objects, no matter how different in type, can be compared?)

Oh, but I was talking about hashes. Well, the hash algorithm for strings apparently isn't aware of default encodings. (Just in case this was specific to the reload(sys) hack, I also tested it with a change to site.py). Note that hash does work for ASCII-encodable Unicode string (i.e., hash('foo') == hash(u'foo')).

Read: String hash vs. Unicode hash

Topic: Russell Smith and Marian D'Eve, R.I.P. Previous Topic   Next Topic Topic: Why Python Unicode Sucks

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use