Python Buzz Forum - String hash vs. Unicode hash

This is odd...

>>> d = {'test': None}
>>> d[u'test'] = 1
>>> d
{'test': 1}
>>> d = {u'test': None}
>>> d['test'] = 1
>>> d
{u'test': 1}

I guess it makes sense, but it's tricky. 'test' == u'test'; but if you feel Unicode strings are different from byte strings (str), then this is no help. But here's a problem with setdefaultencoding:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> s = u'\u0100'
>>> str(s)
'\xc4\x80'
>>> str(s) == s
True
>>> hash(str(s)), hash(s)
(1207774670, -1591639807)
>>> d = {s: None}
>>> d[str(s)] = 1
>>> d
{u'\u0100': None, '\xc4\x80': 1}

The strings are equal, but they don't hash equally, so the dictionary (being a hash table) puts both in and doesn't notice their equality. Not surprising; equality is default encoding aware (the byte string is decoded before comparing it with the unicode string). In fact you get UnicodeDecodeError if you compare a byte string that can't be decoded in the default encoding to a unicode string. (I know exactly why there's an exception there, and I understand why, and maybe I even see how it's a good idea, but how can you not find it disturbing that these two objects can't be safely compared when almost all other objects, no matter how different in type, can be compared?)

Oh, but I was talking about hashes. Well, the hash algorithm for strings apparently isn't aware of default encodings. (Just in case this was specific to the reload(sys) hack, I also tested it with a change to site.py). Note that hash does work for ASCII-encodable Unicode string (i.e., hash('foo') == hash(u'foo')).


	Web Artima.com