This post originated from an RSS feed registered with Python Buzz
by Ben Last.
Original Post: One Of These Things Is Not Like The Other
Feed Title: The Law Of Unintended Consequences
Feed URL: http://benlast.livejournal.com/data/rss
Feed Description: The Law Of Unintended Consequences
I've been playing with Leonard Richardson's useful BeautifulSoup module; a lazy, doesn't-care, do-it-anyway parser for HTML. This is, in turn, because I've been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads, popups, popunders or the hassle of clicking on forms. I'm after something that I can drop a word on and have it translated. Both HTMLParser and htmllib choked on the output from a page such as these definitions of 'cara', so I turned to BeautifulSoup.
Which did almost exactly what I wanted - it ate the HTML and built me an object tree that I could then walk, filtering out what I didn't need. Unfortunately, it and I suffered from a small mismatch of worldview. I use Unicode. A lot.
I grabbed the webpage using urllib, something like:
uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')
f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))
#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for.
response = f.read().decode('iso-8859-1')
#Finished with the request
f.close()
(It's actually a little more complex - you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn't block you). Anyway, that gets me a Unicode string in response. I can then pass it to a BeautifulSoup object, with:
But... calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError. Hmm.
It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious. If you give it a Unicode data source all the internal strings get silently promoted, but there's no specific Unicode handling in there. This is not a bad approach, and would work very well, if it weren't for the fact that the objects use str(), a lot. Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often 'ascii'.
Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the "value" of an object. For a Tag object, __repr__calls __str__. Given that the objects here are derived from a stream of characters, that's not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding. When the result is Unicode that can't be converted to a string, that assumption breaks.
I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:
def __str__(self):
"""Return representation of self, omitting characters that can't be printed."""
return self.string.encode(sys.getdefaultencoding(),'replace')
Value and representation. Two things that can look the same, but aren't.
Oh, and I still like BeautifulSoup very much; so much so that I'm using it, with a patch to avoid the problem, submitted to Leonard.