Python Buzz Forum - One Of These Things Is Not Like The Other

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
One Of These Things Is Not Like The Other

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Ben Last

Posts: 247
Nickname: benlast
Registered: May, 2004

Ben Last is no longer using Python.

One Of These Things Is Not Like The Other

Posted: Jun 13, 2004 1:48 AM

This post originated from an RSS feed registered with Python Buzz by Ben Last.
Original Post: One Of These Things Is Not Like The Other Feed Title: The Law Of Unintended Consequences Feed URL: http://benlast.livejournal.com/data/rss Feed Description: The Law Of Unintended Consequences	Latest Python Buzz Posts Latest Python Buzz Posts by Ben Last Latest Posts From The Law Of Unintended Consequences

I've been playing with Leonard Richardson's useful BeautifulSoup module; a lazy, doesn't-care, do-it-anyway parser for HTML. This is, in turn, because I've been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads, popups, popunders or the hassle of clicking on forms. I'm after something that I can drop a word on and have it translated. Both HTMLParser and htmllib choked on the output from a page such as these definitions of 'cara', so I turned to BeautifulSoup.

Which did almost exactly what I wanted - it ate the HTML and built me an object tree that I could then walk, filtering out what I didn't need. Unfortunately, it and I suffered from a small mismatch of worldview. I use Unicode. A lot.

I grabbed the webpage using urllib, something like:

uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')
f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))
#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for.
response = f.read().decode('iso-8859-1')
#Finished with the request
f.close()

(It's actually a little more complex - you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn't block you).
Anyway, that gets me a Unicode string in response. I can then pass it to a BeautifulSoup object, with:

soup = BeautifulSoup.BeautifulSoup()
soup.feed(response)

But... calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError. Hmm.

It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious. If you give it a Unicode data source all the internal strings get silently promoted, but there's no specific Unicode handling in there. This is not a bad approach, and would work very well, if it weren't for the fact that the objects use str(), a lot. Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often 'ascii'.

Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the "value" of an object. For a Tag object, __repr__calls __str__. Given that the objects here are derived from a stream of characters, that's not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding. When the result is Unicode that can't be converted to a string, that assumption breaks.

I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:

def __str__(self):
    """Return representation of self, omitting characters that can't be printed."""
    return self.string.encode(sys.getdefaultencoding(),'replace')

Value and representation. Two things that can look the same, but aren't.

Oh, and I still like BeautifulSoup very much; so much so that I'm using it, with a patch to avoid the problem, submitted to Leonard.

Read: One Of These Things Is Not Like The Other

Previous Topic

Next Topic


	Web Artima.com