The Artima Developer Community
Sponsored Link

Python Buzz Forum
One Of These Things Is Not Like The Other

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ben Last

Posts: 247
Nickname: benlast
Registered: May, 2004

Ben Last is no longer using Python.
One Of These Things Is Not Like The Other Posted: Jun 13, 2004 1:48 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ben Last.
Original Post: One Of These Things Is Not Like The Other
Feed Title: The Law Of Unintended Consequences
Feed URL: http://benlast.livejournal.com/data/rss
Feed Description: The Law Of Unintended Consequences
Latest Python Buzz Posts
Latest Python Buzz Posts by Ben Last
Latest Posts From The Law Of Unintended Consequences

Advertisement
I've been playing with Leonard Richardson's useful BeautifulSoup module; a lazy, doesn't-care, do-it-anyway parser for HTML.  This is, in turn, because I've been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads,  popups, popunders or the hassle of clicking on forms.  I'm after something that I can drop a word on and have it translated.  Both HTMLParser and htmllib choked on the output from a page such as these definitions of 'cara', so I turned to BeautifulSoup.

Which did almost exactly what I wanted - it ate the HTML and built me an object tree that I could then walk, filtering out what I didn't need.  Unfortunately, it and I suffered from a small mismatch of worldview.  I use Unicode.  A lot.

I grabbed the webpage using urllib, something like:
uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')
f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))
#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for.
response = f.read().decode('iso-8859-1')
#Finished with the request
f.close()

(It's actually a little more complex - you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn't block you).
Anyway, that gets me a Unicode string in response.  I can then pass it to a BeautifulSoup object, with:
soup = BeautifulSoup.BeautifulSoup()
soup.feed(response)

But... calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError.  Hmm.

It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious.  If you give it a Unicode data source all the internal strings get silently promoted, but there's no specific Unicode handling in there.  This is not a bad approach, and would work very well, if it weren't for the fact that the objects use str(), a lot.  Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often 'ascii'.

Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the "value" of an object.  For a Tag object, __repr__calls __str__.  Given that the objects here are derived from a stream of characters, that's not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding.  When the result is Unicode that can't be converted to a string, that assumption breaks.

I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:
def __str__(self):
    """Return representation of self, omitting characters that can't be printed."""
    return self.string.encode(sys.getdefaultencoding(),'replace')


Value and representation.  Two things that can look the same, but aren't.

Oh, and I still like BeautifulSoup very much; so much so that I'm using it, with a patch to avoid the problem, submitted to Leonard.

Read: One Of These Things Is Not Like The Other

Topic: 5250 fans aren't Luddutes Previous Topic   Next Topic Topic: V IFSF and a home disaster

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use