This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: Better Encoding support
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
I've posted an update to the Http-Access package (the part of BottomFeeder that does net access for feeds). The basic VW client software looks in the http headers for encoding information; some html pages stuff that information into meta http-equiv tage instead - I now look for that if there was no encoding header present. Likewise, XML feeds often put that information into the XML tag - which I also now look for.
The problem was, if there was no http header with encoding, the base library was just assuming iso-8859-1. I noticed this as a problem in Joel's feed - the top item has the word R%E9sum%E9 - with the accented characters. Bf was making a hash of that. Why? Well, in the html page, the encoding (UTF-8) is in a meta tag. In the XML feed, it's in the XML header. Either way, the base library was assuming iso-8859-1, and leaving junk characters. Now, I look for encoding information when the headers don't have it, and all is well. This should get rid of a lot of the junk characters Bf was displaying