The Artima Developer Community
Sponsored Link

Agile Buzz Forum
How I learned to love encoding

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large
How I learned to love encoding Posted: Feb 3, 2004 4:57 PM
Reply to this message Reply

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: How I learned to love encoding
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Latest Agile Buzz Posts
Latest Agile Buzz Posts by James Robertson
Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

Advertisement
One of the things that recently got a lot better in BottomFeeder is character and font handling. I have to thank Alan Knight for most of the improvement - his work on Web Toolkit has given him a lot of experience dealing with encoding on the web. Here's how things work in Bf:
  • Download the XML (RSS or Atom) page from the server
  • Parse the XML into Objects
  • Display the feed data to the user

The difficulty came right there at the beginning. The HttpClient in VisualWorks will use the correct encoding to get the content from an Http response object - assuming that the encoding has been set in the header. If it's not set, what it does is assume (as per the standards) that the content has been encoded as iso-8859-1. The trouble comes in when the content is sent without a content encoding header, and is encoded in something other than iso-8859-1 - you'll end up seeing a lot of bad characters in the resulting stream of data

Well, that's the standard way of dealing with this, and so - in some sense - "correct". The trouble is, end users couldn't care less about following standards if it means crappy looking text to read. The upshot is, I had to start playing some of the same games that I suspect browsers play - looking for other encoding information in the text. At the moment, Bf looks in two places:

  • Once I have the content, I scan through looking for <meta http-equiv="Content-Type" tags. If I find one, I go ahead and grab the encoding set there as the proper one to use
  • I look in the XML attributes for encoding information - things like this: <?xml version="1.0" encoding=. Again, if I find that I grab the encoding information

Well, at this point we have a problem - the content has already been decoded as iso-8859-1, and, based on the new encoding information, we know that's wrong. What to do? Here's where Alan's tip saved the day. The following code snippet sets things right:

decode: text with: encoding

 	^[(text asByteArrayEncoding: 'iso-8859-1') asStringEncoding: (encoding asLowercase)]
		on: Error
		do: [:ex | text]

That undoes the iso-8859-1 encoding, and then decodes the text with the proper encoding information gleaned from the document. The results of this simple snippet of text have been really good - I'm seeing far fewer garbage characters in the text of the feeds I access. I still see some (especially in the Meerkat feeds), but I can't do a lot when there's no encoding information at all - iso-8859-1 is as good a guess as any in the complete absence of information.

This is the sort of thing any user-agent on the web is going to have to go through. For good or ill, encodings simply are not universally set in the http headers. Maybe they should be, but they aren't. I don't expect it to get better either - remote hosting services pretty much ensure that

Read: How I learned to love encoding

Topic: Bill on TDD Previous Topic   Next Topic Topic: snow lowers my productivity

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use