This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: The joys of character encodings
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
In subscribing to the Learning Seaside blog, I started seeing bad characters in the RSS Feed. In taking a look at this, I ended up refreshing my memory on the standards for http transmitted documents and for xml documents. It's an interesting little pathway into heck - here's the rules:
Http transport: iso-8859-1
XML documents on the file system: UTF-8
XML via some transport (like HTTP or MIME): The transport default
Here's the relevant section on xml docs from the 1.0 spec:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
That's where, IMHO, a bunch of problems crop up. You create an XML document, encoded in utf-8. Now you make it available on a website, but forget to add encoding information. In most cases in the US, this isn't going to lead to any (obvious) problems. However, say this document comes from Europe, and has characters with accents (et. al.). Suddenly, you have badly encoded characters that show up on the client side as interesting garbage. Browsers mask this problem by having complex and sophisticated encoding guessers - they score the document if it doesn't have a declared encoding, and show you what seems best - and it usually works. Mozilla allows you to change the encoding if your eyeballs think the guess was wrong.
That's where I've ended up with BottomFeeder - I don't have the expertise to create a guesser, so instead I offer an encoding menu for the user in case it "looks wrong". For my example above, changing the encoding to utf-8 fixed all the issues in the feed. The choices I've made available are: utf-8, iso8859-1, iso8859-15, ms-1252. That ought to cover most of the bases, and allow people to fix clearly wrong encoding presentations.
In general, this whole area is a mess, and creates headaches for all of us who have decided to walk into it...