Agile Buzz Forum - The joys of character encodings

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Agile Buzz Forum
The joys of character encodings

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large

The joys of character encodings

Posted: Apr 22, 2004 12:32 PM

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: The joys of character encodings Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.	Latest Agile Buzz Posts Latest Agile Buzz Posts by James Robertson Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

In subscribing to the Learning Seaside blog, I started seeing bad characters in the RSS Feed. In taking a look at this, I ended up refreshing my memory on the standards for http transmitted documents and for xml documents. It's an interesting little pathway into heck - here's the rules:

Http transport: iso-8859-1
XML documents on the file system: UTF-8
XML via some transport (like HTTP or MIME): The transport default

Here's the relevant section on xml docs from the 1.0 spec:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

That's where, IMHO, a bunch of problems crop up. You create an XML document, encoded in utf-8. Now you make it available on a website, but forget to add encoding information. In most cases in the US, this isn't going to lead to any (obvious) problems. However, say this document comes from Europe, and has characters with accents (et. al.). Suddenly, you have badly encoded characters that show up on the client side as interesting garbage. Browsers mask this problem by having complex and sophisticated encoding guessers - they score the document if it doesn't have a declared encoding, and show you what seems best - and it usually works. Mozilla allows you to change the encoding if your eyeballs think the guess was wrong.

That's where I've ended up with BottomFeeder - I don't have the expertise to create a guesser, so instead I offer an encoding menu for the user in case it "looks wrong". For my example above, changing the encoding to utf-8 fixed all the issues in the feed. The choices I've made available are: utf-8, iso8859-1, iso8859-15, ms-1252. That ought to cover most of the bases, and allow people to fix clearly wrong encoding presentations.

In general, this whole area is a mess, and creates headaches for all of us who have decided to walk into it...

Read: The joys of character encodings

Previous Topic

Next Topic


	Web Artima.com