The Artima Developer Community
Sponsored Link

Agile Buzz Forum
The joys of character encodings

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large
The joys of character encodings Posted: Apr 22, 2004 12:32 PM
Reply to this message Reply

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: The joys of character encodings
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Latest Agile Buzz Posts
Latest Agile Buzz Posts by James Robertson
Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

Advertisement

In subscribing to the Learning Seaside blog, I started seeing bad characters in the RSS Feed. In taking a look at this, I ended up refreshing my memory on the standards for http transmitted documents and for xml documents. It's an interesting little pathway into heck - here's the rules:

  • Http transport: iso-8859-1
  • XML documents on the file system: UTF-8
  • XML via some transport (like HTTP or MIME): The transport default

Here's the relevant section on xml docs from the 1.0 spec:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

That's where, IMHO, a bunch of problems crop up. You create an XML document, encoded in utf-8. Now you make it available on a website, but forget to add encoding information. In most cases in the US, this isn't going to lead to any (obvious) problems. However, say this document comes from Europe, and has characters with accents (et. al.). Suddenly, you have badly encoded characters that show up on the client side as interesting garbage. Browsers mask this problem by having complex and sophisticated encoding guessers - they score the document if it doesn't have a declared encoding, and show you what seems best - and it usually works. Mozilla allows you to change the encoding if your eyeballs think the guess was wrong.

That's where I've ended up with BottomFeeder - I don't have the expertise to create a guesser, so instead I offer an encoding menu for the user in case it "looks wrong". For my example above, changing the encoding to utf-8 fixed all the issues in the feed. The choices I've made available are: utf-8, iso8859-1, iso8859-15, ms-1252. That ought to cover most of the bases, and allow people to fix clearly wrong encoding presentations.

In general, this whole area is a mess, and creates headaches for all of us who have decided to walk into it...

Read: The joys of character encodings

Topic: New Seaside for Squeak Previous Topic   Next Topic Topic: Quality software measurement

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use