This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: encoding issues for aggregators
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Dare's comment on Sam Ruby's blog explains the difficulties facing consumers of XML:
RFC 3023 is broken because it ignores practice in the XML world and this has even been noted by the very authors of the spec who've expressed that they'd like to update it. If RSS Bandit actually followed RFC 3023 then we'd cause our users to have difficulties with a large percentage of the feeds they read since lots of them are served with text/xml MIME types but aren't encoded in us-ascii.%A0 Specs are not the perfect and irrevokable Word of God set that are set in stone.
Many of them are ambiguous, contradictory and in some cases infeasible to implement.
This gets to be very nasty very fast. First off, most of the posts with a mime type of text/xml (at least in my experience) are not us-ascii. They fall into a few categories:
No explicit charset listed, but actually uses iso-8859-1
No explicit charset listed, but actually uses utf-8
A charset listed, which is used
A charset listed, but the feed is actually encoded in another (typically iso-8859-1 and utf-8)
In a sense, it no longer matters what the standard says - out in the wild, people are actually doing all sorts of wild things. The practical impact of this in BottomFeeder has been items that are readable, but have specific characters (typically double quotes and/or apostrophes) munged. Browser developers have addressed this by building charset guessers - they score the text for 'goodness' in a few common charsets if there's none listed, and pick the winner. I've not done that; instead, I have options to change the encoding on the fly if the text "looks wrong".
I don't expect this to get better anytime soon - the tools are already out there, and there's confusion in every direction over what the "right" assumptions are....