Agile Buzz Forum - encoding issues for aggregators

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Agile Buzz Forum
encoding issues for aggregators

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large

encoding issues for aggregators

Posted: Apr 23, 2004 6:12 PM

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: encoding issues for aggregators Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.	Latest Agile Buzz Posts Latest Agile Buzz Posts by James Robertson Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

Dare's comment on Sam Ruby's blog explains the difficulties facing consumers of XML:

RFC 3023 is broken because it ignores practice in the XML world and this has even been noted by the very authors of the spec who've expressed that they'd like to update it. If RSS Bandit actually followed RFC 3023 then we'd cause our users to have difficulties with a large percentage of the feeds they read since lots of them are served with text/xml MIME types but aren't encoded in us-ascii.%A0 Specs are not the perfect and irrevokable Word of God set that are set in stone. Many of them are ambiguous, contradictory and in some cases infeasible to implement.

This gets to be very nasty very fast. First off, most of the posts with a mime type of text/xml (at least in my experience) are not us-ascii. They fall into a few categories:

No explicit charset listed, but actually uses iso-8859-1
No explicit charset listed, but actually uses utf-8
A charset listed, which is used
A charset listed, but the feed is actually encoded in another (typically iso-8859-1 and utf-8)

In a sense, it no longer matters what the standard says - out in the wild, people are actually doing all sorts of wild things. The practical impact of this in BottomFeeder has been items that are readable, but have specific characters (typically double quotes and/or apostrophes) munged. Browser developers have addressed this by building charset guessers - they score the text for 'goodness' in a few common charsets if there's none listed, and pick the winner. I've not done that; instead, I have options to change the encoding on the fly if the text "looks wrong".

I don't expect this to get better anytime soon - the tools are already out there, and there's confusion in every direction over what the "right" assumptions are....

Read: encoding issues for aggregators

Previous Topic

Next Topic


	Web Artima.com