The Artima Developer Community
Sponsored Link

Agile Buzz Forum
encoding issues for aggregators

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large
encoding issues for aggregators Posted: Apr 23, 2004 6:12 PM
Reply to this message Reply

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: encoding issues for aggregators
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Latest Agile Buzz Posts
Latest Agile Buzz Posts by James Robertson
Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

Advertisement

Dare's comment on Sam Ruby's blog explains the difficulties facing consumers of XML:

RFC 3023 is broken because it ignores practice in the XML world and this has even been noted by the very authors of the spec who've expressed that they'd like to update it. If RSS Bandit actually followed RFC 3023 then we'd cause our users to have difficulties with a large percentage of the feeds they read since lots of them are served with text/xml MIME types but aren't encoded in us-ascii.%A0 Specs are not the perfect and irrevokable Word of God set that are set in stone. Many of them are ambiguous, contradictory and in some cases infeasible to implement.

This gets to be very nasty very fast. First off, most of the posts with a mime type of text/xml (at least in my experience) are not us-ascii. They fall into a few categories:

  • No explicit charset listed, but actually uses iso-8859-1
  • No explicit charset listed, but actually uses utf-8
  • A charset listed, which is used
  • A charset listed, but the feed is actually encoded in another (typically iso-8859-1 and utf-8)

In a sense, it no longer matters what the standard says - out in the wild, people are actually doing all sorts of wild things. The practical impact of this in BottomFeeder has been items that are readable, but have specific characters (typically double quotes and/or apostrophes) munged. Browser developers have addressed this by building charset guessers - they score the text for 'goodness' in a few common charsets if there's none listed, and pick the winner. I've not done that; instead, I have options to change the encoding on the fly if the text "looks wrong".

I don't expect this to get better anytime soon - the tools are already out there, and there's confusion in every direction over what the "right" assumptions are....

Read: encoding issues for aggregators

Topic: I want one! Previous Topic   Next Topic Topic: Modeling, systems and software

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use