This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: Into the noise, some meaning
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
On the Atom mailing list, there's been a lot of talk recently about what should/should not be done with malformed feeds. The answer probably differs based on context:
In a b2b context, you probably want to reject malformed XML data. This isn't an appropriate place to make a "best guess" and move along
In a consumer context (i.e., the one most news aggregators live in), it's reasonable to flag the bad data (so that a user who cares can report it) and try to present it anyway.
The difference is context - if it's a business level communication, then guessing isn't appropriate. If, on the other hand, I'm trying to find out what the latest baseball scores are, then I don't really care about the stray Unicode character that wandered into a feed.
The truly interesting piece is the stats that Mark Pilgrim dug up:
I analyzed 5096 RSS and Atom feeds chosen at random from Syndic8.com
and parsed them with Universal Feed Parser 3.0.1 using the latest
version of libxml2 as the underlying XML parser.
Actually, I analyzed more feeds than that, but I threw away feeds that
didn't either return an HTTP status code 200 or redirect to a URL that returned 200, or
didn't have a recognizable root-level element of some version of RSS or Atom
3929 feeds (77.10%) were well-formed.
961 feeds (18.86%) were not well-formed due to specifying "Content-Type: text/xml" but containing non-us-ascii characters.
206 feeds (4.04%) were not well-formed for other reasons.
Nearly a quarter of the feeds chosen (and likely this holds across all feeds) have issues - and they have issues that a tighter spec is not going to solve. We've crossed the Rubicon on this one, at least in the consumer space....