This post originated from an RSS feed registered with Ruby Buzz
by Christian Neukirchen.
Original Post: The Dark Side of Atom
Feed Title: chris blogs: Ruby stuff
Feed URL: http://chneukirchen.org/blog/category/ruby.atom
Feed Description: a weblog by christian neukirchen - Ruby stuff
Yesterday antifuchs told me about a problem
with the Atom feed of Anarchaia, that
now and then includes IRC quotes like this:
#ruby-de
12:18 <ionas_> alles was nicht analog ist ist lossy ;p
12:18 <ionas_> und alles was analog ist geht schnell kaputt ,p
In raw HTML, this looks like that, this code is directly taken from
the generated HTML:
<div class="ircquote">
<span class="channel">#ruby-de</span>
<div class="line">12:18 <ionas_> alles was nicht analog ist ist lossy ;p</div>
<div class="line">12:18 <ionas_> und alles was analog ist geht schnell kaputt ,p</div>
</div>
In default IRC style, I quote the nickname with < and >, but
antifuchs tells me he doesn’t see any nicks when he looks at my blog
with Bloglines. Weird, I think, and
decide to have a look at it.
Just for fun, I subscribe to my blog in
NetNewsWire and I see, …no
nicknames! Now, how is my Atom feed generated? The snippet looks
about like that:
<entry>
<title>25</title>
<!-- ... --->
<content mode="xml" xmlns="http://www.w3.org/1999/xhtml">
<div class="ircquote">
<span class="channel">#ruby-de</span>
<div class="line">12:18 <ionas_> alles was nicht analog ist ist lossy ;p</div>
<div class="line">12:18 <ionas_> und alles was analog ist geht schnell kaputt ,p</div>
</div>
</content>
</entry>
And I start to wonder. My Atom feed is perfectly valid, and I just
inserted the raw (and valid) XHTML as-is. This should be OK.
To quote the Atom
specification
(emphasis mine):
3) If the value of “type” is “xhtml”, the content of atom:content
MUST be a single XHTML div element [XHTML], and SHOULD be
suitable for handling as XHTML. The XHTML div element itself
MUST NOT be considered part of the content. Atom Processors that
display the content MAY use the markup to aid in displaying it.
The escaped versions of characters such as “&” and “>” representthose characters, not markup.
Now, apparently, both Bloglines and NetNewsWire somehow pass the XHTML
to a rendering engine, in either case my browser respective HTMLKit.
And those seem to parse it again, thereby creating the tag<ionas_>. Now, I fixed that by escaping all & in my Atom feeds
with &, so now the nick reads &lt;ionas_&gt;. Which
is more than ugly and really pisses me off.
When I see such stuff, sometimes I think, RSS really did it better
when they just decided to escape the whole stuff and stray their
entities all over. That would be consistent, at least.
The civilization of today surely will go down because escaping doesn’t
work (and don’t even get me started on encodings, oh my…).