This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Changing the Structured Blogging plugins' XML output
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
One current issue with the Structured Blogging plugins is that they produce HTML that doesn't validate on the W3C validator and feeds that produce warnings on the Feed Validator.
This is because of the method used to embed the structured post's XML source in the HTML output.
How the output looks
The current output looks like this, with the XML source for the post shown in bold:
<script type="application/x-subnode; charset=utf-8">
<!-- the following is structured blog data for machine readers. -->
<subnode alternate-for-id="sbentry_5"
xmlns:data-view="http://www.w3.org/2003/g/data-view#"
data-view:interpreter="http://structuredblogging.org/subnode-to-rdf-interpreter.xsl"
xmlns="http://www.structuredblogging.org/xmlns#subnode">
<xml-structured-blog-entry xmlns="http://www.structuredblogging.org/xmlns">
<generator id="wpsb-1" type="x-wpsb-post" version="1"/>
<event type="event/conference">
<name>Doc's show</name>
<image>/~phil/sb_latest/images/syndicate_logo.gif</image>
<person role="organizer" url="http://doc.weblogs.com">Doc Searls</person>
<description>This is Doc's show. He organized it, decided what
panels to have, and he's paying for dinner.</description>
<tags>doc</tags>
<begins>2005-12-13T15:57:00</begins>
<ends>2005-12-13T15:57:00</ends>
</event>
</xml-structured-blog-entry>
</subnode>
</script>
This embedding technique, called x-subnode and invented by the guys at PubSub (I think Bob Wyman and Duncan Werner) when they did the first SB plugin, is pretty clever. Because they don't know about the the application/x-subnode script type, browsers will completely ignore the contents. This means you don't need to enclose the whole thing in a comment to stop it from being displayed. Then, you can just drop the whole thing into an RSS <description> or Atom <content> element and have the structured data flow out through the feed.
Other bits to note:
The alternate-for-id attribute points to an ID earlier in the page which encloses the HTML of this post. This would let a Greasemonkey script reformat the post if it wanted to - or allow a crawler to go back from the structured data to the actual HTML.
The two lines in italics are there to enable GRDDL, which lets RDF people extract meaning from the XML content. This lets us be "RDF compatible" without having to actually generate the RDF.
So, in summary:
It lets you embed XML inside HTML without commenting it out.
The XML is still accessible using an XML parser, so XSLT etc works.
GRDDL tools will be able to turn it into RDF.
It works inside HTML and also inside RSS/Atom, so a separate embedding method isn't required for feeds.
Problems
Unfortunately, using <script> for all this fires off warnings everywhere we go, and pretty much everyone who looks at the embedded data, whether in a web page or in a feed, has a really bad first impression. So, it's time to do something about that.
... then, in the profile page, refer to the data-view profile and point to the SB XSLT file using profileTransformation, this will cause the XSLT file to be run on pages generated by the SB plugin.
Getting the XML out of the page
After setting up the GRDDL profile/transform, we could define a microformat to link to the XML source and move it to another URL. This way an RDF crawler would still pick up on it, while crawlers specifically looking for SB posts could look for the links and work from there.
I'm not quite sure how this should look, but here's one possibility: put a class name (e.g. sb_post) on an element surrounding the post, and inside that element, link to the XML source with rel="sb_source". So the HTML for a post might look like:
<div class="sb_post">
<h3>This is the post title</h3>
<p>Here is some text</p>
<p>(<a rel="sb_source" href="/path/to/xml_source">XML</a>)</p>
</div>
Making the XML more accessible inside feeds
Currently the whole chunk of XML (above) is embedded in the description or content elements in syndication feeds, as part of the encoded HTML. It would look a lot nicer if it could be moved out - perhaps like this:
<item>
...
<description>HTML goes here</description>
<source xmlns="http://structuredblogging.org/xmlns>
core XML -- <event> from the first example -- goes here
</source>
</item>
We could GRDDL-enable this by putting a namespaceTransformation reference in the xmlns document.
Pros and cons of the changes
Making these changes would:
make everything look a lot nicer,
and make everything validate,
while maintaining RDF compatibility.
The downside is:
the XML would no longer be directly available inside the HTML, so a crawler would have to make more HTTP requests,
and feed parsers (like the one powering PubSub) would have to be modified to understand the new syntax.