So, I was trying to improve Commentary with respect to its HTML
processing. I was parsing the incoming page with tidy, reading
it with ElementTree and then writing it back out. But tidy was
leaving some entities in there that Expat didn't understand, and
ElementTree was outputing XML that doesn't look like HTML. I think I have it fixed (maybe), but it was all much harder than it
should have been.
I first started with minidom, (or PyXML?) because I
wanted to implement the same algorithms in Javascript, so I preferred
a DOM interface. This worked for a little while, but from what I can
tell minidom is just really really broken. getElementById didn't
work, and after looking at the code it seemed like it couldn't work
given the way I was creating the DOM; which was the only way I saw to
create the DOM -- there's basically no useful documentation for that
module, which is a problem when it doesn't work as claimed. Then
later I was getting a problem with inserting nodes, because you can't
insert a document-type node (only elements and text nodes and
whatnot)... except from what I could tell of the source it was just
utterly and completely wrong, and was testing the node type against
ELEMENT_TYPE. These are such glaringly obvious errors that I didn't
know what to make of them; did I completely not understand what I was
doing? Is this code just completely abandoned and unloved?
Anyway, I felt okay about the algorithm by that time anyway, worked
around the problems, and then reimplemented using ElementTree. This
introduced some problems, because ElementTree doesn't use a model much
like the DOM. In the DOM every node knows about its siblings, parent,
etc. Elements in ElementTree don't know about any of that (which is
conventional in Python and most languages, that you not know about
your container). But that was inconvenient, so I had to make a
wrapper to give me access to that information.
Then there's the issue that there's no code I know of that knows how
to parse HTML (HTMLParser does, of course, but not in a useful
way -- it doesn't create a tree). So everyone uses Tidy to
normalize their code to XHTML, which works but feels really sloppy.
HTML is parseable; in this case, I really only wanted to parse
well-formed HTML anyway. Then, finally, there's builtin way to
serialize ElementTree to HTML from what I can find. There's some
hints, but they still leave you with empty elements like <a
name="foo" />, which browsers do not like. I had to clone a
write method in ElementTree and make edits to it.
I have to say, Javascript and the DOM in the browsers are much
easier to use for HTML processing, even taking into account the
fact that it's Javascript.