Python Buzz Forum - XML Processing

So, I was trying to improve Commentary with respect to its HTML processing. I was parsing the incoming page with tidy, reading it with ElementTree and then writing it back out. But tidy was leaving some entities in there that Expat didn't understand, and ElementTree was outputing XML that doesn't look like HTML. I think I have it fixed (maybe), but it was all much harder than it should have been.

I first started with minidom, (or PyXML?) because I wanted to implement the same algorithms in Javascript, so I preferred a DOM interface. This worked for a little while, but from what I can tell minidom is just really really broken. getElementById didn't work, and after looking at the code it seemed like it couldn't work given the way I was creating the DOM; which was the only way I saw to create the DOM -- there's basically no useful documentation for that module, which is a problem when it doesn't work as claimed. Then later I was getting a problem with inserting nodes, because you can't insert a document-type node (only elements and text nodes and whatnot)... except from what I could tell of the source it was just utterly and completely wrong, and was testing the node type against ELEMENT_TYPE. These are such glaringly obvious errors that I didn't know what to make of them; did I completely not understand what I was doing? Is this code just completely abandoned and unloved?

Anyway, I felt okay about the algorithm by that time anyway, worked around the problems, and then reimplemented using ElementTree. This introduced some problems, because ElementTree doesn't use a model much like the DOM. In the DOM every node knows about its siblings, parent, etc. Elements in ElementTree don't know about any of that (which is conventional in Python and most languages, that you not know about your container). But that was inconvenient, so I had to make a wrapper to give me access to that information.

Then there's the issue that there's no code I know of that knows how to parse HTML (HTMLParser does, of course, but not in a useful way -- it doesn't create a tree). So everyone uses Tidy to normalize their code to XHTML, which works but feels really sloppy. HTML is parseable; in this case, I really only wanted to parse well-formed HTML anyway. Then, finally, there's builtin way to serialize ElementTree to HTML from what I can find. There's some hints, but they still leave you with empty elements like <a name="foo" />, which browsers do not like. I had to clone a write method in ElementTree and make edits to it.

I have to say, Javascript and the DOM in the browsers are much easier to use for HTML processing, even taking into account the fact that it's Javascript.


	Web Artima.com