The Artima Developer Community
Sponsored Link

Python Buzz Forum
XML Processing

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Ian Bicking is a freelance programmer
XML Processing Posted: Jan 3, 2006 6:25 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ian Bicking.
Original Post: XML Processing
Feed Title: Ian Bicking
Feed URL: http://www.ianbicking.org/feeds/atom.xml
Feed Description: Thoughts on Python and Programming.
Latest Python Buzz Posts
Latest Python Buzz Posts by Ian Bicking
Latest Posts From Ian Bicking

Advertisement

So, I was trying to improve Commentary with respect to its HTML processing. I was parsing the incoming page with tidy, reading it with ElementTree and then writing it back out. But tidy was leaving some entities in there that Expat didn't understand, and ElementTree was outputing XML that doesn't look like HTML. I think I have it fixed (maybe), but it was all much harder than it should have been.

I first started with minidom, (or PyXML?) because I wanted to implement the same algorithms in Javascript, so I preferred a DOM interface. This worked for a little while, but from what I can tell minidom is just really really broken. getElementById didn't work, and after looking at the code it seemed like it couldn't work given the way I was creating the DOM; which was the only way I saw to create the DOM -- there's basically no useful documentation for that module, which is a problem when it doesn't work as claimed. Then later I was getting a problem with inserting nodes, because you can't insert a document-type node (only elements and text nodes and whatnot)... except from what I could tell of the source it was just utterly and completely wrong, and was testing the node type against ELEMENT_TYPE. These are such glaringly obvious errors that I didn't know what to make of them; did I completely not understand what I was doing? Is this code just completely abandoned and unloved?

Anyway, I felt okay about the algorithm by that time anyway, worked around the problems, and then reimplemented using ElementTree. This introduced some problems, because ElementTree doesn't use a model much like the DOM. In the DOM every node knows about its siblings, parent, etc. Elements in ElementTree don't know about any of that (which is conventional in Python and most languages, that you not know about your container). But that was inconvenient, so I had to make a wrapper to give me access to that information.

Then there's the issue that there's no code I know of that knows how to parse HTML (HTMLParser does, of course, but not in a useful way -- it doesn't create a tree). So everyone uses Tidy to normalize their code to XHTML, which works but feels really sloppy. HTML is parseable; in this case, I really only wanted to parse well-formed HTML anyway. Then, finally, there's builtin way to serialize ElementTree to HTML from what I can find. There's some hints, but they still leave you with empty elements like <a name="foo" />, which browsers do not like. I had to clone a write method in ElementTree and make edits to it.

I have to say, Javascript and the DOM in the browsers are much easier to use for HTML processing, even taking into account the fact that it's Javascript.

Read: XML Processing

Topic: A question... Previous Topic   Next Topic Topic: OmniBase

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use