I had a very small number of complaints related to basing Kid on
ElementTree. This came in two forms:
SAX and DOM are âstandardâ and while ElementTree is a drastically improved
system for processing XML in Python, it doesn't matter because everyone
already knows SAX/DOM.
âlibxml2 is teh rawk!â
First, if Python's W3C DOM standard based xml.dom
package were a movie, it
would be called Elf, staring xml.dom
. It's the episode of Little House
on the Prairie where Alien asks Michael Landon for permission to
marry his daughter. It does not belong here!
Next, in terms of pythonicness, libxml2 is almost worse than xml.dom
but you
at least get something for it: they don't even have a word to describe this
level of âfastâ and it comes along with XPath, RelaxNG, XSD, XML-Base,
XInclude, and XSLT. My issue with libxml2 is just that it's a bad dependency
for a project like Kid that wants to be able to run on cheap web space with
bare-bones Python support. There are a lot of hosting providers that aren't
going to have libxml2 or the option of compiling from source.
I went with ElementTree because it's simple, pythonic, and fast enough. I also
had a feeling we'd be seeing more development around ElementTree, which brings
us nicely to why I'm posting.
Fredrik Lundh announced cElementTree, an implementation of his
ElementTree XML parsing library for Python implemented in C. The initial
numbers coming out of effland look excellent:
library |
time |
space |
xml.dom.minidom (Python 2.1) |
6.3 s |
80000k |
xml.dom.minidom (Python 2.4) |
1.4 s |
53000k |
ElementTree 1.2 |
1.6 s |
14500k |
ElementTree 1.2.1/1.3 |
1.1 s |
14500k |
PyRXPU (C extension) |
0.22 s |
11500k |
cElementTree 0.8 (C extension) |
0.058 s |
5700k |
readlines (read as text) |
0.032 s |
5050k |
This comes on the heels of a well hidden announcement by Martijn
Faassen on the lxml mailing list Saturday:
The lxml.etree implementation of ElementTree, on top of libxml2, is getting
there now. It features automatic memory management and quite a bit of
ElementTree compatibility. Not all of the ElementTree API has been
implemented yet, but enough for many use cases.
As everyone is already quite aware, libxml2 is fast. But as I
mentioned, the python bindings that ship with libxml2 are painful; many a
hacker has been seduced by its performance only to be bitten later by
monsters growing out of the large impedance mismatch it creates with the rest
of your python code.
This is all really great news, of course, but now there's questions to be asked
and work to be done:
Will Fredrik and others collaborate to create a compatibility definition for
these different ElementTree implementations? I'd like to see a definition of
a mandatory ElementTree API. Ideally, whether to use cElementTree,
lxml.etree
, or ElementTree proper would be a decision based on what was
available in a given environment, not a decision made when coding.
I'd like to see libxml2 added to Fredrik's comparison table. (Fredrik: ping)
At some point in the future (Python 3000?), I'd like to see ElementTree or
its equivalent rolled into the core library. This seems unlikely though, as
I don't think XML-SIG or the greater python community wants the Python/XML
waters any murkier. I partially agree but the number of people looking
outside of core python's XML support for functionality it provides says that
it isn't getting the job done.