Summary
Every time someone creates a new XML-based language, God (performs some unspeakable act).
Advertisement
My initial strategy for XML was to recognize it for what it is -- a platform independent way to move data from one place to another -- and then, as much as possible, to ignore it and assume that I would only have to actually look at XML when I was debugging something.
Occasionally, XML is actually treated this way. For example, in XML-RPC, you make an ordinary function call, the mechanism converts it into XML and uses HTTP to transport the call across the network, the return value comes back as XML and the mechanism converts the XML back into useful data. It's the wet-dream of remote procedure calls that we've been trying, and failing, to achieve all these years (someone starts with an idea, then committees descend and you end up with things like CORBA and SOAP).
Alas, I haven't seen XML-RPC used yet in web-servicey things. I hope that it is used somewhere, but if you read my previous entry you'll see that the first two services I tried to use -- Fedex and the Post Office (and UPS apparently also works this way) -- uses something that's almost XML-RPC, but it's not, and you're expected to assemble your XML by hand and then use HTTP to send it to the server, then unbundle the XML that comes back. It's the worst of both worlds: an ordinary RESTful call, but instead of just handing the server plain HTTP arguments, you have to build and dissect the XML. I'm sure there's some reason they decided not to use XML-RPC in these cases, but I can't imagine what it is.
The point being that here's a case where the XML should have been invisible, but you have to mess with it by hand.
Another example is Ant. The creator of this tool has since apologized for using XML, but it's what we're stuck with. And it was too late, anyway, since Maven also appears to use XML. So again, you're stuck creating XML by hand.
For Thinking in Java, I wanted to automatically create the Ant build files for all the chapters, so I created a tool in Python that I call AntBuilder (it's not a general-purpose tool; it's designed specifically for the code in Thinking in Java). In the process I used xml.dom.minidom from the standard Python library. I'm not exactly sure what the genesis of this library is, but the resulting code is terribly verbose and obtuse, and thus difficult to modify and maintain. Indeed, others have created libraries that are simpler in comparison to xml.dom.minidom; elementree is probably the most notable, but I still found it a bit more verbose than I want, and I prefer something more like py.xml (also look at the simplicity of py.test in the same package). It's worth noting, though, that elementree isn't all that bad and if it had been the original XML library in Python I probably wouldn't have gone to this trouble. And more importantly, a subset of elementree will be in Python 2.5.
At the time I was creating AntBuilder, the messiness of xml.dom.minidom made the creation of the build.xml files far too difficult to manage, so I began creating a simplified approach for building XML trees. And when I tried to make some modifications to the code from http://opensource.pseudocode.net/files/shipping.tgz which also used xml.dom.minidom, I ran into the same over-complexity problem. So yesterday I began rewriting the XML code from the AntBuilder project to create a general purpose XML manipulation library called xmlnode.
My goal was to make the resulting code as minimal and (to my standards) readable as possible. To me, XML is a hierarchy of nodes, and each node has a tag, potentially some attributes, and either a value or subnodes. As it turns out, the whole of XML can be encapsulated into a single class which I call Node. The operator += has been overloaded to insert subnodes into a node, and the operator + has been overloaded to string together a group of subnodes.
You create a new Node using its constructor, and you give it a tag name, an optional value, and optionally one or more attributes. Thus, building an XML tree is about as simple as thinking about it -- the syntax doesn't get in the way (IMO).
With XML web-servicey things, the results also come back in XML, so the Node class has a static method (available in Python 2.4) that will take a dom and produce a hierarchy of Node objects. You can select a Node that has a particular tag name by using the "[]" operator (examples below). This only returns the first node with that tag name; if there are more then you must write the code to iterate through the nodes and select the appropriate one(s). But that's not very hard since each node just contains a Python list of other nodes.
You can download the code here (right-click to download). As I get feedback I will make changes. If you import the file, you'll get the single class that uses xml.dom.minidom, and if you run it as a standalone program it will execute all the test/demonstration code which provides examples so you can see how to use it.
Update: (6/19/06) By using and testing this code, I've significantly redesigned it to make it clearer and easier to use. Also note that the design of the class allows it to be easily retargeted to a different underly XML library implementation, if desired.
#!/usr/bin/python2.4
"""
xmlnode.py -- Rapidly assemble XML using minimal coding.
By Bruce Eckel, (c)2006 MindView Inc. www.MindView.net
Permission is granted to use or modify without payment as
long as this copyright notice is retained.
Everything is a Node, and each Node can either have a value
or subnodes. Subnodes can be appended to Nodes using '+=',
and a group of Nodes can be strung together using '+'.
Create a node containing a value by saying
Node("tag", "value")
You can also give attributes to the node in the constructor:
Node("tag", "value", attr1 = "attr1", attr2 = "attr2")
or without a value:
Node("tag", attr1 = "attr1", attr2 = "attr2")
To produce xml from a finished Node n, say n.xml() (for
nicely formatted output) or n.rawxml().
You can read and modify the attributes of an xml Node using
getAttribute(), setAttribute(), or delAttribute().
You can find the value of the first subnode with tag == "tag"
by saying n["tag"]. If there are multiple instances of n["tag"],
this will only find the first one, so you should use node() or
nodeList() to narrow your search down to a Node that only has
one instance of n["tag"] first.
You can replace the value of the first subnode with tag == "tag"
by saying n["tag"] = newValue. The same issues exist as noted
in the above paragraph.
You can find the first node with tag == "tag" by saying
node("tag"). If there are multiple nodes with the same tag
at the same level, use nodeList("tag").
The Node class is also designed to create a kind of "domain
specific language" by subclassing Node to create Node types
specific to your problem domain.
This implementation uses xml.dom.minidom which is available
in the standard Python 2.4 library. However, it can be
retargeted to use other XML libraries without much effort.
"""
from xml.dom.minidom import getDOMImplementation, parseString
import copy, re
class Node(object):
"""
Everything is a Node. The XML is maintained as (very efficient)
Python objects until an XML representation is needed.
"""
def __init__(self, tag, value = None, **attributes):
self.tag = tag.strip()
self.attributes = attributes
self.children = []
self.value = value
if self.value:
self.value = self.value.strip()
def getAttribute(self, name):
"""
Read XML attribute of this node.
"""
return self.attributes[name]
def setAttribute(self, name, item):
"""
Modify XML attribute of this node.
"""
self.attributes[name] = item
def delAttribute(self, name):
"""
Remove XML attribute with this name.
"""
del self.attributes[name]
def node(self, tag):
"""
Recursively find the first subnode with this tag.
"""
if self.tag == tag:
return self
for child in self.children:
result = child.node(tag)
if result:
return result
return False
def nodeList(self, tag):
"""
Produce a list of subnodes with the same tag.
Note:
It only makes sense to do this for the immediate
children of a node. If you went another level down,
the results would be ambiguous, so the user must
choose the node to iterate over.
"""
return [n for n in self.children if n.tag == tag]
def __getitem__(self, tag):
"""
Produce the value of a single subnode using operator[].
Recursively find the first subnode with this tag.
If you want duplicate subnodes with this tag, use
nodeList().
"""
subnode = self.node(tag)
if not subnode:
raise KeyError
return subnode.value
def __setitem__(self, tag, newValue):
"""
Replace the value of the first subnode containing "tag"
with a new value, using operator[].
"""
assert isinstance(newValue, str), "Value " + str(newValue) + " must be a string"
subnode = self.node(tag)
if not subnode:
raise KeyError
subnode.value = newValue
def __iadd__(self, other):
"""
Add child nodes using operator +=
"""
assert isinstance(other, Node), "Tried to += " + str(other)
self.children.append(other)
return self
def __add__(self, other):
"""
Allow operator + to combine children
"""
return self.__iadd__(other)
def __str__(self):
"""
Display this object (for debugging)
"""
result = self.tag + "\n"
for k, v in self.attributes.items():
result += " attribute: %s = %s\n" % (k, v)
if self.value:
result += " value: [%s]" % self.value
return result
# The following are the only methods that rely on the underlying
# Implementation, and thus the only methods that need to change
# in order to retarget to a different underlying implementation.
# A static dom implementation object, used to create elements:
doc = getDOMImplementation().createDocument(None, None, None)
def dom(self):
"""
Lazily create a minidom from the information stored
in this Node object.
"""
element = Node.doc.createElement(self.tag)
for key, val in self.attributes.items():
element.setAttribute(key, val)
if self.value:
assert not self.children, "cannot have value and children: " + str(self)
element.appendChild(Node.doc.createTextNode(self.value))
else:
for child in self.children:
element.appendChild(child.dom()) # Generate children as well
return element
def xml(self, separator = ' '):
return self.dom().toprettyxml(separator)
def rawxml(self):
return self.dom().toxml()
@staticmethod
def create(dom):
"""
Create a Node representation, given either
a string representation of an XML doc, or a dom.
"""
if isinstance(dom, str):
# Strip all extraneous whitespace so that
# text input is handled consistently:
dom = re.sub("\s+", " ", dom)
dom = dom.replace("> ", ">")
dom = dom.replace(" <", "<")
return Node.create(parseString(dom))
if dom.nodeType == dom.DOCUMENT_NODE:
return Node.create(dom.childNodes[0])
if dom.nodeName == "#text":
return
node = Node(dom.nodeName)
if dom.attributes:
for name, val in dom.attributes.items():
node.setAttribute(name, val)
for n in dom.childNodes:
if n.nodeType == n.TEXT_NODE and n.wholeText.strip():
node.value = n.wholeText
else:
subnode = Node.create(n)
if subnode:
node += subnode
return node
Here is the test program for the class, which also demonstrates how to use the various features:
#!python
"""
Test the xmlnode library, and demonstrate how to use it.
"""
from xmlnode import Node
import copy
# The XML we want to create:
request = """\
<RequestHeader>
<AccountNumber>
12345
</AccountNumber>
<MeterNumber>
6789
</MeterNumber>
<CarrierCode>
FDXE
</CarrierCode>
<Service>
STANDARDOVERNIGHT
</Service>
<Packaging>
FEDEXENVELOPE
</Packaging>
</RequestHeader>
"""
# Create a Node from a string:
requestNode = Node.create(request)
# Create a Node from a minidom:
requestNode = Node.create(requestNode.dom())
# Assemble a Node programmatically:
root = Node('RequestHeader')
account = Node('AccountNumber', '12345')
assert account.value == '12345'
root += account # Insert account node as child of root
# Add new child nodes:
root += Node('MeterNumber', '6789')
root += Node('CarrierCode', 'FDXE')
root += Node('Service', 'STANDARDOVERNIGHT')
root += Node('Packaging', 'FEDEXENVELOPE')
assert root.xml() == requestNode.xml()
assert root.xml() == request
# Dom objects are different, not equivalent:
assert root.dom() != requestNode.dom()
# A more succinct approach. The '+' adds child nodes
# to the RequestHeader node.
root2 = Node('RequestHeader') + \
Node('AccountNumber', '12345') + \
Node('MeterNumber', '6789') + \
Node('CarrierCode', 'FDXE') + \
Node('Service', 'STANDARDOVERNIGHT') + \
Node('Packaging', 'FEDEXENVELOPE')
assert root2.xml() == root.xml()
# Reading a value using operator[].
assert root2['AccountNumber'] == '12345'
# Begin creating a DSL for Fedex:
class RequestHeader(Node):
def __init__(self, acct, meter, ccode, serv, packg):
Node.__init__(self, 'RequestHeader')
self += Node('AccountNumber', acct)
self += Node('MeterNumber', meter)
self += Node('CarrierCode', ccode)
self += Node('Service', serv)
self += Node('Packaging', packg)
header = RequestHeader('12345', '6789', 'FDXE',
'STANDARDOVERNIGHT', 'FEDEXENVELOPE')
assert header.xml() == root2.xml()
root3 = Node("Top")
two = Node("two")
two += Node("three", "wompity", x="42")
two += Node("four", "woo")
root3 += two
root3 += Node("five", "rebar", stim="pinch", attr="ouch")
# Conversion to string:
assert str(root3.node("five")) == """\
five
attribute: stim = pinch
attribute: attr = ouch
value: [rebar]"""
assert root3.xml() == """\
<Top>
<two>
<three x="42">
wompity
</three>
<four>
woo
</four>
</two>
<five attr="ouch" stim="pinch">
rebar
</five>
</Top>
"""
# Reassign values:
root3["four"] = "wimpozzle"
# Can easily extract a subtree:
assert root3.node("four").xml() == """\
<four>
wimpozzle
</four>
"""
# Change attribute:
root3.node("five").setAttribute("attr", "argh!")
assert root3.node("five").xml() == """\
<five attr="argh!" stim="pinch">
rebar
</five>
"""
# Remove attribute:
root3.node("five").delAttribute("attr")
assert root3.node("five").xml() == """\
<five stim="pinch">
rebar
</five>
"""
# Create a Node from a minidom:
nn = Node.create(root3.dom())
assert nn.xml() == root3.xml()
# Cannot insert values into anything but leaf nodes:
test = copy.deepcopy(root3)
test["two"] = "Replacement"
try:
print test.xml()
except AssertionError:
pass # Expected
# Handling block text values (imperfect but consistent):
beware = Node.create("""\
<SomeText>
Beware the machines that confuse
And
The men behind them
</SomeText>
""")
assert beware.xml() == """\
<SomeText>
Beware the machines that confuse And The men behind them
</SomeText>
"""
One word: Namespaces. Wouldn't this approach break down when XML with multiple namespaces is processed? Multiple namespaces imply a composition of XML vocabularies which overlays the simple hierachy of the document. Anyway, just a thought. I do enjoy reading your blog, good brain food...
xml.dom.minidom is a sad, sad library. Python has suffered from not enough NIH sometimes, where an API like the DOM gets brought in from elsewhere even though it's not a particularly good example of anything. I thought for a while I'd use it to get some symmetry between Javascript and Python (before I knew the DOM very well) and it was very disappointing... and maybe it's also that the DOM code in Python just isn't of particularly good quality. I don't know. It's abandoned, even if it is in the standard library; a sad state.
py.xml is very pleasantly simple. Why didn't you use that? Alternately, if you want ElementTree but want it to be easier to build XML, you might find formencode.htmlgen to be convenient (http://svn.formencode.org/FormEncode/trunk/formencode/htmlgen.py) -- it's not really round-trip (it uses ElementTree subclasses, and hence parsed XML won't have the extra building features that you'd otherwise get). At this point, I might be more inclined to add ElementTree serialization to py.xml, and get py.xml properly and independently packaged, instead of going further with that; but I still use that module often anyway.
For RPC I'd probably be inclined to use JSON these days, because the object model is slightly better than XML-RPC (includes null), and simpler (e.g., no method name, doesn't force an RPC feel). JSON seems to be gaining some traction over XML-RPC these days, in part because it can piggyback on Ajax into places where XML-RPC is seen as too old and ineligant.
I think you forgot about namespaces. From my experience with XML, they are THE biggest headache about it. Their semantics are far from intuitive. It's not like they're wrong; you just have to understand them really well in order to use them correctly.
You avoid that problem by not including namespace support. This may be good enough for your needs, but it can't be a general purpose XML manipulation library.
My entire experience with XML was using Java. However I think Java's built-in XML APIs are terrible. The best library I found is Elliotte Rusty Harold's XOM - see http://www.xom.nu/. It has an excellent API.
I like XOM too. There's a small introduction to it in Thinking in Java 4e.
And you're right, I'm ignoring XML namespaces because they haven't appeared yet in what I've had to use XML for. Apparently py.xml does handle namespaces.
I might as well toss my hat into the ring. I've created a Python framework to simplify parsing and generating XML files that match a template. It converts XML to/from Python objects. I haven't *quite* finished the generation features, but I probably should soon.
You would need to modify the Python language itself for full support, but in general the E4X model of XML manipulation is one of the best I've worked with.
It was generated by the following Python which uses ElementTree and shows namespaces being used. It also illustrates combining Python strengths with XPath.
The XML is a small but complex example, to give you an idea here's the first element showing how many (standardised) namespaces are being used!
This is typical of the kind of complex XML used to represent complex data like assay results of samples taken from boreholes, complete with spatial coordinates and attribution to various parties.
""" adx2dot """ import pydot #import pydotNoParser as pydot # Andy's hack so you don't need pyparsing installed # import pdis.xpath # not using at present, just using elementree xpath support from elementtree import ElementTree as ET
def buildActivitiesDict(tree): """ builds dict of all <gml:definitionMember><adx:SpecimenPreparationActivity> This is the sort of thing you'd use an xsl:key for in XSLT. """ activities = tree.findall( ''.join(( '//', gml, 'definitionMember/', adx, 'SpecimenPreparationActivity' )) ) d = {} for a in activities: d[ a.attrib['{http://www.opengis.net/gml}id'] ] = a # index elements by their id return d
def graphSeqDefsCrossLinking(tree, actsD): """ returns a graph you can dump with .to_string() or graph with write_png actsD should be a dict created with buildActivitiesDict
The nodes are shown """ graph = pydot.Dot(size='5.5,7') seqsPath = ''.join( ('//', gml, 'definitionMember/', om, 'ProcedureSequence') ) seqs = tree.findall( seqsPath) for seq in seqs: seqName = seq.find(gml+'name').text graph.add_node( pydot.Node(seqName) ) edgeFrom = seqName for step in seq.findall(om+'step'): activityDesc = actsD[ step.attrib[xlink+'href'][1:] ] # assume # in href[0:0], eg: #rec1 edgeTo = activityDesc.find(gml+'name').text graph.add_edge( pydot.Edge(edgeFrom, edgeTo) ) return graph
def graphSeqDefs(tree, actsD): """ returns a graph you can dump with .to_string() or graph with write_png actsD should be a dict created with buildActivitiesDict """ graph = pydot.Dot(size='5.5,7') seqsPath = ''.join( ('//', gml, 'definitionMember/', om, 'ProcedureSequence') ) seqs = tree.findall( seqsPath) uniqueStepNumber = 0 for seq in seqs: seqName = seq.find(gml+'name').text seqNode = pydot.Node(seqName) seqNode.shape = "box" graph.add_node( seqNode ) edgeFrom = seqName for step in seq.findall(om+'step'): activityDesc = actsD[ step.attrib[xlink+'href'][1:] ] # assume # in href[0:0], eg: #rec1 edgeTo = 'step'+str(uniqueStepNumber) uniqueStepNumber += 1 destNode = pydot.Node(edgeTo) destNode.label = activityDesc.find(gml+'name').text # unique name, possibly similar label graph.add_node( destNode ) graph.add_edge( pydot.Edge(edgeFrom, edgeTo) ) edgeFrom = edgeTo return graph
Have you seen the XML builder libraries in groovy and ruby? With the power of closures and the ability to write a method (method_missing in ruby) that responds to any message, those languages really give you the ability to create a library that is very succinct to use.
I wrote a Python library called "xe". xe is designed to make it very easy to work with structured XML.
Most XML libraries are designed to work with arbitrary XML, so it's a lot of work to do even the simplest thing. xe can handle arbitrary XML, but it's really not what it was designed for.
I've announced this a few times on comp.lang.python and no one seemed excited by it, but I think this is really cool stuff.
Using xe, I have written several libararies for working with syndication feeds. Take a look at the source for my RSS or Atom libraries to see just how little code they contain and how tidy it is.
Currently xe is pretty stable, but there are two changes I intend to make: first, I need to make it automatically convert character entities into characters ("&" should become "&") one layer deep, and extract CDATA sections; second, I want to make it handle XML name spaces. I have been very busy lately and haven't been working on this, but I should have more time to work on it very soon.
If you have any questions, just send email to one of the addresses from one of the above web pages. And of course I'll accept patches if anyone ever offers some.
I went ahead and coded up an example of xe, using the unit test from the blog post. Note that xe pretty-prints the tags somewhat differently from the way Bruce Eckel's code does it; xe does give fine-grained control over how the pretty-printing works (see the xe.TFC class for details).
import xe
xe.set_indent_str(" ")
root = xe.NestElement("RequestHeader")
# binding names inside NestElement builds the tree implicitly
# .attrs is a dictionary of attribute names; by default attrs are strings
# You can subclass the attrs to make custom attributes with checked # values, or of another type. See opml.py for a class that has a # timestamp in the attributes.
I didn't imagine it being quite that clean when I looked at the docs, but that's pretty compelling. I still like my system (it's my C++ roots; I really like elegant applications of operator overloading), but if I started running into more complex problems -- primarily XML namespaces, I imagine -- then I'll probably use py.xml. So far, all I've had to do with XML is things like this, and I haven't yet run into any namespaces issues.
Small rant off topic: I don't understand why the py.test approach isn't being followed in the unit testing stuff that's been incorporated into the standard Python library. It seems like people are just blindly following the JUnit model, which was poorly designed to begin with. In general I really like the simple thinking that the "py." guy has.
Flat View: This topic has 29 replies
on 2 pages
[
12
|
»
]