Summary
Working with XML is seldom associated with joy, but a language that has native support for XML can make working with XML documents a lot more fun, writes Michael Galpin in a recent IBM developerWorks article.
Advertisement
While many language-agnostic APIs have been developed to process XML documents, working with XML is not a chore many developers enjoy. One reason is perhaps because many XML APIs provide a relatively verbose way to read and modify XML documents.
Several newer languages provide a possibly easier way to navigate XML by supporting XML as a native language construct. EcmaScript 4, for instance, has such native XML support in its E4X extension. Adding XML support to a language, however, carries with it the danger of growing the core elements of the language, adding complexity and making the language harder to master.
Although Scala supports XML as a "built-in" data type, Scala's XML support does not come at the expense of simplicity. Instead, XML processing in Scala takes clever advantage of the language's expression syntax and functional features. A recent article by eBay architect Michael Galpin, Scala and XML, published on IBM's developerWorks Web site, provides a detailed tutorial on XML processing in Scala.
Galpin writes that,
XML has become such an important part of technology that it is only natural to use a programming language that has XML support built in to its syntax. That is exactly what you get with Scala... Like most programming languages, Scala gives you multiple options for parsing XML. These are the same basic ones: InfoSet/DOM based representations, push (SAX) or pull (StAX) events, or data-binding similar to Java Architecture for XML Binding (JAXB.).. And with Scala's native support for XML, you can use a template-style syntax to insert dynamic data into an XML structure...
The article illustrates Scala XML processing with an example that loads an XML document from the network into Scala's native XML type:
object FriendFeed {
import java.net.{URLConnection, URL}
import scala.xml._
def friendFeed():Elem = {
val url = new URL("http://friendfeed.com/api/feed/public?format=xml&num=100")
val conn = url.openConnection
XML.load(conn.getInputStream)
}
}
This XML document can be accessed in the following manner:
val feedXml = friendFeed
Among the more interesting XML-related features of Scala is that syntax that appears to be XPath—and is, indeed, valid XPath expression—is translated into Scala function invocations. Galpin illustrates this point with an example of searching through an XML document. This example also shows how to use Scala's pattern matching in the course of searching:
def filterFeed(feed:Elem, feedId:String):Seq[Node] = {
var results = new Queue[Node]()
feed\"entry" foreach{(entry) =>
if (search(entry\"service"\"id" last, feedId)){
results += (entry\"user"\"nickname").last
}
}
return results
}
def search(p:Node, Name:String):Boolean = p match {
case {Text(Name)} => true
case _ => false
}
filterFeed ... takes in an XML element (feed) and an ID of a service. First create a Queue of XML Nodes called results. Queue is parameterized, like a List or Map in Java. Scala uses square brackets to denote generic type, instead of the angle brackets used in Java programming. The line feed\"entry" is an XPath-like expression.
The backslash is actually a method of the class scala.xml.Elem. It returns all of the child-nodes with the given name, that is, all of the entry elements in the feed. This is returned as an instance of the class scala.xml.NodeSeq. This class extends Seq[Node]. Because it is a Seq, it has a foreach method that takes a closure as a parameter.
Other examples in the article demonstrate how to alter an XML document using pattern matching, and also discuss the immutable objects used in the Scala XML API.
What do you think of Scala's XML support? What is you favorite language for working with XML?
We do a lot of work with XML, so last year I got fed up with all the available tools and wrote my own simple XML node library. There are three variants, which address different needs.
The basic variant is just a DOM-like library, except without all the eccentricities of the DOM library. It doesn't handle everything (no namespaces or multiple text block children, for example), but it's much simpler than the DOM library and does everything we need. We just wrap a SAX Parser and use it to produce our own DOM-like tree, but with a much cleaner syntax for searching, modifying, or writing out the tree. Parsing a file is just XMLNode.parseFile(), writing it out is XMLNode.writeToFile(), and manipulating the tree in memory is equally trivial.
We've extended that in GScript, our scripting language, to make extensive use of closures, so from GScript you can call methods like findAll() or findFirst() and pass in a closure. We don't yet support XPath because personally I hate XPath and think it's totally incomprehensible; I don't use it often enough for the syntax to stick, so I spend 30 minutes looking things up online to figure out if I need one or two \ characters or if I use [] or @ or whatever else, when instead I could just write a function like:
More verbose, perhaps, but pretty clear and hard to screw up, and if you do screw it up you have some hope of debugging it. I'd rather spend the extra 5 seconds typing than spend 30 minutes looking up XPath references and then trial-and-erroring my way to the correct expression.
The second variant we have has subclasses of the base XMLNode class that declare metadata about the legal set of children, attributes, etc. We then can use those to automatically validate the tree, which turns out to be about 4x faster (using Xerces) than using an XSD and turning on validation in the parser, not to mention less buggy. It also allows us to generate a fully-denormalized XSD and then use inheritance, etc. in our node classes as you normally would, which is much, much easier than trying to deal with inheritance in a hand-coded XSD.
The last variant we have is only in GScript and makes use of our open type system to take an XSD and produce a set of strongly-typed nodes. It's kind of like Castor or similar tools, but the codegen happens essentially at runtime, and the resulting nodes have a much cleaner, DOM-like API, i.e. you can get the children that correspond to a particular sequence or choice, but you can also just iterate over all children of the node, treat the attributes like a simple map, etc.