An Introduction to XML Data Binding in C++

by Boris Kolpackov

May 4, 2007

Summary

XML processing has become a common task that many C++ application developers have to deal with. Using low-level XML access APIs such as DOM and SAX is tedious and error-prone, especially for large XML vocabularies. XML Data Binding is a new alternative which automates much of the task by presenting the information stored in XML as a statically-typed, vocabulary-specific object model. This article introduces XML Data Binding and shows how it can simplify XML processing in C++.

Introduction

A typical C++ application that has to manipulate the data stored in XML format uses one of the two common XML access APIs: Document Object Model (DOM) or Simple API for XML (SAX). DOM represents XML as a tree-like data structure which can be navigated and examined by the application. SAX is an event-driven XML processing API. The application registers its interest in certain events, such as start element tag, attribute, or text, which are then triggered during the parsing of an XML instance. While DOM has to read the whole document into memory before the application can examine the data, SAX delivers the data as parsing progresses.

Anyone who has had to handle a large XML vocabulary using DOM or SAX can attest that the task is hardly enjoyable. After all, both DOM and SAX are raw representations of the XML structure, operating in generic elements, attributes, and text. An application developer often has to write a substantial amount of bridging code that identifies and transforms pieces of information encoded in XML to a representation more suitable for consumption by the application logic. Consider, for example, a simple XML document that describes a person:

<person>
  <name>John Doe</name>
  <gender>male</gender>
  <age>32</age>
</person>

If we wanted to make sure the person's age is greater than some predefined limit, with both DOM and SAX we would first have to find the age element and then parse the string representation of 32 to obtain the integer value that can be compared. Another significant drawback of generic APIs is string-based flow control. In the example above, when we search for the age element we pass the element name as a string. If we misspell it, we (or a user of our program) will most likely only discover this bug at runtime. String-based flow control also reduces code readability and maintainability. Furthermore, generic APIs lack type safety because all the information is represented as text. For example, we can compare the content of the gender element to an invalid value without any warning from the compiler:

DOMElement* gender = ...

if (gender->getTextContent () == "man")
{
  ...
}

In recent years a new approach to XML processing, called XML Data Binding, has emerged thanks to the progress in XML vocabulary specification languages (XML schemas). The main idea of XML Data Binding is to skip the raw representation of XML and instead deliver the data in an object-oriented representation that models a particular vocabulary. As a result, the application developer does not have to produce the bridging code anymore because the object model can be used directly in the implementation of the application logic. In the example above, instead of searching for the age element and then manually converting the text to an integer, we would simply call the age() function on the person object that already returns the age as an integer. The name XML Data Binding comes from the observation that the object representation is essentially bound to and becomes a proxy for the data stored in XML.

The vocabulary-specific object representation along with other support code such as parsing and serialization functions are generated by a data binding compiler from an XML schema. A schema is a formal specification of a vocabulary that defines the names of elements and attributes, their content, and the structural relationship between them. The majority of XML Data Binding tools use the W3C XML Schema specification language due to its object-oriented approach to the vocabulary description as well as its widespread use. The following fragment describes the XML vocabulary presented above using W3C XML Schema:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:simpleType name="gender_t">
    <xs:restriction base="xs:string">
      <xs:enumeration value="male"/>
      <xs:enumeration value="female"/>
    </xs:restriction>
  </xs:simpleType>

  <xs:complexType name="person_t">
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="gender" type="gender_t"/>
      <xs:element name="age" type="xs:short"/>
    </xs:sequence>
  </xs:complexType>

  <xs:element name="person" type="person_t"/>

</xs:schema>

Even if you are not familiar with XML Schema, it should be fairly straightforward to figure out what is going on here. The gender_t type is an enumeration with the only valid string values being "male" and "female". The person_t type is defined as a sequence of the nested name, gender, and age elements. Note that the term sequence in XML Schema means that elements should appear in a particular order as opposed to appearing multiple times. Finally, the globally-defined person element prescribes the root element for our vocabulary. For an easily-approachable introduction to XML Schema refer to XML Schema Part 0: Primer.

Similar to the direct XML representation APIs, XML Data Binding supports both in-memory and event-driven programming models. In the next sections we will examine the complexity of performing common XML processing tasks using DOM and SAX compared to in-memory and event-driven XML Data Binding. The DOM and SAX examples in this article are based on Apache Xerces-C++ open-source XML parser with character conversions omitted to keep the code focused. The XML Data Binding examples are based on CodeSynthesis XSD open-source XML Schema to C++ data binding compiler.

In-Memory XML Data Binding

Based on an XML schema, a data binding compiler generates C++ classes that represent the given vocabulary as a tree-like in-memory data structure as well as parsing and serialization functions. The parsing functions are responsible for creating the in-memory representation from an XML instance while the serialization functions save the in-memory representation back to XML. For the schema presented in the introduction, a data binding compiler could generate the following code:

class gender_t
{
public:
  enum value {male, female};

  gender_t (value);
  operator value () const;

private:
  ...
};

class person_t
{
public:
  person_t (const string& name,
            gender_t gender,
            short age);

  // name
  //
  string& name ();
  const string& name () const;
  void name (const string&);

  // gender
  //
  gender_t& gender ();
  gender_t gender () const;
  void gender (gender_t);

  // age
  //
  short& age ();
  short age () const;
  void age (short);

private:
    ...
};

std::auto_ptr<person_t> person (std::istream&);
void person (std::ostream&, const person_t&);

From studying the generated code and XML schema declarations, it becomes clear that the compiler maps schema type declarations to C++ classes, local elements to a set of accessors and modifiers, and global elements to a pair of parsing and serialization functions.

In the remainder of this section we will look into performing three common XML processing tasks using DOM and XML Data Binding. These tasks are accessing the data stored in XML, modifying the existing data, and creating a new data from scratch. Based on this exercise we will evaluate the advantages of using XML Data Binding over DOM.

The following code uses XML Data Binding to read in an XML file with a person record and print the person's name if the age is greater than 30. Error handling in this and subsequent code fragments is omitted for brevity.

ifstream ifs ("person.xml");
auto_ptr<person_t> p = person (ifs);

if (p->age () > 30)
  cerr << p->name () << endl;

The example above is concise and to the point. Once the in-memory representation is created from the XML instance, the code has no traces of XML and looks natural, as if working with a hand-written object model. Note also that the XML data presented by the generated C++ classes is statically typed. The following code fragment performs the same task using DOM:

ifstream ifs ("person.xml");
DOMDocument* doc = read_dom (ifs);
DOMElement* p = doc->getDocumentElement ();

string name;
short age;

for (DOMNode* n = p->getFirstChild ();
     n != 0;
     n = n->getNextSibling ())
{
  if (n->getNodeType () != DOMNode::ELEMENT_NODE)
    continue;

  string el_name = n->getNodeName ();
  DOMNode* text = n->getFirstChild ();

  if (el_name == "name")
  {
    name = text->getNodeValue ();
  }
  else if (el_name == "age")
  {
    istringstream iss (text->getNodeValue ());
    iss >> age;
  }
}

if (age > 30)
  cerr << name  << endl;

doc->release ();

The DOM version, besides being more complex, is also less safe because of the use of strings to identify elements. We could easily misspell one of them without any warning from the compiler. In the XML Data Binding version, misspelling a function name which identifies an element would lead to a compile error. Also note that the code in the DOM version is conceptually split into two parts. The first part extracts the data from the raw representation of XML provided by DOM. The second, much smaller part, implements the application logic and is essentially the same as the XML Data Binding version.

The following code snippet increments the age and changes the name using XML Data Binding:

ifstream ifs ("person.xml");
auto_ptr<person_t> p = person (ifs);
ifs.close ();

p->name ("John Smith");
p->age ()++;

ofstream ofs ("person.xml");
person (ofs, *p);

The DOM version that performs the same task is presented next.

ifstream ifs ("person.xml");
DOMDocument* doc = read_dom (ifs);
ifs.close ();
DOMElement* p = doc->getDocumentElement ();

for (DOMNode* n = p->getFirstChild ();
     n != 0;
     n = n->getNextSibling ())
{
  if (n->getNodeType () != DOMNode::ELEMENT_NODE)
    continue;

  string el_name = n->getNodeName ();
  DOMNode* text = n->getFirstChild ();

  if (el_name == "name")
  {
    text->setNodeValue ("John Smith");
  }
  else if (el_name == "age")
  {
    istringstream iss (text->getNodeValue ());
    iss >> age;
    age++;
    ostrinstream oss;
    oss << age;
    text->setNodeValue (oss.str ());
  }
}

ofstream ofs ("person.xml");
write_dom (ofs, doc);
doc->release ();

Again the DOM version suffers from extra complexity compared to XML Data Binding. In this case the DOM navigation and data conversion code is intermixed with the application logic implementation which further reduces readability and maintainability.

The final task that we will consider consists of the creation of a new person record from scratch. The XML Data Binding version is presented below:

person_t p ("John Doe", gender_t::male, 32);

ofstream ofs ("person.xml");
person (ofs, p);

The equivalent DOM version is shown below. Note that a more realistic example would require extra conversions for the gender and age values which are hard-coded as strings in this example.

DOMDocument* doc = create_dom ("person");
DOMElement* p = doc->getDocumentElement ();

DOMElement* e = doc->createElement ("name");
DOMText* t = doc->createCreateTextNode ("John Doe");
e->appendChild (t);
p->appendChild (e);

e = doc->createElement ("gender");
t = doc->createCreateTextNode ("male");
e->appendChild (t);
p->appendChild (e);

e = doc->createElement ("age");
t = doc->createCreateTextNode ("32");
e->appendChild (t);
p->appendChild (e);

ofstream ofs ("person.xml");
write_dom (ofs, doc);
doc->release ();

The examples presented in this section show that processing XML with DOM brings in a large amount of accidental complexity that is associated with navigating and converting the data store in XML as presented by DOM to a format usable by the application. In contrast, the object model provided by XML Data Binding is directly usable in the application logic implementation. The following list summarizes the key advantages of the in-memory XML Data Binding over DOM:

Ease of use. The generated code hides all the complexity associated with parsing and serializing XML. This includes navigating the structure and converting between the text representation and data types suitable for manipulation by the application logic.
Natural representation. The object representation allows us to access the XML data using our domain vocabulary instead of generic elements, attributes, and text.
Concise code. With the object representation the application logic implementation is simpler and thus easier to read and understand.
Safety. The generated object model is statically typed and uses functions instead of strings to access the information. This helps catch programming errors at compile-time rather than at runtime.
Maintainability. Automatic code generation minimizes the effort needed to adapt the application to changes in the document structure. With static typing, the C++ compiler can pin-point the places in the client code that need to be changed.
Compatibility. Sequences of elements are represented in the object model as containers conforming to the standard C++ sequence requirements. This makes it possible to use standard C++ algorithms on the object representation and frees us from learning yet another container interface, as is the case with DOM.
Efficiency. If the application makes repetitive use of the data extracted from XML, then XML Data Binding is more efficient because the navigation is performed using function calls rather than string comparisons and the XML data is extracted only once. Furthermore, the runtime memory usage is reduced due to more efficient data storage (for instance, storing numeric data as integers instead of strings) as well as the static knowledge of cardinality constrains.

Event-Driven XML Data Binding

While the in-memory XML Data Binding and raw XML access APIs such as DOM are relatively easy to use and understand, there are situations where it is not possible or desirable to load the whole document into memory before doing any processing. Examples of such situations include handling XML documents that are too large to fit into memory and performing immediate processing as parts of the document become available (streaming). For applications that are unable to use the in-memory programming model there are even-driven XML Data Binding as well as raw XML access APIs such as SAX which allow to perform XML processing as parsing progresses.

Event-driven XML Data Binding consists of parser templates that represent the given vocabulary as a hierarchy of data availability events which are dispatched using the C++ virtual function mechanism. Compared to SAX, event-driven XML Data Binding shields us from the tasks of manual data extraction and event dispatching. For the schema presented in the introduction, a data binding compiler could generate the following parser templates:

template <typename gender_ret_t>
class gender_t: public xml_schema::parser<gender_ret_t>
{
public:
  // Parser hooks.
  //
  virtual void pre ();
  virtual void _characters (const string&);
  virtual gender_ret_t post ();

private:
  ...
};

template <typename person_ret_t,
          typename name_t,
          typename gender_t,
          typename age_t>
class person_t: public xml_schema::parser<person_ret_t>
{
public:
  // Parser hooks.
  //
  virtual void pre ();
  virtual void name (const name_t&) = 0;
  virtual void gender (const gender_t&) = 0;
  virtual void age (const age_t&) = 0;
  virtual person_ret_t post ();

  // Parser construction API.
  //
  void name_parser (xml_schema::parser<name_t>&);
  void gender_parser (xml_schema::parser<gender_t>&);
  void age_parser (xml_schema::parser<age_t>&);

private:
  ...
};

The generated code needs some explaining. Let us start with the person_t class template. The first five virtual member functions are called parser hooks. We override them in our implementation of the parser to do something useful.

The pre() function is an initialization hook. It is called when a new element of type person_t is about to be parsed. We can use this function to allocate a new instance of the resulting type or clear accumulators that are used to gather information during parsing. The default implementation of this parser hook does nothing.

The post() function is a finalization hook. It is called when parsing of the element is complete and the result of template type parameter person_ret_t should be returned. If person_ret_t is void then the default implementation of this parser hook also does nothing. Otherwise we must override this function in order to return the result value.

The name(), gender(), and age() functions are called when the corresponding elements have been parsed. Their arguments contain the data extracted from XML. The argument types are for us to decide and are paired with the return types of parser implementations that correspond to the types of name, gender, and age elements, respectively.

The last three functions are used to tell the person_t parser which parsers to use to parse the contents of name, gender, and age elements. We will see how to use them shortly.

The gender_t parser template has both pre() and post() hooks as well as the _characters() hook which delivers the raw text content of an element or attribute. The following code fragment shows how we can implement these parser templates to do the same task as in the previous section, namely print the person's name if the age is greater than 30:

enum gender {male, female};

class gender_impl: public gender_t<gender>
{
public:
  virtual void pre ()
  {
    gender_.clear ();
  }

  virtual void _characters (const string& s)
  {
    gender_ += s;
  }

  virtual gender post ()
  {
    return gender_ == "male" ? male : female;
  }

private:
  string gender_;
};

class person_impl: public person_t <void,    // return type
                                    string,  // name
                                    gender,  // gender
                                    short>   // age
{
public:
  virtual void name (const string& n)
  {
    name_ = n;
  }

  virtual void gender (const ::gender& g)
  {
    gender_ = g;
  }

  virtual void age (const short& a)
  {
    age_ = a;
  }

  virtual void post ()
  {
    if (age_ > 30)
      cerr << name_ << endl;
  }

private:
  string name_;
  ::gender gender_;
  short age_;
};

Note that the argument type of the gender() function in person_impl matches the return type of the post() function from gender_impl. The following listing puts all the parsers together and parses the XML instance. Note that we use predefined parser implementations for built-in XML Schema types string and short. These come with the data binding compiler runtime.

// Construct the parser.
//
xml_schema::short_ short_p;
xml_schema::string string_p;

gender_impl gender_p;
person_impl person_p;

person_p.name_parser (string_p);
person_p.gender_parser (gender_p);
person_p.age_parser (short_p);

// Parse the XML instance. The second argument to the document's
// constructor is the document's root element name.
//
xml_schema::document<void> doc_p (person_p, "person");
doc_p.parse ("person.xml");

The following code fragment performs the same task using SAX:

enum gender {male, female};

class parser: public DefaultHandler
{
public:
  virtual void startElement (const string& name)
  {
    text_.clear ();
  }

  virtual void characters (const string& s)
  {
    text_ += s;
  }

  virtual void endElement (const string& name)
  {
    if (name == "person")
    {
      if (age_ > 30)
        cerr << name_ << endl;
    }
    else if (name == "name")
    {
      name_ = text_;
    }
    else if (name == "gender")
    {
      gender_ = text_ == "male" ? male : female;
    }
    else if (name == "age")
    {
      istringstream ss (text_);
      ss >> age_;
    }
  }

private:
  string name_;
  gender gender_;
  short age_;

  string text_;
};

The SAX version is complicated by the additional code necessary to keep track of the element currently being parsed. The complexity will further increase for more realistic XML vocabularies because SAX events do not explicitly reflect the document structure. Instead, the application developer has to deduce the relationship between elements, attributes and text from the order of events being triggered. As with DOM, we also had to manually convert the text representation of age to the integer value and identify elements with strings which reduces the ability of the C++ compiler to detect errors. The following list summarizes the key advantages of the event-driven XML Data Binding over SAX:

Ease of use. The generated code hides all the complexity associated with recreating the document structure, maintaining the dispatch state, and converting the data from the text representation to data types suitable for manipulation by the application logic. Parser templates also provide a convenient mechanism for building custom in-memory representations.
Natural representation. The generated parser templates implement parser hooks as virtual functions with names corresponding to elements and attributes in XML. As a result, we process the XML data using our domain vocabulary instead of generic elements, attributes, and text.
Concise code. With separate parser template for each XML Schema type, the application logic implementation is simpler and thus easier to read and understand.
Safety. The XML data is delivered to parser hooks as statically typed objects. The parser hooks themselves are virtual functions. This helps catch programming errors at compile-time rather than at runtime.
Maintainability. Automatic code generation minimizes the effort needed to adapt the application to changes in the document structure. With static typing, the C++ compiler can pin-point the places in the client code that need to be changed.

Conclusions

Generic APIs such as DOM and SAX do not preserve the semantics of XML vocabularies and thus are disconnected from the problem domain. There are applications where DOM and SAX are more suitable than the domain-specific XML Data Binding approaches. Examples of such application include XML databases and editors where XML vocabularies are not known a priori. There are, however, large classes of applications that operate on a predefined XML vocabulary and are more concerned with the data stored in an XML-based format than with the XML syntax or structure. For such applications XML Data Binding can be an easier, safer and more enjoyable way to handle XML.

Resources

The XML Data Binding examples in this article are based on CodeSynthesis XSD open-source XML Schema to C++ data binding compiler, which is available at:
http://www.codesynthesis.com/products/xsd/

XML Schema Part 0: Primer, an easily approachable introduction to XML Schema, is at:
http://www.w3.org/TR/xmlschema-0/

The DOM and SAX examples in this article are based on the Apache Xerces-C++ open-source XML parser, which is available at:
http://xml.apache.org/xerces-c/

Talk back!