artima - Article Discussion

Article Discussion

An Introduction to XML Data Binding in C++

View Flat

Summary: XML processing has become a common task that many C++ application developers have to deal with. Using low-level XML access APIs such as DOM and SAX is tedious and error-prone, especially for large XML vocabularies. XML Data Binding is a new alternative which automates much of the task by presenting the information stored in XML as a statically-typed, vocabulary-specific object model. This article introduces XML Data Binding and shows how it can simplify XML processing in C++.

20 posts.

The ability to add new comments in this discussion is temporarily disabled.

Most recent reply: February 11, 2010 9:43 PM by Brian

Bill

Posts: 409 / Nickname: bv / Registered: January 17, 2002 4:28 PM

An Introduction to XML Data Binding in C++

May 4, 2007 4:08 PM

XML processing has become a common task that many C++ application developers have to deal with. This article introduces XML Data Binding and shows how it can simplify XML processing in C++.

http://www.artima.com/cppsource/xml_data_binding.html

What do you think of the techniques presented in this article? What other approaches have you taken in C++ to process XML?

Martin

Posts: 1 / Nickname: kardigen / Registered: October 5, 2006 4:37 AM

Re: An Introduction to XML Data Binding in C++

May 5, 2007 1:29 AM

It's interesting approach. I've worked with DOM like (object oriented text based) approaches and I think it's more universal then XML Data Binding, because the XSD is not needed. However, in some cases presented techniques would be better.

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 5, 2007 11:56 AM

I've worked with DOM like (object oriented text based) approaches and I think it's more universal then XML Data Binding, because the XSD is not needed.

Working on a large XML vocabulary and not having a formal definition for it is suicidal. Handling large and complex vocabularies is also exactly the situation where one experiences the most pain from raw APIs such as DOM and SAX.

Hector

Posts: 2 / Nickname: hector / Registered: May 6, 2007 4:40 AM

Re: An Introduction to XML Data Binding in C++

May 6, 2007 2:28 PM

> What do you think of the techniques presented in this
> article? What other approaches have you taken in C++ to
> process XML?

A small nit: The title should of maybe indicated, "....using Product XYZ"

Otherwise, the technique is similar to what we do but instead auto-generating a different p-code language used by applications server. The RTE is written in C++. The reasons were basically the same as cited by the article, with speed being a big influence.

On a semi-related note, it might interest you that James Ward (Adobe) has produced an interesting benchmarking demo outlining the different ways today from processing huge data sets (like 5000 records).

http://www.jamesward.org/census

--
HLS

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 0:01 AM

A small nit: The title should of maybe indicated, "....using Product XYZ"

The product neutrality issue was considered carefully. The choices were to provide an article without any code examples or to pick a tool and try to show only the basics that are the same or similar across different products. The former choice would have rendered the article pretty much useless so we went with the latter.

http://www.jamesward.org/census

This page doesn't have any content.

Bjarne

Posts: 48 / Nickname: bjarne / Registered: October 17, 2003 3:32 AM

Re: An Introduction to XML Data Binding in C++

May 4, 2007 9:42 PM

Looks very elegant. How complicated is the generator? Are there any ways to control the style of C++ generated? (e.g. use of containers, int sizes, string types)

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 5, 2007 11:33 AM

How complicated is the generator?

We tried to make the generator as simple as possible but it is still somewhat complex mainly due to various idiosyncrasies of the XML Schema language. We have a custom semantic graph for XML Schema with a convenient traversal mechanism. The output streams perform automatic indentation of the C++ code being produced. This makes the code in the generator quite transparent. The complexity comes from the difficulty of mapping some of the XML Schema constructs to C++. One notable example is anonymous types. At some point we realized that it is often easier to get rid of such constructs by transforming the graph before code generation than to handle things in the generator. As a result we now have a number of transformations such as naming of anonymous types and resolving name conflict that significantly simplify the generator.

Are there any ways to control the style of C++ generated? (e.g. use of containers, int sizes, string types)

There is support for selectively customizing the generated C++ classes, including the mapping of built-in XML Schema types to C++ types (so types like integers, string, etc., can be remapped to custom types). The mechanism is described in the following document:

http://wiki.codesynthesis.com/Tree/Customization_guide

At the moment there is no way to customize the underlying containers but it shouldn't be hard to support.

Roland

Posts: 25 / Nickname: rp123 / Registered: January 7, 2006 9:42 PM

Re: An Introduction to XML Data Binding in C++

May 5, 2007 2:24 AM

XML Data Binding stems from the Java world. The most notable examples are JAXB (https://jaxb.dev.java.net/) and XMLBeans (http://xmlbeans.apache.org/). I used JAXB years ago and can recommend it. You often have a given schema or need to write one for XML validation. Code generation from the schema is simple then and allows for very convenient handling of small to medium XML documents.
BTW, a similar approach can be used to generate code from a database schema.
As for the generated C++ code from XSD, IMO, the auto_ptrs are unnecessary. I'd prefer classic RAII where a parent owns (a tree of) children. Unfortunately XSD is a commercial product (with open source GPL teaser) so the motivation to change the generator is limited.

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 5, 2007 0:11 PM

As for the generated C++ code from XSD, IMO, the auto_ptrs are unnecessary. I'd prefer classic RAII where a parent owns (a tree of) children.

auto_ptr helps you not to write exception-unsafe code, e.g.,

handle_person (person ("p.xml"), can_throw ());

You can also easily strip auto_ptr away with a call to release():

person_t* p = person ("p.xml").release ();

James

Posts: 128 / Nickname: watson / Registered: September 7, 2005 3:37 AM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 10:26 AM

> XML Data Binding stems from the Java world. The most
> notable examples are JAXB (https://jaxb.dev.java.net/) and
> XMLBeans (http://xmlbeans.apache.org/). I used JAXB years
> ago and can recommend it. You often have a given schema or
> need to write one for XML validation. Code generation from
> the schema is simple then and allows for very convenient
> handling of small to medium XML documents.

I do not recommend generating code in a static language from schemata. The fact of the matter is that is doesn't really solve anything and creates a very brittle and non-reusable code.

Basically you take the XML structure and create code that is coupled to it. Now you will generally need to walk the tree and extract the data for use in other places. This means writing a bunch of code bound tightly to those structures. In a nutshell you've just bound all your code to a xml structure. Any change to the schema will require regeneration of the code (if you use validation) even where the changes are irrelevant. If you have multiple versions of the schema or different schemata mapping to the same cannonical data structures, binding will be very difficult at best and infeasible in most cases: more hardcoding results.

A more effective strategy, one that JAXB 2 allows but seems to be rarely used, is to create schemata for your compiled types. This is a lot cleaner because XML allows for the declaration of rich structures that cannot be built in a language like C++ or Java without executable code ('choice' elements, for example) and can easily represent hierarchal field definitions from classes. Once this is done, you map the data from XML documents into these formats using a powerful XML tool such as XPath and convert them into objects. This effective decouples the code from the xml structures. These schemata change only when the classes change. While it might seem like this just moves the same amount of work to XPath, this kind of thing is trivial using stylesheets.

Roland

Posts: 25 / Nickname: rp123 / Registered: January 7, 2006 9:42 PM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 0:30 PM

> I do not recommend generating code in a static language
> from schemata. The fact of the matter is that is doesn't
> really solve anything and creates a very brittle and
> non-reusable code.

At least the generated code is type safe and therefore less brittle than e.g. DOM code.

> Basically you take the XML structure and create code that
> is coupled to it. Now you will generally need to walk the
> tree and extract the data for use in other places. This
> means writing a bunch of code bound tightly to those
> structures. In a nutshell you've just bound all your code
> to a xml structure. Any change to the schema will require
> regeneration of the code (if you use validation) even
> where the changes are irrelevant.

What you describe as disadvantages can also be seen as advantages. It depends on the application. When you need to create a XML message or store data in structured (XML) format JAXB certainly is a very convenient option (compared to e.g. DOM). OTOH, if you want to recursively traverse the nodes or transform the XML file then DOM or XSLT may be better suited.

> A more effective strategy, one that JAXB 2 allows but
> seems to be rarely used, is to create schemata for your
> compiled types.

This seems to put the cart before the horse. Moreover, the schema is often given because it's standardized.

James

Posts: 128 / Nickname: watson / Registered: September 7, 2005 3:37 AM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 5:47 PM

> > I do not recommend generating code in a static language
> > from schemata. The fact of the matter is that is
> doesn't
> > really solve anything and creates a very brittle and
> > non-reusable code.
>
> At least the generated code is type safe and therefore
> less brittle than e.g. DOM code.

I guess you can define 'brittle code' any way you like but what I mean by brittle is that insignificant changes cause the code to break. DOM based code doesn't break when a new element is added to the schema or if an element's type is changed in a minor way e.g. it's length goes from 5 to 6.

In any event this isn't really the worst thing about JAXB code. The worst thing is that all the work to get the required information into useful places is basically hardcoded. I worked with fairly large JAXB base that was basically unmaintainable. With DOM you can at least write code that can retrieve elements from similar structures. With JAXB, you can have two schemata with the exact same address element structure but you must write the code to retrieve the the elements repeatedly because it creates wholly separate types for them. But I'm not advocating DOM. It's basically a straw man.

> > Basically you take the XML structure and create code
> that
> > is coupled to it. Now you will generally need to walk
> the
> > tree and extract the data for use in other places.
> This
> > means writing a bunch of code bound tightly to those
> > structures. In a nutshell you've just bound all your
> code
> > to a xml structure. Any change to the schema will
> require
> > regeneration of the code (if you use validation) even
> > where the changes are irrelevant.
>
> What you describe as disadvantages can also be seen as
> advantages. It depends on the application. When you need
> to create a XML message or store data in structured (XML)
> format JAXB certainly is a very convenient option
> (compared to e.g. DOM).

Convenient in what way? What could be more convenient than just populating Objects with data and using them? Walking trees with Java is a nightmare. The JAXB code I worked with looked like this:

if (greatgrandparent != null) {
   GrandParent grandparent = greatgrandparent.getChild();
 
   if (grandparent != null) {
       Parent parent = grandparent.getChild();
 
       if (parent != null) {
           Child child = parent.getChild();
       }
   }
}

But many more levels deep and over and over again. The only thing that's convenient about it is it allows you to avoid learning to use a proper XML toolset and do everything with Java.

> OTOH, if you want to recursively
> traverse the nodes or transform the XML file then DOM or
> XSLT may be better suited.

XPath and XSLT are good for these things but it has nothing to do with what I am talking about here.

> > A more effective strategy, one that JAXB 2 allows but
> > seems to be rarely used, is to create schemata for your
> > compiled types.
>
> This seems to put the cart before the horse. Moreover, the
> schema is often given because it's standardized.

You are definitely missing the point. The schema generated for the code is only used to map data into objects. The standardized schema doesn't go away.

I spent 2 years banging my head on JAXB generated classes. We got to the point where we'd go out of our way to avoid changing a schema because of all the work that was required to do it. Any slight modification would require generating new classes, writing a bunch of code and touching all kinds of tangential modules.

With the methodology I am advocating, you add any new elements to the classes that use them, regenerate the schemata and use any number of highly efficient XML tools to map the required data into the Object. You can map many different message formats to the same Objects making your code much more reusable. Generating classes from schemata seems like a good idea on the surface but is a fundamentally flawed approach. It would be workable in a dynamic language like Ruby or Python but in a static language it gets you nowhere. It creates a redundant mirror of the XML structures in code violating DRY among other principles of good design.

To make it clearer what I am talking about. We'd have say 6 different standardized schemata for a purchase order that we had to support with new ones added over time. In order to avoid generating 6 sets of classes and writing thousands of lines of Java to place those orders, we created a canonical schema for a purchase order. Then we took this and generated classes from it. Then we had about 1000 or so lines of code to put that canonical data into stable business Objects. Then we had another 1000 or so lines of code to write the data from the usable java Object back into the JAXB Object. The JAXB classes did nothing for us. In fact the made things harder because walking a tree in JAXB is extremely labor (read: code) intensive. Using the technique I am describing, the data goes from XML straight into the business objects and all the translation is done with the proper tools.

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 11:48 PM

I guess you can define 'brittle code' any way you like but what I mean by brittle is that insignificant changes cause the code to break.

On the other hand, in the data binding approach, the client code that breaks as a result of a change will be flagged by the C++ compiler thanks to static typing. In case of DOM or your XPath-based manual mapping approach, with every change to your XML vocabulary you are left wondering (or guessing) whether the change was insignificant or the code is now silently broken.

To make it clearer what I am talking about. We'd have say 6 different standardized schemata for a purchase order that we had to support with new ones added over time. In order to avoid generating 6 sets of classes and writing thousands of lines of Java to place those orders, we created a canonical schema for a purchase order. Then we took this and generated classes from it. Then we had about 1000 or so lines of code to put that canonical data into stable business Objects. Then we had another 1000 or so lines of code to write the data from the usable java Object back into the JAXB Object.

There is a much cleaner way to implement this in XSD (I don't know about JAXB). The idea is to define a base type for all purchase orders in XML Schema. This type can be empty or it can contain some common elements/attributes. Then you define your purchase orders as extensions of this base type. When compiling the schema to C++, you customize the base class by adding virtual functions that will constitute the interface to all the purchase orders. Then you customize the concrete purchase orders by implementing those virtual functions. The application code manipulates all purchase orders via the customized base class. This approach is also a lot more efficient than XPath-based remapping.

James

Posts: 128 / Nickname: watson / Registered: September 7, 2005 3:37 AM

Re: An Introduction to XML Data Binding in C++

May 8, 2007 8:05 AM

> I guess you can define 'brittle code' any way you like
> but what I mean by brittle is that insignificant changes
> cause the code to break.
>
> On the other hand, in the data binding approach, the
> client code that breaks as a result of a change will be
> flagged by the C++ compiler thanks to static typing. In
> case of DOM or your XPath-based manual mapping approach,
> with every change to your XML vocabulary you are left
> wondering (or guessing) whether the change was
> insignificant or the code is now silently broken.

This is true to some degree. We used thorough testing to make sure this did not happen. Specifically, automatic regression scripts. The reality is that the above is not nearly enough to guarantee correctness and regression testing is needed anyway.

Generally speaking, if you are working with standard schemata, they don't change. You might have a new version of the schema that may or may not require coding changes, coding changes that require remapping or just mapping changes but the standard schemata should be fairly stable. The problem was that we had a canonical schema that would have to change in addition to the code and mapping changes. The approach I am describing removes this canonical schema completely and instead uses the class as the canonical structure.

> To make it clearer what I am talking about. We'd have
> say 6 different standardized schemata for a purchase order
> that we had to support with new ones added over time. In
> order to avoid generating 6 sets of classes and writing
> thousands of lines of Java to place those orders, we
> created a canonical schema for a purchase order. Then we
> took this and generated classes from it. Then we had about
> 1000 or so lines of code to put that canonical data into
> stable business Objects. Then we had another 1000 or so
> lines of code to write the data from the usable java
> Object back into the JAXB Object.
>
> There is a much cleaner way to implement this in XSD (I
> don't know about JAXB). The idea is to define a base type
> for all purchase orders in XML Schema. This type can be
> empty or it can contain some common elements/attributes.
> Then you define your purchase orders as extensions of this
> base type. When compiling the schema to C++, you customize
> the base class by adding virtual functions that will
> constitute the interface to all the purchase orders. Then
> you customize the concrete purchase orders by implementing
> those virtual functions. The application code manipulates
> all purchase orders via the customized base class. This
> s approach is also a lot more efficient than XPath-based
> remapping.

I'm not sure I understand what you are describing here but the situation I mean is that you have many different external purchase order schemata. You have a half-dozen to a dozen RosettaNet layouts, you have a number of web service layouts, some custom layouts for important customers and them you might even have non-XML formats such as a number of EDI formats. All the data elements across these formats map into a single true set valid data elements for a purchase orders. The different layouts can be dramatically different. For example, an element that is a child in one could be a parent in another. The names of the elements and their paths are pretty much guaranteed to be different. Mapping the data from all these different formats into a canonical form with Xpath is trivial. Trying to do this with JAXB generated classes is not feasible without a lot of code. Perhaps XSD can map wildly different formats into the same class structures but I imagine that would require some sort of mapping syntax which is not unlike the approach I am describing. Also you can use similar techniques to map EDI formats or any other formats to the same classes and keep order processing logic from being duplicated across the system.

Ray

Posts: 2 / Nickname: lisch / Registered: May 7, 2007 3:35 AM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 9:11 AM

> As for the generated C++ code from XSD, IMO, the auto_ptrs
> are unnecessary. I'd prefer classic RAII where a parent
> owns (a tree of) children.

RAII is nice for simple cases, but pointers work better with optional items and substitution groups, just to name two examples.

> Unfortunately XSD is a
> commercial product (with open source GPL teaser) so the
> motivation to change the generator is limited.

Fortunately, XSD is open source (GPL), so we are able to change the generator at will. I've been able to fix bugs without waiting for a vendor, and I've been modifying the code generator to suit our specific needs.

Roland

Posts: 25 / Nickname: rp123 / Registered: January 7, 2006 9:42 PM

Re: An Introduction to XML Data Binding in C++

May 7, 2007 11:52 AM

> RAII is nice for simple cases, but pointers work better
> with optional items and substitution groups, just to name
> two examples.

Pointers and ownership are independent of each other. XML is a hierarchical format, i.e. child nodes have only meaning in reference to (in context of) a parent node (except for the document node). Therefore it's quite 'natural' to let the parent nodes own their child nodes (the parent as 'container' for the child-ren). Of course, parent nodes may give access to their child nodes via pointers or iterators. But the lifetime of the child nodes can, and IMO should, be bound to the lifetime of the parents.
It also amazes me that you (apparently the author of "C++ In a Nutshell") consider RAII only for 'simple cases'. Quite the contrary. Automatic, deterministic resource management (a.k.a. RAII) is the key idiom to reduce complexity in large systems (why else would you still want to use C++ today).

> XSD is open source (GPL), so we are able to
> change the generator at will. I've been able to fix bugs
> without waiting for a vendor, and I've been modifying the
> code generator to suit our specific needs.

Right, but the license terms (http://www.codesynthesis.com/products/xsd/license.xhtml) also make it clear that runtime and generated code may be used freely only for the mentioned FLOSS projects. The authors have of course the right to put their product under any license they deem appropriate.

Boris

Posts: 6 / Nickname: boris / Registered: May 10, 2006 8:23 PM

Re: An Introduction to XML Data Binding in C++

May 8, 2007 0:48 AM

Pointers and ownership are independent of each other. XML is a hierarchical format, i.e. child nodes have only meaning in reference to (in context of) a parent node (except for the document node). Therefore it's quite 'natural' to let the parent nodes own their child nodes (the parent as 'container' for the child-ren). Of course, parent nodes may give access to their child nodes via pointers or iterators. But the lifetime of the child nodes can, and IMO should, be bound to the lifetime of the parents.

I don't understand what the original issue was then. The document tree is dynamically-allocated and returned as a pointer (wrapped into auto_ptr). Sub-nodes are owned by the tree and are returned as references.

Ray

Posts: 2 / Nickname: lisch / Registered: May 7, 2007 3:35 AM

Re: An Introduction to XML Data Binding in C++

May 9, 2007 5:51 AM

Okay, "simple" was the wrong word.

Some situations require pointers. Some don't. If I were implementing the code manually, I would use pointers only where they were necessary. On the other hand, I don't mind that Code Synthesis uses a uniform mechanism for all child elements. I don't expect a tool to generate code that is identical to the code I would write manually.

Substitution groups and xsi:type require polymorphism, and therefore pointers (or references). Optional elements map cleanly to pointers. Some schemas describe cyclic data, and child elements must be pointers.

John

Posts: 2 / Nickname: jtorjo / Registered: September 24, 2007 0:18 AM

Re: An Introduction to XML Data Binding in C++

September 24, 2007 10:05 PM

template <typename gender_ret_t>
class gender_t: public xml_schema::parser<gender_ret_t>
{
public:
  // Parser hooks.
  //
  virtual void pre ();
  virtual void _characters (const string&);
  virtual gender_ret_t post ();

private:
  ...
};

Why does gender_t need to know about this? It would seem more natural to have different classes which actually hold the logic for filtering which data you need. The code would be way more flexible, and simpler to read...

--
http://John.Torjo.com -- C++ expert
... call me only if you want things done right

Pete

Posts: 1 / Nickname: peteco / Registered: November 28, 2008 8:23 PM

Re: An Introduction to XML Data Binding in C++

November 29, 2008 2:43 AM

FYI - We have a similar C++ XML data binding product called Codalogic LMX. You can find out more at http://codalogic.com/lmx/ .

Brian

Posts: 1 / Nickname: aberle / Registered: February 11, 2010 3:05 PM

Re: An Introduction to XML Data Binding in C++

February 11, 2010 9:43 PM

This is an age old problem. I was the team lead of the development team with the most critical path dependencies on the largest software project in the world during 1999 and 2000 and this very issue wss the focus of my work during that time. I am convinced that the wheel was invented by multiple engineers who were unaware that others had already invented it. The same is true of XML Data binding in C++. I invented it too, and I've been perfecting it for over 10 years on various projects. I have a solution that addresses the issues noted here and some additional issues that repeatedly arise:

1. XML Updates. This is the ability to re-apply a subset of XML into an existing object model. In many cases the XML is bound to indexed objects and we cannot afford to re-index for each update.

2. COM and CORBA interface management. In the same respect that the XML Data Binding can be automated through object oriented practices - so can the instances of interface objects that provide that data to the application layer.

3. State Tracking. The application often needs to distinguish between an empty value <String></String> vs. a missing value - both create an empty string. This provides the validation along with Data Binding.

The source code uses the least restrictive license - less so that GPL. The project is supported and managed from here:

http://www.codeproject.com/KB/XML/XMLFoundation.aspx

Now that it's the year 2010, I believe that nobody else will attempt to reinvent the wheel because there are a few to choose from. IMHO - this wheel is the most polished and well rounded implementation available.

Enjoy.