Summary
DataDirect released a set of components for Java and .NET aimed to automate the conversion of assorted data sources, such as flat files and EDI messages, to XML. In an interview with Artima, DataDirect product manager Carlo Innocenti talks about the new products, as well as about developers' changing attitude toward XML.
Advertisement
Although by no means a new technology, XML can still generate heated debates among developers. On the one hand, XML is now very widely used: it provides one of the foundational technologies for Ajax, some of the most popular consumer software products have standardized on XML as document representation format, and XML forms the basis for many popular developer tools, such as Ant.
Perhaps because of its popularity, many developers argue that XML is used too widely, and that in some cases XML pushes aside technologies that would better suite a problem domain.
Regardless of a developer's attitude toward XML, because of readily available XML tools, developers often have to convert various other data formats to XML, and XML to those other formats. For instance, XML can be used as serialization format for Java objects, or to pass data to third-party clients.
DataDirect released this week a set of XML Converters aimed to simply the conversion of a variety of source formats to and fro XML.
In an interview with Artima, DataDirect's XML product manager, Carlo Innocenti, noted that:
Through our database connectivity and XQuery products, we’ve been exposed to the sorts of problems our customers were experiencing with XML. A typical use-case is having to deal with heterogeneous data sources—a relational database, XML documents, or even flat files. Whenever these come into the picture, and you start having to integrate with EDI messages, or comma-separated value files, you end up having to write some kind of conversion program each time in order to incorporate those data sources into your system.
We started building conversion tools, which we are now calling XML Converters. These components can stream data that would not otherwise be available to XML natively.
The streaming access to data is what makes our components very unique. It’s fairly trivial to take a big EDI message, or a comma-separated value file, read it, parse it, and then generate an XML representation of that data. But doing all that in memory doesn’t scale and doesn’t perform well. In some cases, we have customers who deal with EDI messages in the eleven or twelve-megabyte range per message. When converted to XML, those messages become quite huge beasts, with 300 or more megabytes of footprint.
The unique aspect of our converters is not just that we support a large variety of formats, but also that we don’t ever have to allocate memory for the whole message we’re processing. Instead, we do the conversion in a streaming fashion. We provide, for instance, StAX events that are then pushed to the downstream processor that can make use of the newly created XML.
Having to deal with 300MB message files piqued our interest, and we tacitly asked Innocenti if XML was perhaps then the wrong choice for that sort of an application:
In the use-case I was mentioning, XML is not actually used to transmit messages on the wire—EDI would be used for that, for instance. You would convert the messages from the wire format into XML in order to do some processing in your back-end.
One could say that XML is verbose even for that purpose, and I tend to agree with that. But you have to consider all the tools and technologies readily available to process XML. If you chose some other format instead, you’d have to create your own parser, for instance. XML, by contrast, gives you a unified model for your data. Once you have a message in XML, think of all the tools and APIs you can use to access and process that message.
In addition to the maturity of the tools, developer attitudes toward XML have changed a lot in the past decade. When I started working with XML technologies back in 1998, the idea was that XML would become a lingua franca that everyone would use to store data, to exchange messages, and to represent documents.
Clearly, that hasn’t happened, and I don’t think it will happen in the future. In the past few years, XML has become increasingly popular in the document space, and is well-accepted as a way to represent documents. Many popular desktop document editors, for example, are moving to XML-based formats for their file storage and document representation.
By contrast, on the data storage side, XML has completely lost, especially considering the expectations people had. Even though XML is now supported as a native data type by the three major database vendors, only a very small percent of data is stored as XML.
Finally, if you consider the use of XML in messaging, you can see a significant growth in that area. Where a lot of companies were using EDI in the late eighties and early nineties, XML has started to play an increasingly significant role in that space. Many e-commerce companies or systems have, for instance, moved away from EDI to XML.
There are concerns, like the one about verbosity. But that’s relevant only for specific cases where the messages are very large. And we also see other options surfacing, such as JSON. Even inside the XML community, there are attempts to fix this problem in a more native XML way.
Most people now see XML as not something to be adopted, but something that’s here to stay. On the one hand people are not fighting it. On the other hand, they’re also not trying to use XML for anything and everything as the early adopters did.
What is your take on using XML as an intermediate data representation inside an application?
I don't think it makes sense to convert a 12 MB document that contains all the necessary data into a 300MB document that contains the same data just to pass it around. I think the main reason this is done is that the tools for dealing with legacy formats in Java are sorely lacking.
This is the exact reason why I started (shameless plug) the cb2java project on SourceForge. This is a library that deals with application data and cobol formats. I know this is a little different than EDI in general but it is closely related in my experience. The main advantage of this library over what I have seen available elsewhere is that it uses the cobol copybook natively to parse application data i.e. there is no intermediate generation steps or generated code.
I'm mentioning this here in the hope that someone else who needs to do work with this kind of thing will be interested in contributing to this project. It's currently in alpha but I consider it fully functional, just not verified. There are also a number of features, improvements, and tools that I would like to add once the core capabilities are finished. If you are interested, please contact me thought the cb2java project: http://sourceforge.net/projects/cb2java
Anyway, back to the point at hand, this library makes it easy to work with the raw flat-file data in a lightweight manner. This means there is no need to convert to another format for no other reason than to move it around the system.