Summary
Enterprise applications increasingly are required to store the full change history of important business objects. While relational databases are often used to store historic data with triggers or application code, Swaminathan Radhakrishnan argues in an OnJava article that storing object history data is the perfect job for a version control system.
Advertisement
At a talk given at USC a few years ago, Jim Gray mused that inexpensive storage might make the SQL delete and update operators obsolete: Instead of updating a database record, a system could just insert a new version of that record, keeping of continuous version history of database objects.
Gray's main motivation in making that remark was not primarily a desire to store a change history of an object just because we can—due to inexpensive storage—but because storing large amounts of data extremely fast might not be feasable with updates and deletes. (The topic of his talk was working with petabyte databases.)
Recently passed stringent laws about business record keeping are even more important motivators than cheap disks for keeping object version histories. Such change histories must preserve not only how an object changed, but also who and when caused an object to change.
A common method of keeping history records is to create a time-stamped version history table along with a main table for a class of database objects. When an update or delete is initiated on the main table, a database trigger can select the old record and insert it into the history table, along with some metadata identifying the user making the change and a change timestamp.
That approach makes retrieving an object's version history as simple as selecting all history records for an object, and displaying the changes occuring to each field between consecutive records.
In a recent OnJava article, Swaminathan Radhakrishnan suggests that storing versioning data is just the task version control systems are designed for: Version control systems not only store differences between objects, but also metadata associated with change that's useful in a business context. Radhakrishnan goes on to demonstrate how to use the Java Subversion client library, JavaSVN, to store version histories of a business object.
Using a version control system can provide a nice benefit when diffing large text fields, such as the field that stores the text of this news post, for instance. By comparing changes to such text fields as Subversion diffs, graphical diff tools, such as fldiff can be used to illustrate to a user the changes occurring to such text.
The biggest drawback for using a version control system for object histories might be performance. One trick in Radhakrishnan article is that he stores business objects as XML, not as Java objects. While this approach lets one diff object versions using JavaSVN's SVNDiffManager, having to serialize Java classes to XML for every database update might yield sluggish overall performance. By contrast, database triggers alleviate the need to serialize object state and, indeed, keep data entirely within the database server's address space during versioning.
What's your preferred method of storing object version histories? And how do you present version changes to users?
Almost every database server already stores changes (called transaction log or write ahead log) for recovery and backup purposes.
The only problem with these log files is that they can only be used to restore an old state of the entire database, not separate tables, rows, or columns.
Hence, we'd only need an SQL API to access the log files, which we need for backup purposes anyways.
This way, we would not only have the fastest versioning system (free), but it would also be maintenance-free (we only need to make sure that all log files are saved).
Sounds like a good idea to me. Oracle has something called "a flashback query". I wonder if those flashback queries read log files? Querying log files that are not loaded into tablespaces and indexed and, in general, not optimized for querying, might be slow. Such an API sounds like a good idea even if it's slow to query. For high performance versioning something else might be needed.
Version control software has a concept of branches, and I don't know how that would work with logs. I don't know if business apps need to take advantage of branching histories. If a business app needed to take advantage of branching histories, then using version control system would appear more attractive.
> Querying log files that are not > loaded into tablespaces and indexed and, in general, not > optimized for querying, might be slow. Yes. If legal obligations are your concern (as suggested in the article), this will not be a problem.
> Such an API sounds > like a good idea even if it's slow to query. For high > performance versioning something else might be needed. Preferrable, the API would be the same. You would just tell the database that the history of all/some tables should be kept in tablesspaces (with indexes). As those history tables could always be restored from the log files, they would be similar to materialized views.
> Version control software has a concept of branches, and I > don't know how that would work with logs. Postgres has a concept of timelines, which are the same as branches (but not as easy to handle). But thats not the point. The point of my original post is that database servers can easily be modified to handle versioning. And this is cleanest (in terms of ACID) and easiest to use solution.
Actually, there is a full product that does exactly this: vAuditServer. More details (datasheet, etc) can be found at www.variaware.com
The previous posters have already identified the main requirements for tracking versioned data: 1. We want access to the old versions of data, not just database logs. 2. We cannot depend on history tables, as the history tables will pollute (destroy?) the original schema. Application logic will become 90% history keeping and 10% business logic. 3. It may be somewhat acceptable for audit trace to be a bit slower than regular business object. 4. Version tracking should be switchable (on/off), and you should be able to add that feature to new business objects that you build. In other words, it should not involve new development for every data object.