Summary
Relational databases offer a rich toolset to help ensure the integrity of data. Apart from normalized schema, however, few enterprise applications take advantage of those tools and, instead, tend to implement integrity checking inside application layers. If SQL database constraints are useful, why aren't more applications taking advantage of them?
Advertisement
An important trend in data management, especially on the Web, is to use relaxed SQL schemas—or to eschew the principles of relational data design altogether. A recent Artima discussion focused on the "NoSQL" movement, as it were—and the general conclusion seemed to be that relational constraints are very useful, indeed, and that developers should bypass them at their peril.
That's hardly a surprise, given that normalized relational data models came about as a way to help ensure data integrity. Users, in general, like to think of databases as authoritative sources of data: an underlying notion for popular data access methods, such as SQL queries, is that the return values represent a well-defined notion of "truth." Preserving that notion is much harder when relational constraints are relaxed, which puts most of the integrity maintenance burdens on application-level code.
While much of the discussion on NoSQL centers around database de-normalization—relaxing SQL normal forms that results in possible data redundancy—modern database management systems, as well as the SQL standard, offer many additional mechanisms to ensure data integrity. In perusing the latest edition of Joe Celko's SQL for Smarties—a highly recommended read—it became clear to me that I'm using but a small portion of SQL's power to ensure data integrity. For instance, I almost never take advantage of the SQL check mechanism when declaring database tables, and I forgo other now standard SQL facilities, too, to, in effect, keep the database clean.
Although I believe in the value of pushing as much data management as possible into a database, I tend to implement a lot of data integrity-related code in the domain and controller layers. Of course, all that code needs to be tested, too. I recently surveyed a few dozen test suites I wrote using ScalaTest, encompassing several hundred test cases, and noticed that no less than 30% of our test cases had to do with enforcing data integrity constraints at the application layer. An example of that is making sure that API-level preconditions are enforced—checking for the proper handling of null values, for instance. And if a complex object is passed as a parameter to a method, I tend to write tests to make sure the method can gracefully accommodate method parameter objects in unacceptable states.
We do all that, even though we also use Hibernate Validator in the entity layer, carefully annotating entity classes with required validation constraints. Finally, we also write a significant amount of integrity-checking code in the presentation layer—Flex, in our case—to ensure that users submit valid data to the server-side API. And that presentation-layer code has tests suites, too. In total, therefore, we encode data integrity the database, the entity layer, and in the presentation layer, with significant portions of our test cases devoted to covering integrity-checking code in the controller layer and in the presentation as well.
That's a lot of code that clearly violates the DRY principle. The closest to non-redundant integrity constraints I've seen was in ActiveRecord, Rails' O/R mapping and persistence layer. ActiveRecord annotates entity-layer classes with integrity constraints, and Rails' controller and presentation layers use that data to, for instance, display nice error messages to the user upon incorrect data entry. However, as with Hibernate Validator, ActiveRecord barely takes advantage of the underlying relational database's expressive power: Constraint-related SQL generated by both frameworks is limited mostly to foreign-key relations and non-null field declarations. The rest of the integrity checks are performed by a sort of validation layer (as part of the entity classes).
I think that in a more ideal situation the database should be the guardian of the data, including constraint enforcement. That can be achieved today by writing lots of hand-coded SQL, which may not be an effective solution because most SQL databases rely on SQL error codes to indicate constraint violations. For instance, setting a constraint so that the salary field can't accept a value less than 5000 and greater than 1000000 will cause a database to return an error code when attempting insert a record with a salary of 100—but that error code would then have to be translated into some usable value in other layers of the application. The effort involved in that translation pays few dividends, especially since it's not portable across relational database products.
Where do you think is the proper place to define data constraints? Do you agree that SQL should provide more standardized error reporting for integrity violations?
I think that people don't use DB integrity constraints because they are not very portable and most projects are obsessed with being able to plug-replace their DBMSes, although projects that actually change their DBMS in midflight are most unusual.
Database contraints: lose them! It is generally good practise to validate all input in a program and not depend on the well-behavedness of external programs. What if your program gets to run somewhere else and for some reason the database constraints are not properly transferred? Does this mean that your program might crash? Then shame on you! Generally the programming language is the most powerfull thing in the house so if you want expressive power you should use that. Databases are just like files for writing structs to and from. Use the programming language for anything more advanced.
You already mention that the checking performed by database constraints must often be duplicated in the program. This clearly indicates that something is wrong. If the very basic integretry criterium that no tables or columns are missing a program should be able to do the right thing with/to any state it might find in the database.
Database Constraints: Lose them! Call me old fashioned if you like, but when I want to validate that the salary field can only accept a value between 5000 and 1000000 I perform that checking in the application, not the database. This means that I can detect the error as soon as possible, and output the correct error message in the correct place. I only attempt to write to the database AFTER all data has been validated, and any response from the database layer other than 'OK' is treated as a failure. Writing unfiltered data to the database, then attempting to use error codes to identify 'dirty' data that needs to be sent back to the user for reinput, with an appropriate error message (with possible translation into various different languages), just seems like using a sledgehammer to crack a nut. Sure it's possible, but "proper" programmers don't do it that way.
I think it's easier that you make it: the current generation of programmers forgot how to use SQL constraints because MySQL doesn't have them. Of course, fighting the OR divide (and being unaware of it) doesn't help. As for using or losing, it depends a lot on the kind of application you're writing, but some days I would give a chunck of my own flesh to have schema-wide assertions in postgres.
If multiple applications access the same database, you may wish to have the constraints in the database. (Or develop a REST front-end). But I often see costly ETL and replication used to avoid this situation.
Functional requirement may ask to save invalid objects, to be able to finish them later. I am not even sure that it would be possible to express this kind of conditional constraint in SQL. Or you could define non-constrained table just for unfinished objects. But I often ended up removing SQL constraints to manage this requirement in application code.
Or let say, you have a table person with an Id, a String "name" and a boolean "deleted". You cannot simply put a "unique" constraint on the name and keep an history. There would be solutions of course, but I am not sure IDEs for SQL would manage the code easily.
And as already stated, it isn't easy to catch SQL exception to see what business rules has been broken !
So I would put business logic in the application, close to the errors messages ... and just put "classical" constraints in the database, so I don't have to verify everything in the application.
And think about deployment, peoples are scared to change database schemas.
> Or let say, you have a table person with an Id, a String > "name" and a boolean "deleted". You cannot simply put a > "unique" constraint on the name and keep an history.
yes we can, at least with postgres
test=# create table people ( id integer primary key, name varchar, deleted boolean ); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "people_pkey" for table "people" CREATE TABLE test=# create unique index partial_idx on people (name) where deleted = false; CREATE INDEX test=# insert into people values (1, 'foo', false); INSERT 0 1 test=# insert into people values (2, 'foo', false); ERROR: duplicate key value violates unique constraint "partial_idx" test=# insert into people values (3, 'bar', false); INSERT 0 1 test=# insert into people values (4, 'bar', true); INSERT 0 1 test=# insert into people values (5, 'bar', true); INSERT 0 1 test=#
> And as already stated, it isn't easy to catch SQL > exception to see what business rules has been broken !
In my apps, SQL exceptions must not happen. It means a bug has to be fixed. Yes this means duplicating (in the app) what's already declared in the DB -- but at least some validation can happen by querying the information_schema.
> In my apps, SQL exceptions must not happen. It means a bug > has to be fixed. > Yes this means duplicating (in the app) what's already > declared in the DB -- but at least some validation can > happen by querying the information_schema.
I think this gets to the heart of the reason. With every database I've worked with, constraint violation error messages are horribly arcane. And that's from the developers perspective. From an end user and support perspective, they are unbearable.
For this reason, each constraint ends up needing to be implemented in code. As we all know this violates DRY and specifically requires that we maintain the constraints in two places.
If I had an easy way to get user/log friendly error messages from the database I would definitely consider using constraints more.
> I think this gets to the heart of the reason. With every > database I've worked with, constraint violation error > messages are horribly arcane. And that's from the > developers perspective. From an end user and support > perspective, they are unbearable.
Yes, the more general error messages are, the more out of context and unrelated to what the user is doing.
> If I had an easy way to get user/log friendly error > messages from the database I would definitely consider > using constraints more.
I don't have enough field experience to figure out a taxonomy of the constraints people use, and what could be automated..
But there is a trend towards the generation of the DB schema by means of declarative constructs on the application side: either via an XML (yuck) definition, or something more flexible like SQLAlchemy's Table objects.. in this case a lot of constraints are declared only once, if the libraries supports them.
With SQLAlchemy, I prefer to keep the freedom of creating the schema in plain SQL and analyze it with the powerful autoload feature. Having such metadata, the application can check some simple column constraints, type domain, foreign keys and unique indexes.
Say you want a UNIQUE INDEX over table_name (UPPER(column_name)).... declaring it in the application and generating the corresponding SQL is a no-brainer (*), but making sense of the information_schema and automatically applying the constraints is not fun anymore.
(*) for a bit of fun, the app could have a hierarchy of Constraint objects that both generate the SQL and do the checking on the procedural side
Of course complex constraints and their evil twins, triggers, cannot be replicated application-side by automated tools, but they can still be used to expose programming bugs.
There's only one database modeling tool today that does Rails-style generation, and its Andromeda. Zoho Creator does very rudimentary stuff, but doesn't appear to do the "nice" things you want.
ActiveRecord's biggest flaw is that there is no decoupling between the source text and what gets loaded in memory. As a result, these things get produced on the fly, and also in a dumb way.
> The closest to non-redundant integrity constraints I've > seen was in ActiveRecord, Rails' O/R mapping and > persistence layer. ActiveRecord annotates entity-layer > classes with integrity constraints, and Rails' controller > and presentation layers use that data to, for instance, > display nice error messages to the user upon incorrect > data entry. However, as with Hibernate Validator, > ActiveRecord barely takes advantage of the underlying > relational database's expressive power: Constraint- > related SQL generated by both frameworks is limited > mostly to foreign-key relations and non-null field > declarations. The rest of the integrity checks are > performed by a sort of validation layer (as part of the > entity classes).
As an *experienced* authority on this sort of stuff (Model-Driven Architecture weenie raising his hand), I find you've generally got the right idea but are missing the big picture.
If you want to hack on something better than Rails, then there isn't much out there. Oslo is probably the most mature thing. Eclipse has a very fragmented approach by comparison, and its *integration* that matters: cradle-to-the-grave seamless round-trip engineering.
SQL DBMSes actually don't have enough expressive power. There are all sorts of ways SQL syntax is not composable in the way you might write Haskell code with monads, for example (i.e., try and orElse). Really, with SQLServer 2008 and SQLCLR enabled, that's the only really moderately expressive SQL platform out there. Newer versions of Oracle scales better under a higher number of concurrent users, though, so it's a trade-off. Even to take advantage of SQLServer 2008, you have to know a lot about database tuning, trade-offs in performance of allocation of TSQL stackframes vs. CLR stackframes, etc. There's a lot to know to make something really expressive. Apart from that, there is the basic limitation that SQL, in all dialects, tends to be very poor at specifying constraints across tables (just the sort of logical problem Haskell's appraoch to STM is good at).
I can sum up your comments by saying I bet you've reached a point in your career where you've got this basic frustration buzzing in your head: I'm smart enough to make it work, I know how to make it work, but I don't want to have to put in the effort to make it work.
> I think that people don't use DB integrity constraints > because they are not very portable and most projects are > obsessed with being able to plug-replace their DBMSes, > although projects that actually change their DBMS in > midflight are most unusual.
This is precisely why the integrity constraints don't belong to the database. They belong to the model. This also allows you better control over whether the model is a good model, because in the past 10years model checking tools have grown up. Tools ike MIT's Alloy static analyzer are the future, and lightweight formal methods are going to start spreading throughout industry. It will take two more generations of programmers for this to happen, though. The average career in high tech is 7 years, so that means a minimum of 14 years wait.
Chris Anderson, the architect on Oslo (along with others who share that title), has said Oslo will take about 12 or so years to build (or something like that). Based on my own experiences, I'd say he's about right.
Database Admin: The SQL schema is the model John Zabroski: SQL DBMSes actually don't have enough expressive power. JPA, Hibernate, etc: model entities are POJOs Goetz (concurrency): use immutable objects (largely incompatible with POJO)
It seems that with todays technology I am forced into a lot of repetition. In the model I use for computation (in parallel across all the cores I can get) it is desireable to have lots of immutable objects. This model also needs at least sanity checking constraints. Then another model to persist the input data and results. Then there is the SQL schema.
Put me in the nay column, as well. Typically, when I get a database error, I roll back the transaction. That means that the client code also has to roll back its state to where it was before the DB threw the error, which can get complicated. I think DB errors should be treated as exceptions. Something bad happened, and now we're trying to recover/minimize the damage. It shouldn't be used as flow control, which is what most data validation is.
> Database Admin: The SQL schema is the model > John Zabroski: SQL DBMSes actually don't have enough > expressive power. > JPA, Hibernate, etc: model entities are POJOs > Goetz (concurrency): use immutable objects (largely > incompatible with POJO) > > It seems that with todays technology I am forced into a > lot of repetition. In the model I use for computation (in > parallel across all the cores I can get) it is desireable > to have lots of immutable objects. This model also needs > at least sanity checking constraints. Then another model > to persist the input data and results. Then there is the > SQL schema. > What we did for Artima's web app many years ago was make a few external DSL code generators. So from a single expression of the data model in our DSL, we get POJO entity classes, Hibernate XML files, DAO classes, and some SQL triggers. Basically there is some duplication of data integrity code in the generated SQL and Java, but we don't have to write it twice. We just write it once in our DSL and push a button and generate the different layers. You don't need to choose between ensuring data integrity in one layer or the other. You can have both without duplicating source code with a little bit of metaprogramming.
Flat View: This topic has 56 replies
on 4 pages
[
1234
|
»
]