Agile Buzz Forum - Normalization vs Compression

Recently I was working on a relational database and I was juggling the eternal act of normalization with efficiency. There are many books about database normalization written by people who love writing books about database normalization. I've read a couple myself and it all seemed reasonable back in university.

But it's the 21st century and it dawned on me that normalization is the futile act of trying to manually compress data. If you break up your data records in to its related component parts to -avoid duplicate data- then you're literally trying to re-invent compression ..badly.

So let's say we let compression deal with our data storage. Smart hashing would mean you'd easily be able to store just as much data in "document" format. But how do you identify when two records are actually the same? I suspect this is the job of hashing and indexing. If two people have the same (and more interestingly - similar) address, statistical machine analysis will find this fact without a programmer having to define the concept of address as its own table.

I've also seen a few impressive search engines that work by doing exactly this - compressing the data and using hashes to look it up rapidly. That combined with similar indexing and automatic data normalization through reduction seems interesting to me. It could find patterns that you'd never think of normalizing normally but ultimately make your program more efficient, based on how you're using the data.

You can also use the same technique to apply indexing automatically. It is compression after all, so finding the kinds of data you're looking for is the job of statistical machine learning. Is there anyone out there doing this right now? I'd be interested in trying to use this kind of database over a classical relational database.


	Web Artima.com