Java Community News - Column-Oriented Databases

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Community News
Column-Oriented Databases

27 replies on 2 pages. Most recent reply: Dec 7, 2007 4:33 PM by Raoul Duke

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 27 replies on 2 pages [ 1 2 | » ]

Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

Column-Oriented Databases

Posted: Nov 28, 2007 5:07 PM

Summary
In a recent set of articles, Don Haderle and Michael Stonebraker review a bit of database history and point the way to column-oriented databases.

Current databases are mostly oriented around database rows, a design that originated with the constraints present at the time the first relational databases were implemented.

Among the architects of one of those early databases, IBM's DB2, was Don Haderle, now an IBM vice president and CTO of that company's data management division. Haderle is also an advisor to Vertica, a startup focused on creating a new type of database centered around columns, not rows. Vertica was co-founded by Michael Stonebraker, a CS professor at Berkeley and designer of the Postgres database system.

In a recent set of blog posts, Haderle and Stonebraker discuss the constraints of the original relational database implementations, and how changes in the cost of processing can usher in column-oriented databases more suitable to analyze rich data types.

In Once upon a time ... the origins of today's relational database architectures, Haderle notes that:

Current relational database management systems are largely built on designs from the 1980s. Back then, computers were expensive and slow relative to today's systems. The minimization of expensive CPU cycles -- not I/O considerations -- was the driving force in early relational DBMS design. The market sweet spot was transaction processing coupled with simple decision support, which was generally satisfied by access on a limited set of attributes (dimensions)...

In the 1980s, rows were small (actually the model was an 80 column punched card averaging 20 fields per record) and the number of entities (tables) was small (100 was large). Two- and three-way joins were the norm. Today, the number of attributes per table is in the hundreds with the most perverse having thousands of attributes. The number of tables in the database has climbed into the multiple thousands. Six- and seven-way joins are common; ten- and twelve-way joins are not extraordinary. As a result, searching is significantly more complex given the number of search arguments (attributes) and the number of relationships involved...

To address these challenges, it makes sense to design an inverted database where the emphasis is on the attribute lists and the relationships between entities. This is precisely what a columnar database does. The rest is details which will determine success: compressing for efficiency; linking lists for joins; time stamping data elements to provide historical detail, as well as alleviate pressure on loading and updating; adding all of the relational functionality; etc.

Stonebraker provides some insights into the advantages of column-oriented databases, in Stonebraker comments on OODB market failures, data warehouse pain, and column advantages:

Only reading the columns you need. We see fact tables with anywhere from 40 to 200 attributes. Warehouse queries typically read 5 or less columns. A column database reads exactly the columns needed; a row store reads all the columns. In round numbers this is an order of magnitude performance penalty.

Superior compression. Columnar compression is more effective than the schemes used by row stores because blocks read by column databases have only one data type in them (a portion of a column) while row stores have several data types (a collection of tuples). Compressing one data type is fundamentally easier than compressing several. Moreover, in Vertica's case, it does not store an explicit tuple identifier or space-wasting bit maps of non-null fields. Hence, we typically see columnar compression beating row-store compression by a factor of 2.

Executor runs on compressed data. The row stores uncompress a block when it is brought off the disk because they have legacy executors that deal with uncompressed data. In contrast, the Vertica executor runs on compressed data. This results in better L2 cache performance, copying less bytes of data, etc.

Inner loop is column-oriented not row-oriented. A row-oriented query plan has an inner loop that picks up a tuple and does something with it. A column-oriented query plan has an inner loop that picks up a column and does something with it. In a fact table query, there might be 10 ** 9 rows but only 5 relevant columns. Hence, the inner loop, with its inherent overhead, is executed vastly less times in a column store.

What do you think of the idea of column-oriented databases?

Leo Lipelis

Posts: 111
Nickname: aeoo
Registered: Apr, 2006