Summary
Robert McIntosh wrote a thought-provoking piece on designing a scalable Web application without a database. I share three reasons why such a notion deserves some merit.
McIntosh's basic thesis is centered around three observations. The first one is that true scalability can best be achieved in a shared-nothing architecture. Not all applications can be designed in a completely shared-nothing fashion—for instance, most consumer-facing Web sites that need the sort of scaling McIntosh envisions require access to a centralized user database (unless a single sign-on solution is used). But a surprising number of sites could be partitioned into sections with little shared data between the various site areas.
McIntosh's second observation is that a new class of ready-to-use infrastructures are becoming available that make horizontal scaling an economically feasible option. While amassing an array of lightly-used servers would have been considered a waste of resources just a few years ago, OS-level virtualization techniques turned such seeming waste into an economic advantage: instead of having to architect, configure, develop, and maintain a scalable software architecture, one can possibly build out, or use, a scalable (virtual) hardware architecture. The conceptual difference is merely that scaling is pushed into a lower infrastructure layer.
The example he mentions is Amazon's EC2 compute cloud:
The hardware architecture would be very similar and based on Amazon’s Elastic Cloud and S3 services. The idea being that the data would reside in S3 in a text format (we’ll use XML for sake of argument), with the actual site work running off of Elastic Cloud instances.
McIntosh's final observation is that although modern Web frameworks speed up development already, a new level of rapid development can possibly be reached by managing data in plain files, such as XML:
Rapid development. Isn’t that why we have Ruby on Rails, Grails and other frameworks like these? True, but how valuable are the domain driven OO frameworks that have dumb domain objects? I will concede that with some applications, working with the data is easier using objects that say a tabular model as the data is naturally hierarchal in nature. Then again, that is where the XML/JSON data models can fit. Also, frameworks like Ruby on Rails, Grails and ORMs like Hibernate, JPA, entity beans, etc. are most valuable when you need a full CRUD application. While both of these scenarios have CRUD operations, they aren’t data entry CRUD apps in the traditional sense.
McIntosh then outlines a system architecture that relies on possibly many server instances serving up and managing plain files, most likely in some structured format, such as XML or JSON. The system does not have a domain objects layer—instead, the controller layer presumably translates incoming requests to some file-related operation, such as reading a file or changing the contents of a file. And the outgoing operations are simply a matter of transforming file-bound data to a format needed by the presentation layer, such as XML, XHTML or, again, JSON.
McIntosh's suggestion could be quickly dismissed as simplistic, ignoring many decades of data management and application development wisdom, but for three reasons:
First, scalable data management has increasingly come to mean in-memory databases. Oracle's recent purchase of Tangosol, a leading distributed cache vendor, and earlier purchase of another in-memory database shop, TimesTen, are but two indications that in-memory data management is here to say. With falling RAM prices, it's possible to load several GB of data into main memory. Once that data is in memory, it is perhaps less important whether the data is accessed via a relational database layer or by application-level code that co-habits the same memory space. Tangosol's Coherence product, for instance, provides its own API for accessing cache-resident data. Other in-memory databases provide SQL or some other data-access API.
If an application can de-cluster its data across many server nodes, each node can load portions of the data into main memory and manipulate that data with application-specific code. To be sure, one of the key benefits of the relational model is exactly that it abstracts data storage and access away from application code. Yet, many applications, instead of providing direct database access, expose their data through an API, as in service-oriented architectures. Indeed, shared database access (when one database is shared by several applications) is increasingly the exception.
Another reason to entertain some of McIntosh's notions is that quick access to large amounts of data occurs through indexes—be those indexes managed by a relational database or indexes created ex-database, such as with Lucene. An application relying on, say, XML-based files for data storage could generate the exact indexes it needs in order to provide query execution capabilities over the files. And, in general, ex-database indexes have proven more scalable than database-specific indexes: Not only can such indexes be maintained in a distributed fashion, they can also be custom-tailored to the exact retrieval patterns of an application.
The final reason to ponder some of McIntosh's thoughts is that next to short access times, the most important requirement for a data-driven site is data availability. As more business-style applications migrate to the Web, the ability to keep data alive at all times is sure to become a central concern of enterprise application development. There is but one sure way to ensure high data availability, and that is via replication. But replicating data whose identity is tied in some way to a database management system, such as by database-specific IDs, for instance, makes replication harder. Many database products provide replication solutions, but none equal the scalability of simply copying files around a vast distributed filesystem, such as Amazon's S3. If data represented by such files have globally-unique identifiers, then, theoretically, any node could take over management of such files (keeping in mind some cardinal rules of replication, though).
I don't agree with many of McIntosh's ideas, but merely find them interesting, especially as we are confronting new challenges (e.g., the mandate to keep data alive at all times) and presented with new opportunities (Amazon's EC2, inexpensive RAM, in-memory databases). At some point, application architecture will have to change to take into account those new realities. I'm not sure McIntosh is right that file-based shared-nothing design is the path to the future, but real-world data management practices have greatly evolved from the days of the classic, centralized relational database and three-tier application design, and his ideas deserve some merit.
What do you think of McIntosh's notions of scaling a Web site without a database? If you don't agree with his ideas, then how would you scale an application to the extent the biggest consumer-facing sites require?
And the outgoing operations are simply a matter of transforming file-bound data to a format needed by the presentation layer, such as XML, XHTML or, again, JSON.
Any bigger and you need to rethink your storage solution. Databases only scale so far.
A good analogy is malloc. It is general purpose and flexible - but if you need maxiumum performance, you can always beat malloc for a specific case with a bit of work.
Anyone who has worked on a collossally scaled website like Amazon, Google, etc... knows that databases by themselves simply don't cut it.
Werner Vogels, CTO of Amazon, has a blog called http://allthingsdistributed.com where he addresses various topics related to massive scalabiltiy. Always worth a read.
> > Any bigger and you need to rethink your storage > solution. > > Databases only scale so far. > > databases by themselves simply don't cut it. > > 1. Access to index records grows logarithmic with the > number of records. Relational databases scale very > well.
But I think the problem is not the amount of data as much as it is the number of concurrent clients.
> 2. A file system is a database, but one without ACID. If > you don't need ACID you don't need a RDBMS. That's not new.
That's the thing that I think people tend to gloss over in these discussions about scaling an RDBMS. I think a lot of people have been using databases when they don't need ACID. Actually I know they have. I see it all the time.
So yeah, if you take something that never really needed a RDBMS and move it off of one, it's going to be fairly easy. It's when you do need ACID that things get hairy.
> So yeah, if you take something that never really needed a > RDBMS and move it off of one, it's going to be fairly > easy. It's when you do need ACID that things get hairy.
Then there is the idea that conventional ACID can be worked around. IOW, it is possible to engineer your solution to not require ACID with a bit of thinking outside the database.
For example, if you can relax the Atomic requirement and settle for "eventual" consistency, you can really improve scalability.
You don't suppose Amazon.com is a simple CRUD app sitting on an Oracle database, do you?
Here's a pointer to a pretty decent paper on the kind of tradeoffs you have to make as you scale up. Most really huge scale work I've seen is all about unloading the database using layered caches and idempotent ansynchronus update protocols.
> > So yeah, if you take something that never really needed > a > > RDBMS and move it off of one, it's going to be fairly > > easy. It's when you do need ACID that things get > hairy. > > Then there is the idea that conventional ACID can be > worked around. IOW, it is possible to engineer your > solution to not require ACID with a bit of thinking > outside the database.
I never meant to say that you couldn't redesign things to avoid this requirement. But I think it's harder to design than the standard ACID approach, or a least less well known.
> For example, if you can relax the Atomic requirement and > settle for "eventual" consistency, you can really improve > scalability. > > You don't suppose Amazon.com is a simple CRUD app sitting > on an Oracle database, do you?
No, I don't. If you've ordered from them, you might have noticed they don't confirm anything synchronously. If they are out of something, you might find out a couple days after ordering it.
Honestly, I'm not sure why everyone makes such a big deal about Amazon. I used to have to work with people at Amazon and they had as many problems as anyone else. I don't think their success has been driven primarily though extraordinary technical prowess.
> Here's a pointer to a pretty decent paper on the kind of > tradeoffs you have to make as you scale up. Most really > huge scale work I've seen is all about unloading the > database using layered caches and idempotent ansynchronus > update protocols. > > http://www.cs.utah.edu/~sai/papers/proposal/
I'm quite sure that databases don't scale indefinitely. I've never said anything different. I think they are overused. But I think there are definitely things that need ACID transactions. I imagine that banking laws impose some pretty difficult scalability problems. But that still doesn't necessarily mean RDBMS.
> > For example, if you can relax the Atomic requirement > and > > settle for "eventual" consistency, you can really > improve > > scalability. > > I think that seems to be increasingly the role, rather than the exception. I think many applications can handle that "eventual consistency" quite well and, in fact, that sort of models many aspects of how the real world works, i.e., most people are fine when things balance out over some acceptable period of time. I'm not sure how, for example, what banking regulations say about this, but even my bank account follow this "eventual" consistency pattern: it takes some time for deposits to clear and for transactions to settle.
But Atomicity means more than that. When you write a file and the process crashes halfway your file is simply corrupt. When your 'transaction' consists of 3 written files and the process crashes after 2 your 'transaction' isn't atomic any more. 'Manually' implementing (parts of) ACID is possible but it's not as simple as it seems at first sight.
> > > For example, if you can relax the Atomic requirement > > and > > > settle for "eventual" consistency, you can really > > improve > > > scalability. > > > > I think that seems to be increasingly the role, rather > than the exception. I think many applications can handle > that "eventual consistency" quite well and, in fact, that > sort of models many aspects of how the real world works, > i.e., most people are fine when things balance out over > some acceptable period of time. I'm not sure how, for > example, what banking regulations say about this, but even > my bank account follow this "eventual" consistency > pattern: it takes some time for deposits to clear and for > transactions to settle.
Sure but this doesn't mean that they don't have an ACID transaction somewhere in their architecture (not saying that any or all banks do by any means.) All it means is that they implementing ACID synchronously on the front-end. I don't think this is exotic or especially difficult to implement. I worked on a B2B architecture that would take orders and confirm (or deny) them asynchronously. We'd take a ton of orders and queue them up to run through a mainframe. This allowed us to process many more orders but it didn't remove the bottleneck, it made possible smooth the load out over time. We still needed a very expensive (by my standards) mainframe.
And to the point that was made earlier by Roland, we used a database to track things through the front-end and come to think of it, we didn't really need that. It made things easy but it definitely could have been done another way and that probably would have saved us a lot of pain and suffering.
Most enterprise-level databases are designed around "classic" database problems. Apparently they aren't the same problems that Google, Amazon, Flickr, FedEx, MySpace, et al. want solved now ( http://www.adambosworth.net/archives/000038.html ). I don't know the best solution to these problems, but I do want to be part of the group finding those solutions.
> I think it's harder to design > than the standard ACID approach, or a least less well > known.
Sure it is. ACID is conceptually easy. It is a nice model. But the usual implementation isn't efficient enough for really high volumes.
> Honestly, I'm not sure why everyone makes such a big deal > about Amazon.
You should work there for awhile. I did for 2.5 years and it smashed a bunch of previously held prejudices I had about how to build scalable apps. Amazon started out as a conventional database centric app. A lot of the architecture evolved to work around the lack of scalability in the database. That limit was reached many years ago and new models have been developed to cope with their admittedly unique demands.
Its a pity they don't publish their own research (competitive advantage), although they do a good job of sharing knowledge through weekly brown bag presentations.
These days the mantra is: Scalability, Availability, Consistency - pick two. Conventional ACID emphasizes the latter two while saying nothing at all about scalability - which pretty well guarantees you're not going to get it.
> Once that data is in memory, it is perhaps less > important whether the data is accessed via a relational > database layer or by application-level code that co-habits > the same memory space.
In a backward way, I think this introduces an opportunity for relational languages and structures in what's often seen as "the O-O layer." Relational isn't necessarily about persistence; it's about predicates and facts, and certainly doesn't preclude in-memory databases. And there's nothing special about objects (or associative arrays or any other typical application-level data structure) that relations don't offer; I'd wager the opposite is true.
> To be sure, one of the key benefits of the relational > model is exactly that it abstracts data storage and > access away from application code.
That doesn't really have anything to do with the relational model; that's the DBMS model, which is much the same as any other service, save that the DBMS offers an interface so flexible that requests are made in the form of a string (e.g. SQL), rather than a small set of API calls.
> Yet, many applications, instead of providing direct > database access, expose their data through an API, as in > service-oriented architectures.
I don't see how the two are related; certainly an API can act as a "hard" interface and firewall against direct DBMS access, and also hides however and wherever the data is kept. I'm not sure I've seen an application that "provides direct database access." Can you elaborate?
Flat View: This topic has 16 replies
on 2 pages
[
12
|
»
]