Summary
At SD Forum 2006, two eBay architects presented an overview how eBay's architecture handles a billion page requests a day, and how that architecture evolved from a few Perl scripts to 15,000 application instances running in eight data centers. One conclusion from the presentation is that scaling is only in part a question of architecture.
Advertisement
At SD Forum 2006, Randy Shoup and Dan Pritchett, both with eBay, gave a presentation on eBay's architecture. Pritchett subsequently posted his presentation slides in his blog, The eBay Architecture [PDF].
Predictably, the presentation contained a few awe-inspiring statistics, such as:
212,000,000 registered users
1 billion page views per day
26 billion SQL queries and updates per day
Over 2 petabytes of data
$1,590 worth of goods traded per second
Over 1 billion photos
7 languages
99.94% uptime
Other stats in the presentation related to development process and features, such as:
Over 300 new features released each quarter
Over 100,000 lines of code released every two weeks
That scale notwithstanding, according to the presentation, the goal of eBay's current architecture is to handle an additional ten-fold increase in traffic, something eBay expects to reach within a few short years. Another architecture objective is to be able to handle peak loads, and for components to gracefully degrade under unusual load or in the case of system failures.
According to the presentation, the system architecture is currently moving to Version 4. Predictably, the most interesting technical pieces of the presentation focus on that version, including, for instance, what the presenters said was the first step in scaling the application tier: Throwing out most of J2EE. Instead, they noted that "eBay scales on servlets and a re-written connection pool."
Another interesting aspect of application-layer scaling is that, according to the presentation, no session state is maintained in the application. Instead, "transient state [is] maintained in cookie or scratch database." For data access, eBay uses an internally-developed Java O/R mapping solution.
In scaling up the search aspect of the site, the presenters noted a unique requirement not encountered by general Web search engines, such as Google: eBay users expect changes to their data to show up in search results right away. As well, auction listers know exactly the expected search results—for instance, the items they just listed must show up in all relevant searches. Apparently, just updating the search index took about 9 hours prior to the latest re-architecting of eBay's search.
The presentation is full of similarly challenging problems, as well as insights into their solutions. To me, the most interesting aspect of the presentation, however, is the overview it provides on how eBay's architecture itself evolved. It's worth considering some aspects of Version 1, for instance:
Built over a weekend in 1995 by Pierre Omidyar
Every item was a separate file, generated by a Perl script
No search, browsing only by category
System hardware from commodity parts that could be purchased at Fry's.
This architecture was in place between 1995 and September, 1997. By then, eBay was one of the better-known Web sites, and the architecture maxed out at 50,000 listings, according to the presentation.
The next few iterations involved a move to a 3-tier architecture, at first on Microsoft's IIS server, and then moving to Java. The final few versions indicate a move away from J2EE, and are highly customized to meet eBay's unique demands.
One way to look at the four main architecture versions is as an evolution. Another way to look at it, however, is as coming full circle: starting with a custom-designed solution, moving to a standards-inspired solution, and then moving again into a custom solution.
Based on the overview of the various architecture stages, one cannot help but wonder to what extent eBay's architects were solving urgent present scaling problems, and to what extent they were looking to build scalability into the system to handle future loads. And even if the plan was to design for the future, to what extent could architects truly forecast the scalability of the system at some imagined point in the future?
One problem with such predictions is that even if plenty of data is available on the currently operational system, usage patterns of the system may change—for instance users may start to favor video over simple images, or voice calls as part of interacting with the system. Such usage pattern changes can happen fairly fast, especially given that the average architecture lifespan is around 2-3 years, based on the presentation. Not many people heard of YouTube two or three ago, for example, and in the short lifespan of that company millions of users grew comfortable posting videos online.
Scaling: Organizational Capability + Architecture
That last issue brings me to what I think is the main message of the eBay presentation. The most amazing aspect of this evolution to me is not necessarily the technical brilliance of the solutions at each architecture stage, but the fact that eBay was able to meet the challenges of its growth with subsequent refinements to its system, all the while keeping the site operational.
The reason that's interesting is because it suggests that you can start with almost any architecture—even with Perl or Rails or JSP pages—as long as you know how to migrate to the next step, and have the capability to do so, if and when you need to scale your app. That, in turn, suggests that the key test of scalability is not so much how each architecture stage scales, but how readily a company or an organization can move an application from one architecture step to the next. That indicates that scaling is as much an individual or organizational question as a technical one.
That should not be surprising, of course, since scale always had operational as well as architectural design aspects. (The last segment of the eBay presentation is devoted to the operational aspects of scaling—for instance, illustrating how 15,000 application instances are managed across 8 data centers.) Approaching scaling from that broader perspective suggests, however, that two common aspects of looking at scale may prove unhelpful in practice.
The first aspect is overt emphasis on design for scalability from the start. Most developers know that no architecture scales infinitely, but on occasion architects expand much effort on trying to design one architecture that will scale to some long-term need of an application. Pierre Omidyar likely didn't share that view, which is perhaps why he went down the path of Perl scripts and one-file-per-item in his initial version.
The second less-than-helpful view of scalability views scalability and performance as merely afterthoughts, and discourages scalability considerations at initial stages of an application's development. This view is sometimes expounded by XP proponents, who would much rather code something up quickly than worry about how that code will scale to handle some future application workload.
In practice, neither view may be very helpful. A third, more realistic, view would consider scaling as partly an organizational, even business-level, capability. Recognizing that predicting future workloads is hard, if not impossible, this view would aim at an architecture that handles some near-term scaling goal, and at the same time allows the deployment of features rapidly so that the application's real users can generate a business rational for supporting future architecture upgrades. Far from considering scaling as an afterthought, however, this view would also aim to develop from the start the organizational, and even business, capabilities to handle architectural changes to the system. That seems to be the view presented by the eBay architects at SD Forum.
In your projects, when do you start thinking about scalability?
My goal is to worry about scale in tiers. As in, there isn't a "one size fits all" for me. For example, consider the eBay's throughput of new features and lines of code. That's an important scaling variable. Can you keep producing new features all the time? Does it run fast enough? Is it stable? How are updates delivered? Etc.
Yes, you have captured the essence of long term scalability. Can your organization adapt whatever you have deployed as necessary? The most impressive thing I feel we do, and the deck doesn't address it, is migrating the site in flight.
> It's worth considering some aspects of > Version 1, for instance: > > Built over a weekend in 1995 by Pierre Omidyar > ... > The second less-than-helpful view of scalability views > scalability and performance as merely afterthoughts, and > discourages scalability considerations at initial stages > of an application's development. This view is sometimes > expounded by XP proponents, who would much rather code > something up quickly than worry about how that code will > scale to handle some future application workload.
First, thanks for an interesting article.
Reading the article up to this point, about how eBay started simple (first quote above), and scaled as needed, I was reminded of the XP/agile tenet of "Do the simplest thing that could possibly work", as well as YAGNI.
However, when reading the second quote above, I can't really see how that one follows from what precedes it. "Built over a weekend" using Perl scripts and generated HTML-files seems very much in line with "scalability and performance as merely afterthoughts, and discourages scalability considerations at initial stages of an application's development", or did you mean that the way eBay has evolved was not a good example to follow? It certainly worked, so I doubt that was your point, but again, I don't see the support for your second point in the article. Perhaps you could comment on that?
> "Built over a weekend" using Perl scripts and generated > HTML-files seems very much in line with "scalability and > performance as merely afterthoughts, and discourages > scalability considerations at initial stages of an > application's development", or did you mean that the way > eBay has evolved was not a good example to follow? It > certainly worked, so I doubt that was your point, but > again, I don't see the support for your second point in > the article. Perhaps you could comment on that?
The point I was trying to make was that somehow eBay had the capability to move to subsequent versions without significant interruption to their growth. While that ability seems obvious, many companies don't have to ability to seamlessly migrate from one architecture to another, let alone to go through four different architectures.
That's not a purely technical capability, but more of an organizational one: a question of resources, perhaps, but resources don't just show up automatically, and they have to be managed in order to provide that scaling capability.
I don't know how eBay managed that, but I suspect one insight Omidyar had from the beginning was that he would need help in managing that growth, i.e., that he would need to procure capital, people, etc. That seems so obvious, but only in the way it's also obvious that running a marathon consists of putting one foot after another for 26.2 miles. Just as most people can walk a few miles but few can finish a marathon, many companies are able to build one architecture, but are never able to successfully scale by moving to subsequent ones.
Instead, what often happens, I think, is that either companies get one architecture done and then just patch that infinitely in order to scale up (of course, at some point you reach a limit), or that startups (for instance) worry so much about getting that initial architecture right from the start that they take too long launch.
So eBay's is a very successful example of how resources can be managed to affect scaling up a system. Dan Pritchett posted in the comments section a link to an article that has a lot more to say about that.
> Instead, what often happens, I think, is that either > companies get one architecture done and then just patch > that infinitely in order to scale up (of course, at some > point you reach a limit), or that startups (for instance) > worry so much about getting that initial architecture > right from the start that they take too long launch. > > So eBay's is a very successful example of how resources > can be managed to affect scaling up a system. Dan > Pritchett posted in the comments section a link to an > article that has a lot more to say about that.
They basically rewrote the thing three times, right? Is that a grand-strategy or 3 big screw-ups? I'm not quite sure. Sometimes it's smart to say, 'I don't have time to design this properly, so I'll just poop something out' but sometimes you have to rewrite because you didn't consider all the information at hand or because you just did it the way you always do it or the way some book said.
One thing I do agree on is that it's almost impossible to predict future need. My current strategy is to not invest too much time preparing for future needs but not paint myself into a corner. I often consider how something might be used later and add a little 'hook' or something that could easily be modified or reworked to allow for growth. Really the best strategy for this is minimize coupling in the system but if you don't, you can always rewrite it.
Regarding XP, I can see an issue with managing customer expectations. XP customers tend to expect relatively quick turnaround on feature requests and scalability is a feature, but there may be cases, such as in eBay's case where the business does so well that it requires a rewrite of the entire system.
I've also observed that in many shops, including XP, scalability requirements are poorly defined and the infrastructure to verify them (load testing) is not setup early enough.
Finally, I always keep an eye open for opportunities to produce a faster system in the same amount of time that it takes to produce a slower one. This approach has worked well for me more than once.
I suppose if one writes tight, logically-structured, readable code that mimimizes redundancy (the kind that can only result from constant, small simplicity-seeking refactorings), then when architectural changes later become necessary it will be feasible to refactor the code to support them.
I agree, in my experience scaling is more about people, less about technologies. Software is flexible, architectures can be refactored. The main obstacle to scaling is usually the team, both developers and business people, being too tight to the original design: after it is "sold" and understood nobody want to change it.
In my case I often have to face customer refusing to change their architecture because they don't accept the idea that it is intrinsically evolutionary: "it should be right from the beginning". It is amazing, they don't understand good software derives from a good development process. I know teams that are always successful regardless of the technology, while others are often failing and just blaming the technology they used.
Looks like eBay people understood early this basic lesson so they are now a multi-billion company while many others are just going from failure to failure with their software projects. Darwin's law rules also our little IT world.