Summary
In a recent presentation at the MySQL users conference, Domas Mituzas explains Wikipedia's architecture.
Advertisement
As the eights busiest Web site, Wikipedia is unique in that it relies mostly on free, open-source software for its highly available infrastructure. At the 2007 MySQL users conference, MySQL's Domas Mituzas, who also works with Wikipedia, gave a presentation on Wikipedia's scalable architecture. The presentation is available from Wikipedia's Site Internals, Configuration, Code Examples and Management Issues.
Mituzas points out that:
The principle of openness
forced all operation to use free & open-source software only. Having commercial alternatives
out of question, Wikipedia had the challenging task to build efficient platform of freely
available components...
Wikipedia’s primary aim is to provide a platform for building collaborative compendium of
knowledge. Due to different kind of funding (it is mostly donation driven), performance and
efficiency has been prioritized above high availability or security of operation.
Mituzas highlights the key elements of Wikipedia's architecture:
Linux - operating system (Fedora, Ubuntu)
PowerDNS - geo-based request distribution
LVS - used for distributing requests to cache and application servers
Squid - content acceleration and distribution
lighttpd - static file serving
Apache - application HTTP server
PHP5 - Core language
MediaWiki - main application
Lucene, Mono - search
Memcached - various object caching
The presentation focuses on many aspects of caching and content delivery:
Content delivery network is the ‘holy grail’ of performance for Wikipedia. Most of pages
(except for logged in users) end up generated in such a manner, where both caching and
invalidating the content is fairly trivial...
There’re no unaccounted dynamic bits on a content page (if there are, the changes are
not invalidated in cache layer, hence causing stale data).. Every content page has strict naming, with single URI to the file ( good for having uniform linking and not wasting memory on dupe cache entries)... Caching is application-controlled (via headers) (simplifies configuration, more efficient selection of what can and cannot be cached)... Content purging is completely application-driven (the amount of unpredictable changes
in unpredictable areas would render lots of stale data otherwise)... Application must support lightweight revalidations (If-Modified-Since requests)
What do you think of Wikipedia's architecture as presented by Mituzas?