This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Scaling Technorati
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Interesting that keyword search is working better than URL search. I would have thought URL search would be the easier of the two to get running well -- and the easiest to parallelize well. It's still a pretty big problem, though:
1.4 million new posts per day, eh. If each post has an average of 5 links, that's 7 million links per day. 2.5 billion links per year. If the average link takes 100 bytes to store, that's 250 Gb per year. So storage isn't a big deal - especially when you normalise a little.
1.4 million posts evenly spread over 24 hours means 16 per second, or 81 new links to go into the database every second. Anyone got benchmarks on how many SELECTs per second you could do on a MySQL table containing 10 billion rows (1 TB) and taking 81 INSERTs every second?
Anyway, that's not how you'd do it. Each time a new link came in, you'd hash it, then assign it to a server based on the hash value. Then when someone does a URL search, you'd do the same thing and ask the right server about it. Enough servers would result in things working fairly well.
But - how many is "enough"? You'd probably want to keep the data on each small enough to fit in memory. Ordinary boxes these days can probably take 3 GB, so 333 servers could handle a terabyte of links. Ouch! But then, Dave did say they just added 400 servers, so maybe this isn't so far off.