Python Buzz Forum - Scaling Technorati

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Scaling Technorati

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand

Scaling Technorati

Posted: Sep 3, 2005 5:53 PM

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Scaling Technorati Feed Title: Second p0st Feed URL: http://www.myelin.co.nz/post/rss.xml Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange	Latest Python Buzz Posts Latest Python Buzz Posts by Phillip Pearson Latest Posts From Second p0st

Dave Sifry is in damage control mode after the last couple of weeks' backlash against Technorati.

Interesting that keyword search is working better than URL search. I would have thought URL search would be the easier of the two to get running well -- and the easiest to parallelize well. It's still a pretty big problem, though:

1.4 million new posts per day, eh. If each post has an average of 5 links, that's 7 million links per day. 2.5 billion links per year. If the average link takes 100 bytes to store, that's 250 Gb per year. So storage isn't a big deal - especially when you normalise a little.

1.4 million posts evenly spread over 24 hours means 16 per second, or 81 new links to go into the database every second. Anyone got benchmarks on how many SELECTs per second you could do on a MySQL table containing 10 billion rows (1 TB) and taking 81 INSERTs every second?

Anyway, that's not how you'd do it. Each time a new link came in, you'd hash it, then assign it to a server based on the hash value. Then when someone does a URL search, you'd do the same thing and ask the right server about it. Enough servers would result in things working fairly well.

But - how many is "enough"? You'd probably want to keep the data on each small enough to fit in memory. Ordinary boxes these days can probably take 3 GB, so 333 servers could handle a terabyte of links. Ouch! But then, Dave did say they just added 400 servers, so maybe this isn't so far off.

Comment

Read: Scaling Technorati

Previous Topic

Next Topic


	Web Artima.com