The Artima Developer Community
Sponsored Link

Python Buzz Forum
Scaling Technorati

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
Scaling Technorati Posted: Sep 3, 2005 5:53 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Scaling Technorati
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

Dave Sifry is in damage control mode after the last couple of weeks' backlash against Technorati.

Interesting that keyword search is working better than URL search. I would have thought URL search would be the easier of the two to get running well -- and the easiest to parallelize well. It's still a pretty big problem, though:

1.4 million new posts per day, eh. If each post has an average of 5 links, that's 7 million links per day. 2.5 billion links per year. If the average link takes 100 bytes to store, that's 250 Gb per year. So storage isn't a big deal - especially when you normalise a little.

1.4 million posts evenly spread over 24 hours means 16 per second, or 81 new links to go into the database every second. Anyone got benchmarks on how many SELECTs per second you could do on a MySQL table containing 10 billion rows (1 TB) and taking 81 INSERTs every second?

Anyway, that's not how you'd do it. Each time a new link came in, you'd hash it, then assign it to a server based on the hash value. Then when someone does a URL search, you'd do the same thing and ask the right server about it. Enough servers would result in things working fairly well.

But - how many is "enough"? You'd probably want to keep the data on each small enough to fit in memory. Ordinary boxes these days can probably take 3 GB, so 333 servers could handle a terabyte of links. Ouch! But then, Dave did say they just added 400 servers, so maybe this isn't so far off.

Comment

Read: Scaling Technorati

Topic: Only on comp.lang.python Previous Topic   Next Topic Topic: Roadkill

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use