The Artima Developer Community
Sponsored Link

Python Buzz Forum
PubSub "Hyperbole number"

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
PubSub "Hyperbole number" Posted: Aug 14, 2005 11:52 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: PubSub "Hyperbole number"
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

Bordering on the ridiculous, PubSub's Hyperbole number exceeds 1.6 TRILLION at times.

The number represents the number of "matches" per day, where a "match" is a test to see if an arriving feed entry matches a subscription. Somewhere over 300k subscriptions times ~1.5M feed entries per day gives them an average Hyperbole number of about 500 billion, and a peak number up around 1.6 trillion (because people don't post evenly over 24 hours).

This is kind of cool, and it gives an idea of the magnitude of the problem PubSub deals with. Two things I'd like to see would be an indication of the number of unique subscriptions and the number of actual matches (i.e., the number of items that actually make it out into their search result feeds) they see per day.

My favourite part of his post: "If our algorithm wasn't sub-linear, we'd need massively more machines than we currently use for matching. (i.e. we'd need more than one...)". Hah!

I got to meet PubSub's Bob Wyman and Salim Ismail in person last night; they seem like nice guys.

---

Now all I need to do is resist the temptation to take Bob up on his challenge to write a similarly fast matching engine :-)

I can't resist thinking about the algorithm, though. His response to the first comment states that you're only allowed to match one message at a time. That means the sub-linearity has to come out of indexing the subscriptions. I guess that explains why they don't do traditional-style search as well: their algorithm only works "backwards".

Back of the envelope calculation time. 2 million entries per day is 23 per second on average, so you'd have to aim for at least 100 per second to handle (current) peak load. That gives you 10 ms per entry, or 24 million CPU cycles on Bob's desktop box. A linear algorithm would get 80 cycles per subscription, i.e. not enough. The average entry is probably 100 words or less.

I'd start by making a hash table of all the words used in subscriptions. Assign each word a token ID. At the same time, compile all the subscriptions down to expression trees that refer to token IDs. Merge duplicate subscriptions.

Presumably a subscription has to have at least one "required word" - otherwise it would match everything in the database. So you order the compiled subscriptions by their least common "required word". Some indirection would be required to cope with subscriptions which OR together a bunch of match expressions - or perhaps you could break these up into separate "virtual subscriptions", then follow them back to their subscription record when you get a match.

Now walk through the words in the entry and make a list of token IDs for all words in the entry that are present in a subscription. Sort them or drop them in whatever associative container you have available. Now walk through the words and for each, find all subscriptions that require that word and evaluate their expression trees in the usual way.

This should be pretty fast, but it will bog down if you start getting a lot of subscriptions with complex expressions. The solution to that would be to merge similar expressions. Instead of just making a big list of all the unique subscriptions, compile all the subscriptions down into a big tree that mirrors the aforementioned ordered list of required words and has lists of subscriptions on the leaves of the tree.

This way, the searches (1) +FOO, (2) +FOO+BAR, (3) +FOO-BAR, (4) +FOO+BAZ, (5) +FOO+BAZ-BAR would result in a tree like this:

FOO [1]
→ BAR [2]
→ -BAR [3]
→ BAZ [4]
→ -BAR [5]

For each word with a token ID, you'd have to evaluate the whole expression tree for that word and keep track (in another associative container) of the subscriptions that have been matched so far. 24 million CPU cycles per entry, or 240k per word, should easily be enough for this.

So there you go. Pretty fast, and probably not too hard to implement, either. I wonder how close this thought experiment got to PubSub's actual algorithm ...

Comment

Read: PubSub "Hyperbole number"

Topic: Very limited offer, take advantage while you still can Previous Topic   Next Topic Topic: MochiKit 0.80 Released

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use