The Artima Developer Community
Sponsored Link

Python Buzz Forum
More on "reverse" search

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
More on "reverse" search Posted: Aug 14, 2005 1:51 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: More on "reverse" search
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

Bob Wyman commented on the previous post and implied that PubSub doesn't use a tree-based method. Interesting!

30K messages per second would give you 80k CPU cycles per message, which sounds pretty reasonable.

If I were implementing a tree-based system, I'd do it the way Phillip Eby suggests - make the tree add-only, and either garbage collect or just rebuild the whole tree once in a while. If you think of the tree as being built out of nested hashtables, you could build it one subscription at a time by compiling the subscription into a standalone tree and effectively ORing it on top of the existing tree: where a branch doesn't exist, create it, otherwise follow the existing branch. This should be approximately linear with number of subscriptions (it will get a little slower as the tree grows as your hashtables won't be as fast, but hashtable lookups - which could be binary searches here, because we're just dealing with numbers - are O(log N), which isn't so bad).

My "bogged down" comment was referring more to the case when you only properly index the first "required word" of each subscription, instead of properly merging stuff into a tree. I believe this sort of algorithm could handle complex structured data, but I see what you mean - things get harder there. Matching simple attributes, like a PubSub search for "TITLE:foo", isn't so bad - you start off by dividing the post into its component attributes, and then run the search on each of them in turn. Alternatively you could build it into the tree - a search for "TITLE:foo" would start off matching the word "foo" and then after that, there would be a test to see if it was in the title. (Otherwise you would need to evaluate every word in the post twice - once with the attribute attached and once without, so a title containing "foo" would match searches for both "TITLE:foo" and "foo").

Something I was thinking of that didn't make it into the post was that "evaluating the whole expression tree" for each word doesn't have to mean doing it the slow way for complex queries. It could be done in a proper recursive manner - so the procedure for moving down the tree would be the same whether you are at the root or deeper inside. If the current node has 10K children, you do the query "backwards" - walk through the words in the post being matched and find matching child nodes that way - whereas if it only has 5, you evaluate it in the traditional way.

Comment

Read: More on "reverse" search

Topic: Python 2.4, Windows and MSVC7 Previous Topic   Next Topic Topic: String hash vs. Unicode hash

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use