The Artima Developer Community
Sponsored Link

Python Buzz Forum
URL longest prefix match

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
URL longest prefix match Posted: Feb 28, 2005 12:09 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: URL longest prefix match
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

For some reason right now I'm motivated to rewrite or tidy up some of the bits of the old Blogging Ecosystem code and release them as mini-projects. The other day's URL stemmer code was the first bit.

The next one will be a bit of C++ code (with a Python wrapper) to do longest prefix matching on parts of URLs using hashes. This sort of thing comes in handy for blog crawler type applications if you don't have a reliable stemmer (and I don't expect my URL stemmer to work 100% of the time). Basically, you give it a big list of blog URLs, and after that you can give it a URL and it will tell you whether it starts with any of the blog URLs.

How it works is to repeatedly query the hash table with less and less of the URL each time. So if you give it http://foo.bar.com/users/1234/weblog/2002/04/01/#my-post, it will see if any of the following match any known blog URLs:

http://foo.bar.com/users/1234/weblog/2002/04/01/#my-post
http://foo.bar.com/users/1234/weblog/2002/04/01
http://foo.bar.com/users/1234/weblog/2002/04/
http://foo.bar.com/users/1234/weblog/2002/
http://foo.bar.com/users/1234/weblog/
http://foo.bar.com/users/1234/
http://foo.bar.com/users/
http://foo.bar.com/

It would probably match the one in bold above.

You can do this with a database, but it seems that databases don't tend to squish everything into as small a space as they could, which only gives you about 10% of the performance of C++ code like what I'll release (sometime).

Anyway - if you are interested, leave a comment here. I'll post about it on this blog when I'm done.

Comment

Read: URL longest prefix match

Topic: WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics Previous Topic   Next Topic Topic: Weblog URL stemming

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use