Ruby Buzz Forum - Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby)

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby)

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.

Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby)

Posted: Mar 29, 2006 8:39 AM

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Persistent URLs: really easy (thank you open-uri, SOAP4R, Ruby) Feed Title: Eigenclass Feed URL: http://feeds.feedburner.com/eigenclass Feed Description: Ruby stuff --- trying to stay away from triviality.	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Eigen Class Latest Posts From Eigenclass

Using google (or any other search engine) to generate persistent URLs is one of those obvious ideas that make you wonder if you came to them on your own before being exposed. At any rate, I had never seen an implementation*1, so here's mine.

But first of all, some examples of the persistent URLs created by the script shown below:

It doesn't always work that well; for instance, the persistent URL of my Ruby 1.9 change summary (the first hit for http://google.com/search?q=ruby+1.9 ), becomes http://google.com/search?q=ruby+foo+file+method+nil+array+index+proc+def+methods .

Implementation

This is pretty easy, all one needs to do is:

extract candidate search terms from the desired destination URL:
- only consider text
- try to find significant terms
check against google, verifying if the chosen query is good enough

Extracting text from arbitrary HTML pages

There's no need for a full parse tree of the HTML: just the list of words that would be considered by google will do.

I took some old code of mine, from one of my very first (useful) Ruby scripts (a filtering proxy that added hints to German pages, inspired by jisyo.org, which does the same for Japanese text). It just uses a number of regexps to reject unwanted parts of the text, until we're left with simple words. It's not too inefficient thanks to strscan, and as naÃ¯ve as the regexps might seem, they work well in practice:

Previous Topic

Next Topic


	Web Artima.com