The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Tracing websites for fun (and profit?)

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Christian Neukirchen

Posts: 188
Nickname: chris2
Registered: Mar, 2005

Christian Neukirchen is a student from Biberach, Germany playing and hacking with Ruby.
Tracing websites for fun (and profit?) Posted: Dec 5, 2005 12:06 PM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Christian Neukirchen.
Original Post: Tracing websites for fun (and profit?)
Feed Title: chris blogs: Ruby stuff
Feed URL: http://chneukirchen.org/blog/category/ruby.atom
Feed Description: a weblog by christian neukirchen - Ruby stuff
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Christian Neukirchen
Latest Posts From chris blogs: Ruby stuff

Advertisement

I recently stumbled on GotToZ.com, which really is an waste of time but I looked like a challenge, so I tried it for some minutes. The purpose of the site is to tangle through a web of sites, each representing a letter of the alphabet, finally reaching Z. I got up to Y manually, but then I decided Ruby could do that much better than me. I encourage you to try it manually first, though. (Not like you’re wasting enough time already…)

require 'open-uri'

@pages = {}
@count = Hash.new 0

def track(page)
  return  if @count[page] > 2
  @count[page] += 1

  a = (@pages[page] ||= [])
  open(page).read.scan(
      /\074A href="(http:\/\/.*?.com\/)".*?\076[A-Z]\074/m) { |e|
    a.push e.first
  }
  a.uniq!
  a.sort!
  STDERR.puts page
  a.each { |x| track x }
end

track 'http://www.amongothers.com/'

puts "digraph {"
@pages.each { |k, v|
  v.each { |l| puts %Q[  "#{k}" -> "#{l}";] }
}
puts "}"

Run it, possibly a few times because I think the Z site is added randomly (that’s why every page is fetched up to three times, too), and save the standard output into a file. Now, you have a nice graph you can run GraphViz on and do nifty diagrams, like this (click for full 3884x3434 view, be careful):

Circo graph of GotToZ.com

NP: Dire Straits—Walk Of Life

Read: Tracing websites for fun (and profit?)

Topic: Ballad Of A Thin Man (#ruby-lang version) Previous Topic   Next Topic Topic: Try Ruby

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use