Page snatcher 2

Posted by Aaron Feng Sun, 17 Feb 2008 03:02:00 GMT

A couple of months ago I wrote a utility that will download a web page with all the dependencies (css, and images) to your hard drive. All the references in the web page will be changed to refer to your local copy.

I wrote it as a prototype, and it took me 30 to 40 minutes to write it, so I'm sure there is room for improvement. I pointed to a few web pages, such as amazon, ebay, google, and my blog it worked pretty well!

The code requires Why the lucky stiff's Hpricot library. With out further adieu, here is the code below:

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'rio'

module Hpricot
  class Elem
    def is_css
      if self.name == "link"
        self["type"] == "text/css"
      else
        false
      end
    end
    def is_full_path
      if self.name == "link"
        self["href"][0..6] == "http://"
      elsif self.name == "img"
        self["src"][0..6] == "http://"
      else
        false
      end
    end
  end
end

if ARGV.size.zero?
  puts "Missing web page you wish to snatch."
  exit
end

url_scheme = "http://"
url = ARGV[0]
doc = Hpricot(open(url_scheme + url))

Dir.mkdir(url) unless File.directory?(url)

doc.search("link") do |item|
  if item.is_css
    if item.is_full_path
      rio(item['href']) > rio(url)
    else
      rio(url_scheme + url + item['href']) > rio(url)
    end

    # nested style sheets in another style sheet
    css_path = File.dirname(item['href'])
    css_file = File.basename(item['href']).scan(/(.*?\.css)/m).flatten.to_s

    file = File.open(url + "/" + css_file,"r")

    inner_css = file.read.scan(/@import '(.*?\.css)';/m).flatten
    inner_css.each do |css|
      css_url = url_scheme + url + css_path + "/" + css
      rio(css_url) > rio(url)
    end
    file.close

    item['href'] = css_file
  end
end

doc.search("img") do |item|
  if item.is_full_path
    rio(item["src"]) > rio(url)
  else
    rio(url_scheme + url + item["src"]) > rio(url)
  end
  item["src"] = item["src"].split("/")[-1]
end

File.open(url + "/" + url + ".html", "w") do |file|
  file << doc.to_s
end
Comments

Leave a response

  1. Alicia Sun, 17 Feb 2008 14:53:45 GMT

    The code is seamless. Great job! This is the base for all sites, isn't it?

  2. Aaron Feng Sun, 17 Feb 2008 18:08:16 GMT

    Thank you Alicia. Yes, some modification might be required to work for all sites.

Comments