Page snatcher 2
A couple of months ago I wrote a utility that will download a web page with all the dependencies (css, and images) to your hard drive. All the references in the web page will be changed to refer to your local copy.
I wrote it as a prototype, and it took me 30 to 40 minutes to write it, so I'm sure there is room for improvement. I pointed to a few web pages, such as amazon, ebay, google, and my blog it worked pretty well!
The code requires Why the lucky stiff's Hpricot library. With out further adieu, here is the code below:
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'rio'
module Hpricot
class Elem
def is_css
if self.name == "link"
self["type"] == "text/css"
else
false
end
end
def is_full_path
if self.name == "link"
self["href"][0..6] == "http://"
elsif self.name == "img"
self["src"][0..6] == "http://"
else
false
end
end
end
end
if ARGV.size.zero?
puts "Missing web page you wish to snatch."
exit
end
url_scheme = "http://"
url = ARGV[0]
doc = Hpricot(open(url_scheme + url))
Dir.mkdir(url) unless File.directory?(url)
doc.search("link") do |item|
if item.is_css
if item.is_full_path
rio(item['href']) > rio(url)
else
rio(url_scheme + url + item['href']) > rio(url)
end
# nested style sheets in another style sheet
css_path = File.dirname(item['href'])
css_file = File.basename(item['href']).scan(/(.*?\.css)/m).flatten.to_s
file = File.open(url + "/" + css_file,"r")
inner_css = file.read.scan(/@import '(.*?\.css)';/m).flatten
inner_css.each do |css|
css_url = url_scheme + url + css_path + "/" + css
rio(css_url) > rio(url)
end
file.close
item['href'] = css_file
end
end
doc.search("img") do |item|
if item.is_full_path
rio(item["src"]) > rio(url)
else
rio(url_scheme + url + item["src"]) > rio(url)
end
item["src"] = item["src"].split("/")[-1]
end
File.open(url + "/" + url + ".html", "w") do |file|
file << doc.to_s
end
The code is seamless. Great job! This is the base for all sites, isn't it?
Thank you Alicia. Yes, some modification might be required to work for all sites.