BTMash

Blob of contradictions

Mirroring a website using HTTrack

Wed, 05/25/2011 - 12:01 -- btmash
HTTrack is a free and easy-to-use offline browser utility.

At my workplace, there are often times when I am *not* working with Drupal (contrary to what people might think ^_^). From working on server tuning to writing up small scripts to do batch tasks to small applications for integration with outside services such as google apps, it helps ensure that I don't fry my brain in one area :) I was recently asked to figure out a way to create an archive (on some sort of schedule) of one of our art journals. There were a few hurdles facing me on this task:

  • Codebase was not hosted on our servers.
  • Media (images, videos, etc) were hosted on an entirely separate server/service. (Amazon Cloud)
  • I did not have access to the servers/services.
  • The codebase was written in Ruby.
  • The database engine powering the site is PostgreSQL.
  • There is no support for PostgreSQL on campus.

So ultimately, we cannot host the same codebase on campus for the backups. And while I could be provided with a db dump, I would need to figure out a way to convert it to mysql (or sql server if we hosted on a Microsoft Server) and write up the app that supports it. Sounds like a whole lotta work and not very flexible.

So the only real option would be to figure out some sort of mirroring tool (along with mirroring whatever media on the other server was linked to in the pages) to help with this task. I had tried things with wget but was not getting the results I wanted (primarily, creating relative urls with their html replacement. I was also not getting localised links to files as everything was assumed to be from the root (which this site was not going to be); if someone knows how, please post below!). I was used to using SiteSucker for the mac (which, oddly enough, crashed on the site; I'll get into that in a bit). In any event, this was on the mac and we needed a tool that would work on any linux server. I could try and write something myself but that would be an awful lot of reinventing the wheel (and would probably be a square wheel given what I said above).

I came across HTTrack and saw that it seemed to fit most of my needs:

  • Can be compiled on any machine and also available as a debian package (also in the apt repositories). Also has a windows executable
  • Mirror a website.
  • Localise urls / convert urls to point to appropriate html files.
  • Grab content from any other domains with a blacklist. (in my case, from amazon storage cloud)
  • Blacklist paths to not follow (important later)
  • And much more... - see the immense Manual for more details

So I figured, let's give this a try. There are an immense number of options but I figure my case should be relatively simple. My initial script would be:

httrack "http://example.org/" -O "$backupDirectoryPath" "+*.example.org/*" "+http://s3.amazonaws.com/*" -v

There are quite a few things going on in this script so let me try and explain the pieces that I have in there:

  1. We first specify the starting point - in this case, it is "http://example.org" (note that httrack prefers for arguments with urls and blacklists/whitelists to be in parentheses).
  2. We can specify where the mirror will go using the -O flag (note that $backupDirectoryPath is a path I defined earlier on in my script - I will post this up on pastebin so you can see what I had in full).
  3. The quotations where I have "+" with something are urls from which it is allowed to download, and process. If you do not specify any, it will only download from the domain you specify. You can technically tell it to download everything from every domain that is come across (so you could spider the entire web provided you have the room for it).
  4. -v flag just means it will be verbose and tell you what is going on.

And that's it (as a start). I tried running the above and while it would start to crawl through the site, it started taking far longer than I had anticipated (about 25 hrs in total) and far more space than I expected (6.1 gigs last I checked). I also found that the links had been broken off which I'm not entirely sure on but the basic gist of why it took so long and the space issue was that while there wasn't a huge amount of content on the site, the site had a lot of tags and there was a page with tag-based faceted filtering where you could continue applying more tags even without any results (last I checked, there were approximately 500 tags...given they could be in any order and all the tags could be applied in the filtering, there were 500 factorial combinations - suffice to say, its a HUGE number). My guess is because of the number of files that were being created for processing, it was taking up a lot of space, time, and unable to process the files properly (so the real functionality that I wanted, pagers, were not working).

Luckily, HTTrack has the option to also allows blacklisted patterns (they start with a minus) so urls that match the pattern will not be downloaded/processed (your local file will point to the original url, not a relative version so the url will still work; be careful as your user will end up on the original site then ^_^). Thus my final script for HTTrack would be:

httrack "http://example.org/" -O "$backupDirectoryPath" "+*.example.org/*" "+http://s3.amazonaws.com/*" "-http://example.org/*/tags/*" -v

Whereby it is saying "Download from example.org to this directory I specify and download everything from example.org and amazon s3. However, avoid anything from example.org that fits the faceted filtering page format (so urls such as http://example.org/archives/recommended/tags/architecture, are avoided while still allowing me to download first level tagged pages such as http://example.org/archives?tags=California; so it is still fairly useful :)). Based off this change, the mirroring now takes 1 hour (when the campus is busy) and occupies 200 megs of space. Big difference!

We are now left with 'unwanted' links in the pages that have been downloaded (contact/signup/login forms, the compounded tags pages which are pointing to the live site) which we do not want the end user to see. Regardless of the blacklisting, this would have popped up. While I get the feeling that HTTrack probably has the ability to let any other scripts created to plug in to the processing to do any other work you would like, the manual is huge!. And while I could write a php script to go through the files and clean it out of the files, I figured that I could get away with something simpler. And so I decided that I would instead create a secondary 'cleanup' css file which held the pieces I didn't want displayed to the users (a bunch of css rulesets which contain display: none;. Once I did that, I go through the newly created archive and its list of css files, and append the contents of my cleanup.css file to them. Since there will be far fewer css files than actual html files, this should be a quick task (especially given that it is simply appending the content to them). So my final code looks like:

#!/bin/sh

#Back up site

Year=`eval date +%Y`
Month=`eval date +%m`
Day=`eval date +%d`
pathToCleanup="/path/to/cleanup.css"
backupDirectoryPath="/path/to/archive-hub/archive-$Year-$Month-$Day"
httrack "http://example.org/" -O "$backupDirectoryPath" "+*.example.org/*" "+http://s3.amazonaws.com/*" "-http://example.org/*/tags/*" -v

for i in "$backupDirectoryPath"/example.org/stylesheets/*.css; do cat "$pathToCleanup" >> "$i"; done

You can view this with nicer formatting on Pastebin. For anyone that sees places where the code that I wrote could be improved or you have trouble understanding what is there, please comment! Your feedback will help make this better for anyone else that comes across this article.