Tags: apple perl lowbandwidth

Posted: July 1, 2010

On large downloads and unreliable links...

So, I recently updated to iOS 4, and have a few ideas for applications I’d like to build for my 2nd generation iPod touch. Over to http://developer.apple.com/ to download the latest SDK. The SDK weighs in at just over 2GiB, which is quite large when you’re sharing a 384 kilobit connection with the rest of base; but that in itself isn’t a show stopper. All good things come to him who waits. And waits. And waits.

No, the size per se isn’t the problem. The problem is the combination of:

  1. a large file;
  2. an unreliable satellite link that stops several times a day;
  3. HTTP (the clue is in the name); and
  4. vendors who assume everyone has a high quality broadband connection.

Taking the simple approach and attempting to download the SDK from a machine on base is doomed to failure. Regardless of which browser or download manager you might try, the connection will stop repeatedly during the several days the download takes due to higher priority traffic flooding the link. HTTP just doesn’t cut it when it comes to resuming large binary downloads under such circumstances.

My usual approach in these cases is to download the file onto the virtual server that runs clarkema.org, and then use rsync over SSH to retrieve the file from the server back to a machine on base. rsync provides throttling, efficient data transfer, and, most importantly for me, reliable resumption of transfers following an interruption of the download. The system works very well, except when vendors demand that you log in in order to download a file.

Unfortunately, the ADC does require a log-in. I could use links or a similar command-line web browser, but although I generally like them a lot, using a browser like that on a site like Apple’s or Adobe’s is an exercise only for the masochistic. Not to worry though – there’s always X Forwarding! A little bit of fiddling around with SSH allowed me to run Firefox on the clarkema.org server but have the display appear on a machine here on base. In theory, this should allow me to control the browser from here, but have the download go directly to my server over a fast, reliable link. Problem solved?

If only!

I managed to log into the ADC and start the download fine, but immediately noticed it wasn’t going as quickly as it should have been. It was certainly starting to appear in the expected directory on the server, just… slowly. Very slowly. At this point it was quite late, so I decided to leave the download running overnight and come back to it in the morning. Next day, the download had finished, so I started the process of moving it from the server onto base.

(Imagine some kind of calendar-flipping effect as time passes, if you like.)

Eventually the disk image arrived, only to declare itself corrupt when I tried to mount it on my laptop. DAMMIT. I didn’t really fancy trying exactly the same thing again, only to end up with another corrupt download. Fortunately, help was at hand in the shape of @CuddlyDragon. He offered to download the file himself (taking a grand total of sob 27 seconds sniff.) So, you might ask, how does this help? It helps in two ways. Firstly, Peter could verify that the disk image he had mounted correctly. I could have then downloaded that disk image from Peter, in the knowledge that I would eventually have a working file, but that would have taken another couple of days, and there is a quicker way.

Other than the fact that my file was corrupted in a few places, it was largely the same as the file the Peter had. In other words, I already had most of the file, so there’s no point downloading it all again. The incredibly handy ‘rdiff’ comes to the rescue in cases like this, and it works thusly:

First, I create a ‘signature’ describing the file I’ve got:

rdiff signature myfile myfile.sig

This file weighs in at 12MiB.

Next, I send Peter my signature file. He uses my signature file and his copy of the SDK download to generate what is effectively a binary patch, which can be applied to my file to turn it into a copy of his file.

rdiff delta myfile.sig petersfile deltafile

This file was 146MiB; much better than downloading a full 2GiB again! I applied the delta:

rdiff patch myfile deltafile newfile

A quick sha1sum confirmed that ‘newfile’ and ‘petersfile’ are identical, so I could finally install the SDK. However, this still left question of why the initial download had been corrupt and had taken so long – clearly something about using Firefox wasn’t working.

I decided to write a few lines of code using Perl and WWW::Mechanize, to see if I could download files from the ADC to my server reliably using that, instead of Firefox and X Forwarding.

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;

my $download_url = shift;

my $www = WWW::Mechanize->new();

$www->get( 'http://developer.apple.com/membercenter' );

$www->submit_form(
    form_name => 'appleConnectForm',
    fields    => {
        theAccountName => 'USERNAME',
        theAccountPW   => 'PASSWORD',
    }
);

$www->get( $download_url, ':content_file' => 'download' );"

This code throws up warnings about cookies from the depths of one of the Perl modules involved, but nevertheless it does actually work! It successfully downloaded the SDK to my server in just over 4 minutes.

The simple act of downloading a file really should not be this hard.