~clarkema >_

Semi-automated PDF download

30 December 2023

Catching up on some admin tasks over the Christmas break, I found myself needing to download a bunch of PDF invoices from Vultr.com. I started off by checking their API, which will give me a list of all the invoice IDs, but no way of downloading a PDF. I was unenthusiastic about spending 10 minutes clicking through and downloading each file manually via the web interface, so obviously I decided to take the only sensible route and spend several hours playing around with automating it via Chrome and Puppeteer instead.

Yes, XKCD did warn me.

XKCD comparing time spent automating vs time saved

And yes, I went ahead anyway.

Puppeteer

My first attempt was just to get a list of invoice ids from the API using Perl and then download each in turn using credentials from a logged-in browser session.

Unfortunately, this naïve approach fails because Vultr uses Cloudflare. This led me towards Puppeteer because in addition to being able to drive its own hidden browser in “headless” mode it can also connect to a normal Chrome instance and share control with the user.

This shared approach means I can start Chrome, navigate to the Vultr billing dashboard, log in manually and get past any (many!) required CAPTCHAs, and then fire Puppeteer at the logged-in browser to just automate the downloads. It’s somewhat vexing to be required to jump through these hoops to access my own account data, but that’s a rant for another day.

Once I’d settled on the correct approach Puppeteer was refreshingly straightforward to work with.

First, require the libraries we’re going to be using and set up a sleep function that we can use later to introduce a pause between download requests to avoid angering the capricious gods of Cloudflare protection:

const sleepTime = 5000;

const fetch = (...args) => import('node-fetch').then(({default: fetch}) => fetch(...args));
const commander = require('commander');
const puppeteer = require('puppeteer-extra');
puppeteer.use(require('puppeteer-extra-plugin-stealth')());

const syncfs = require('fs');
const fs = require('fs/promises');

function sleep(time) {
    return new Promise(resolve => setTimeout(resolve, time));
}

Next, we need a list of invoice IDs. I could have got this by scraping the billing page, but since I’d already written the code to fetch them in my initial Perl experiments I figured I’d stick with the approach of retrieving the list from the Vultr API. It’s likely to be more reliable than scraping, and as a bonus it gave me an opportunity to explore recursive generators in Javascript, which I haven’t played with before.

/*
 * Return a generator that returns invoice records from the Vultr API.
 * Each record is of the form:
 *   {
 *     "id": 20123123,
 *     "date": "2023-01-01T00:00:00+00:00",
 *     "description": "Invoice #20123123",
 *     "amount": 1.00,
 *     "balance": -10.0000000000
 *   }
 */
async function* invoiceIterator({ key, cursor }) {
    let url = 'https://api.vultr.com/v2/billing/invoices?per_page=50';
    if (cursor) { url = url + '&cursor=' + cursor; }

    const response = await fetch(url, {
        headers: {
            'Authorization': 'Bearer ' + key
        }
    });
    const data = await response.json();

    for (const invoice of data["billing_invoices"]) {
        yield invoice;
    }

    const next = data["meta"]["links"]["next"];
    if (next) {
        yield* invoiceIterator({
            key: key,
            cursor: next
        });
    }

    return;
}

The key here is using yield* when recursing; it provides a very neat way of walking through a paginated API.

Given this generator, downloading the PDFs is trivial:

async function downloadInvoices(page, pdfDir, apiKey) {
    await page.goto("https://my.vultr.com/billing/#billinghistory");
    const invoices = invoiceIterator({ key: apiKey });

    for await (const invoice of invoices) {
        const filename = `${pdfDir}/invoice_${invoice["id"]}.pdf`;
        if (syncfs.existsSync(filename)) {
            console.log(`${filename} already exists; skipping.`);
            continue;
        }
        console.log(invoice["id"]);
        try {
            await page.goto(`https://my.vultr.com/billing/invoice/${invoice["id"]}`)
        } catch (e) {
            //console.error(e)
        }
        await sleep(sleepTime);
    }
}

The only wrinkle is that there’s no explicit way of requesting a download rather than navigating to a page. There is a Puppeteer plugin that seems to offer some options that might be useful for headless operation, but for my “shared” mode with Puppeteer driving an existing browser session it didn’t seem to help.

What worked in the end was navigating to chrome://settings/content/pdfDocuments and configuring Chrome to download PDFs rather than open them itself.

While I was at it I figured I may as well download payment receipts as well. These do not appear in the API, so there’s no way of getting a simple list of them. Frustrating, but I guess it gives me an excuse to explore some simple scraping in Puppeteer.

The page layout is very straightforward, so we can just get a list of all the .billing-receipt elements, grab their hrefs and then open each one in turn. Once we’ve worked through all the download links on the current page, click the navigation link to the next page if we aren’t already there and start again.

async function downloadReceipts(page, pdfDir) {
    await page.goto("https://my.vultr.com/billing/#billinghistory");

    do {
        const links = await page.$$eval(
            ".billing-receipt",
            (list) => list.map((elm) => elm.href)
        );

        for (const element of links) {
            console.log(element);
            try {
                await page.goto(element);
            } catch (e) {
                // console.error(e);
            }
            await sleep(sleepTime);
        }

        const nextUrl = await page.$eval(
            ".pageoptions > a:last-child",
            (elm) => elm.href
        );

        if (nextUrl == page.url()) {
            break;
        }

        await page.click(".pageoptions > a:last-child");
    } while (true);
}

Finally, we need to set up the browser and call both download functions:

async function main(wsEndpointURL, pdfDir, apiKey) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: wsEndpointURL,
        defaultViewport: null
    });

    const page = (await browser.pages())[0];

    await page._client().send(
        "Page.setDownloadBehavior",
        {
            behavior: "allow",
            downloadPath: pdfDir
        }
    );

    await downloadInvoices(page, pdfDir, apiKey);
    await downloadReceipts(page, pdfDir);
}

And one last IIFE to process command-line arguments and start the whole show:

(async () => {
    commander
        .argument('<dir>')
        .requiredOption('-k, --key <api_key>', 'Vultr API key')
        .option('-p, --port <chrome_port>', 'Chrome debugger port', 9222)
        .parse(process.argv);

    const pdfDir = commander.args[0];
    const apiKey = commander.opts()["key"];
    const url = `http://127.0.0.1:${commander.opts()["port"]}/json/version`;

    const response = await fetch(url);
    const data = await response.json();
    const webSocketDebuggerUrl = data["webSocketDebuggerUrl"];

    await main(
        webSocketDebuggerUrl,
        pdfDir,
        apiKey
    );

    process.exit();
})();

With all this in place I can start Chromium with chromium --remote-debugging-port=9222, log in and then trigger my Node script which will then happily start filling a directory with my PDF files.

Conclusion

The code above is obviously not production quality. It’s brittle, hacky, and woefully short on error handling. Nonetheless, it’s still useful; not every piece of code needs to be written as though it were going to be running on Voyager. Quick'n'dirty this might be, but it’s enough to sit in a terminal off to the side and quietly download one invoice every five seconds while I’m productively employed watching an episode of something instead.