How to fetch authenticated CSV’s with Google Chrome (Headless) in NodeJS

Engineering

Recently I had a use case where we need to log in to a third-party site and fetch protected CSV files. While PhantomJS (via CasperJS) can often accomplish this take, we have had issues with its stability often crashing, especially in Docker.

But that’s okay because Google Chrome (headless) is up to the task and has quite an awesome, simple API to use.

Here is a NodeJS sample using puppeteer to interact with Google Chrome (Headless). First, we need to install the puppeteer module.

npm install puppeteer --save

Once the module has been installed create a Javascript file with the following contents.

const puppeteer = require('puppeteer');
// variables
const USER = '[email protected]';
const PASS = 'password';

(async () => {
  // create a browser instance
  // use the --no-sandbox and --disable-setuid-sandbox parameters depending on your kernel support
  const browser = await puppeteer.launch({ executablePath: '/usr/bin/chromium-browser', headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu'] });
  // create a page instance
  const page = await browser.newPage();

  // export the browser and page variables so we can debug using Chrome Developer Tools
  Object.assign(global, { browser, page });

  // output script console messages in our terminal console
  page.on('console', msg => console.log(`chrome[${msg.text()}]`));

  // connect to the site
  await page.goto('https://site.com/users/sign_in', { waitUntil: 'networkidle0' });

  // login to the site
  await page.click('#user_email');
  await page.keyboard.type(USER);
  await page.click('#user_password');
  await page.keyboard.type(PASS);
  await page.click('#new_user input[type="submit"]');
  await page.waitForNavigation();

  // now we are going to tell the Chrome instance to use the fetch() function to download the content for us.
  // be sure to include the credentials so that any cookies and session variables are passed through and then
  // the downloaded content will be returned to our NodeJS script.
  const downloadUrl = 'https://site.com/fetch/file.csv';
  const downloadedContent = await page.evaluate(async downloadUrl => {
    const fetchResp = await fetch(downloadUrl, { credentials: 'include' });
    return await fetchResp.text();
  }, downloadUrl);

  console.log(`Downloaded: ${downloadedContent}`);

  await browser.close();
})();

Now we just need to run our sample and watch the output.

node test.js