Code With Wolf


Puppeteer JS Tutorial

Puppeteer JS Tutorial

What is Puppeteer JS

When it comes to web scraping, python and selenium have always gotten all the credit. I'm not knocking python and selenium. If you need to run automated tests on multiple browsers, it's definitely the way to go.

However, the new kid in town for web scraping jobs is a JS library puppeteer.

Puppeteer runs headless by default and can scrape web pages. Their docs say that it can do pretty much anything that a human could do on Chrome or a chromium based browser.

Why I prefer Puppeteer JS

For scraping jobs, automated testing, or crawlers that need to interact with the DOM, I prefer puppeteer. In terms of performance, it is more efficient than selenium. It also allows you to write JavaScript to interact with DOM elements just as you would if you were writing frontend code, which is something most web developers understand well.

How To Scrape the Web With Puppeteer JS

Let's build a simple scraper to get you started with puppeteer js.

What we will build

We will scrape https://quotes.toscrape.com. This is a sandbox that was built to teach people how to scrape the web.

We will scrape each quote and return an object with the quote, author, and an array of tags associated with the quote.

Start Project

First, let's start a new node project

mkdir quote-scraper
cd quote-scraper
npm init -y
yarn add puppeteer
touch index.js

Now that we have installed puppeteer in our new node js project, let's navigate to the index.js and fill that out.

// index.js

const puppeteer = require("puppeteer");

async function start() {
  let results = [];
  try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    let n = 1;
    while (n <= 10) {
      await page.goto(`https://quotes.toscrape.com/page/${n}`);
      let current = await page.evaluate(() => {
        let items = document.querySelectorAll(".quote");
        const arr = Array.from(items);
        return arr.map((item) => ({
          text: item.querySelector(".text").textContent,
          author: item.querySelector(".author").textContent,
          tags: Array.from(item.children[2].children)
            .map((c) => c.textContent)
            .filter((t) => t),
        }));
      });
      results = [...results, ...current];
      n++;
    }
    browser.close();
    console.log(results);
  } catch (e) {
    console.error(e);
  }
}

start();

First we import puppeteer. Then we use it to go to quotes.toscrape.com. Once we are there, we await a page.evalute function where we are able to access the document object and interact with it just as we would if we were in the browser.

From there, we can pull the elements and text content that we want from the page.

I do this inside of a while loop and grab 10 pages worth of quote. (1000 quotes total).

Now we can run the script:

node index.js

And in your console, you should see 1000 quotes with their authors and tags associated with them.



© 2022 Code With Wolf