3

Web scraping using a headless browser in NodeJS

 10 months ago
source link: https://hackernoon.com/web-scraping-using-a-headless-browser-in-nodejs
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Web scraping using a headless browser in NodeJS

July 2nd 2023 New Story
by @terieyenike

Teri Eyenike

@terieyenike

I am a software developer focused on creating content through...

Read this story in a terminal
Print this story

Too Long; Didn't Read

Web scraping collects and extracts unstructured data from a website to a more readable structured format like CSV and more. Organizations set restrictions for web scraping guiding how users are allowed to collect data which on every website has a guiding principle in the form of the robots.txt file on the web page.
featured image - Web scraping using a headless browser in NodeJS

@terieyenike

Teri Eyenike

I am a software developer focused on creating content thr...


Receive Stories from @terieyenike

Web scraping collects and extracts unstructured data from a website to a more readable structured format like JSON, CSV, and more. Organizations set guiding principleson scraped endpoints that are permitted.

When scraping a website for personal use, it can be stressful to manually change the code every time, as most big brand websites want people to refrain from scraping their public data. The following restrictions or problems might arise, such as CAPTCHAs, user agent (allowed and disallowed endpoints) blocking, IP blocking, and proxy network setup are set.

A practical use case of web scraping is notifying users of price changes for an item on sites like Amazon, eBay, etc.

In this article, you will learn how to use Bright Data’s **Scraping Browser **to unlock websites at scale without being blocked because of its built-in unlocking capabilities.

Sandbox

Test and run the complete code in this .

Prerequisites

It would help if you had the following to complete this tutorial:

  • Basic knowledge of JavaScript.
  • Have Node installed on your local machine. It is required to install dependencies
  • A code editor - VS Code

What is Bright Data?

Bright Data is a data collection or aggregation service with a massive network of internet protocols (IPs) and proxies to scrape information off a website, thereby having the resource to avoid detection by company bots that prevent data scraping.

In essence, Bright Data does the heavy lifting in the background because of its large datasets available on the platform, which removes the worry of being blocked or gaining access to website data.

What is a headless browser?

A headless browser is a browser that operates without a graphical user interface (GUI). Modern web browsers like Google, Safari, Brave, Mozilla, and so on; all have a graphical interface for interactivity and displaying visual content. For headless browsers, it functions in the background with scripts or in the command line interface (CLI) written by developers.

Using a headless browser for web scraping is essential because it allows you to extract data from any public website by simulating user behavior.

Headless browsers are suitable for the following:

  • Automated testing
  • Web scraping

Benefits of Puppeteer

Puppeteer is an example of a headless browser. The following are some of the benefits of using Puppeteer in web scraping:

  • Crawl single-page application (SPA)
  • Allows for automated testing of website code
  • Clicking on pages elements
  • Downloading data
  • Generate screenshots and PDFs of pages

Installation

Create a new folder for this app, and run the command below to install a node server.

npm init -y

The command will initialize this project and create a package.json file containing all the dependencies and project information. The -yflag accepts all the defaults upon initialization of the app.

With the initialization complete, let’s install the nodemon dependency with this command:

npm install -D nodemon

Nodemon is a tool that will automatically restart the node application when the file changes.

In the package.json, update the scripts object with this code:

package.json

{
  ...
  "scripts": {
    "start": "node index.js",
    "start:dev": "nodemon index.js"
  },
  ...
}

Next, create a file, index.js, in the directory's root, which will be the entry point for writing the script.

The other package to install is the puppeteer-core, the automation library without the browser used when connecting to a remote browser.

npm install puppeteer-core

Building with Bright Data’s Scraping Browser

Create an account on Bright Data to access all its services. But for this project, the focus would be on the Scraping Browser functionality.

On your admin dashboard, click on the Proxies and Scraping Infra.

proxies and scraping infra

Scroll to the bottom of the page and select the Scraping Browser. After that, click the Get started button from the proxy products listed.

Scraping browser

On opening the tool, give the proxy a name and click the button, Add Proxy, and when prompted about creating a new zone, select Yes.

naming the proxy

The next screen should be something like this, with the host, username, and password displayed.

host, username, and password

Now, click on the button </> Check out code and integration examples and on the next screen, select Node.js as the language of choice for this app.

Creating environment variables

Environment variables are secret keys and credentials that should not be shared, hosted, or pushed to GitHub to prevent unauthorized access.

Before creating the .env file in the root of the directory, let’s install this command:

npm install dotenv

Copy-paste this code to the .env file, and replace the entire value in the quotation from your Access parameters tab:

.env

USERNAME="<user-name>"
HOST="<host>"

Creating a web scraper using Puppeteer

Back to the entry point file, index.js, copy-paste this code:

index.js

const puppeteer = require("puppeteer-core");
require("dotenv").config();

const auth = process.env.USERNAME;
const host = process.env.HOST;

async function run() {
  let browser;
  try {
    browser = await puppeteer.connect({
      browserWSEndpoint: `wss://${auth}@${host}`,
    });

    const page = await browser.newPage();
    page.setDefaultNavigationTimeout(2 * 60 * 1000);
    
    await page.goto("http://lumtest.com/myip.json");
    const html = await page.content();

    console.log(html);
  } catch (e) {
    console.error("run failed", e);
  } finally {
    await browser?.close();
  }
}

if (require.main == module) run();

The code above does the following:

  • Import the modules, the puppeteer-core, and dotenv
  • Read the secret variables with the host and auth variables
  • Define the asynchronous run function
  • In the try block, connect the endpoint with puppeteer in the object using the key browserWSEndpoint
  • The browser page launches programmatically to access the different pages like elements and fire up events
  • Since this is an asynchronous method, the setDefaultNavigationTimeout sets a navigation timeout for 2 minutes
  • Navigate to the page using the goto function, and afterward, get the URL's content with the page.content() method
  • It is compulsory that after scraping the web, you must close it in the finally block

If you want to expand this project, you can take screenshots of the web pages in png or pdf format.

Check out the documentation to learn more.

Conclusion

Scraping the web with Bright Data infrastructure makes the process quicker for your use case without writing your scripts from scratch, as it is already taken care of for you.

Try it today to explore the benefits of Bright Data over traditional web scraping tools, restricted by proxy networks and make it challenging to work with large datasets.

Resources

Scraping Browser documentation

Scrape at scale with Bright Data Scraping Browser


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK