2

ScrapeNinja Cheerio Live Sandbox

 1 year ago
source link: https://scrapeninja.net/cheerio-sandbox?slug=basic
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

ScrapeNinja Cheerio Live Sandbox

Get started right from your browser: write cheerio extractor code and test it live!

Sample Data (HTML)
<html>
  <body>
    <div id="title">Title</div>
  </body>
</html>
Quick examples: basic    list of items    3 ways of accessing elements    text nodes    HackerNews    Auto.de
Extractor
xxxxxxxxxx
// define function which accepts body and cheerio as args
function extract(input, cheerio) {
    // return object with extracted values              
    let $ = cheerio.load(input);
    return {
        title: $('#title').text().trim()
    };
}
More cheerio examples can be found on Cheerio Cheatsheet
Note

Extracted data

{
    "title": "Title"
}

This sandbox was created to make the process of writing Node.js HTML scrapers more effective. Debugging cheerio extractors in local node.js environment takes a while, with lots of runs across multiple test inputs to ensure the syntax works right under all circumstances. Read more about reasons why this sandbox was created, and how it works, in a blog post Running untrusted JavaScript in Node.js

How to write your perfect extractor

Websites change their html layouts and break things. So, perfect and bullet proof extractor is the extractor that you didn't have to write! So make sure the website you are scraping does not provide some sort of JSON API before scraping HTML.

The extractor uses cheerio node.js package so first of all read its documentation.

Cheerio is in a lot of cases similar to jQuery, but with notable and sometimes annoying differences.

The best tool to get and test your css selectors is Chrome Dev Tools console.

How to use in a real project:

You can use this extractor function in your local cheerio installation (you need to have your Node.js installation for this) or in ScrapeNinja extractor field for /scrape endpoint.

Running your extractor locally:

Step #1. Create project folder and install node-fetch&cheerio

mkdir your-project-folder && \ cd "$_" && \
npm i -g create-esnext && \
npm init esnext && \
npm i node-fetch cheerio -y

Step #2. Copy&paste the code

Create new empty file like scraper.js and paste the code to this file:

import cheerio from 'cheerio'

// paste the extractor function here
function extract(input, cheerio) { ... } // the extractor function can now be called as extract()

// retrieve your input from node-fetch or file system
const input = '<h2 class="title">YOUR TEST INPUT</h2>';

let results = extract(input, cheerio);


// the json data is now located in results variable
console.log(results);


Step #3. Launch

node ./scraper.js

Running your scraper with extractor in ScrapeNinja:

Just copy&paste the code of function to "extractor" field in ScrapeNinja sandbox and then put generated ScrapeNinja code to your local node.js script.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK