134

Bypassing website anti-scraping protections

 5 years ago
source link: https://www.tuicool.com/articles/hit/fyyaIfN
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Some websites protect themselves from web scraping. However, sometimes it is still reasonable and fair (and based on a recent US court ruling also legit) to extract data from them. In this article, we'll go through the most commonly used anti-scraping protections and show you how to bypass them.

There are four main categories of protection against scraping:

  1. IP detection
  2. IP rate limiting
  3. Browser detection
  4. Tracking user behavior

IP detection

Some websites deny access to their content based on the location of your IP address. They just want to show their content to users from given countries.

Another option is that some websites block access based on the IP range your address belongs to. This kind of protection is usually implemented to reduce the amount of non-human traffic. For instance, websites will deny access to IP ranges of Amazon Web Services and other commonly known ranges.

This kind of protection is usually easily bypassed by the use of a proxy server.

On Apify platform, you can either use our pool of proxy servers based in the United States, or you can ask us to order a custom dedicated pool from countries you need, or you can use your own proxy servers from services like oxylabs.io or luminati.io .

IP rate limiting

The second most common protection is to limit access based on how many requests were made from a single IP address in a certain period of time.

This kind of protection can be either manual (meaning a human is checking logs, and if they see large volumes of traffic from same IP address, they block the IP) or fully automatic.

For example, for google.com, you can typically make only around 300 requests per day, and if you reach this limit, you will see a CAPTCHA instead of search results. Another example could be a website which allows ten requests per minute and throws an error for anything above this threshold.

Protection like this can be temporary, but sometimes it can be permanent, especially if it is done manually by a human.

There are two ways to work around rate limiting. One option is to introduce delays in execution to make the crawling process slower. The second option is to use proxy servers and rotate IP addresses after a certain number of requests.

To introduce delays in yourApify crawler, you can change the Delay between requests setting in the Advanced settings tab of the crawler. Or you can use context.willFinishLater() , context.finish(data) with setTimeout in your Page function as follows:

function pageFunction(context) {
    // Tell crawler that result will be
    // passed in context.finish()
    context.willFinishLater();
    setTimeout(function(){
        var data = {};
        
        // process page and save data into data variable
        // ...
        
        // Finish and output data
        context.finish(data);
    }, 5 * 1000); // wait 5 seconds
}

InApify actor you can use promises to introduce delays before execution using thesleep() function from the Apify SDK as follows:

const Apify = require('apify');

Apify.main(async () => {    
    await Apify.utils.sleep(10 * 1000);
 
    // Any code bellow will be delayed by 10 seconds...
});

To use the second method and rotate proxy servers in your Apify crawler, you need to go to the Advanced settings tab of a crawler, pick Proxy groups you want to use and then specify the Max pages per IP address setting.

Browser detection

Another relatively pervasive form of protection is based on the web browser that you are using.

User agents

Some websites use the detection of User-Agent HTTP headers to block access from specific devices. In Apify crawler, there is a handy option in Advanced settings tab called Rotate User-Agent headers . If you enable it, each request will have a different user agent. This usually solves the user agent-based protection.

Apify actors don't have such a handy method, but if you are using the Apify SDK , then both Apify.launchPuppeteer() andPuppeteerCrawler functions have a parameter called " userAgent ". Here is an example of launching puppeteer with random user agent using the modern-random-ua NPM package:

const Apify = require('apify');
const randomUA = require('modern-random-ua');

Apify.main(async () => {
     // Set one random modern user agent for entire browser
    const browser = await Apify.launchPuppeteer({
        userAgent: randomUA.get(),
    });
    const page = await browser.newPage();
    // Or you can set user agent for specific page
    await page.setUserAgent(randomUA.get());
    // And work on your code here
    await page.close();
    await browser.close();
});

Blocked PhantomJS

Apify crawlers use PhantomJS to open web pages, but when you open a web page in PhantomJS, it will add variables to the window object that makes it easy for browser detection libraries to figure out that the connection is automated and not from a real person. Usually, websites which employ protection against PhantomJS will either block these connections or even worse, mark the used IP address as a robot and ban it.

The only way to crawl websites with this kind of protection is to switch to a standard web browser like headless Chrome or Firefox . That's one of the reasons why we launched Apify actor.

Blocked headless Chrome with Puppeteer

Puppeteer is essentially a Node.js API to headless Chrome. Although it is a relatively new library, there are already anti-scraping solutions on the market that can detect its usage based on a variable it puts into the browser's window.navigator property.

Thankfully, we have developed a solution which removes the property from the web browser and thus prevents these kind of protections from figuring out that the browser is automated. This feature is available as thehideWebDriver() function in the Apify SDK .

Here is an example how to use it:

const Apify = require('apify');

Apify.main(async () => {
const browser = await Apify.launchPuppeteer({});
const page = await browser.newPage();

// Utility function which strips the variable
// from window.navigator object
await Apify.utils.puppeteer.hideWebDriver(page);

// The function needs to be placed before the goto() function,
// so that the property is removed before the page is loaded
await page.goto('https://www.example.com');

// Add rest of your code here...
await page.close();
await browser.close();
});

Browser fingerprinting

Another option sometimes used by anti-scraping solutions is to create a unique fingerprint of the web browser and connect it using a cookie with the browser's IP address. Then if the IP address changes but the cookie with the fingerprint stays the same, the website will block the request.

In this way, sites are also able to track or ban fingerprints that are commonly used by scraping solutions - for example, Chromium with the default window size running in headless mode.

The best way to fight this type of protection is to remove cookies and change the parameters of your browser for each run and switch to real Chrome browser instead of Chromium.

Here is an example of how to launch Puppeteer with Chrome instead of Chromium using Apify SDK :

const browser = await Apify.launchPuppeteer({
    useChrome: true,
});
const page = await browser.newPage();

This example shows how to remove cookies from the current page object:

// Get current cookies from the page for certain URL
const cookies = await page.getCookies('https://www.example.com');
// And remove them
await page.deleteCookie(...cookies);

Note that the snippet above needs to be run before you call page.goto() !

And this is how you can randomly change the size of the Puppeteer window using the page.viewport() function:

await page.viewport({
    width: 1024 + Math.floor(Math.random() * 100),
    height: 768 + Math.floor(Math.random() * 100),
})

Finally, you can use the Apify's base Docker image called Node.JS 8 + Chrome + Xvfb on Debian to make Puppeteer use a normal Chrome in non-headless mode using the X virtual framebuffer (Xvfb).

Tracking user behavior

The last protection option sometimes employed by advanced anti-scraping solutions is to track the behavior of the user in order to detect anything that is not done by humans, like clicking on a link without actually moving a mouse cursor there. This kind of protection is commonly implemented together with browser fingerprinting and IP rate limiting by the most advanced anti-scraping solutions.

Bypassing this protection cannot be easily done with just a simple piece of code, but we have noticed that there are some patterns to look for and if you find those, then it is possible to bypass such a protection. Here's what you need to do:

1) Check the website to see if it's saving data about your browser

You can do that by opening Chrome DevTools in your Chrome browser and going to the Network tab. Then switch to either the XHR or Img tab, as the websites sometimes hide the tracking requests as image loads. Check if there are POST requests made when you open the page or carry out some action on the page. If you find a request that has weird encoded data, then you've hit the jackpot. Here's an example of what it might look like:

Screenshot+2018-05-30+15.29.22.png

If you find a request like this one, you can check the payload value on a site like base64decode.org and if it contains data about your browser, you've found the tracking request. 

2) Block the tracking requests

The next step is to disable the tracking. For that you need to go the view of all requests and check the Initiator column for that request. It usually contains the JavaScript file which initiated the call.

Screenshot+2018-05-30+15.32.30.png

You will need to disable this file in order to block the protection. Here's an example how to do that in Puppeteer:

// Tell Puppeteer that you want to be able to block
// requests on this page
await page.setRequestInterception(true);

page.on('request', (request) => {
   const url = request.url();
   // Check request if it is for the file
   // that we want to block, and if it is, abort it
   // otherwise just let it run
   if (url.endsWith('main.min.js')) request.abort()
   else request.continue();
});

Now try to run your Apify actor, if everything works, you've successfully bypassed the protection. If the page stops working properly, then it means that the file contained other functions bundled with the protection and in that case you can use the code above but block the request with your browser data instead of the file that creates them.

And that's all. If you find a website that still does not work even if you follow all these steps, let us know [email protected] - we love new challenges :)

Happy crawling!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK