37

5 Tips To Create A More Reliable Web Crawler

 4 years ago
source link: https://www.tuicool.com/articles/QF7biiZ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

W hen I am crawling websites, web crawlers being blocked by websites could be describe as the most annoying stuff. To be really great in web crawling, you should not only able to write the xpath or css selectors very fast but also how you design your crawlers matters a lot especially in the long run.

During the first year of my crawling website’s journey, I am more focus on how to scrape website. Being able to scrape the data, clean and organise it, this achievement already can make my day. After crawling more websites, is when I find out there are 4 important elements which are the most important to be a great web crawlers.

Speed of the crawler

Are you able to scrape the data in your limited time?

Completeness of the data scraped

Do you manage to scrape all the data you want to scrape?

Accuracy of the data scraped

How can you ensure the data scraped is accurate?

Scalability of the web crawler

Could you scale the web crawler if the amount of websites increases?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK