icrawler：强大简单的图片爬虫库

Original 大邓大邓和他的Python 2017-10-17 03:19

icrawler基本用法

该框架包含6个内置的图像抓取工具。

Flickr
通用网站图片爬虫（greedy）
UrlList（抓取给定URL列表的图像）

以下是使用内置抓取工具的示例。搜索引擎抓取工具具有相似的界面。

from icrawler.builtin import BaiduImageCrawler 
from icrawler.builtin import BingImageCrawler 
from icrawler.builtin import GoogleImageCrawler 
"""
parser_threads：解析器线程数目，最大为cpu数目
downloader_threads：下载线程数目，最大为cpu数目
storage：存储地址，使用字典格式。key为root_dir
keyword:浏览器搜索框输入的关键词
max_num:最大下载图片数目
"""

#谷歌图片爬虫
google_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler学习/google'}
google_crawler = GoogleImageCrawler(parser_threads=4, 
                                   downloader_threads=4, 
                                   storage=google_storage)
google_crawler.crawl(keyword='beauty', 
                     max_num=10)


#必应图片爬虫
bing_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler学习/bing'}
bing_crawler = BingImageCrawler(parser_threads=2,
                                downloader_threads=4, 
                                storage=bing_storage)
bing_crawler.crawl(keyword='beauty',
                   max_num=10)


#百度图片爬虫
baidu_storage = {'root_dir': '/Users/suosuo/Desktop/icrawler学习/baidu'}

baidu_crawler = BaiduImageCrawler(parser_threads=2,
                                  downloader_threads=4,
                                  storage=baidu_storage)
baidu_crawler.crawl(keyword='美女', 
                    max_num=10)

GreedyImageCrawler

如果你想爬某一个网站，不属于以上的网站的图片，可以使用贪婪图片爬虫类，输入目标网址。

from icrawler.builtin import GreedyImageCrawler

storage= {'root_dir': '/Users/suosuo/Desktop/icrawler学习/greedy'}
greedy_crawler = GreedyImageCrawler(storage=storage)
greedy_crawler.crawl(domains='http://desk.zol.com.cn/bizhi/7176_88816_2.html', 
                     max_num=6)

UrlListCrawler

如果你已经拥有了图片的下载地址，可以直接使用UrlListCrawler，为了高效抓取，可以使用多线程方式下载，快速抓取目标数据。

from icrawler.builtin import UrlListCrawler

storage={'root_dir': '/Users/suosuo/Desktop/icrawler学习/urllist'}
urllist_crawler = UrlListCrawler(downloader_threads=4, 
                                 storage=storage)

#输入url的txt文件。
urllist_crawler.crawl('url_list.txt')

定义自己的图片爬虫

通过icrawler我们很容易扩展，最简单的方式是重写Feeder，Parser和downloader这三个类。

Feeders：给crawler爬虫喂url，待爬
Parser：解析器（对某个url请求后得到该url的html文件，我们通过parser可以解析出html中的图片下载地址）
Downloader:图片下载器

Feeder

重写Feeder，需要改的方法：

feeder.feed(self, **kwargs)

如果你想一次提供一个起始url，例如从http://example.com/pageurl/1爬到http://example.com/pageurl/10 我们可以这样重写Feeder

from icrawler import Feederclass MyFeeder(Feeder):
    def feed(self):
        for i in range(10):
            url = 'http://example.com/page_url/{}'.format(i + 1)
            
            #感觉这里的output类似于yield一个url给downloader
            self.output(url)

Parser

重写Parser，需要改的方法：

parser.parse(self, response, **kwargs)

对某个url请求后得到该url的html文件，我们通过parser可以解析出html中的图片下载地址。解析方法文档中建议使用BeautifulSoup，这里以GoogleParser为例

class GoogleParser(Parser):

    def parse(self, response):
        soup = BeautifulSoup(response.content, 'lxml')
        image_divs = soup.find_all('div', class_='rg_di rg_el ivg-i')
        for div in image_divs:
            meta = json.loads(div.text)
            if 'ou' in meta:
                
                #将解析到的url以字典形式yield处理，注意字典的键使用的file_url
                yield dict(file_url=meta['ou'])

Downloader

如果你想改变图片的保存时的文件名，可以这样重写方法

downloader.get_filename(self, task, default_ext)

默认的文件名命名规则是从000001 到 999999。这里是另外一种命名规则的实现

import base64from icrawler import ImageDownloader
from icrawler.builtin import GoogleImageCrawler
from six.moves.urllib.parse import urlparse

class PrefixNameDownloader(ImageDownloader):

    def get_filename(self, task, default_ext):
        filename = super(PrefixNameDownloader, self).get_filename(
            task, default_ext)
        return 'prefix_' + filename


class Base64NameDownloader(ImageDownloader):

    def get_filename(self, task, default_ext):
        url_path = urlparse(task['file_url'])[2]
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in ['jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm']:
                extension = default_ext        
        else:
            extension = default_ext        
        # works for python 3
        filename = base64.b64encode(url_path.encode()).decode()
        return '{}.{}'.format(filename, extension)

google_crawler = GoogleImageCrawler(downloader_cls=PrefixNameDownloader,
                                   # downloader_cls=Base64NameDownloader,
                                    downloader_threads=4,
                                    storage={'root_dir': 'images/google'})

google_crawler.crawl('tesla', max_num=10)

Crawler

到现在，我们可以使用自己重写的Feeder、Parser、Downloader，

storage={'backend': 'FileSystem', 'root_dir': 'images'}
crawler = Crawler(feeder_cls=MyFeeder, 
                  parser_cls=MyParser,
                  downloader_cls=ImageDownloader, 
                  downloader_threads=4,
                  storage=storage)
                  
crawler.crawl(feeder_kwargs=dict(arg1='blabla', arg2=0),max_num=1000)

更多内容可以查看icrawler文档

http://icrawler.readthedocs.io/en/latest/usage.html

icrawler：强大简单的图片爬虫库

icrawler：强大简单的图片爬虫库

icrawler基本用法

GreedyImageCrawler

UrlListCrawler

定义自己的图片爬虫

Feeder

Parser

Downloader

Crawler

历史文章：

文本处理分析

图片数据处理

Recommend

使用 Tensorflow 构建 CNN 进行情感分析实践 - 腾讯云社区 - 腾讯云

iOS 多网络请求的线程安全 - BigNerdCoding - SegmentFault

前端高性能计算之四：GPU加速计算 | Magicly's Blog

体验一下链式写法 - 翁旺

谈谈PostCSS · Issue #41 · laizimo/zimo-article · GitHub

Chrome smoked by Edge in browser phishing test – Naked Security

GitHub - areina/helm-dash: Browse Dash docsets inside emacs

GitHub - redguardtoo/emacs.d: Efficient Emacs setup.

GitHub - ananthakumaran/typescript.el

任正非：应届生应有两年保护期两年内不可被淘汰

About Joyk