GitHub - howie6879/aspider: aspider - A lightweight,asynchronous micro-framework... - JOYK Joy of Geek, Geek News, Link all geek

README.md

aspider

A lightweight,asynchronous micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Installation

pip install aspider

pip install git+https://github.com/howie6879/aspider

Usage

Request & Response

We provide an easy way to request a url and return a friendly response:

import asyncio

from aspider import Request

request = Request("https://news.ycombinator.com/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-INFO  <GET: https://news.ycombinator.com/>
# <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>

JavaScript Support:

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)

Note, when you ever run the fetch() method first time,, it will download a recent version of Chromium (~100MB). This only happens once.

Item

Let's take a look at a quick example of using Item to extract target data. Start off by adding the following to your demo.py:

import asyncio

from aspider import AttrField, TextField, Item


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value


items = asyncio.get_event_loop().run_until_complete(HackerNewsItem.get_items(url="https://news.ycombinator.com/"))
for item in items:
    print(item.title, item.url)

Run: python demo.py

Notorious ‘Hijack Factory’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
 ......

Spider

For multiple pages, you can solve this with Spider

Create hacker_news_spider.py:

import aiofiles

from aspider import AttrField, TextField, Item, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, res):
        items = await HackerNewsItem.get_items(html=res.html)
        for item in items:
            async with aiofiles.open('./hacker_news.txt', 'a') as f:
                await f.write(item.title + '\n')


if __name__ == '__main__':
    HackerNewsSpider.start()

Run hacker_news_spider.py:

[2018-07-11 17:50:12,430]-aspider-INFO  Spider started!
[2018-07-11 17:50:12,430]-Request-INFO  <GET: https://news.ycombinator.com/>
[2018-07-11 17:50:12,456]-Request-INFO  <GET: https://news.ycombinator.com/news?p=2>
[2018-07-11 17:50:14,785]-aspider-INFO  Time usage: 0:00:02.355062
[2018-07-11 17:50:14,785]-aspider-INFO  Spider finished!

TODO

Custom middleware
JavaScript support
Friendly response

Contribution

Pull Request
Open Issue

Thanks

demiurge

GitHub - howie6879/aspider: aspider - A lightweight,asynchronous micro-framework...

README.md

aspider

Installation

Usage

Request & Response

Item

Spider

TODO

Contribution

Thanks

Recommend

京东关注店铺领京豆 100个以上京豆

【续】5年后，我们为什么要从 Entity Framework 转到 Dapper 工具？ - 三生石上(FineU...

上尉诗人James Blunt这支广告真是从头笑到尾

硬核拨浪鼓制作

HBO恐怖大剧《亚洲怪谈》首款预告来了！本剧共6集，每一集由亚洲各地不同导演执导，包...

消息称苹果正与大型新闻机构接触，希望为自家捆绑订阅服务加点料

马云新名片公布：职称是“老师” 共计11个头衔

揭秘马云接班人张勇：阿里史上第三任CEO “双11”缔造者

漏洞治理平台的设计与实现

让Redis突破内存大小的限制

About Joyk