135
GitHub - biezhi/elves: ? 轻量级的爬虫框架设计和实现
source link: https://github.com/biezhi/elves
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Elves
一个轻量级的爬虫框架设计与实现,博文分析。
- 多线程执行
CSS
选择器和XPath
支持
Maven 坐标
<dependency>
<groupId>io.github.biezhi</groupId>
<artifactId>elves</artifactId>
<version>0.0.2</version>
</dependency>
如果你想在本地运行这个项目源码,请确保你是 Java8
环境并且安装了 lombok 插件。
调用流程图
搭建一个爬虫程序需要进行这么几步操作
- 编写一个爬虫类继承自
Spider
- 设置要抓取的 URL 列表
- 实现
Spider
的parse
方法 - 添加
Pipeline
处理parse
过滤后的数据
举个栗子:
public class DoubanSpider extends Spider {
public DoubanSpider(String name) {
super(name);
this.startUrls(
"https://movie.douban.com/tag/爱情",
"https://movie.douban.com/tag/喜剧",
"https://movie.douban.com/tag/动画",
"https://movie.douban.com/tag/动作",
"https://movie.douban.com/tag/史诗",
"https://movie.douban.com/tag/犯罪");
}
@Override
public void onStart(Config config) {
this.addPipeline((Pipeline<List<String>>) (item, request) -> log.info("保存到文件: {}", item));
}
public Result parse(Response response) {
Result<List<String>> result = new Result<>();
Elements elements = response.body().css("#content table .pl2 a");
List<String> titles = elements.stream().map(Element::text).collect(Collectors.toList());
result.setItem(titles);
// 获取下一页 URL
Elements nextEl = response.body().css("#content > div > div.article > div.paginator > span.next > a");
if (null != nextEl && nextEl.size() > 0) {
String nextPageUrl = nextEl.get(0).attr("href");
Request nextReq = this.makeRequest(nextPageUrl, this::parse);
result.addRequest(nextReq);
}
return result;
}
}
public static void main(String[] args) {
DoubanSpider doubanSpider = new DoubanSpider("豆瓣电影");
Elves.me(doubanSpider, Config.me()).start();
}
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK