![](/style/images/good.png)
![](/style/images/bad.png)
Golang(Go语言)爬虫框架colly简明教程及源码阅读与分析 - Jiajun的编程随想
source link: https://jiajunhuang.com/articles/2019_08_31-go_colly.md.html?
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Golang(Go语言)爬虫框架colly简明教程及源码阅读与分析
使用了一下colly这个爬虫框架,发现非常的好用,它的设计还是值得学习一下的,API设计的非常简洁。不过首先 我们要看看这玩意儿咋用。
colly的安装和使用
首先如果我们在项目里引用,就要先安装一下:
$ go get -u github.com/gocolly/colly/...
我们看个简单的demo:
package main
import (
"fmt"
"github.com/gocolly/colly"
"github.com/gocolly/colly/extensions"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("httpbin.org"), // 要限定域名,否则就把全网都爬下来了
)
extensions.RandomUserAgent(c) // 使用随机的UserAgent,最好能使用代理。这样就不容易被ban
extensions.Referer(c)
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://httpbin.org/")
}
我们来运行一下:
$ go build -o main && ./main
Visiting https://httpbin.org/
...
colly有两个比较好用的扩展:
extensions.RandomUserAgent(c)
这个是在访问的时候,使用随机的UserAgent,来模拟不同的浏览器访问extensions.Referrer(c)
这个是在访问的时候带上Referrer,意思就是这一次点击是从哪个页面产生的
colly 有几个常用的回调方法:
OnError
是指出错的时候的回调,回调函数签名为type ErrorCallback func(*Response, error)
OnHTML
是指返回结果是 HTML 的时候的回调,回调函数签名为type HTMLCallback func(*HTMLElement)
,OnXML
和它类似OnRequest
是在发起请求之前的回调OnResponse
是在收到响应之后的回调OnScraped
是指在抓取完成之后执行的回调,也就是在OnHTML
之后。关于这些回调函数的执行顺序,我们会在源码分析这一节讲述
colly源码分析
分析colly的源码,我们主要就是想弄清楚colly到底是怎么实现的,我们从上面的demo可以看出来,首先我们执行了 c := colly.NewCollector()
,
然后我们执行了 c.Visit()
开始访问。
我们从 NewCollector
入手,看看 Collector
是什么东西,使用代码跳转到 NewCollector
的定义所在,然后找到 Collector
的定义所在:
// Collector provides the scraper instance for a scraping job
type Collector struct {
// UserAgent is the User-Agent string used by HTTP requests
UserAgent string
// MaxDepth limits the recursion depth of visited URLs.
// Set it to 0 for infinite recursion (default).
MaxDepth int
// AllowedDomains is a domain whitelist.
// Leave it blank to allow any domains to be visited
AllowedDomains []string
// DisallowedDomains is a domain blacklist.
DisallowedDomains []string
// DisallowedURLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request will be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
DisallowedURLFilters []*regexp.Regexp
// URLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request won't be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
URLFilters []*regexp.Regexp
// AllowURLRevisit allows multiple downloads of the same URL
AllowURLRevisit bool
// MaxBodySize is the limit of the retrieved response body in bytes.
// 0 means unlimited.
// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
MaxBodySize int
// CacheDir specifies a location where GET requests are cached as files.
// When it's not defined, caching is disabled.
CacheDir string
// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
// the target host's robots.txt file. See http://www.robotstxt.org/ for more
// information.
IgnoreRobotsTxt bool
// Async turns on asynchronous network communication. Use Collector.Wait() to
// be sure all requests have been finished.
Async bool
// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
// to true to enable it.
ParseHTTPErrorResponse bool
// ID is the unique identifier of a collector
ID uint32
// DetectCharset can enable character encoding detection for non-utf8 response bodies
// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
DetectCharset bool
// RedirectHandler allows control on how a redirect will be managed
RedirectHandler func(req *http.Request, via []*http.Request) error
// CheckHead performs a HEAD request before every GET to pre-validate the response
CheckHead bool
store storage.Storage
debugger debug.Debugger
robotsMap map[string]*robotstxt.RobotsData
htmlCallbacks []*htmlCallbackContainer
xmlCallbacks []*xmlCallbackContainer
requestCallbacks []RequestCallback
responseCallbacks []ResponseCallback
errorCallbacks []ErrorCallback
scrapedCallbacks []ScrapedCallback
requestCount uint32
responseCount uint32
backend *httpBackend
wg *sync.WaitGroup
lock *sync.RWMutex
}
可以看出来,这就是colly中集中所有运行时候需要的东西的地方,注意,最下面有几个重要的东西:
store storage.Storage
这个是后端存储,即把东西爬下来了存哪里,具体使用可以参考文档- 一系列的 xxxCallbacks 这个就是我们设置的回调存储的地方
我们看下我们demo的执行,随便来一个,例如 c.OnHTML
那个:
// OnHTML registers a function. Function will be executed on every HTML
// element matched by the GoQuery Selector parameter.
// GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback) {
c.lock.Lock()
if c.htmlCallbacks == nil {
c.htmlCallbacks = make([]*htmlCallbackContainer, 0, 4)
}
c.htmlCallbacks = append(c.htmlCallbacks, &htmlCallbackContainer{
Selector: goquerySelector,
Function: f,
})
c.lock.Unlock()
}
就可以看出来,其实回调函数就是存在 c.htmlCallbacks
里。
接下来我们看 c.Visit
,它是整个流程的启动者:
// Visit starts Collector's collecting job by creating a
// request to the URL specified in parameter.
// Visit also calls the previously provided callbacks
func (c *Collector) Visit(URL string) error {
if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)
}
然后我们追踪到 c.scrape
:
func (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error {
if err := c.requestCheck(u, method, depth, checkRevisit); err != nil {
return err
}
parsedURL, err := url.Parse(u)
if err != nil {
return err
}
if parsedURL.Scheme == "" {
parsedURL.Scheme = "http"
}
if !c.isDomainAllowed(parsedURL.Host) {
return ErrForbiddenDomain
}
if method != "HEAD" && !c.IgnoreRobotsTxt {
if err = c.checkRobots(parsedURL); err != nil {
return err
}
}
if hdr == nil {
hdr = http.Header{"User-Agent": []string{c.UserAgent}}
}
rc, ok := requestData.(io.ReadCloser)
if !ok && requestData != nil {
rc = ioutil.NopCloser(requestData)
}
req := &http.Request{
Method: method,
URL: parsedURL,
Proto: "HTTP/1.1",
ProtoMajor: 1,
ProtoMinor: 1,
Header: hdr,
Body: rc,
Host: parsedURL.Host,
}
setRequestBody(req, requestData)
u = parsedURL.String()
c.wg.Add(1)
if c.Async {
go c.fetch(u, method, depth, requestData, ctx, hdr, req)
return nil
}
return c.fetch(u, method, depth, requestData, ctx, hdr, req)
}
c.scrape
做的事情就是,首先执行 c.requestCheck
检查一下URL是不是ok呀,是不是访问过了呀之类的,然后解析一下URL,
判断域名是不是允许访问,然后组装好请求之后,传递到 c.fetch
里,我们继续跟:
func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error {
defer c.wg.Done()
if ctx == nil {
ctx = NewContext()
}
request := &Request{
URL: req.URL,
Headers: &req.Header,
Ctx: ctx,
Depth: depth,
Method: method,
Body: requestData,
collector: c,
ID: atomic.AddUint32(&c.requestCount, 1),
}
c.handleOnRequest(request)
if request.abort {
return nil
}
if method == "POST" && req.Header.Get("Content-Type") == "" {
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")
}
if req.Header.Get("Accept") == "" {
req.Header.Set("Accept", "*/*")
}
origURL := req.URL
response, err := c.backend.Cache(req, c.MaxBodySize, c.CacheDir)
if proxyURL, ok := req.Context().Value(ProxyURLKey).(string); ok {
request.ProxyURL = proxyURL
}
if err := c.handleOnError(response, err, request, ctx); err != nil {
return err
}
if req.URL != origURL {
request.URL = req.URL
request.Headers = &req.Header
}
atomic.AddUint32(&c.responseCount, 1)
response.Ctx = ctx
response.Request = request
err = response.fixCharset(c.DetectCharset, request.ResponseCharacterEncoding)
if err != nil {
return err
}
c.handleOnResponse(response)
err = c.handleOnHTML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
err = c.handleOnXML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
c.handleOnScraped(response)
return err
}
这里就涉及到我们上面说的回调的顺序了,可以从代码里看出来,依次是:
c.handleOnRequest(request)
if err := c.handleOnError(response, err, request, ctx); err != nil {
c.handleOnResponse(response)
err = c.handleOnHTML(response)
err = c.handleOnXML(response)
c.handleOnScraped(response)
duang,顺序一下就明了了。
至于store就由读者自己去看吧,其实很简单,实现这个接口就好了:
// Storage is an interface which handles Collector's internal data,
// like visited urls and cookies.
// The default Storage of the Collector is the InMemoryStorage.
// Collector's storage can be changed by calling Collector.SetStorage()
// function.
type Storage interface {
// Init initializes the storage
Init() error
// Visited receives and stores a request ID that is visited by the Collector
Visited(requestID uint64) error
// IsVisited returns true if the request was visited before IsVisited
// is called
IsVisited(requestID uint64) (bool, error)
// Cookies retrieves stored cookies for a given host
Cookies(u *url.URL) string
// SetCookies stores cookies for a given host
SetCookies(u *url.URL, cookies string)
}
官方文档上有redis,sqlite3等等后端支持。
参考资料:
![微信公众号](https://jiajunhuang.com/static/wechat_qrcode.png)
关注公众号,获得及时更新
Nginx设置Referer来防止盗图 Graphviz dot简明教程 jQuery简明教程 Python RQ(Redis Queue)添加gevent支持 读《超级运营术》- 如何做社区? 技术人,光有技术是不行的 搭建aria2服务器 使用shairport-sync搭建airplay音频服务器 VirtManager Windows自适应屏幕 使用btrfs组建RAID1 Swagger? 不好用 Golang/Python最佳实践 读《毛泽东选集》 GORM源码阅读与分析 随想
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK