go语言解析html

nop4ss · 2015-08-06 20:00:07 · 39339 次点击 · 预计阅读时间 3 分钟 · 大约8小时之前开始浏览

这是一个创建于 2015-08-06 20:00:07 的文章，其中的信息可能已经有所发展或是发生改变。

第一次，站长亲自招 Gopher 了>>>

有两个不错的库：

https://github.com/PuerkitoBio/goquery

http://code.google.com/p/go.net/html

html是html的解析器，把html文本解析出来，goquery基于html包，在此基础上结合cascadia 包（一个css选择器工具），实现类似于jquery的功能，操作html非常方便。

使用goquery来查找，选择相应的html节点，但如果要对选择的节点进行修改，删除操作，还需要深入使用html包。

html包把html文本解析为一个树，这个树有很多Node组成，操作的核心就在于对Node的操作。

用几个例子来说明一下吧：

doc, err := goquery.NewDocument("http://sports.sina.com.cn")

生成一个goquery的doc。

goquery用的最多的是Find函数，类似于jquery的$()，可以选择dom结构。

dhead := doc.Find("head")
	dcharset := dhead.Find("meta[http-equiv]")
	charset, _ := dcharset.Attr("content")

这个例子用来找出页面的charset。

logo := doc.Find("#retina_logo")

这个是根据html中的id来选择dom

bread := doc.Find("div.blkBreadcrumbLink")

选择doc中class为blkBreadcrumbLink的div

var faceImg string
var innerImg = []string{}

dom_body.Find("div.img_wrapper").Each(func(i int, s *goquery.Selection) {
		imgpath, exists := s.Find("img").Attr("src")
		if !exists {
			return
		}

		if i == 0 {
			faceImg = imgpath
		}
		innerImg = append(innerImg, imgpath)
	})

找出所有class为img_wrapper的div，然后在每个div下搜索img，获取img的src

dom_node := doc.Find("[bosszone='ztTopic']").Find("a")

这个是根据属性/值来查找相应的元素

如果要对html进行编辑操作，需要使用html.Node，这里提供一个清洗div的代码，使用了递归：

func clear_dom(pn *html.Node, isgb2312 bool) error {
	var err error
	for nd := pn.FirstChild; nd != nil; {
		switch nd.Type {
		case html.ElementNode:
			tn := strings.ToLower(nd.Data)
			//fmt.Printf("element node: %s\n", tn)
			if tn == "script" || tn == "style" {
				// delete the element
				tmp := nd
				nd = tmp.NextSibling
				pn.RemoveChild(tmp)
			} else if tn == "a" {
				tmp := nd
				nd = nd.NextSibling

				if err = convert_dom(tmp, isgb2312); err != nil {
					return err
				}
			} else if tn == "span" {
				tmp := nd
				nd = nd.NextSibling

				clear_dom(tmp, isgb2312)
			} else {
				tmp := nd
				nd = nd.NextSibling

				if err = convert_dom(tmp, isgb2312); err != nil {
					return err
				}
			}
		case html.CommentNode:
			tmp := nd
			nd = tmp.NextSibling
			pn.RemoveChild(tmp)
		case html.TextNode:
			tmp := nd
			nd = nd.NextSibling

			if err = convert_dom(tmp, isgb2312); err != nil {
				return err
			}
		default:
			nd = nd.NextSibling
		}
	}

	return nil
}

其中conver_dom是对node节点的text进行转码操作，如果不需要，可以忽略。

func Nodehtml(n *html.Node) string {
	var buf = bytes.NewBuffer([]byte{})
	html.Render(buf, n)
	return buf.String()
}

func Nodetext(node *html.Node) string {
	if node.Type == html.TextNode {
		// Keep newlines and spaces, like jQuery
		return node.Data
	} else if node.FirstChild != nil {
		var buf bytes.Buffer
		for c := node.FirstChild; c != nil; c = c.NextSibling {
			buf.WriteString(Nodetext(c))
		}
		return buf.String()
	}

	return ""
}

上面的两个函数，分别获取节点的html代码和text代码。html代码和text代码的区别是，html代码是原封不动的html代码，text代码仅仅显示html代码的内容，例如一段html: 例子,它的text代码是”例子”

有疑问加站长微信联系（非本文作者）

本文来自：开源中国博客

感谢作者：nop4ss

查看原文：go语言解析html

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

加入收藏微博

go语言解析html

go语言解析html

Recommend

အွန်လိုင်းရှော့ လုပ်သူတိုင်း ကြုံတွေ့နေရတဲ့ အခက်အခဲ (၅) ခုကို ဘယ်လိုဖြေရှင်းကြမလ...

စင်တင်များနှင့် မန္တလေး

OCR

Variable

The Strategic Imperative of Big Data Analytics in Workforce Scheduling

新的商业模式——竞拍抢购系统你知道么？

Beginner’s Guide to Content Marketing Reporting

Netherlands devises €1B plan to keep ASML in the country

The Vector Computer Company

The VC shakeup 2020-2024 and what's ahead - VC Cafe

About Joyk