

[Golang] Extract Text via State Machine and goquery
source link: http://siongui.github.io/2016/04/11/go-extract-text-via-state-machine-and-goquery/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction
Extract text (i.e., footnote) in HTML via state machine and goquery in Golang (Go programming language).
Assume we have the following HTML:
index.html | repository | view raw
<!DOCTYPE html> <html> <head><title>[Golang] Extract Footnote via State Machine</title></head> <body> <p>I am paragraph #1</p> <p>I am paragraph #2</p> <p>I am paragraph #3</p> <p>I am paragraph #4</p> <hr> <div>Reference:</div> <div>[1] I am footnote #1</div> <div>[2] I am footnote #2</div> <div>[3] I am footnote #3</div> <hr> <p>Updated: 2016-04-11</p> </body> </html>
We want to extract the text (i.e., footnote) starting from Reference, and until Updated.
Install goquery Package
$ go get -u github.com/PuerkitoBio/goquery
Read HTML
html.go | repository | view raw
package main import ( "flag" "fmt" "os" ) func parseCommandLineArguments() string { pPath := flag.String("input", "", "Path of HTML file to be processed") flag.Parse() path := *pPath if path == "" { fmt.Fprintf(os.Stderr, "Error: empty input file path!\n") } return path } func main() { inputFilePath := parseCommandLineArguments() f, err := os.Open(inputFilePath) if err != nil { panic("Fail to open " + inputFilePath) } defer f.Close() footnoteBody := extractFootnote(f) fmt.Println(footnoteBody) }
Extract Text (Footnote)
Find all children of body element in HTML document. Convert each child of body element to text by Text() method. Process the text one line by one line. If the text line starting with Reference, the state machine enters InFootnote state, storing the text in the state machine. If the text line starting with Update, the state machine leave InFootnote state and stop storing the text. After all finished, output the text stored in the state machine, which is the text we want.
footnote.go | repository | view raw
package main import ( "github.com/PuerkitoBio/goquery" "os" "strings" ) const ( InFootnote = iota NotInFootnote ) type StateMachine struct { State int FootnoteBody string } func NewStateMachine() *StateMachine { return &StateMachine{ State: NotInFootnote, } } func (s *StateMachine) ProcessLine(line string) { if strings.HasPrefix(line, "Reference") { s.State = InFootnote } if strings.HasPrefix(line, "Update") && s.State == InFootnote { s.State = NotInFootnote } if s.State == InFootnote { s.FootnoteBody += line } } func extractFootnote(f *os.File) string { doc, err := goquery.NewDocumentFromReader(f) if err != nil { panic(err) } sm := NewStateMachine() doc.Find("body").Contents().Each(func(_ int, s *goquery.Selection) { sm.ProcessLine(s.Text()) }) return sm.FootnoteBody }
Usage
Put above three files (index.html, html.go, footnote.go) together in current directory. Run the following command:
$ go run html.go footnote.go -input=index.html
Tested on: Ubuntu Linux 15.10, Go 1.6.
References:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK