6

[Golang] Extract Text via State Machine and goquery

 3 years ago
source link: http://siongui.github.io/2016/04/11/go-extract-text-via-state-machine-and-goquery/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Introduction

Extract text (i.e., footnote) in HTML via state machine and goquery in Golang (Go programming language).

Assume we have the following HTML:

index.html | repository | view raw

<!DOCTYPE html>
<html>
<head><title>[Golang] Extract Footnote via State Machine</title></head>
<body>
<p>I am paragraph #1</p>
<p>I am paragraph #2</p>
<p>I am paragraph #3</p>
<p>I am paragraph #4</p>
<hr>
<div>Reference:</div>
<div>[1] I am footnote #1</div>
<div>[2] I am footnote #2</div>
<div>[3] I am footnote #3</div>
<hr>
<p>Updated: 2016-04-11</p>
</body>
</html>

We want to extract the text (i.e., footnote) starting from Reference, and until Updated.

Install goquery Package

$ go get -u github.com/PuerkitoBio/goquery

Read HTML

html.go | repository | view raw

package main

import (
	"flag"
	"fmt"
	"os"
)

func parseCommandLineArguments() string {
	pPath := flag.String("input", "", "Path of HTML file to be processed")
	flag.Parse()
	path := *pPath
	if path == "" {
		fmt.Fprintf(os.Stderr, "Error: empty input file path!\n")
	}

	return path
}

func main() {
	inputFilePath := parseCommandLineArguments()

	f, err := os.Open(inputFilePath)
	if err != nil {
		panic("Fail to open " + inputFilePath)
	}
	defer f.Close()

	footnoteBody := extractFootnote(f)
	fmt.Println(footnoteBody)
}

Extract Text (Footnote)

Find all children of body element in HTML document. Convert each child of body element to text by Text() method. Process the text one line by one line. If the text line starting with Reference, the state machine enters InFootnote state, storing the text in the state machine. If the text line starting with Update, the state machine leave InFootnote state and stop storing the text. After all finished, output the text stored in the state machine, which is the text we want.

footnote.go | repository | view raw

package main

import (
	"github.com/PuerkitoBio/goquery"
	"os"
	"strings"
)

const (
	InFootnote = iota
	NotInFootnote
)

type StateMachine struct {
	State        int
	FootnoteBody string
}

func NewStateMachine() *StateMachine {
	return &StateMachine{
		State: NotInFootnote,
	}
}

func (s *StateMachine) ProcessLine(line string) {
	if strings.HasPrefix(line, "Reference") {
		s.State = InFootnote
	}

	if strings.HasPrefix(line, "Update") && s.State == InFootnote {
		s.State = NotInFootnote
	}

	if s.State == InFootnote {
		s.FootnoteBody += line
	}
}

func extractFootnote(f *os.File) string {
	doc, err := goquery.NewDocumentFromReader(f)
	if err != nil {
		panic(err)
	}

	sm := NewStateMachine()
	doc.Find("body").Contents().Each(func(_ int, s *goquery.Selection) {
		sm.ProcessLine(s.Text())
	})

	return sm.FootnoteBody
}

Usage

Put above three files (index.html, html.go, footnote.go) together in current directory. Run the following command:

$ go run html.go footnote.go -input=index.html

Tested on: Ubuntu Linux 15.10, Go 1.6.


References:

[1]jquery iterate over elements - Google search

[2]github.com/PuerkitoBio/goquery - GoDoc


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK