GitHub - jdkato/prose: A Golang library for text processing, including tokenizat...
source link: https://github.com/jdkato/prose
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
prose
prose
is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.
You can find a more detailed summary on the library's performance here: Introducing prose
v2.0.0: Bringing NLP to Go.
Installation
$ go get github.com/jdkato/prose/v2
Usage
Contents
Overview
package main import ( "fmt" "log" "github.com/jdkato/prose/v2" ) func main() { // Create a new document with the default configuration: doc, err := prose.NewDocument("Go is an open-source programming language created at Google.") if err != nil { log.Fatal(err) } // Iterate over the doc's tokens: for _, tok := range doc.Tokens() { fmt.Println(tok.Text, tok.Tag, tok.Label) // Go NNP B-GPE // is VBZ O // an DT O // ... } // Iterate over the doc's named-entities: for _, ent := range doc.Entities() { fmt.Println(ent.Text, ent.Label) // Go GPE // Google GPE } // Iterate over the doc's sentences: for _, sent := range doc.Sentences() { fmt.Println(sent.Text) // Go is an open-source programming language created at Google. } }
The document-creation process adheres to the following sequence of steps:
tokenization -> POS tagging -> NE extraction
\
segmentation
Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:
doc, err := prose.NewDocument( "Go is an open-source programming language created at Google.", prose.WithExtraction(false))
Tokenizing
prose
includes a tokenizer capable of processing modern text, including the non-word character spans shown below.
Type | Example |
---|---|
Email addresses | [email protected] |
Hashtags | #trending |
Mentions | @jdkato |
URLs | https://github.com/jdkato/prose |
Emoticons | :-) , >:( , o_0 , etc. |
package main import ( "fmt" "log" "github.com/jdkato/prose/v2" ) func main() { // Create a new document with the default configuration: doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).") if err != nil { log.Fatal(err) } // Iterate over the doc's tokens: for _, tok := range doc.Tokens() { fmt.Println(tok.Text, tok.Tag) // @jdkato NN // , , // go VB // to TO // http://example.com NN // thanks NNS // :) SYM // . . } }
Segmenting
prose
includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter
.
Name | Language | License | GRS (English) | GRS (Other) | Speed† |
---|---|---|---|---|---|
Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |
† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while
prose
was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.
package main import ( "fmt" "strings" "github.com/jdkato/prose/v2" ) func main() { // Create a new document with the default configuration: doc, _ := prose.NewDocument(strings.Join([]string{ "I can see Mt. Fuji from here.", "St. Michael's Church is on 5th st. near the light."}, " ")) // Iterate over the doc's sentences: sents := doc.Sentences() fmt.Println(len(sents)) // 2 for _, sent := range sents { fmt.Println(sent.Text) // I can see Mt. Fuji from here. // St. Michael's Church is on 5th st. near the light. } }
Tagging
prose
includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library | Accuracy | 5-Run Average (sec) |
---|---|---|
NLTK | 0.893 | 7.224 |
prose |
0.961 | 2.538 |
(See scripts/test_model.py
for more information.)
The full list of supported POS tags is given below.
TAG | DESCRIPTION |
---|---|
( |
left round bracket |
) |
right round bracket |
, |
comma |
: |
colon |
. |
period |
'' |
closing quotation mark |
`` |
opening quotation mark |
# |
number sign |
$ |
currency |
CC |
conjunction, coordinating |
CD |
cardinal number |
DT |
determiner |
EX |
existential there |
FW |
foreign word |
IN |
conjunction, subordinating or preposition |
JJ |
adjective |
JJR |
adjective, comparative |
JJS |
adjective, superlative |
LS |
list item marker |
MD |
verb, modal auxiliary |
NN |
noun, singular or mass |
NNP |
noun, proper singular |
NNPS |
noun, proper plural |
NNS |
noun, plural |
PDT |
predeterminer |
POS |
possessive ending |
PRP |
pronoun, personal |
PRP$ |
pronoun, possessive |
RB |
adverb |
RBR |
adverb, comparative |
RBS |
adverb, superlative |
RP |
adverb, particle |
SYM |
symbol |
TO |
infinitival to |
UH |
interjection |
VB |
verb, base form |
VBD |
verb, past tense |
VBG |
verb, gerund or present participle |
VBN |
verb, past participle |
VBP |
verb, non-3rd person singular present |
VBZ |
verb, 3rd person singular present |
WDT |
wh-determiner |
WP |
wh-pronoun, personal |
WP$ |
wh-pronoun, possessive |
WRB |
wh-adverb |
prose
v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON
) and geographical/political Entities (GPE
) by default.
package main import ( "github.com/jdkato/prose/v2" ) func main() { doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.") for _, ent := range doc.Entities() { fmt.Println(ent.Text, ent.Label) // Lebron James PERSON // Los Angeles GPE } }
However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose
: Radically efficient machine teaching in Go for a tutorial.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK