README.md

Jargon

Jargon offers a tokenizer for Go, with an emphasis on handling technology terms correctly:

C++, ASP.net, and other non-alphanumeric terms are recognized as single tokens
#hashtags and @handles
Simple URLs and email address are handled pretty well, though can be notoriously hard to get right

There is also an HTML tokenizer, which applies the above to text nodes in markup.

The tokenizer preserves all tokens verbatim, so that the original text can be reconstructed with fidelity (“round tripped”).

In turn, Jargon offers a lemmatizer, for recognizing canonical and synonymous terms. For example the n-gram “Ruby on Rails” becomes ruby-on-rails. It implements “insensitivity” to spaces, dots and dashes.

(It turns out™️ that the above rules apply well to structured text such as CSV and JSON.)

Try it

GoDoc

Demo

package main

import (
    "fmt"

    "github.com/clipperhouse/jargon"
    "github.com/clipperhouse/jargon/stackexchange"
)

func main() {
    text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
    r := strings.NewReader(text)
    tokens := jargon.Tokenize(r)
    
    // iterate over the resulting tokens, or pass on to the lemmatizer...

    dict := stackexchange.Dictionary
    lem := jargon.NewLemmatizer(dict)
    lemmatized := lem.Lemmatize(tokens)
    for t := range lemmatized {
        fmt.Print(t)
    }
}

Jargon uses a streaming API – reader in, channel out.

Problem

When dealing with technology terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

Prior art

Existing tokenizers (such as Treebank), appear not to be round-trippable, i.e., are destructive. They also take a hard line on punctuation, so “ASP.net” would come out as two tokens instead of one. Of course I’d like to be corrected or pointed to other implementations.

Search-oriented databases like Elastic handle synonyms with analyzers.

In NLP, it’s handled by stemmers or lemmatizers. There, the goal is to replace variations of a term (manager, management, managing) with a single canonical version.

Recognizing mutli-words-as-a-single-term (“Ruby on Rails”) is named-entity recognition.

Who’s it for?

Dunno yet, some ideas…

Recognition of domain terms appearing in text
NLP on unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

GitHub - clipperhouse/jargon: Tokenizers and lemmatizers for Go

README.md

Jargon

Try it

Problem

Prior art

Who’s it for?

Recommend

必领神券、促销活动:京东自营男装 618年中大促领券满199减100元+店铺优惠_促销活...

对空数据页面等公共页面实现的一些思考

Android Jetpack - 使用 Navigation 管理页面跳转

约姑娘看电影：从入门到舌吻

荣耀Play正式发布：麒麟970把旗舰定价拉回2000元档

秃顶的朋友稳住，这项新技术能让你的头发「无限再生」

九宫格图片 - 帮你一键把图片切割成九宫格，还有不同造型可以选 - NEXT

百变昵称 - 花式昵称小工具 - NEXT

GitHub - HDInnovations/UNIT3D: The Nex-Gen Private Torrent Tracker (Aimed For Mo...

潜水去哪里？PADI考证推荐脱光光潜水去！_促销活动

About Joyk