FoolNLTK

A Chinese word processing toolkit

Features

Although not the fastest, FoolNLTK is probably the most accurate open source Chinese word segmenter in the market
Trained based on the BiLSTM model
High-accuracy in participle, part-of-speech tagging, entity recognition
User-defined dictionary
Ability to self train models
Allows for batch processing

Getting Started

*** 2020/2/16 *** update: use bert model train and export model to deploy, chinese train documentation

To download and build FoolNLTK, type:

get clone https://github.com/rockyzhengwu/FoolNLTK.git
cd FoolNLTK/train

For detailed instructions

Only tested in Linux Python 3 environment.

Installation

pip install foolnltk

Usage Intructions

For Participles:

import fool

text = "一个傻子在北京"
print(fool.cut(text))
# ['一个', '傻子', '在', '北京']

For participle segmentations, specify a -b parameter to increase the number of lines segmented every run.

python -m fool [filename]

User-defined dictionary

The format of the dictionary is as follows: the higher the weight of a word, and the longer the word length is, the more likely the word is to appear. Word weight value should be greater than 1。

难受香菇 10
什么鬼 10
分词工具 10
北京 10
北京天安门 10

To load the dictionary:

import fool
fool.load_userdict(path)
text = ["我在北京天安门看你难受香菇", "我在北京晒太阳你在非洲看雪"]
print(fool.cut(text))
#[['我', '在', '北京', '天安门', '看', '你', '难受', '香菇'],
# ['我', '在', '北京', '晒太阳', '你', '在', '非洲', '看', '雪']]

To delete the dictionary

fool.delete_userdict();

POS tagging

import fool

text = ["一个傻子在北京"]
print(fool.pos_cut(text))
#[[('一个', 'm'), ('傻子', 'n'), ('在', 'p'), ('北京', 'ns')]]

Entity Recognition

import fool 

text = ["一个傻子在北京","你好啊"]
words, ners = fool.analysis(text)
print(ners)
#[[(5, 8, 'location', '北京')]]

Versions in Other languages

Java

For any missing model files, try looking in sys.prefix, under /usr/local/

FoolNLTK：可能是最准的中文处理工具包（Python）

FoolNLTK

Features

Getting Started

Installation

Usage Intructions

For Participles:

User-defined dictionary

POS tagging

Entity Recognition

Versions in Other languages

Recommend

大型网站系统与 Java 中间件实践

经验分享 | Burpsuite插件的使用

InfluxDB使用说明

使用 ARM64 汇编实现共享栈式协程

Optionalize your getters with help from Intellij IDEA custom inspections

GitHub - briandowns/sky-island: Sky Island is a FaaS platform for running raw Go...

摩拜单车月卡30天月卡免费领取_摩拜单车优惠

高端秀:AIR JORDAN 4 Retro x Eminem “Encore” 限量版男款篮球鞋 9码 457980.6元_海...

In and out type variant of Kotlin – AndroidPub

打造腾讯帝国：2017年马化腾平均每月买10家公司

About Joyk