FoolNLTK ——可能是目前最准的中文分词工具

6 years ago

source link: https://www.oschina.net/p/foolnltk?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

FoolNLTK

中文处理工具包

可能不是最快的开源中文分词，但很可能是最准的开源中文分词
基于 BiLSTM 模型训练而成
包含分词，词性标注，实体识别，　都有比较高的准确率
用户自定义词典

Install

pip install foolnltk

import fool

text = "一个傻子在北京"
print(fool.cut(text))
# ['一个', '傻子', '在', '北京']

命令行分词

python -m fool [filename]

用户自定义词典

词典格式格式如下，词的权重越高，词的长度越长就越越可能出现，　权重值请大于 1

难受香菇 10
什么鬼 10
分词工具 10
北京 10
北京天安门 10

import fool
fool.load_userdict(path)
text = "我在北京天安门看你难受香菇"
print(fool.cut(text))
# ['我', '在', '北京天安门', '看', '你', '难受香菇']

fool.delete_userdict();

import fool

text = "一个傻子在北京"
print(fool.pos_cut(text))
#[('一个', 'm'), ('傻子', 'n'), ('在', 'p'), ('北京', 'ns')]

import fool 

text = "一个傻子在北京"
words, ners = fool.analysis(text)
print(ners)
#[(5, 8, 'location', '北京')]

暂时只在 Python3 Linux 平台测试通过

展开阅读全文

Recommend

FoolNLTK

Install

Recommend

如何做到单机毫秒完成上亿规模大数据常规统计

有人闷声发大财，有人遭遇滑铁卢——2017科技领域大盘点

除了买票住酒店连打车都能用携程了

Awesome Kotlin

苹果将游戏内开箱系统纳入审核中：必须公布开箱率

曝刘强东内部讲话：马云骗人这么多年我都替他丢人

如何将关系型数据导入MongoDB？

GitHub - fireworq/fireworq: Fireworq is a lightweight, high-performance, languag...

百度起诉景驰科技王劲侵犯商业秘密，要求赔偿5000万

哈维·温斯坦跌下神坛：10个月的调查，13 位女性当事人的回忆

About Joyk