22

自然语言处理工具之deepnlp

 4 years ago
source link: https://www.biaodianfu.com/deepnlp.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

DeepNLP简介

deepnlp项目是基于Tensorflow平台的一个python版本的NLP套装, 目的在于将Tensorflow深度学习平台上的模块,结合 最新的一些算法,提供NLP基础模块的支持,并支持其他更加复杂的任务的拓展,如生成式文摘等等。

  • NLP 套装模块
    • 分词 Word Segmentation/Tokenization
    • 词性标注 Part-of-speech (POS)
    • 命名实体识别 Named-entity-recognition(NER)
    • 依存句法分析 Dependency Parsing (Parse)
    • 自动生成式文摘 Textsum (Seq2Seq-Attention)
    • 关键句子抽取 Textrank
    • 文本分类 Textcnn (WIP)
    • 可调用 Web Restful API
    • 计划中: 句法分析 Parsing
  • 算法实现
    • 分词: 线性链条件随机场 Linear Chain CRF, 基于CRF++包来实现
    • 词性标注: 单向LSTM/ 双向BI-LSTM, 基于Tensorflow实现
    • 命名实体识别: 单向LSTM/ 双向BI-LSTM/ LSTM-CRF 结合网络, 基于Tensorflow实现
    • 依存句法分析: 基于arc-standard system的神经网络的parser
  • 预训练模型
    • 中文: 基于人民日报语料和微博混合语料: 分词, 词性标注, 实体识别

DeepNLP的安装

安装说明

pip install deepnlp

下载模型:

import deepnlp
# Download all the modules
deepnlp.download()
 
# Download specific module
deepnlp.download('segment')
deepnlp.download('pos')
deepnlp.download('ner')
deepnlp.download('parse')
 
# Download module and domain-specific model
deepnlp.download(module = 'pos', name = 'en') 
deepnlp.download(module = 'ner', name = 'zh_entertainment')

执行示例代码,报如下错误:

from deepnlp import segmenter
 
tokenizer = segmenter.load_model(name='zh_entertainment')
text = "我刚刚在浙江卫视看了电视剧老九门,觉得陈伟霆很帅"
segList = tokenizer.seg(text)
text_seg = " ".join(segList)
Traceback (most recent call last):
  File "D:/CodeHub/NLP/test_new.py", line 3, in <module>
    from deepnlp import segmenter
  File "D:\CodeHub\NLP\venv\lib\site-packages\deepnlp\segmenter.py", line 6, in <module>
    import CRFPP
ModuleNotFoundError: No module named 'CRFPP'

解决方案,安装CRFPP。

DeepNLP的使用

使用示例:

from deepnlp import segmenter, pos_tagger, ner_tagger, nn_parser
from deepnlp import pipeline
 
# 分词模块
tokenizer = segmenter.load_model(name='zh')
text = "我爱吃北京烤鸭"
seg_list = tokenizer.seg(text)
text_seg = " ".join(seg_list)
print(text_seg)
 
# 词性标注
p_tagger = pos_tagger.load_model(name='zh')
tagging = p_tagger.predict(seg_list)
for (w, t) in tagging:
    pair = w + "/" + t
print(pair)
 
# 命名实体识别
n_tagger = ner_tagger.load_model(name='zh')  # Base LSTM Based Model
tagset_entertainment = ['city', 'district', 'area']
tagging = n_tagger.predict(seg_list, tagset=tagset_entertainment)
for (w, t) in tagging:
    pair = w + "/" + t
    print(pair)
 
# 依存句法分析
parser = nn_parser.load_model(name='zh')
words = ['它', '熟悉', '一个', '民族', '的', '历史']
tags = ['r', 'v', 'm', 'n', 'u', 'n']
dep_tree = parser.predict(words, tags)
num_token = dep_tree.count()
print("id\tword\tpos\thead\tlabel")
for i in range(num_token):
    cur_id = int(dep_tree.tree[i + 1].id)
    cur_form = str(dep_tree.tree[i + 1].form)
    cur_pos = str(dep_tree.tree[i + 1].pos)
    cur_head = str(dep_tree.tree[i + 1].head)
    cur_label = str(dep_tree.tree[i + 1].deprel)
    print("%d\t%s\t%s\t%s\t%s" % (cur_id, cur_form, cur_pos, cur_head, cur_label))
 
# Pipeline
p = pipeline.load_model('zh')
text = "我爱吃北京烤鸭"
res = p.analyze(text)
print(res[0])
print(res[1])
print(res[2])
words = p.segment(text)
pos_tagging = p.tag_pos(words)
ner_tagging = p.tag_ner(words)
print(list(pos_tagging))
print(ner_tagging)

自己训练模型流程:

参考链接: https://github.com/rockingdingo/deepnlp


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK