70

Find Lowest Common Ancestor Subtree and Shortest Dependency Path with spaCy Only

 4 years ago
source link: https://www.tuicool.com/articles/q2uUv2b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In the previous post: How to Find Shortest Dependency Path with spaCy and StanfordNLP , I talked about how to use spaCy and NetworkX to extract Shortest Dependency Path (SDP).

But there is a problem to use NetworkX. We cannot get the index of head entity or tail entity. For example, we have below sentence. The triple is Convulsions->(caused_by)->fever . But there are two fever in this sentence.

Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.

The solution with NetworkX

A solution is to add an index with each token and specify which fever to find.

import spacy
import networkx as nx
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.')
# Add pair to edges
edges = []
for token in doc:
for child in token.children:
edges.append(('{0}-{1}'.format(token.text, token.i),
'{0}-{1}'.format(child.text, child.i)))
# Construct Graph with nextworkx
graph = nx.Graph(edges)
# Get the length and path
entity1 = 'Convulsions-0'
entity2 = 'fever-9'
print(nx.shortest_path_length(graph, source=entity1, target=entity2))
print(nx.shortest_path(graph, source=entity1, target=entity2))
##### output #####
3
['Convulsions-0', 'caused-6', 'by-7', 'fever-9']

The edges looks like below.

In [6]: edges
Out[6]:
[('Convulsions-0', 'occur-2'),
 ('occur-2', 'that-1'),
 ('occur-2', 'after-3'),
 ('caused-6', 'Convulsions-0'),
 ('caused-6', 'DTaP-4'),
 ('caused-6', 'are-5'),
 ('caused-6', 'by-7'),
 ('caused-6', ',-10'),
 ('caused-6', 'and-11'),
 ('caused-6', 'cause-14'),
 ('by-7', 'fever-9'),
 ('fever-9', 'a-8'),
 ('cause-14', 'fever-12'),
 ('cause-14', 'may-13'),
 ('cause-14', 'headache-15'),
 ('cause-14', '.-16')]

In this way, we can ensure the tail entity token is fever-9 instead of fever-12 .

This solution is a little troublesome because NetworkX only accepts string type, and we have to incorporate such information in the string.

The solution with spaCy only

The token class in sapCy is very powerful. It has index information inside each token. But how to use spaCy to find SDP?

After some research, I found we can utilize the get_lca_matrix function.

In [11]: doc = nlp(u"This is a test")
    ...: lca_matrix = doc.get_lca_matrix()In [12]: lca_matrix
Out[12]:
array([[0, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 2, 3],
       [1, 1, 3, 3]], dtype=int32)

Doc.get_lca_matrix: Calculates the lowest common ancestor (LCA) matrix for a given Doc . Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found.

We can use this function to find the SDP.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.')def get_sdp_path(doc, subj, obj, lca_matrix):
lca = lca_matrix[subj, obj]

current_node = doc[subj]
subj_path = [current_node]
if lca != -1:
if lca != subj:
while current_node.head.i != lca:
current_node = current_node.head
subj_path.append(current_node)
subj_path.append(current_node.head)
current_node = doc[obj]
obj_path = [current_node]
if lca != -1:
if lca != obj:
while current_node.head.i != lca:
current_node = current_node.head
obj_path.append(current_node)
obj_path.append(current_node.head)

return subj_path + obj_path[::-1][1:]

# set head entity index and tail entity index
head = 0
tail = 9


sdp = get_sdp_path(doc, head, tail, doc.get_lca_matrix())
print(sdp)
##### output #####
[Convulsions, caused, by, fever]

The get_sdp_path() can find the SDP between head entity and tail entity. The only thing we need is to input the head entity index and tail entity index.

Inside the get_sdp_path() function, it actually first find the SDP path from head entity to LCA token , and then find the SDP path from tail entity to LCA token . Finally, we combine two subtrees together and return the result.

Reference


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK