Find Lowest Common Ancestor Subtree and Shortest Dependency Path with spaCy Only

In the previous post: How to Find Shortest Dependency Path with spaCy and StanfordNLP , I talked about how to use spaCy and NetworkX to extract Shortest Dependency Path (SDP).

But there is a problem to use NetworkX. We cannot get the index of head entity or tail entity. For example, we have below sentence. The triple is Convulsions->(caused_by)->fever . But there are two fever in this sentence.

Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.

The solution with NetworkX

A solution is to add an index with each token and specify which fever to find.

import spacy
import networkx as nx
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.')# Add pair to edges
edges = []
for token in doc:
    for child in token.children:
edges.append(('{0}-{1}'.format(token.text, token.i),
                      '{0}-{1}'.format(child.text, child.i)))# Construct Graph with nextworkx
graph = nx.Graph(edges)# Get the length and path
entity1 = 'Convulsions-0'
entity2 = 'fever-9'print(nx.shortest_path_length(graph, source=entity1, target=entity2))
print(nx.shortest_path(graph, source=entity1, target=entity2))##### output #####
3
['Convulsions-0', 'caused-6', 'by-7', 'fever-9']

The edges looks like below.

In [6]: edges
Out[6]:
[('Convulsions-0', 'occur-2'),
 ('occur-2', 'that-1'),
 ('occur-2', 'after-3'),
 ('caused-6', 'Convulsions-0'),
 ('caused-6', 'DTaP-4'),
 ('caused-6', 'are-5'),
 ('caused-6', 'by-7'),
 ('caused-6', ',-10'),
 ('caused-6', 'and-11'),
 ('caused-6', 'cause-14'),
 ('by-7', 'fever-9'),
 ('fever-9', 'a-8'),
 ('cause-14', 'fever-12'),
 ('cause-14', 'may-13'),
 ('cause-14', 'headache-15'),
 ('cause-14', '.-16')]

In this way, we can ensure the tail entity token is fever-9 instead of fever-12 .

This solution is a little troublesome because NetworkX only accepts string type, and we have to incorporate such information in the string.

The solution with spaCy only

The token class in sapCy is very powerful. It has index information inside each token. But how to use spaCy to find SDP?

After some research, I found we can utilize the get_lca_matrix function.

In [11]: doc = nlp(u"This is a test")
    ...: lca_matrix = doc.get_lca_matrix()In [12]: lca_matrix
Out[12]:
array([[0, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 2, 3],
       [1, 1, 3, 3]], dtype=int32)

Doc.get_lca_matrix: Calculates the lowest common ancestor (LCA) matrix for a given Doc . Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found.

We can use this function to find the SDP.

import spacy
nlp = spacy.load("en_core_web_sm")doc = nlp(u'Convulsions that occur after DTaP are caused by a fever, and fever may cause headache.')def get_sdp_path(doc, subj, obj, lca_matrix):
  lca = lca_matrix[subj, obj]

  current_node = doc[subj]
  subj_path = [current_node]
  if lca != -1: 
    if lca != subj: 
      while current_node.head.i != lca:
        current_node = current_node.head
        subj_path.append(current_node)
      subj_path.append(current_node.head)current_node = doc[obj]
  obj_path = [current_node]
  if lca != -1: 
    if lca != obj: 
      while current_node.head.i != lca:
        current_node = current_node.head
        obj_path.append(current_node)
      obj_path.append(current_node.head)

  return subj_path + obj_path[::-1][1:]

# set head entity index and tail entity index
head = 0
tail = 9

sdp = get_sdp_path(doc, head, tail, doc.get_lca_matrix())
print(sdp)##### output #####
[Convulsions, caused, by, fever]

The get_sdp_path() can find the SDP between head entity and tail entity. The only thing we need is to input the head entity index and tail entity index.

Inside the get_sdp_path() function, it actually first find the SDP path from head entity to LCA token , and then find the SDP path from tail entity to LCA token . Finally, we combine two subtrees together and return the result.

The solution with NetworkX

The solution with spaCy only

Reference

Recommend

不完全预测：2020年将流行何种编程技术？

MachineX: Data Cleaning in Python

6400万像素！高像素手机是伪命题吗？

网易严选：“中国版无印良品”的危险生意

玩手游还要充会员？这背后有你想不到的门道

3年血亏10亿美元谷歌Deepmind出了什么问题？

如何通过MD5反查身份证号？

你多久没发微博了

公众号七年：那些被改变命运的人，正在走向何处？

稳定币的主要用户场景：是支付还是交易？

About Joyk