The Building a Large-scale, Accurate and Fresh Knowledge Graph

Microsoft gives a wonderful tutorial about Knowledge Graph in KDD 2018. If you are a machine learning engineer or an NLP engineer, I highly recommend reading this tutorial. It talks about what is knowledge graph (KG), the KG construction challenges for a large scale, and the approaches for challenges with paper references.

This post is a summary of the tutorial. You can find the slide here .

Part I: Introduction

There are several measurements to evaluating knowledge quality, correctness, coverage, freshness, and usage .

JrABfqm.png!web

How to ensure correctness, coverage, freshness for a vast KG is a huge challenge. A very common problem is there are multiple entity share the same name, e.g. Will Smith. How to link the information to the correct Will Smith is also a challenge (Entity Linking, EL).

mM3iMjN.png!web

Converting raw data to a high-quality KG mainly contains three steps: extracting data from structured or unstructured data sources, use schema to correlated data and relationships, and conflate the schematized knowledge.

3uYBveR.png!web

The above figure shows the active research and product efforts related to KG. KG has many research directions. If you want to start learning the KG, I recommend starting from Knowledge Graph Construction, which including some common NLP techniques, Named Entity Recognition (NER), Relation Extraction (RE), End-to-End Relation extraction. The goal of these techniques is to get the triple data. For example, (Will Smith, profession, Actor) is an entity-attribute-value triple data. (Will Smith, couple, Jada Pinkett Smith) is an entity-relation-entity triple data.

Part II: Acquiring Knowledge in the Wild

This part list a lot of papers. I just take some of them. If you find anything you are interested in, I recommend reading the tutorial directly.

NbENRbb.png!web

We can get the extracted knowledge from numerous data sources, which mainly contains two kinds, structured sources, and unstructured sources. The amount of structured sources is limited, so we need to extract knowledge by many NLP techniques, like NER, RE, etc.

zIVJnyn.png!web

In part II, the slides list up many papers for extracting knowledge from different sources. These papers are related to extract data from web (rule-based, tree-based, machine-learning-based), from news and forums , from email & calendars , from social media .

V3aAfuu.png!web

In order to increase the coverage, it also lists some paper related to NER, Relation Extraction, Entity Linking, knowledge base (KB) embedding for KB completion.

nuaa2ij.png!web

There are also some works about verify knowledge.

Besides the above content, it also contains human intervention papers related to Distant supervision (DS) and crowdsourcing.

VNnIzia.png!web

Part III: Building Knowledge Graph

This part introduces how Microsoft builds Satori Graph. The whole process mainly has four phases.

Phase 1: Data Ingestion

qAzm2iM.png!web

Data Ingestion including using parsing and standardization to store data in a uniform manner, and mapping extracted data to Microsoft Ontology.

Phase 2: Match & Merge

YBzQFvB.png!web

This graph shows that the ingestion flow in the second phase, Match & Merge.

viYjInY.png!web

The biggest problem in ingestion flow is the entity matching , identity and discover instances referring to the same real-world entity, and the data quality challenge , missing data may be caused by human information extraction tech or human errors.

ueIzymz.png!web