A Practical Guide to Build an Enterprise Knowledge Graph for Investment Analysis

How to solve the practical problems when building a real Enterprise Knowledge Graph service

Oct 23 ·8min read

yyyau2J.jpg!web

Photo by Peter Nguyen on Unsplash

This is an application paper about how to solve challenges when developing an Enterprise Knowledge Graph (EKG) service, which incorporates information about 40,000,000 companies. I find this paper is quite practical useful if someone wants to build an EKG for real business. So I write this summary to save you time. If you want to know the detail, I recommend reading the paper directly. PDF is here .

2QVV326.png!web

from connected-data-london-as-a-graph

This EKG service is sold to securities companies. Because the securities companies provide investment bank services and investment consulting services, they have to know the information about the small and medium-sized enterprises. So the product can help securities companies to know and to approach target companies better and quicker.

There are two kinds of challenges in this project, the technology challenges, and business challenges.

Business challenges

There are two challenges on the business side.

- Data Privacy : how to provide deep and useful analysis services without violating the privacies of a company and its employees.

- Killer Services on the Graph: EKG is complex and huge, how to make the graph easy to use is a challenge.

The solution for two challenges.

Data Privacy:

- Transform the original data into the rank form or the ratio form instead of using real accurate values (rank form or the ratio form?)

- Obscure critical nodes (e.g., person-related information) which should not be shown when visualizing the EKG as a graph

Killer Services on the Graph:

- Deliver services that directly meet the business requirements of users. For example, the service finding an enterprise’s real controllers tells the investors from investment banks who are the real owner of a company, and the service enterprise path discovery provides hints on how the investors could reach the enterprises they want to invest in.

Technology challenges

Technology challenges arise from the diversity and the scale of the data sources.

- Constructing problems such as transforming the databases to RDF (D2R), representing and querying difficulties when meta properties and n-ary relations are involved

- Performance issues since the KG contain more than one billion triples

Before introducing challenges in detail, let’s see the whole workflow to build the EKG first.

At the first stage of our project, we mainly utilize relational databases (RDBs) from CSAIC.
Secondly, we supplement the EKG with bidding information from the Chinese Government Procurement Network (CGPN) and stock information from Eastern Wealth Network (EWN).
Then the EKG is fused with the patent information extracted from the Patent Search and Analysis Network of State Intellectual Property Office (PASN-SIPO) in another project.
At last, the competitor relations and acquisition events are added to the EKG. This information is extracted from encyclopedia sites, namely Wikipedia, Baidu Baike and Hudong Baike.

The following challenges are encountered during the above process:

Data Model (Complex data types): meta property (property of relations, or property graph) and event (n-ary relation). But no existing mature solutions on representing and querying meta properties and events in an efficient way.
D2R Mapping: using D2R tools (e.g., D2RQ9) to map RDBs from CSAIC into RDF has the following challenges: a) Mapping of meta property. b) Data in the same column of RDBs map to different classes in RDF. c) Data in the same RDB tables may map to different classes having subClass relations.
Information Extraction: Extract useful relation from various types, like “competitive”, “acquisition” and so on. Entity extraction becomes difficult when there are abbreviations of company names in encyclopedic sites.
Query Performance: We encounter performance bottlenecks since the number of triples of our EGK has reached billions. Furthermore, there are more complex query patterns when the EKG usage scenarios increase: a) When users query the KG on an IPC code, we should find all the
subClasses of the IPC code recursively, then find the patents belong to the subClasses. b) Query all properties of an instance. The problem arises since different properties of the same instance may store as different triples in graph store. c) The queries on meta properties and n-ary relations.

The authors carefully select the most suitable methods and adapt them to the above problems.

First, we split the original tables into atomic tables and complex tables, then we use D2RQ tools to handle mappings on atomic tables. At last, we develop programs to process ad hoc mappings on complex tables.
We use multi-strategy learning methods in [16]( Bootstrapping yahoo! finance by Wikipedia for competitor mining) to extract competitor relations and acquisition events from various data sources of encyclopedic sites.
We adopt a graph-based entity linking algorithms in [2] (Graph ranking for collective named entity disambiguation) to accomplish the task of entity linking.
We design our own storage structure to fully optimize the performance of miscellaneous queries in EKG. We use a hybrid storage solution composed of multiple kinds of databases. For large-scale data, we use NoSQL database namely Mongodb as the underlying storage. For high-frequency query data, we use a memory database to store data.

Approach Overview

Data Sources and Related Tasks to Construct the EKG

UV732qb.png!web

Building EGK from multiple sources :Aluminum Corporation of China Limited Example

This graph is an example to show the process of extracting the information of Aluminum Corporation of China Limited Example from multiple sources.

First they use the Aluminum Corporation of China Limited Example data in CSAIC as the basic KG. Then they transform RDBs into RDF to form the basic enterprise KG and get triples like (Aluminum Corporation of China Limited, director, Weiping Xiong).
Secondly, they extract the patent information from a patent website and build a patent KG. The basic enterprise KG is transformed from CSAIC. The two KGs serve different users. So they use data fusion algorithms to link the two KGs with companies and persons.
Finally, they extract stock code from a stock website, corporate executives from infobox of Baidu Baike and Wikipedia, acquisition events from free texts of encyclopedia sites.

Building Knowledge Graphs

uaEvuai.png!web

Data-driven KG constructing process

There are 5 major steps in the whole constructing process: Schema Design , D2R Transformation , Information Extraction , Data Fusion with Instance Matching, Storage Design and Query Optimization .

1. Schema Design

While most general KGs such as DBpedia and YAGO are built in a bottom-up manner to ensure wide coverage of cross-domain data, the authors adopt a top-down approach in EKG construction to ensure the data quality and stricter schema.

At the first iteration, the EKG includes four basic concepts, namely “Company”, “Person”, “Credit” and “Litigation”. Major relations include “subsidiary”, “shareholder”, and “executive”. The concepts in patent KG only include “Patent”. Major relation is “applicant”. At the second iteration, we add “ListedCompany”, “Stock”, “Bidding” and “Investment” to the EKG.

2. D2R Transformation

The authors take three steps to transform RDBs to RDF, namely table splitting, basic D2R transformation by D2RQ and post-processing.

Bfmme2u.png!web

Table Splitting: As shown in Figure 4, the original table Person Information also contains enterprise information. We divide the table into Person_P , Enterprise_E and Person Enterprise_PE . The Enterprise_E table is furthered merged with the original Enterprise Information table because the two tables share similar information about enterprises.
Basic D2R Transformation by D2RQ: We write a customized mapping file in D2RQ to map fields related to atomic entity tables and atomic relation tables into RDF format. We map table names into classes, columns of the table into properties, and cell values of each record as the corresponding property values for a given entity.
Post Processing: a) Meta property Mapping. The program gives a self-increasing ID annotation to the fact which has meta properties. The meta properties will then be properties of this n-ary relation identify by this ID. Thus we get some new triples(e.g., ). b) Conditional taxonomy mapping. Our program determines whether the entity maps to the subClass according to whether the entity appears in the table related to the subClass. For example, if a company exists in the relation table of company and stock, it implies that the company is a listed company, so we add a triple

3. Information Extraction

The authors adopt a multi-strategy learning method to extract multiple types of data from various data sources. The whole process is as follows:

Entities and attribute value pairs of patent, stock and bidding information are extracted from PSAN-SIPO, EWN and CGPN respectively by using HTML wrappers.
Attribute value pairs (e.g., the chairman of an enterprise ) of enterprises are extracted from infoboxes of encyclopedic sites by using HTML wrappers.
Binary relations, events and synonyms identification on free texts require seeds annotation in sentences to learn patterns.

For the problem with abbreviations of company names, the author use entity linking algorithm to link a company mentioned in the text to companies in the basic EKG. They adopt a graph-based method to accomplish the task of entity linking in two steps:

Candidate detection: finding candidate entities in the KB that are referred by each mention. It deletes the suffix (Corp. Co. Ltd, Group) to calculate the similarity between the core work of the mention and the core word of the entity in KB.
Disambiguation: selecting the most possible candidate to link. Here, we use the disambiguation algorithm proposed in the literature (Graph ranking for collective named entity disambiguation)

4. Data Fusion with Instance Matching

The problem is simple for instance matching of companies. However, the problem is tough for instance matching of persons. While there are personal ID numbers for every person, there are no such IDs in the patent data sources. The authors use a simple heuristic rule to match the person in the patent KG to that person in the basic KG. If the name of the patent inventor and the applicant equals the name of the person and company in the basic KG respectively, they say the patent inventor matches the person name in the basic KG.

5. Storage Design and Query Optimization

The authors use MongoDB as main storage for its large install bases, good query performance, mass data storage, and scalability with clustering support.

Deployment and Usage Scenarios

iMfiE3y.png!web

This section talks about how to make the graph easy to use. The authors pre-defined some queries for some use cases.

zqYjmia.png!web

Different structures for a person to control an enterprise

Finding an enterprise’s real controllers. The person who owns the biggest equity share is the real decision-maker. But there are many patterns, see the above figure.
Innovative enterprise analysis. See the company’s patent.
Enterprise path discovery. Securities companies would like to know whether there are paths to reach their new customers, and they also want to know whether their potential customers have paths to their competitors.
Multidimensional relationship discovery. Given two companies, there might vary relationships between them.

Finally, I highly recommend reading the paper if you have the need to build a KG for real business.

Reference

http://www-kasm.nii.ac.jp/iswc2016/papers/paper_A29_.pdf

A Practical Guide to Build an Enterprise Knowledge Graph for Investment Analysis