Beam: A Distributed Knowledge Graph Store

May 1, 2019

By:

Diego Ongaro andSimon Fell

Comments:

Social:

Topics: Distributed Systems , Open Source

We're excited to announce the public release of Beam, a distributed knowledge graph store, under the Apache 2.0 open source license. Beam is the result of four person-years of exploration and engineering effort, so there's a lot to unpack here! This post will discuss what Beam is, how it's implemented, and why we've chosen to release it as open source.

Beam is a Distributed Knowledge Graph Store

Beam is a knowledge graph store, sometimes called an RDF store or a triple store. Knowledge graphs are suitable for modeling data that is highly interconnected by many types of relationships, like encyclopedic information about the world. For example, Wikidata is a great dataset that contains the structured data and relationships from Wikipedia and is a good fit for a knowledge graph. A knowledge graph store enables rich queries on its data, which can be used to power real-time interfaces, to complement machine learning applications, and to make sense of new, unstructured information in the context of the existing knowledge.

In a knowledge graph, data is represented as a single table of facts, where each fact has a subject, predicate, and object. This representation enables the store to sift through the data for complex queries and to apply inference rules that raise the level of abstraction. Here's an example of a tiny graph:

subject

predicate

object

<John_Scalzi>

<born>

<John_Scalzi>

<lives>

<John_Scalzi>

<wrote>

<Old_Mans_War>

Beam uses an RDF-like representation for data and a SPARQL-like query language. To learn about how to represent and query data in Beam, see docs/query.md in the GitHub repo.

Beam is a distributed store. It's designed to store large graphs that cannot fit on a single server. It scales out horizontally to support higher query rates and larger data sets. Its write rates don't scale, but a typical Beam deployment should be able to support tens of thousands of changes per second. We've run a 20-server deployment of Beam for development purposes and off-line use cases for about a year, which we've most commonly loaded with a dataset of about 2.5 billion facts. We believe Beam's current capabilities exceed this capacity and scale; we haven't yet pushed Beam to its limits.

Beam's architecture

Beam's architecture is based around a central log, as shown in the following figure. Each box in the diagram is a separate process on a network. The central log isn't a novel idea (see Tango , for example), but it's often overlooked. All write requests are sequenced into an append-only central log. The log is a network service that is internally replicated for fault-tolerance and persisted for durability. Several view servers read the log and apply its entries in sequence, each deterministically updating its local state. Different view servers maintain different state. An API tier accepts requests from clients. It appends the write requests to the log, and it collects data from the view servers to answer reads.

V3ayYna.png!web

The central log imposes a fundamental bottleneck: the maximum rate of appends to the log determines the maximum rate of change to the entire dataset. In exchange, it makes many features simpler to implement, including cross-partition transactions, consistent queries and historical global snapshots, replication, data migration, cluster membership, partitioning, and indexing the dataset multiple ways. See docs/central_log_arch.md for more details.

To be more specific, Beam's implementation is shown in the following figure. The interface to the log is modular. Apache Kafka is the current recommended log implementation (when configured to write new log entries durably to disk before acknowledging them). Beam currently includes a single view implementation called a DiskView, which can run in two modes: either indexing knowledge graph facts by subject-predicate or by predicate-object. A typical deployment will run three replicas of multiple partitions of each mode. The DiskViews store their facts in RocksDB (this, too, is modular). The API server contains a sophisticated query processor, which we'll discuss next, and the Transaction Timer is a small process that times out slow-running transactions in case an API server fails.

EzAbeaM.png!web The API server has far more functionality than that little box would imply: it contains an entire query processor, as shown in the following figure. Beam's query processor implements a query language that's similar to a subset of SPARQL, which is analogous to SQL but for knowledge graphs. The query processor consists of a parser, a cost-based query planner, and a parallel execution engine. The parser transforms an initial set of query lines into an abstract syntax tree (AST). The planner combines the AST with statistics about the data to find an efficient query plan. The executor then runs the plan, using batching and streaming throughout for high performance. The executor relies on a View Client/RPC Fanout module to collect data efficiently from the many view servers. See docs/protobeam_v3.md for more details.

UfaiqyA.png!web

Why we've open-sourced Beam

We learned a lot during our journey with Beam. We first built an in-memory key-value store from scratch (ProtoBeam v1), then iterated to a disk-based property graph (ProtoBeam v2), then to a knowledge graph (ProtoBeam v3). Then, we transitioned from prototype mode to writing production-ready code, with thorough documentation, testing, and reviews. We explored many interesting engineering trade-offs in the process, and we wrote up many detailed snapshot documents that show how our thinking evolved over time; see docs/README.md in the GitHub repo for an overview.

The Beam project has come a long way, but unfortunately, we can't continue working on it full-time to turn it into the polished system we had hoped for. We still think it's a really interesting project that has a nice foundation, and it may be useful for some people:

Though we were targeting a production deployment of Beam, lots of other use cases would not need that level of service. Beam could well be used for offline, noncritical, or research applications today.
Beam would benefit from additional love, and we'd be excited to see your contributions continue to take the Beam project forward. Take a look through the GitHub issues for ideas on what to contribute.
Even without taking Beam as a whole, it has many internal packages that may be useful in other projects, like its fanout and query planner modules.
Finally, others might find Beam an interesting project to study or learn from. It's an interesting case study of the central log architecture and an example of a fairly large Go project.

We hope you'll take a look around the project and give it a spin. Please file bugs, feature requests, and questions as Issues on GitHub.

Topics: Distributed Systems , Open Source

Previous Post : Swapping Fridays—Improving Customer Experience with Role Swapping

Beam: A Distributed Knowledge Graph Store