25

GitHub - buraksezer/olricdb: Embeddable, in-memory and distributed key/value sto...

 6 years ago
source link: https://github.com/buraksezer/olricdb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

OlricDB

GoDoc Coverage Status Build Status Go Report Card License

Embeddable, in-memory and distributed key/value store for Go.

WIP

This project is a work in progress. The implementation is incomplete. The documentation may be inaccurate.

Table of Contents

Features

  • Designed to share some transient, approximate, fast-changing data between servers,
  • Accepts arbitrary types as value,
  • Only in-memory,
  • Embeddable,
  • Highly available,
  • Horizontally scalable,
  • Provides best-effort consistency guarantees without being a complete CP solution,
  • Distributes load fairly among cluster members with a consistent hash function,
  • Supports replication by default(with sync and async options),
  • Thread-safe by default,
  • Very simple package API,
  • Time-To-Live(TTL) eviction policy,
  • Offers an HTTP API with built-in Go client,
  • Provides a single-node lock implementation which can be used for non-critical purposes.

Installing

With a correctly configured Golang environment:

go get -u github.com/buraksezer/olricdb

Usage

OlricDB is designed to work efficiently with the minimum amount of configuration. So the default configuration should be enough for experimenting:

db, err := olricdb.New(nil)

This creates an OlricDB object without running any server at background. In order to run OlricDB, you need to call Start method.

err := db.Start()

When you call Start method, your process joins the cluster and will be responsible for some parts of the data. This call blocks indefinitely. So you may need to run it in a goroutine. Of course, this is just a single-node instance, because you didn't give any configuration.

Create a DMap object to access the cluster:

dm := db.NewDMap("my-dmap")

DMap object has Put, PutEx, Get, Delete, LockWithTimeout, Unlock and Destroy methods to access and modify data in OlricDB. We may add more methods for finer control but first, I'm willing to stabilize this set of features.

When you want to leave the cluster, just need to call Shutdown method:

err := db.Shutdown(context.Background())

This will stop background tasks, then call Shutdown methods of HTTP server and memberlist, respectively.

Put

Put sets the value for the given key. It overwrites any previous value for that key and it's thread-safe.

err := dm.Put("my-key", "my-value")

The key has to be string. Value type is arbitrary. It is safe to modify the contents of the arguments after Put returns but not before.

PutEx

Put sets the value for the given key with TTL. It overwrites any previous value for that key. It's thread-safe.

err := dm.PutEx("my-key", "my-value", time.Second)

The key has to be string. Value type is arbitrary. It is safe to modify the contents of the arguments after PutEx returns but not before.

Get

Get gets the value for the given key. It returns ErrKeyNotFound if the DB does not contains the key. It's thread-safe.

value, err := dm.Get("my-key")

It is safe to modify the contents of the returned value. It is safe to modify the contents of the argument after Get returns.

Delete

Delete deletes the value for the given key. Delete will not return error if key doesn't exist. It's thread-safe.

err := dm.Delete("my-key")

It is safe to modify the contents of the argument after Delete returns.

LockWithTimeout

LockWithTimeout sets a lock for the given key. If the lock is still unreleased the end of given period of time, it automatically releases the lock. Acquired lock is only for the key in this map. Please note that, before setting a lock for a key, you should set the key with Put method. Otherwise it returns ErrKeyNotFound error.

err := dm.LockWithTimeout("my-key", time.Second)

It returns immediately if it acquires the lock for the given key. Otherwise, it waits until timeout. The timeout is determined by http.Client which can be configured via Config structure.

You should know that the locks are approximate, and only to be used for non-critical purposes.

Please take a look at Lock Implementation section for implementation details.

Unlock

Unlock releases an acquired lock for the given key. It returns ErrNoSuchLock if there is no lock for the given key.

err := dm.Unlock("my-key")

Destroy

Destroy flushes the given DMap on the cluster. You should know that there is no global lock on DMaps. So if you call Put/PutEx and Destroy methods concurrently on the cluster, Put/PutEx calls may set new values to the DMap.

err := dm.Destroy()

Configuration

memberlist configuration can be tricky and and the default configuration set should be tuned for your environment. A detailed deployment and configuration guide will be prepared before stable release.

Please take a look at Config section at godoc.org

Here is a sample configuration for a cluster with two hosts:

m1, _ := olricdb.NewMemberlistConfig("local")
m1.BindAddr = "127.0.0.1"
m1.BindPort = 5555
c1 := &olricdb.Config{
	Name:          "127.0.0.1:3535", // Unique in the cluster and used by HTTP server.
	Peers:         []string{"127.0.0.1:5656"},
	MemberlistCfg: m1,
}

m2, _ := olricdb.NewMemberlistConfig("local")
m2.BindAddr = "127.0.0.1"
m2.BindPort = 5656
c2 := &olricdb.Config{
	Name:          "127.0.0.1:3636",
	Peers:         []string{"127.0.0.1:5555"},
	MemberlistCfg: m2,
}

db1, err := olricdb.New(c1)
// Check error

db2, err := olricdb.New(c2)
// Check error

// Call Start method for db1 and db2 in a seperate goroutine.

Architecture

Overview

OlricDB uses:

OlricDB distributes data among partitions. Every partition is owned by a cluster member and may has one or more backup for redundancy. When you read or write a map entry, you transparently talk to the partition owner. Each request hits the most up-to-date version of a particular data entry in a stable cluster.

In order to find the partition which the key belongs to, OlricDB hashes the key and mod it with the number of partitions:

partID = MOD(hash result, partition count)

The partitions are distributed among cluster members by using a consistent hashing algorithm. In order to get details, please see buraksezer/consistent. The backup owners are also calculated by the same package.

When a new cluster is created, one of the instances elected as the cluster coordinator. It manages the partition table:

  • When a node joins or leaves, it distributes the partitions and their backups among the members again,
  • Removes empty owners from the partition owners list,
  • Pushes the new partition table to all the members,
  • Pushes the the partition table to the cluster periodically.

Members propagates their birthdate(Unix timestamp in nanoseconds) to the cluster. The coordinator is the oldest member in the cluster. If the coordinator leaves the cluster, the second oldest member elected as the coordinator.

OlricDB has a component called fsck which is responsible for keeping underlying data structures consistent:

  • Works on every node,
  • When a node joins or leaves, the cluster coordinator pushes the new partition table. Then, fsck goroutine runs immediately and moves the partitions and backups to their new hosts,
  • Merges fragmented partitions,
  • Runs at background periodically and repairs partitions i.e. creates new backups if required.

Partitions have a concept called owners list. When a node joins or leaves the cluster, a new primary owner may be assigned by the coordinator. At any time, a partition may has one or more partition owner. If a partition has two or more owner, this is called fragmented partition. The last added owner is called primary owner. Write operation is only done by the primary owner. The previous owners are only used for read and delete.

When you read a key, the primary owner tries to find the key on itself, first. Then, queries the previous owners and backups, respectively. Delete operation works with the same way.

The data(distributed map objects) in the fragmented partition is moved slowly to the primary owner by fsck goroutine. Until the move is done, the data remains available on the previous owners. DMap methods use this list to query data on the cluster.

Please note that, multiple partition owner is an undesirable situation and the fsck component is designed to fix that in a short time.

OlricDB uses HTTP as transport layer. It's suitable to transfer small messages between servers. HTTP/2 is highly recommended for production use because it uses a single TCP socket to deliver multiple requests and responses in parallel.

When you call Start method of OlricDB, it starts an HTTP server at background which can be configured by the user via Config struct.

Consistency and Replication Model

OlricDB is an AP product, which employs the combination of primary-copy and optimistic replication techniques. With optimistic replication, when the partition owner receives a write or delete operation for a key, applies it locally, and propagates it to backup owners.

This technique enables OlricDB clusters to offer high throughput. However, due to temporary situations in the system, such as network failure, backup owners can miss some updates and diverge from the primary owner. If a partition owner crashes while there is an inconsistency between itself and the backups, strong consistency of the data can be lost.

Two types of backup replication are available: sync and async. Both types are still implementations of the optimistic replication model.

  • sync: Blocks until write/delete operation is applied by backup owners.
  • async: Just fire & forget.

An anti-entropy system has been planned to deal with inconsistencies in DMaps.

Eviction

OlricDB only implements TTL eviction policy. It shares the same algorithm with Redis:

Periodically Redis tests a few keys at random among keys with an expire set. All the keys that are already expired are deleted from the keyspace.

Specifically this is what Redis does 10 times per second:

  • Test 20 random keys from the set of keys with an associated expire.
  • Delete all the keys found expired.
  • If more than 25% of keys were expired, start again from step 1.

This is a trivial probabilistic algorithm, basically the assumption is that our sample is representative of the whole key space, and we continue to expire until the percentage of keys that are likely to be expired is under 25%

When a client tries to access a key, OlricDB returns ErrKeyNotFound if the key is found to be timed out. A background task evicts keys with the algorithm described above.

LRU eviction policy implementation has been planned.

Lock Implementation

DMap implementation is already thread-safe to meet your thread safety requirements. When you want to have more control on the concurrency, you can use LockWithTimeout method. It's slightly modified version of Moby's(formerly Docker) locker package. It utilizes sync.Mutex. Take a look at the code for details.

Please note that the lock implementation has no backup. So if the node, which the lock belongs to, crashed, the acquired lock is dropped.

I recommend the lock implementation to be used for efficiency purposes in general, instead of correctness.

Client

OlricDB is mainly designed to be used as an embedded DHT. So if you are running long-lived servers, OlricDB is pretty suitable to share some transient, approximate, fast-changing data between them. What if you want to access the cluster in a short-lived process? Fortunately, OlricDB has a simple HTTP API which can be used to access the cluster within any environment. It will be documented soon.

A Golang client is already prepared to access and modify DMaps from outside. Here is the documentation.

Planned Features

  • Anti-entropy system to repair inconsistencies in DMaps,
  • LRU eviction policy,
  • Eviction listeners, if it's reasonable and easy to build,
  • Python client.

We may implement different data structures such as list, queue or bitmap in OlricDB. It's highly depends on attention of the Golang community.

Sample Code

The following snipped can be run on your computer directly. It's a single-node setup, of course:

package main

import (
	"context"
	"fmt"
	"log"
	"reflect"
	"strconv"
	"time"

	"github.com/buraksezer/olricdb"
)

type customType struct {
	Field1 string
	Field2 uint64
}

func main() {
	// This creates a single-node OlricDB cluster. It's good enough for experimenting.
	db, err := olricdb.New(nil)
	if err != nil {
		log.Fatalf("Failed to create OlricDB object: %v", err)
	}

	go func() {
		// Call Start at background. It's a blocker call.
		err = db.Start()
		if err != nil {
			log.Fatalf("Failed to call Start: %v", err)
		}
	}()

	// Put 10 items into the DMap object.
	dm := db.NewDMap("bucket-of-arbitrary-items")
	for i := 0; i < 10; i++ {
		c := customType{}
		c.Field1 = fmt.Sprintf("num: %d", i)
		c.Field2 = uint64(i)
		err = dm.Put(strconv.Itoa(i), c)
		if err != nil {
			log.Printf("Put call failed: %v", err)
		}
	}

	// Read them again.
	for i := 0; i < 10; i++ {
		val, err := dm.Get(strconv.Itoa(i))
		if err != nil {
			log.Printf("Get call failed: %v", err)
		}
		fmt.Println(val, reflect.TypeOf(val))
	}

	// Don't forget the call Shutdown when you want to leave the cluster.
	ctx, _ := context.WithTimeout(context.Background(), 10*time.Second)
	err = db.Shutdown(ctx)
	if err != nil {
		log.Printf("Failed to shutdown OlricDB: %v", err)
	}
}

To-Do

  • Document the code,
  • Some parts of FSCK implementation is missing: It currently doesn't repair failed backups,
  • Design & write benchmarks,
  • Document the external HTTP API,
  • Build a website for OlricDB and create extensive documentation.

Caveats

OlricDB uses Golang's built-in map. It's known that the built-in map has problems with the GC:

OlricDB already uses map[uint64]interface{} as underlying data structure. It should work fine for most of the cases.

I have implemented an off-heap hash table with mmap. We may add an option to use it in the future but my implementation needs too much effort to be used in production.

Contributions

Please don't hesitate to fork the project and send a pull request or just e-mail me to ask questions and share ideas.

License

The Apache License, Version 2.0 - see LICENSE for more details.

About the name

The inner voice of Turgut Özben who is the main character of Oğuz Atay's masterpiece -The Disconnected-.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK