0

Mozilla Firefox Design Review: Key-Value Storage

 收集于1周前 阅读数 0

Design Review: Key-Value Storage

We propose the standardization of a simple key-value storage capability, based on LMDB, that is fast, compact, multi-process-capable, and equally usable from JS, Java, Rust, Swift, and C++.

Document status

  • 2018-03-20: review concluded; see notes here .
  • 2018-03-05: wrapped up a concrete proposal.
  • 2018-02-22: lots of input from gcp. Added section about content process use. Added analysis of SQLite.
  • 2018-02-21: positive feedback from mak; no document comments yet.
  • 2018-02-13: input from the Firefox team, inc. Florian and Felipe.
  • 2018-02-12: restructured open questions, which was becoming huge thanks to answers!
  • 2018-02-09: added extensive notes around NFS shares and locking (thanks Mossop and mkaply). Discussion with Mossop re management of change and versioning.
  • 2018-02-07: consulted njn.
  • 2018-02-06: suggestions from nanj and snorp applied.
  • 2018-02-04: suggestions from Myk applied. Appendix A added. Sent to snorp and nanj.
  • 2018-02-02: rough draft

Roles

  • Reviewer : dtownsend
  • Chair : jwalker
  • Proposer : rnewman, Browser Architecture

Summary

There is no one-size-fits-all solution for storage.

The best we can do to minimize engineering complexity is to cluster the varied requirements of various consumers around particular points in the solution space, reducing our large number of unmaintained ad hoc solutions into a smaller number of supported pieces of infrastructure.

The exploration effort that’s now underway as a result of 2017’s Sync & Storage Roadmap Review primarily addresses one such point: the capabilities required for structured, evolving, reusable, long-lived, syncable user data — data around which Firefox user-facing features are built.

The subject of this document is the polar opposite of that point: data that is unstructured or primitive, non-evolving (or at least with no advances on the current state of the art for managing change), special-purpose, and with no particular storage-level effort expended on reliable sync, if it syncs at all.

Components that store this kind of data often have a different set of required capabilities: a need for very fast reads, reads and writes from multiple processes, synchronous reads, control over filesystem impact, etc .

Examples of data like this are configuration data, certificate revocation lists, session storage, visited link caches, and simple extension storage.

The existing implementations of these features, if they exist, tend to use what’s available — flat text files, JSON files, Kinto, IndexedDB, libpref, or non-persisted in-process data structures repopulated via IPC or messaging. In doing so they tend to work around missing capabilities in, and necessary limitations of, those tools. Some of these tools are hard to access from different languages, and most are next to impossible to use correctly in an embedding context.

Brief

We propose ‘buying’, not building, the core of such a solution, and wrapping it in idiomatic libraries that we can use on all platforms.

We propose that LMDB is a suitable core (see Appendix A for options considered): it is compact (32KB of object code), well-tested, professionally maintained, reliable, portable, scales well, and is exceptionally fast for our kinds of load . We have engineers at Mozilla with prior experience with LMDB, and their feedback is entirely positive.

This key-value storage system can be targeted at a number of new consumers that might otherwise need to roll their own storage, or (ab)use preferences:

  • HSTS etc. download storage. (Waiting on survey from mgoodwin.)
  • Devtools UI configuration.
  • A partner-acceptable data storage and configuration approach for GeckoView and other embedding projects. (More detailed conversations need to happen around this.)
  • New mobile applications.

and potentially displace or support a number of existing solutions:

  • The atypical “application” consumers of preferences. These consumers don’t use defaults, and don’t use prefs’ overlaying capabilities, but do want a simple API, fast (possibly synchronous) reads, ACID, and richer types ( e.g. , non-stringified timestamps). They currently use libpref because it was the most suitable option at the time they were built.
  • Cross-process preferences usage, which currently uses command-line arguments and IPC to replicate data between processes . Note that doing so would require filesystem access from each process, which might make it unsuitable for the content process; see below.
  • Limited-size flat file systems like autofill, login manager, etc.
  • XUL Storage
  • Session store ?

It should provide a sensible, scalable, and efficient default choice for consumers that now use hand-rolled key-value systems.

The core wrapping library provides safety (both padding LMDB’s sharp-edged API and also protecting against system limitations like NFS locking), improves developer ergonomics ( e.g. , tracking value types, seamlessly resizing databases, even automatic sharding), and consolidates best practices ( e.g. , asynchronous reads and writes, versioning).

Idiomatic per-environment APIs can then take care of platform integration on top of a shared representation, managing things like string conversions, Future / Promise / Deferred , etc .

We expect the resulting APIs to be simpler (and more opinionated) than LMDB’s, and for ongoing maintenance of language wrappers to be relatively cheap. We have recent and relevant experience with Rust wrappers for C++, Swift, and Java, so we’re not too worried about this.

Explicit non-goals for this proposal are:

  • Displacing SQLite for relational consumers, consumers that require full-text search, or very large databases.
  • Implementing synchronization within the data store itself.
    • It’s likely that some consumers will replicate data in, and it’s possible that some consumers will build syncing on top, just as we do for prefs today, but — unlike with our other efforts — we’re not trying here to make syncing a supported part of the storage system itself.
  • Implementing the full set of capabilities of libpref ( e.g. , distribution configuration) within the storage system itself.
  • Addressing storage of data that must be human-editable at rest without specialized tooling.

Not-yet or nevergoals for this proposal are:

  • Standardization via a standards body as a web API.
  • Encryption (though there might be touch points with Lockbox’s secure device storage concept).

Content process use

Particular thanks to Randell and gcp for this input .

One appealing aspect of LMDB is its relative ease of use from multiple processes, above and beyond its basic capabilities as yet-another-fast-key-value-store.

We expect to have multiple non-content processes in Firefox that could use this capability.

We also expect to have lots — potentially dozens or hundreds — of content processes. Naturally it’s not desirable from a security perspective to allow arbitrary reads and writes, and arbitrary filesystem access, from content processes. There is a tension between performance, reducing memory usage, etc . and reduced attack surface — we can have a locked-down content process, or efficient sharing, but not both.

LMDB uses mmap, which requires some amount of filesystem access. It requires write access to the database lock file for coordination. In ordinary use, the memory-mapped region is read-only, with only the lock file requiring writes.

There are several possibilities for use from the content process.

  • Full read-write via mmap. This would require whitelisting and careful auditing, and is still a concern.
  • Read-only access via mmap, either by whitelisting the lock file for writes, or building our own non-filesystem locking ( e.g. , using shmem). Writes from content would have to be asynchronous over IPC. It’s possible that we’ll build our own lock manager anyway in order to achieve robustness in the face of NFS mounts, so this might be cheap and relatively safe when considering amortized costs.
  • Read-only in-memory use by modifying LMDB to be able to use anonymous mmap. In theory we can then flush to disk for persistence. This would essentially be a standardized atomic key-value interface on top of shared memory.
  • Don’t use LMDB from the content process: use IPC to access storage in another process. This is what we do now, if you ignore the storage technology on the other end.

We do not propose solving this up-front: the worst-case scenario is no different to what we do now, but with the benefit of easier coordination between non-content processes, and it’s possible to make different choices in the future.

Open and partially answered questions

32-bit systems

32-bit systems have limited address space for memory-mapped files. Will this restriction bite us? How large can we make data stores without running into these issues?

LMDB files’ maximum size is specified at opening time. Address space usage is dictated by this maximum size.

64-bit users make up at least 80% of our hardware report (we under-count). On Android we don’t yet ship ARM 64 , but intend to do so this year.

Preliminary survey of crashstats suggests that we’ll be totally fine for reasonable (< 1MB or < 256KB) database volumes.

oom-small is about 4% of our crash rate. Most of those are genuine memory pressure: 2GB or 4GB devices with 75%+ memory usage. Quite a few of the reports are nonsense — failure to allocate 8 bytes with gigs of free RAM and VM.

If I search for oom, <4GB total virtual memory, memory usage <80%, and < 30 seconds uptime — that is, a startup OOM where we shouldn’t actually have run out of memory — we get about 350 oom-small crashes per week across Firefox 58-60. 10% of those are Android, and the rest are Windows.

At those rates, typical LMDB usage should be lost in the noise. It might even reduce OOMs: the entire mmapped file occupies address space, but the actual resident set will be less than snarfing an entire JSON file into memory and causing an oom-large in the JS engine or in file-writers like ChunkedJSONWriteFunc , both of which appear in crash-stats.

Windows support

How well does LMDB work on Windows? About 85% of our desktop users are on Windows.

Reading suggests that as of 2016, LMDB should work just fine, even using sparse files correctly, but we should experimentally verify.

Android support

How well does LMDB work on our supported Android versions?

Remote filesystem support

LMDB’s documentation recommends not using it on network filesystems. The principal obstacle is the lack of advisory locking.

We already have some accommodations for SQLite on NFS. It’s possible that we could apply the same. Depending on exactly which issues occur — locking, syncing, or other — we might need to find a variety of approaches; the ultimate workaround is to put a ‘working copy’ of each LMDB file in temporary storage, and use the atomic snapshot feature to copy them back to the profile in the background at some point before crash/quit. Some notes:

Being careful

LMDB has a number of straightforward restrictions: no multiple opens from a single process, no multiple top-level transactions open at the same time within a single thread, careful handling of locks when processes crash, the need to clean up stale readers if they crash. Will these restrictions be onerous? Current work suggests that a layer or two of Rust helps a lot with the sharp edges.

Defaults

What kinds of file sizes and defaults give reasonable behavior on our supported platforms? How changeable are these later? LMDB requires some amount of tuning and specifying limits ( e.g. , maximum number of open named databases in an environment).

Binary storage

How can we protect ourselves against variations in platform/runtime encoding issues? Endianness, string encoding, Rust repr…? Initial exploratory work used bincode for this, but we need to validate.

Resilience to corruption

What level of external corruption ( e.g. , disk issues) or internal corruption ( e.g. , writing malformed data) do we need to handle? We are very careful about things like session storage ( R: how careful? Do we handle a failure to parse due to corrupt JSON? Do we handle a failure to parse the compressed file due to corruption?) , but many of our existing stores either fail to init, breaking the browser, or discard the corrupt file. Do we need to do better? Is LMDB more or less prone to issues?

Are customers who need recoverability from corruption adequately served by other points in the solution space, or by features of this one ( e.g. , LMDB’s ability to take consistent live backups)?

LMDB’s live backup seems like it could be used to build a sensible backup/migration strategy, but it would be good if we could understand to what extent we consider the requirement important before investing effort into building capabilities. Even if we have some potential consumers, maybe those aren’t the first ones.

Observers/change notifications

What approaches to observers/notifications make sense? We don’t get this out of the box for most stores, particularly not cross-process; we’re breaking new ground. We can learn some lessons from observer and preference change notifications in Firefox.

Performance

Does a memory-mapped solution give us:

  • Acceptable startup time compared to prefs or flat files?
  • Acceptable shutdown time? (It should: LMDB is crash-safe.)
  • A reduction in the number of fsyncs compared to, e.g. , JSONFile ?

Can LMDB beat IPC-coordinated shmem or flat file solutions for for inter-process data sharing?

Shipping vehicle(s)

It makes sense to have at least one target consumer in mind when building out a capability. We have lots of options; which ones will have the biggest potential upside, the lowest risk, and offer the best learning opportunities?

Cross-process, cross-language use examples

How a Java app and GeckoView might collaborate on the visited link set or configuration data

Chrome shares its visited link set between the main process and content process via a complicated shared memory scheme: the main process constructs chunks and coordinates handoff to content processes via IPC.

In Firefox for Android we use JNI to allow the Gecko main thread to read from and add to a Java-side data structure that serves a similar purpose.

GeckoView will, eventually, distinguish between the application process (started by Android, runs the enclosing Java application) and Gecko’s main process. These two processes will communicate via Binder .

In contrast to Fennec, application data — including history data — won’t be in the same process as the Gecko main process. Sharing persistent data in this situation is well-suited to LMDB: the same LMDB database could be opened from both the application (Java) process and the Gecko main process. The two processes would have a shared understanding of the data format used and the file path and settings required. Given those things, the application could add and remove entries from the database, and the Gecko main process could read (and, indeed, write) without additional coordination.

Planning

Executing on this proposal consists of five overlapping phases.

  1. Evaluation. Given the set of open questions, research, measure (either through prototypes or through probes in Firefox itself), or write code in order to answer the question. In particular, the following:
    1. Windows support.
    2. Android support.
    3. NFS.
  2. Design. It’s likely that some kind of profile-linked manager will be needed to control access to storage. An API needs to be defined and documented. Design and evaluation are intertwined.
  3. Productization. Produce and document:
    1. A core library (using the stable Rust wrapper for safety and ergonomics).
    2. Build integration with mozilla-central and Gecko, and an idiomatic interface that can be consumed by C++ code.
    3. … and chrome JS code.
    4. A Swift wrapper around our core Rust library.
    5. A Java library that allows for independent use of that library, and shared data use with Gecko.
    6. Baseline performance tests to make sure we stay fast.
  4. Leverage. Use these libraries in consumers, migrating older systems as appropriate. We propose the following as the initial set of consumers, depending on staffing:
    1. XULStore (existing).
      • Addresses a concern raised by the XUL replacement work. Replaces a JS XPCOM component that’s in the hot path for window creation. Stores a manageable amount of data.
    2. Search service cache ( search.json.lz4 ).
      • Binary data storage: opportunity for size wins. The component currently compresses to halve disk usage… but 80% of that space usage is base64 and percent-encoded image data!
      • Read asynchronously as the first thing the component does, so a good perf challenge.
      • It’s a cache: no need to address migration, and rollback is easy.
      • Opportunity to rework how the component manages state ( e.g ., no need to read the descriptions of each engine during init).
    3. Devtools configuration (new).
      • Meets an unmet need, and involves figuring out content process communication.
    4. Security storage (new/replacement).
      • We currently store some certificate data (revocations, whitelists, HSTS, etc.) in a mixture of Kinto (for delivery) and plain-text files.
      • This data needs a mixture of synchronous and asynchronous access.
      • Infrequently updated. Can be repopulated from the server if necessary.
      • Blob storage.
      • Two interesting needs: ideally this data is shared between profiles (no sense in having it copied), and usable on Android, iOS, and from C++, Rust (soon), and JS in Gecko.
    5. A new Android capability to support new Fennec and GeckoView work.
      • Exercises another platform, and explores how the system meets the needs of a diverse set of developers.
  5. Analysis of other consumers, existing and nascent. There are dozens of stores in Firefox on each platform. Some of these will be better suited to this kind of key-value store than their existing storage system (and some will be better suited to other capabilities that we’re building). Still others will not be worth the effort to port. Assess each consumer along axes: size, complexity, risk, technical debt, cross-platform costs and potential value. This will inform subsequent porting efforts.

We have begun building a simple production-quality Rust wrapper to implement storage of typed values and to begin to impose an opinionated and safe interface. This can be used for evaluation and as a testbed for design, ultimately being stabilized and productized.

Footnotes

  • 1 : In this set I include the preferences that we currently pass to content processes via command-line arguments (“early prefs”) and IPC messages (“late prefs”).
  • 2 : For example, Firefox for Android still uses the SQLite login manager storage, because it needs concurrent access from Java and Gecko; flat files don’t make that easy.
  • 3 : It’s possible that a key-value storage system could be used as the backend for libpref, just as we currently use prefs.js. Randell reports that each content process uses approximately 350KB to keep copies of preferences; that’s likely to be reduced in a shared memory system.
  • 4 : “Do not use LMDB databases on remote filesystems, even between processes on the same host. This breaks flock() on some OSes, possibly memory map sync, and certainly sync between programs on different hosts.” — http://www.lmdb.tech/doc/
  • 5 : Unless MDB_NOTLS is specified, tying transactions to the transaction object instead of to the thread. ‘Defaults’ and ‘Being careful’!

Appendix A: survey of key-value stores

IndexedDB

A web-standard storage abstraction layer. In Chrome, uses LevelDB for storage. In Firefox, currently backed by SQLite. Both use remoting for multi-process access. Only accessible from JS.

SQLite

Well-tested and already ships in Firefox. Crypto available via SQLCipher. In-memory (non-persistent) storage is easy. Built-in VACUUM is more convenient than compaction in LMDB. Disk sizes are approximately equivalent to LMDB. Known good behavior on 32-bit platforms.

WAL versus COW — writes, and database opens after crashes, are less predictable due to WAL checkpointing. Reads likely to be on the order of microseconds, not nanoseconds; not suitable to back a synchronous API. Reader concurrency not quite as good as LMDB. SQLite will have a higher baseline of memory usage: real allocated memory per connection, not mapped virtual memory.

Benchmarking shows SQLite to perform relatively poorly on KV workloads when compared to LMDB. In sequential reads on published benchmarks LMDB has 47x throughput (I measured 80x for string keys). In random reads LMDB has 9x throughput. Sequential writes, 8x; random writes 5.7x. Batching makes writes significantly faster for LMDB thanks to amortized cost of COW, and reading sequential keys is very fast.

Using SQLite instead of a specialized key-value store would trade performance (particularly read performance) for existing deployment experience.

SQLite is the only real contender for delivering this capability in Firefox. It is conceivable that we could use the same opinionated interface to front both SQLite and LMDB. There is negligible increase in object size to ship both, because LMDB is so small.

My own runs of benchmarks, adjusting the SQLite schema and using WAL, with key times highlighted in bold:

String keys:

SQLite:

readrandom : 8.110 micros/op 123310 ops/sec; (1000000 of 1000000 found)
**readseq : 2.586 micros/op 386687 ops/sec; 36.9 MB/s**
readreverse : 2.580 micros/op 387580 ops/sec; 37.0 MB/s

fillrandsync : 82.125 micros/op 12176 ops/sec; 1.3 MB/s (1000 ops)
fillrandom : 24.933 micros/op 40107 ops/sec; 4.4 MB/s
**fillrandbatch : 16.004 micros/op 62483 ops/sec; 6.9 MB/s**
fillseqsync : 31.004 micros/op 32253 ops/sec; 3.6 MB/s (1000 ops)
fillseq : 12.433 micros/op 80430 ops/sec; 8.9 MB/s
fillseqbatch : 2.108 micros/op 474321 ops/sec; 52.5 MB/s
overwrite : 36.793 micros/op 27179 ops/sec; 3.0 MB/s

LMDB:

readrandom : 1.070 micros/op 934143 ops/sec; (1000000 of 1000000 found)
**readseq : 0.032 micros/op 31309684 ops/sec; 3463.7 MB/s**
readreverse : 0.023 micros/op 42877969 ops/sec; 4743.4 MB/s

fillrandsync : 166.412 micros/op 6009 ops/sec; 0.7 MB/s (1000 ops)
fillrandom : 5.176 micros/op 193206 ops/sec; 21.4 MB/s

Integer keys (faster reads, slightly smaller file):

SQLite:

readrandom : 4.258 micros/op 234843 ops/sec; (1000000 of 1000000 found)
readseq : 0.281 micros/op 3560505 ops/sec; 339.6 MB/s
readreverse : 0.261 micros/op 3826081 ops/sec; 364.9 MB/s

LMDB:

readrandom : 0.548 micros/op 1825317 ops/sec; (1000000 of 1000000 found)
readseq : 0.028 micros/op 35937612 ops/sec; 3701.5 MB/s
readreverse : 0.019 micros/op 53358945 ops/sec; 5495.8 MB/s

fillseqsync : 165.802 micros/op 6031 ops/sec; 0.7 MB/s (1000 ops)
fillseq : 3.360 micros/op 297576 ops/sec; 32.9 MB/s
fillseqbatch : 0.504 micros/op 1983717 ops/sec; 219.5 MB/s
overwrite : 5.099 micros/op 196105 ops/sec; 21.7 MB/s

LevelDB

Google-sourced. Implemented in Go. Not lightweight (306KB macOS release dylib). Non-ACID, no transactions: only read-snapshots and atomic batch writes. Poor reputation for data loss and consistency bugs . Writes are automatically compressed, spending CPU to get reasonable writes. Coarse-grained locking. LSM trees are write-optimized. Not multiprocess . Windows support is second-class , and iOS regressions occur .

LSM tree LevelDB-alikes derivatives (RocksDB (Facebook), HyperLevelDB, Basho’s LevelDB fork etc.)

Dgraph’s summary :

LSM tree based stores provide high write performance by sequentially writing key-value pairs in memory in the foreground, and then arranging them in multi-tiered levels on disk in the background. This approach is not without tradeoffs though. Values are written multiple times when arranging them in the tree (known as write amplification) and a single read might need to read multiple levels in the LSM tree before finding a value (known as read amplification).

These various libraries all aim to be faster or more scalable than LevelDB, typically by improving thread parallelism or altering how compaction works ( e.g. , RocksDB doesn’t use level-based compaction ). All are targeted at server workloads.

Bolt

A Go implementation of LMDB . B+tree. Not multiprocess .

MDBM

A memory-mapped hash/cache (it would be a stretch to call it persistent) implemented in C, with C++ and Perl bindings. Loses data by default (manual sync), not resistant to power loss/crash, vulnerable to data corruption. No transactions. Keys aren’t sorted, so no range queries or cursors. Building your own persistent storage/log is recommended.

https://github.com/yahoo/mdbm

https://news.ycombinator.com/item?id=8732891

Kyoto Cabinet

Reportedly very slow. Not multiprocess.

BerkeleyDB

Very long pedigree. For our purposes, essentially obsoleted in every way by LMDB.

LMDB

A distillation of experience with BerkeleyDB and use in OpenLDAP. Implemented in C.

Dgraph gives an excellent brief summary :

LMDB provides a key-value stored using B+ trees . It has ACID semantics and support for transactions. It does not do any background work and has a crash resilient design that uses MVCC instead of locking. This means readers operate on an isolated snapshot of the database and don’t block. The codebase is quite small and portable, so LMDB runs across multiple platforms including Android, BSD, and Solaris. This talk by Howard Chu at Databaseology 2015 goes into much more details about LMDB’s design.

It is generally understood that B+ tree based stores are ideal for workloads which are read intensive.

Read-optimized, very lightweight (32KB), predictable performance. Copy-on-write mmap B+tree. Zero-copy reads . Handles larger values better than log-based approaches, with less write amplification. Database sizes are limited by address space (~128TB on 64-bit machines). Single-writer multiple-reader (linear scaling in readers, and total throughput reportedly better with serialized writes).

nanj’s experience :

Some less well-known merits of LMDB:

  • You can get a quite good performance out of the box, without any fuss or configuration like RocksDB.
  • Its cursor iterator is awesome, very powerful and flexible.
  • It also supports integer key and value, dup keys, and whole bunch of other nice value types.
  • You can have multiple sub-dbs in a single database, each of them could be of different type (integer key, dup key etc.)
  • We’ve never run into a single case of database corruption before, even for those traffic (both write and read) intensive applications. See more in this OSDI paper: Torturing Databases for Fun and Profit

Some lessons we’ve learnt:

  • You will need to choose a maximum database size upon its creation. Since LMDB uses MVCC and shadow paging to implement ACID, long running read transactions would increase the page usage quickly, potentially fill up the mmap, and the write transaction will abort and leave the database in an unwritable state until page reclaiming is performed. Resizing the database on the fly is feasible, although it needs certain amount of cooperation between readers and writers.
  • LMDB by default has a maximum key size set as 512 bytes, there is a compile time configuration to change it, though a larger key may have performance impact too.
  • Random write is just OK, unlike other log-based data stores.
  • Lack of built-in compression usually leads to a higher disk usage.
  • Fully relying on OS’s page management may block the way of certain optimization.
  • Some database maintenances are still needed, LMDB provides various APIs (e.g mdb_reader_check) to clean up zombie readers (due to the improper transaction termination) so that the unused pages could be reclaimed. No downtime needed though, could be done in the background.
  • The code base of LMDB is not very readable ;), at least comparing to SQLite.
  • Howard Chu usually is quite responsive on bugs and feature requests, also he is a quite opinionated guy, you can see his comments on HN :)

https://banksco.de/p/lmdb-the-leveldb-killer.html

https://symas.com/is-lmdb-a-leveldb-killer/

ActorDB

Distributed SQL database.

https://github.com/biokoda/actordb

Built on top of SQLite and LMDB: it replaces SQLite’s pager with LMDB, and hooks into SQLite’s WAL for Raft-based replication between actors. Developed for applications such as file syncing.

nanj’s observations :

Let me share some observations on this project here. Disclaimer, I’ve never used it before, all my understanding on it is from following blog posts and skimming its source code.

http://blog.biokoda.com/post/112206754025/why-we-built-actordb

http://blog.biokoda.com/post/133121776825/actordb-how-and-why-we-run-sqlite-on-top-of-lmdb

What’s ActorDB

In short, the ActorDB team wants to build a file synchronisation solution anyone could install and use.

“The best way we can describe ActorDB is that it is an ideal server-side database for apps. Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries to the database is within a specific user”

The author believes that a distributed SQL database is the best storage solution to this.

Features

  • Server side storage with SQL including transaction support (unlike KV store or document store)
  • No single point of failure (user data is replicated among multiple nodes by Raft)
  • Horizontally scalable (each user has its own database, so they call that ActorDB)

Other Observations

  • In the early versions, they use SQLite as both the SQL engine and the storage engine, and hack SQLite’s WAL to implement Raft. According to the second article, although they manage to combine all the WALs into a single one, it is not ideal since every user has a separate SQLite base file.
  • Then, they choose LMDB as the storage engine to mitigate the drawback above. They again hack the SQLite’s WAL module so that it completely bypasses the SQLite’s b-tree based storage layer, instead stores the actual data (i.e. pages) and Raft logs on LMDB. They find LMDB is perfect engine for their use case, since LMDB supports duplicate keys with sorted value (i.e. dupsort key) as well as multiple sub-dbs within a single database.
  • Other components are written in Erlang. Also worth pointing out that the server application, the SQL engine, and the storage engine are tightly coupled with each other in this project. To my understanding, there is no clear boundaries between subsystems. This makes ActorDB hard to extend and evolve.
  • Due to its design, i.e. one db for each user, certain SQL features are not supported, like cross user table joins, ATTACH database etc.

Although it’s unclear how exactly ActorDB works out in production, and some other limitations in its design & implementation. I think it still provides us with some interesting points for further investigation, such as, the approach of integration LMDB with SQLite. Introduction of WAL replication via Raft.

Questions for the review:

  • Let’s discuss failure modes.
    • Durability/consistency
    • Failure to open
    • Performance
    • Corruption (on read)
    • Upgrades/Downgrades
  • How well does LMDB work across various platforms?
    • Android
    • iOS
    • Windows
    • macOS
    • Linux
  • What are some indicators that RKV is the wrong choice?
    • Definition of ‘wrong’: wrong API vs wrong storage backend
    • Syncability and encryption
    • Effects on downstream Firefox builds
    • Slow write performance
  • There’s going to be lots of knowledge about how to do sync and storage at Mozilla, how are we going to impart that?
  • What’s our confidence level about performance?
  • How many RKV DBs will we have and how will they be organized?
    • Content/main process?
    • Fast/wired? Early-stage startup.
    • Topic?
    • Cached or non-cached
    • Navigation by key and multiple values per key.
  • At what point do we have to tackle NFS support?
查看原文

猜你喜欢

关于头条


聚合每日国内外有价值,有趣的链接。