Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

Jun 22, 2023 2 min read

Discord has migrated trillions of message records from Apache Cassandra to ScyllaDB, reducing the size of the largest cluster from 177 Cassandra nodes to 72 ScyllaDB nodes and reducing tail latencies for reads and writes. The move has unlocked new product use cases because of the improved database stability and performance.

As Discord grew, it migrated its data from MongoDB to Cassandra in 2017 because it was looking for a scalable database to handle ever-growing data volumes. Initially, the Cassandra cluster consisted of 12 nodes and stored billions of messages. Still, after five years, the cluster had 177 nodes and was frequently experiencing performance problems, forcing the team to reduce some of the maintenance operations, which became too expensive to run.

Some performance issues were caused by hot partitions resulting from the table schema design, where partitioning was based on the Discord channel and time bucket. Bo Ingram, a senior software engineer at Discord, explains the impact of hot partitions on the database cluster:

When we encountered a hot partition, it frequently affected latency across our entire database cluster. One channel and bucket pair received a large amount of traffic, and latency in the node would increase as the node tried harder and harder to serve traffic and fell further and further behind. [...] Since we perform reads and writes with a quorum consistency level, all queries to the nodes that serve the hot partition suffer latency increases, resulting in a broader end-user impact.

Based on the experimenting and testing done internally, the team has decided to move its data across all clusters to ScyllaDB. They opted for ScyllaDB primarily to improve performance, including avoiding garbage-collection-related issues they experienced with Cassandra. They also worked with the ScyllaDB team to improve some use cases they depended on, like reverse queries.

After migrating all smaller clusters by 2020, the team prepared to migrate the biggest cluster, containing trillions of messages. To minimize the hot partition problem, they created a new intermediary service layer in their architecture, named data services, written in Rust and interfaced via gRPC API.

One important responsibility of data services is request coalescing, which avoids multiple database calls when many users request the same message. Secondly, the team implemented consistent hash-based routing to data service instances based on a routing key, such as a channel id. Together, these changes significantly reduced hot partition problems, giving Discord extra time to prepare for the big migration.

Source: https://discord.com/blog/how-discord-stores-trillions-of-messages

For the migration itself, the team first considered using ScyllaDB’s Apache Spark migrator but, in the end, decided to implement a bespoke solution in Rust with SQLite used for checkpointing, which allowed them to shorten the migration time from three months to nine days. After addressing some minor hiccups, they validated the completed migration and switched to ScyllaDB in May 2022. Since then, the new cluster has proven stable and provided consistent performance, which, together with the data service layer, allowed it to handle extra traffic generated by the World Cup gracefully.

About the Author

Rafal Gancarz

Rafal is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafal has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.

Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

About the Author

Rafal Gancarz

Recommend

GitHub - scylladb/gocqlx: Go toolkit for ScyllaDB and Apache Cassandra®

号称十倍性能于Cassandra的ScyllaDB，究竟祭出了哪些技术"利器"？

GitHub - scylladb/go-set: Type-safe, zero-allocation sets for Go

Improving Language Models by Retrieving from Trillions of Tokens

The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastru...

To-go coffee cups shed trillions of plastic particles under normal use

Davos FinTech - we lost trillions on crypto, but don’t worry it was unregulated!...

How Discord Stores Trillions of Messages

El Niños Cause Trillions in Lost Economic Growth, Study Shows

NASA now believes our galaxy is home to trillions - not billions - of rogue plan...

About Joyk