

Discord Migrates Trillions of Messages from Cassandra to ScyllaDB
source link: https://www.infoq.com/news/2023/06/discord-cassandra-scylladb/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Discord Migrates Trillions of Messages from Cassandra to ScyllaDB
Jun 22, 2023 2 min read
Discord has migrated trillions of message records from Apache Cassandra to ScyllaDB, reducing the size of the largest cluster from 177 Cassandra nodes to 72 ScyllaDB nodes and reducing tail latencies for reads and writes. The move has unlocked new product use cases because of the improved database stability and performance.
As Discord grew, it migrated its data from MongoDB to Cassandra in 2017 because it was looking for a scalable database to handle ever-growing data volumes. Initially, the Cassandra cluster consisted of 12 nodes and stored billions of messages. Still, after five years, the cluster had 177 nodes and was frequently experiencing performance problems, forcing the team to reduce some of the maintenance operations, which became too expensive to run.
Some performance issues were caused by hot partitions resulting from the table schema design, where partitioning was based on the Discord channel and time bucket. Bo Ingram, a senior software engineer at Discord, explains the impact of hot partitions on the database cluster:
When we encountered a hot partition, it frequently affected latency across our entire database cluster. One channel and bucket pair received a large amount of traffic, and latency in the node would increase as the node tried harder and harder to serve traffic and fell further and further behind. [...] Since we perform reads and writes with a quorum consistency level, all queries to the nodes that serve the hot partition suffer latency increases, resulting in a broader end-user impact.
Based on the experimenting and testing done internally, the team has decided to move its data across all clusters to ScyllaDB. They opted for ScyllaDB primarily to improve performance, including avoiding garbage-collection-related issues they experienced with Cassandra. They also worked with the ScyllaDB team to improve some use cases they depended on, like reverse queries.
After migrating all smaller clusters by 2020, the team prepared to migrate the biggest cluster, containing trillions of messages. To minimize the hot partition problem, they created a new intermediary service layer in their architecture, named data services, written in Rust and interfaced via gRPC API.
One important responsibility of data services is request coalescing, which avoids multiple database calls when many users request the same message. Secondly, the team implemented consistent hash-based routing to data service instances based on a routing key, such as a channel id. Together, these changes significantly reduced hot partition problems, giving Discord extra time to prepare for the big migration.

Source: https://discord.com/blog/how-discord-stores-trillions-of-messages
For the migration itself, the team first considered using ScyllaDB’s Apache Spark migrator but, in the end, decided to implement a bespoke solution in Rust with SQLite used for checkpointing, which allowed them to shorten the migration time from three months to nine days. After addressing some minor hiccups, they validated the completed migration and switched to ScyllaDB in May 2022. Since then, the new cluster has proven stable and provided consistent performance, which, together with the data service layer, allowed it to handle extra traffic generated by the World Cup gracefully.
About the Author
Rafal Gancarz
Rafal is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafal has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.
Show moreRecommend
-
224
README.md GocqlX
-
77
-
37
Set Package set is a type-safe, zero-allocation port of the excellent package fatih/set . It contains sets for most of the basic types and you can generate set fo...
-
9
Improving Language Models by Retrieving from Trillions of Tokens Abstract We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with prece...
-
6
The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data InfrastructureMy name is Zhenzhong Xu. I joined Netflix in 2015 as a founding engineer on the Real-time D...
-
3
To-go coffee cups shed trillions of plastic particles under normal use ...
-
11
Davos FinTech - we lost trillions on crypto, but don’t worry it was unregulated! says WEF By Chris Middleton
-
6
How Discord Stores Trillions of Messages
-
10
El Niños Cause Trillions in Lost Economic...
-
4
NASA now believes our galaxy is home to trillions - not billions - of rogue planets Castaways that aren't gravitationally tied to a star system By
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK