8

How GitHub Partitioned its Relational Database to Improve Reliability at Scale

 2 years ago
source link: https://www.infoq.com/news/2021/09/GitHub-database-partition/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How GitHub Partitioned its Relational Database to Improve Reliability at Scale

Sep 30, 2021 2 min read

GitHub has been working for the last couple of years to partition their relational database and move the data to multiple. independent clusters. This effort led to a 50% load reduction and a significant reduction of database-related incidents, explains GitHub engineer Thomas Maurer.

GitHub architecture originally relied on a single MySQL database cluster, known as mysql1.

With GitHub’s growth, this inevitably led to challenges. We struggled to keep our database systems adequately sized, always moving to newer and bigger machines to scale up. Any sort of incident negatively affecting mysql1 would affect all features that stored their data on this cluster.

To improve this situation, GitHub engineers defined a strategy to partition their relational database without impairing the level of service. Key to this was the definition of virtual partitions of database schemes and the use of SQL linters.

Virtual partitions are group of tables used together in queries and transactions. Identifying table groups that could be separated in the application layer was the first step to a smoothless transition. To codify the relation among tables, GitHub engineers introduced the notion of schema domains. Below, you can see an example of schema domain, defined in a simple YAML file:

gists:
  - gist_comments
  - gists
  - starred_gists
repositories:
  - issues
  - pull_requests
  - repositories
users:
  - avatars
  - gpg_keys
  - public_keys
  - users

This notation was the basis for the application of SQL linters.

[Linters] identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster.

Linters were applied to both queries and transactions. Query linters ensured that only tables belonging to the same schema domain were referenced in the same query. Transaction linters were used to identify transactions that needed to be reviewed or required a data model update.

Once virtual partitions were identified through schema analysis, related table could be physically moved to different database clusters. For this final step, it was paramount for GitHub to avoid any downtime.

After an initial investigation of Vitess, a scaling layer running on top of MySQL, GitHub engineers opted to implement their own approach, dubbed write-cutover. This consisted in adding a new replica to a cluster then running a script to stop replication and make that replica independent from the original cluster. Using this approach, GitHub moved 130 of their most critical tables at once.

Thanks to this effort, explains Maurer, GitHub database was spread across several clusters. This made it possible to increase handled queries by over 30% while the average load on each host halved. Additionally, GitHub engineers observed a significant reduction of the number of database-related incidents.

Maurer's article contains much more detail that can be provided here, especially concerning their write-cutover procedure. Do not miss his write up to get the full picture.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK