Automatic member fencing with OFFLINE_MODE in Group Replication

Group Replication enables you to create fault-tolerant systems with redundancy by replicating the system state to a set of servers. Even if some of the servers subsequently fail, as long it is not all or a majority, the system is still available.

This blog post will focus on what happens to the failed servers, that is, how the group can be configured in order to prevent that the failed but still client reachable servers do not accept client requests.

A group member unintentionally leaves the group:

after encountering an applier error;
after encountering a recovery error;
in the case of a loss of majority (if group_replication_unreachable_majority_timeout is different of 0);
when another member of the group expels it due to a suspicion timing out;
after an error on coordinated group changes ;
after a primary election error;
when automatic rejoin is enabled, after its attempts are exhausted unsuccessfully.

The behaviour of the failed member after leaving the group is controlled by the option group_replication_exit_state_action .

Until 8.0.17, this behaviour could be:

READ_ONLY disable writes on the server (the default value);
ABORT_SERVER shutdown the server.

On 8.0.18 we added:

OFFLINE_MODE close all connections and disallow new ones from users who do not have the CONNECTION_ADMIN or SUPER privilege. This mode includes READ_ONLY, otherwise a user with CONNECTION_ADMIN or SUPER privilege would be able to do changes that would never reach the group.

These three behaviours allow the DBA to customise the failed server, and on the more severe situation the full system behaviour. For instance, in the case all members become unreachable, due to a internal network failure, all members will follow the configured behaviour.

The DBA has the capability to only block writes, if she/he goes with READ_ONLY; block all operations with OFFLINE_MODE; or even stop the server completely with ABORT_SERVER.

When a failed server, configured with group_replication_exit_state_action=OFFLINE_MODE, leaves the group we can see its ERROR state on the performance_schema.replication_group_members table:

SELECT * FROM performance_schema.replication_group_members;

and the offline mode can be check with:

SELECT @@GLOBAL.offline_mode;

After fixing the failure that caused the unintentionally leave, the DBA needs to unset the offline_mode

SET @@GLOBAL.offline_mode = OFF;

apart from rejoin the member to the group.

Conclusion

I hope this new fencing mode will help you improve and better configure the HA properties of your systems, allowing you to focus on your applications!

156 total views, 120 views today

Conclusion

Recommend

log4g：站在巨人的头上实现一个可配置的Go日志库

OpenBSD 6.6

Lighter and Faster - A Guide to the Svelte Framework

Researchers predict which content is to be raided by 4chan and offer solution

关于缓存这件事

技术资讯 | Presto资源组之快速指南

马蜂窝 IM 移动端架构的从 0 到 1

八千字长文深度解读，迁移学习在强化学习中的应用及最新进展

Samsung patents a reverse notch display - GSMArena.com news

Bose 700 体验：更强，更优雅

About Joyk