24

MongoDB 节点宕机引发的思考

 4 years ago
source link: https://mp.weixin.qq.com/s/7Hsk1p_I91PD33tdU3vuOQ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

AF7NJrz.jpg!web

简介 

最近一个 MongoDB 集群环境中的某节点异常下电了,导致业务出现了中断,随即又恢复了正常。 通过ELK 告警也监测到了业务报错日志。

运维部对于节点下电的原因进行了排查,发现仅仅是资源分配上的一个失误导致。 在解决了问题之后,大家也对这次中断的也提出了一些问题:

>”当前的 MongoDB集群 采用了分片副本集的架构,其中主节点发生故障会产生多大的影响?

>”MongoDB 副本集不是能自动倒换吗,这个是不是秒级的?

带着这些问题,下面针对副本集的自动Failover机制做一些分析。

日志分析 

首先可以确认的是,这次掉电的是一个副本集上的主节点,在掉电的时候,主备关系发生了切换。


从另外的两个备节点找到了对应的日志:

备节点1的日志

2019-05-06T16:51:11.766+0800 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms

2019-05-06T16:51:11.766+0800 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected

2019-05-06T16:51:11.766+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071

2019-05-06T16:51:11.767+0800 I REPL [ReplicationExecutor] VoteRequester(term 3 dry run) received a yes vote from 172.30.129.7:30071; response message: { term: 3, voteGranted: true, reason: "", ok: 1.0 }

2019-05-06T16:51:11.767+0800 I REPL [ReplicationExecutor] dry election run succeeded, running for election

2019-05-06T16:51:11.768+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071

2019-05-06T16:51:11.771+0800 I REPL [ReplicationExecutor] VoteRequester(term 4) received a yes vote from 172.30.129.7:30071; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0 }

2019-05-06T16:51:11.771+0800 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 4

2019-05-06T16:51:11.771+0800 I REPL [ReplicationExecutor] transition to PRIMARY

2019-05-06T16:51:11.771+0800 I REPL [ReplicationExecutor] Entering primary catch-up mode.

2019-05-06T16:51:11.771+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host 172.30.129.78:30071 due to bad connection status; 2 connections to that host remain open

2019-05-06T16:51:11.771+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071

2019-05-06T16:51:13.350+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.30.129.78:30071; ExceededTimeLimit: Couldn't get a connection within the time limit

备节点2的日志

2019-05-06T16:51:12.816+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host 172.30.129.78:30071 due to bad connection status; 0 connections to that host remain open

2019-05-06T16:51:12.816+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.30.129.78:30071; ExceededTimeLimit: Operation timed out, request was RemoteCommand 72553 -- target:172.30.129.78:30071 db:admin expDate:2019-05-06T16:51:12.816+0800 cmd:{ replSetHeartbeat: "shard0", configVersion: 96911, from: "172.30.129.7:30071", fromId: 1, term: 3 }

2019-05-06T16:51:12.821+0800 I REPL [ReplicationExecutor] Member 172.30.129.160:30071 is now in state PRIMARY

可以看到, 备节点1 在 16:51:11 时主动发起了选举,并成为了新的主节点,随即 备节点2 在 16:51:12 获知了最新的主节点信息,因此可以确认此时主备切换已经完成。

同时在日志中出现的,还有对于原主节点(172.30.129.78:30071)大量心跳失败的信息。

那么,备节点具体是怎么感知到主节点已经 Down 掉的,主备节点之间的心跳是如何运作的,这对数据的同步复制又有什么影响?

下面,我们挖掘一下 ** 副本集的故障转移(Failover)** 机制

 副本集是如何实现Failover  

如下是一个PSS(一主两备)架构的副本集,主节点除了与两个备节点执行数据复制之外,三个节点之间还会通过心跳感知彼此的存活。

ZzE7naQ.jpg!web

一旦主节点发生故障以后,备节点将在某个周期内检测到主节点处于不可达的状态,此后将由其中一个备节点事先发起选举并最终成为新的主节点。这个检测周期 由 electionTimeoutMillis   参数确定,默认是10s。

VFZfQ3n.jpg!web

接下来,我们通过一些源码看看该机制是如何实现的:

<>

db/repl/replication_coordinator_impl_heartbeat.cpp

相关方法

- ReplicationCoordinatorImpl::_startHeartbeats_inlock 启动各成员的心跳

- ReplicationCoordinatorImpl::_scheduleHeartbeatToTarget 调度任务-(计划)向成员发起心跳

- ReplicationCoordinatorImpl::_doMemberHeartbeat 执行向成员发起心跳

- ReplicationCoordinatorImpl::_handleHeartbeatResponse 处理心跳响应

- ReplicationCoordinatorImpl::_scheduleNextLivenessUpdate_inlock 调度保活状态检查定时器

- ReplicationCoordinatorImpl::_cancelAndRescheduleElectionTimeout_inlock 取消并重新调度选举超时定时器

- ReplicationCoordinatorImpl::_startElectSelfIfEligibleV1 发起主动选举

db/repl/topology_coordinator_impl.cpp

相关方法

- TopologyCoordinatorImpl::prepareHeartbeatRequestV1 构造心跳请求数据

- TopologyCoordinatorImpl::processHeartbeatResponse 处理心跳响应并构造下一步Action实例

下面这个图,描述了各个方法之间的调用关系

eyQBFnq.jpg!web

图-主要关系

心跳的实现 

首先,在副本集组建完成之后,节点会通过ReplicationCoordinatorImpl::_startHeartbeats_inlock方法开始向其他成员发送心跳:

<span><span style="letter-spacing: normal;font-size: 16px;">void ReplicationCoordinatorImpl::_startHeartbeats_inlock() {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">const Date_t now = _replExecutor.now();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _seedList.clear();</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//获取副本集成员</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">for (int i = 0; i restartHeartbeats();</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//使用V1的选举协议(3.2之后)</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">if (isV1ElectionProtocol()) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">for (auto&amp;&amp; slaveInfo : _slaveInfo) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> slaveInfo.lastUpdate = _replExecutor.now();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> slaveInfo.down = false;</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//调度保活状态检查定时器</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _scheduleNextLivenessUpdate_inlock();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">}</span></span>

在获得当前副本集的节点信息后,调用_scheduleHeartbeatToTarget方法对其他成员发送心跳,这里_scheduleHeartbeatToTarget 的实现比较简单,其真正发起心跳是由 _doMemberHeartbeat 实现的,如下:

<span><span style="letter-spacing: normal;font-size: 16px;">void ReplicationCoordinatorImpl::_scheduleHeartbeatToTarget(const HostAndPort&amp; target,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> int targetIndex,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> Date_t when) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//执行调度,在某个时间点调用_doMemberHeartbeat</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _trackHeartbeatHandle(</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _replExecutor.scheduleWorkAt(when,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> stdx::bind(&amp;ReplicationCoordinatorImpl::_doMemberHeartbeat,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> this,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> stdx::placeholders::_1,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> target,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> targetIndex)));</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">}</span></span>

ReplicationCoordinatorImpl::_doMemberHeartbeat 方法的实现如下:

<span><span style="letter-spacing: normal;font-size: 16px;">void ReplicationCoordinatorImpl::_doMemberHeartbeat(ReplicationExecutor::CallbackArgs cbData,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">const HostAndPort&amp; target,</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">int targetIndex) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">LockGuard topoLock(_topoMutex);</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//取消callback 跟踪</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _untrackHeartbeatHandle(cbData.myHandle);</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">if (cbData.status == ErrorCodes::CallbackCanceled) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">return;</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">const Date_t now = _replExecutor.now();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> BSONObj heartbeatObj;</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">Milliseconds timeout(0);</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//3.2 以后的版本</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">if (isV1ElectionProtocol()) {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">const std::pair hbRequest =</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _topCoord-&gt;prepareHeartbeatRequestV1(now, _settings.ourSetName(), target);</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//构造请求,设置一个timeout</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> heartbeatObj = hbRequest.first.toBSON();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> timeout = hbRequest.second;</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> } else {</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> ...</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//构造远程命令</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">const RemoteCommandRequest request(</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> target, &quot;admin&quot;, heartbeatObj, BSON(rpc::kReplSetMetadataFieldName &lt;getTerm()) {</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//取消并重新调度 electionTimeout定时器</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> cancelAndRescheduleElectionTimeout();</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> }</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> ...</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//调用topCoord的processHeartbeatResponse方法处理心跳响应状态,并返回下一步执行的Action</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> HeartbeatResponseAction action = _topCoord-&gt;processHeartbeatResponse(</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> now, networkTime, target, hbStatusResponse, lastApplied);</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> ...</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//调度下一次心跳,时间间隔采用action提供的信息</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _scheduleHeartbeatToTarget(</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> target, targetIndex, std::max(now, action.getNextHeartbeatStartDate()));</span></span>

<span><br /></span>

<span><span style="letter-spacing: normal;font-size: 16px;">//根据Action 执行处理</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;"> _handleHeartbeatResponseAction(action, hbStatusResponse, false);</span></span>

<span><span style="letter-spacing: normal;font-size: 16px;">}</span></span>

这里省略了许多细节,但仍然可以看到,在响应心跳时会包含这些事情的处理:

- 对于主节点的成功响应,会重新调度 electionTimeout定时器(取消之前的调度并重新发起)

- 通过_topCoord对象的processHeartbeatResponse方法解析处理心跳响应,并返回下一步的Action指示

- 根据Action 指示中的下一次心跳时间设置下一次心跳定时任务

- 处理Action指示的动作

那么,心跳响应之后会等待多久继续下一次心跳呢? 在 TopologyCoordinatorImpl::processHeartbeatResponse方法中,实现逻辑为:

如果心跳响应成功,会等待heartbeatInterval,该值是一个可配参数,默认为2s;

如果心跳响应失败,则会直接发送心跳(不等待)。

代码如下:

HeartbeatResponseAction TopologyCoordinatorImpl::processHeartbeatResponse(...) {


...


const Milliseconds alreadyElapsed = now - hbStats.getLastHeartbeatStartDate();

Date_t nextHeartbeatStartDate;


// 计算下一次 心跳启动时间

// numFailuresSinceLastStart 对应连续失败的次数(2次以内)

if (hbStats.getNumFailuresSinceLastStart() <= kMaxHeartbeatRetries &&

alreadyElapsed = _rsConfig.getElectionTimeoutPeriod()) {

...

//在保活周期后仍然未更新节点,置为down状态

slaveInfo.down = true;


//如果当前节点是主,且检测到某个备节点为down的状态,进入memberdown流程

if (_memberState.primary()) {


//调用_topCoord的setMemberAsDown方法,记录某个备节点不可达,并获得下一步的指示

//当大多数节点不可见时,这里会获得让自身降备的指示

HeartbeatResponseAction action =

_topCoord->setMemberAsDown(now, memberIndex, _getMyLastDurableOpTime_inlock());

//执行指示

_handleHeartbeatResponseAction(action,

makeStatusWith(),

true);

}

}

}

//继续调度下一个周期

_scheduleNextLivenessUpdate_inlock();

}

可以看到,这个定时器主要是用于实现主节点对其他节点的保活探测逻辑:

当主节点发现大多数节点不可达时(不满足大多数原则),将会让自己执行降备

因此,在一个三节点的副本集中,其中两个备节点挂掉后,主节点会自动降备。 这样的设计主要是为了避免产生意外的数据不一致情况产生。

YvMNRbi.jpg!web

图- 主自动降备

第二个是_cancelAndRescheduleElectionTimeout_inlock函数,这里则是实现自动Failover的关键了,它的逻辑中包含了一个选举定时器,代码如下:

void ReplicationCoordinatorImpl::_cancelAndRescheduleElectionTimeout_inlock() {


//如果上一个定时器已经启用了,则直接取消

if (_handleElectionTimeoutCbh.isValid()) {

LOG(4) << "Canceling election timeout callback at " << _handleElectionTimeoutWhen;

_replExecutor.cancel(_handleElectionTimeoutCbh);

_handleElectionTimeoutCbh = CallbackHandle();

_handleElectionTimeoutWhen = Date_t();

}


//仅支持3.2后的V1版本

if (!isV1ElectionProtocol()) {

return;

}

//仅备节点可执行

if (!_memberState.secondary()) {

return;

}

...

//是否可以选举

if (!_rsConfig.getMemberAt(_selfIndex).isElectable()) {

return;

}


//检测周期,由 electionTimeout + randomOffset

//randomOffset是随机偏移量,默认为 0~0.15*ElectionTimeoutPeriod = 0~1.5s

Milliseconds randomOffset = _getRandomizedElectionOffset();

auto now = _replExecutor.now();

auto when = now + _rsConfig.getElectionTimeoutPeriod() + randomOffset;


LOG(4) << "Scheduling election timeout callback at " << when;

_handleElectionTimeoutWhen = when;


//触发调度,时间为 now + ElectionTimeoutPeriod + randomOffset

_handleElectionTimeoutCbh =

_scheduleWorkAt(when,

stdx::bind(&ReplicationCoordinatorImpl::_startElectSelfIfEligibleV1,

this,

StartElectionV1Reason::kElectionTimeout));

}

上面代码展示了这个选举定时器的逻辑,在每一个检测周期中,定时器都会尝试执行超时回调,而回调函数指向的是_startElectSelfIfEligibleV1,这里面就实现了主动发起选举的功能,

如果心跳响应成功,通过cancelAndRescheduleElectionTimeout调用将直接取消当次的超时回调(即不会发起选举)

如果心跳响应迟迟不能成功,那么定时器将被触发,进而导致备节点发起选举并成为新的主节点!

同时,这个回调方法(产生选举)被触发必须要满足以下条件:

1. 当前是备节点

2. 当前节点具备选举权限

3. 在检测周期内仍然没有与主节点心跳成功

这其中的检测周期略大于electionTimeout(10s),加入一个随机偏移量后大约是10-11.5s内,猜测这样的设计是为了错开多个备节点主动选举的时间,提升成功率。

最后,将整个自动选举切换的逻辑梳理后,如下图所示:

uMFb2u7.jpg!web

图-超时自动选举

 业务影响评估  

副本集发生主备切换的情况下,不会影响现有的读操作,只会影响写操作。如果使用3.6及以上版本的驱动,可以通过开启retryWrite来降低影响。


但是如果主节点是属于强制掉电,那么整个 Failover 过程将会变长,很可能需要在Election定时器超时后才被副本集感知并恢复,这个时间窗口会在12s以内。

此外还需要考虑客户端或mongos对于副本集角色的监视和感知行为。 但总之在问题恢复之前,对于原主节点的任何读写都会发生超时。


因此,对于极为重要的业务,建议最好在业务层面做一些防护策略,比如设计重试机制。

参考链接  

https://docs.mongodb.com/manual/replication/#automatic-failover

https://www.percona.com/blog/2016/05/25/mongodb-3-2-elections-just-got-better/

https://www.percona.com/blog/2018/10/10/mongodb-replica-set-scenarios-and-internals/

欢迎关注"美码师"公众号,笔者是十年老兵一枚,欢迎留言打扰,话题不限于技术、职场或生活..  

"写一首代码,做一手好菜",当技术与美走到一起时,生活也可以是诗和远方

7fUjay3.jpg!web


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK