

MongoDB Oplog Stones 实现分析
source link: https://mongoing.com/archives/75500
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

本文基于 4.2 代码分析
背景
Oplog Collection 首先是作为一个 Capped Collection 来实现的,但是单纯的 Capped Collection 会在每一次的写操作之后,如果发现集合大小超出配置的最大值,就会 同步的去进行删除文档的操作 。
删除文档的步骤大约是,
- 计算:设置一个 end cursor 指向 capped collection 头部,然后不断的调用 cursor->next(),通过累积 value 大小来确定需要删除的文档数
-
删除:
- 需要删除的文档数量小于等于 3 个,直接循环调用 cursor->remove() 删除文档
- 否则,设置两个 cursor,一个 start,一个 end,start 直接设置在 capped collection 的头部,然后调用 session->truncate(session, start, end) 来批量删除文档,session->truncate() 实际最终也是调用 cursor->remove()
具体代码可参考: WiredTigerRecordStore::_cappedDeleteAsNeeded_inlock
可以看到 确定需要删除的文档数是这里面比较耗时的部分 ,cursor->next() 来统计 value 大小,实际就是在不断的做读取。对于 Oplog Collection 来说,所有的用户写操作都会记录 Oplog,当 Oplog 写满后,每一次的用户写操作都会触发同步的删除操作,显然效率很低。
所以 MongoDB 采用了一种标记删除位点,然后批量删除的策略来 解决 这个问题。
Oplog Stones 初始化
所谓的 Oplog Stone,实际上就是用 truncate point(删除位点) 在 oplog 上分隔的逻辑区域 ,而 truncate point 本质上就是 oplog entry 的 ts 字段,同时作为 RecordID ,实际也是对应的 WiredTiger Table 中一条记录的 key。
Oplog Stone 的信息 MongoDB 并没有做持久化,而是选择 每次重启的时候重新初始化 。
Oplog Stones 整体初始化的逻辑还是比较简单的,首先是根据 cappedMaxSize 计算需要多少个 stone,
// The minimum oplog stone size should be BSONObjMaxInternalSize. const unsigned int oplogStoneSize = std::max(gOplogStoneSizeMB * 1024 * 1024, BSONObjMaxInternalSize); // IDL does not support unsigned long long types. const unsigned long long kMinStonesToKeep = static_cast<unsigned long long>(gMinOplogStones); const unsigned long long kMaxStonesToKeep = static_cast<unsigned long long>(gMaxOplogStonesDuringStartup); unsigned long long numStones = maxSize / oplogStoneSize; size_t numStonesToKeep = std::min(kMaxStonesToKeep, std::max(kMinStonesToKeep, numStones));
这里有很多默认值参与计算,我们需要知道的是,oplog stone 最多 100 个,最少 10 个,如果 oplogSizeMB 配置值超过 2GB,在默认情况下,基本上就需要 100 个 stone(这个是根据 Capped Collection配置的最大值算出来的基准,如果 Capped Collection 实际还没有写满,会根据实际大小来换算,stone 会更少,但是无论如何,上限不会超过 100)。
确定了 Oplog Stone 的个数,下面要做的就是确定每个 Oplog Stone 的边界,分 两种情况 ,
-
Oplog 集合当前的 record 数量太少,小于 20 倍的需要 sample 的数量
(对于 100 个 stone 来说,每个 stone 默认 sample 10 条记录,所以 Collection record 数量低于 2w 条,就走全表的逻辑),直接通过全表扫描的方式来确定 Oplog Stone 边界,这个逻辑很简单,就是对扫描到的 oplog entry 累加大小,超过单个 oplog stone 大小上限就生成一个 stone,保存下来,直到扫描结束。(代码:
WiredTigerRecordStore::OplogStones::_calculateStonesByScanning
) -
否则,就不能通过全表扫描了,效率太低。MongoDB 借助于 WiredTiger 提供的 random cursor
来进行采样,从而快速确定每个 oplog stone 的边界。(代码:
WiredTigerRecordStore::OplogStones::_calculateStonesBySampling
)- 正常来说,100 个 oplog stone,采样 100 次,似乎就可以确定所有 stone 的边界了,但是 为了保证边界尽可能准确,MongoDB 采用了 oversampling 的方式 ,即,对于每一个 stone 采样 10 次(默认),100 个 stone 就采样 1000 次。然后,把这些采集到的 sample 按 key(opTime)排序,每个 oplog stone 使用每 10 个 sample 中最后一个的 key 作为其边界,即,从 0 开始数,那么 9th, 19th, 29th, ……号 sample 是顺序对应的 oplog stone 的边界。
-
另外,MongoDB 在后面的版本 迭代
中还使用了优化的 random cursor,可以保证更精确的采样,通过
next_random_sample_size
告知 WT 需要 random 采样的个数,WT 会 把底层的 table 划分为 next_random_sample_size 份,分别从其中获取一个采样 ,显然这样划分后,每第 10 个 sample 可以更贴近预期的 oplog stone 边界。
这个是进程启动时 oplog stone 的初始化的方式,随着有新的写入,还会创建新的 oplog stone,这个时候 oplog stone 的大小是可以保证精确的,因为在写入的时候,可以很方便的统计到当前已经在最新的 stone 里面写了多少数据,这个数值是精确的。所以,如果初始启动的时候因为 oplog stone 边界不精确导致 oplog 删除的过多或过少,并不是个大问题,这个会在后续的更新中把误差抹掉。
Oplog 回收
有了 Oplog Stone 后,oplog 的回收就不需要再每次写入时去计算要删除的文档数,再同步去删除,只需要 在当前 oplog stone 写满后,创建新的 oplog stone 时,把最老的 oplog stone 删除掉即可 。
这个按 stone 删除通常会一次性删除比较多的文档,所以oplog 的删除动作是放在后台的 OplogTruncaterThread
来做的,删除时会直接调用 session->truncate 方法,使用 oldest oplog stone 的边界作为 range truncate 的上界。
//WiredTigerRecordStore::reclaimOplog setKey(cursor, stone->lastRecord); invariantWTOK(session->truncate(session, nullptr, nullptr, cursor, nullptr));
删除的时候还需要考虑 stable checkpoint 对 oplog 的依赖,具体逻辑后面再发文分析。
Oplog Stones 初始化的时间开销分析
单纯从代码逻辑看,可以看到 Oplog Stones 的初始化的绝大部分时间都会花在 random cursor 的采样上 ,因为其他的步骤都是简单的 in-memory compute,几个 ms 足以完成。总的初始化时长会和采样的个数成正比,采样个数最多是 1000 个,所以 这个初始化时间是有上限的,并不是随 oplog 集合大小无限增长 。
为了验证上述结论,我们构造一个测试场景来证明。
先构造数据集,创建 副本集 1 ,使用 mongo shell 自带的 benchmark 工具生成数据,
benchRun( { "host": "127.0.0.1:9111", "ops": [ { "ns": "test.foo", "op": "insert", "doc": { "x": { "#RAND_STRING": [ 64 ] }, "y": { "#RAND_STRING": [ 128 ] } } } ], "parallel": 16, "seconds": $seconds } )
生成大约 1.5 亿条 oplog,33GB ,(有压缩,实际是在 50G左右)
xdjmgset-dbfs1:PRIMARY> show dbs admin 0.000GB config 0.000GB local 33.363GB test 35.284GB xdjmgset-dbfs1:PRIMARY> use local switched to db local xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count() 150531637
在代码上加了一些日志用于查看 cursor random 的时间开销,以及每一次 random 采样的开销,
diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp index f2c3d1c220..7f029b788d 100644 --- a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp +++ b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp @@ -460,8 +460,14 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon // each logical section. auto cursor = _rs->getRandomCursorWithOptions(opCtx, extraConfig); std::vector<RecordId> oplogEstimates; + const std::uint64_t startWaitTime = curTimeMicros64(); for (int i = 0; i < numSamples; ++i) { + const std::uint64_t startWaitTime = curTimeMicros64(); auto record = cursor->next(); + auto waitTime = curTimeMicros64() - startWaitTime; + LOG(1) << "WT cursor random sample " << i << ", " + << Timestamp(record->id.repr()).toStringPretty() << ", took " + << waitTime / 1000.0 << "ms"; if (!record) { // This shouldn't really happen unless the size storer values are far off from reality. // The collection is probably empty, but fall back to scanning the oplog just in case. @@ -471,6 +477,8 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon } oplogEstimates.push_back(record->id); } + auto waitTime = curTimeMicros64() - startWaitTime; + LOG(1) << "WT cursor random sampling total took " << waitTime/1000.0 << "ms"; std::sort(oplogEstimates.begin(), oplogEstimates.end()); for (int i = 1; i <= wholeStones; ++i) {
把副本集 1 重启,观察日志,
020-10-27T15:34:09.058+0800 I STORAGE [initandlisten] Taking 755 samples and assuming that each section of oplog contains approximately 1991955 records totaling to 687194924 bytes 2020-10-27T15:34:09.058+0800 D1 STORAGE [initandlisten] WT cursor random sample 0, Oct 26 19:54:57:7347, took 0.415ms 2020-10-27T15:34:09.063+0800 D1 STORAGE [initandlisten] WT cursor random sample 1, Oct 26 19:57:47:6215, took 4.488ms 2020-10-27T15:34:09.067+0800 D1 STORAGE [initandlisten] WT cursor random sample 2, Oct 26 15:37:47:1030, took 4.608ms 2020-10-27T15:34:09.072+0800 D1 STORAGE [initandlisten] WT cursor random sample 3, Oct 26 15:44:15:4619, took 4.471ms 2020-10-27T15:34:09.076+0800 D1 STORAGE [initandlisten] WT cursor random sample 4, Oct 26 15:46:51:2640, took 4.597ms 2020-10-27T15:34:09.081+0800 D1 STORAGE [initandlisten] WT cursor random sample 5, Oct 26 15:49:22:10335, took 4.556ms 2020-10-27T15:34:09.086+0800 D1 STORAGE [initandlisten] WT cursor random sample 6, Oct 26 15:52:03:10684, took 4.746ms 2020-10-27T15:34:09.090+0800 D1 STORAGE [initandlisten] WT cursor random sample 7, Oct 26 15:54:14:4494, took 4.586ms 2020-10-27T15:34:09.095+0800 D1 STORAGE [initandlisten] WT cursor random sample 8, Oct 26 15:56:46:1960, took 4.889ms 2020-10-27T15:34:09.100+0800 D1 STORAGE [initandlisten] WT cursor random sample 9, Oct 26 15:59:18:7246, took 4.695ms 2020-10-27T15:34:09.105+0800 D1 STORAGE [initandlisten] WT cursor random sample 10, Oct 26 16:02:05:4727, took 4.895ms 2020-10-27T15:34:09.110+0800 D1 STORAGE [initandlisten] WT cursor random sample 11, Oct 26 16:04:30:5742, took 4.673ms 2020-10-27T15:34:09.115+0800 D1 STORAGE [initandlisten] WT cursor random sample 12, Oct 26 16:06:45:1917, took 4.881ms 2020-10-27T15:34:09.119+0800 D1 STORAGE [initandlisten] WT cursor random sample 13, Oct 26 16:08:50:5188, took 4.786ms 2020-10-27T15:34:09.124+0800 D1 STORAGE [initandlisten] WT cursor random sample 14, Oct 26 16:11:13:7634, took 4.449ms 2020-10-27T15:34:09.129+0800 D1 STORAGE [initandlisten] WT cursor random sample 15, Oct 26 16:13:25:6775, took 5.204ms ... 2020-10-27T15:34:12.463+0800 D1 STORAGE [initandlisten] WT cursor random sample 752, Oct 26 15:56:15:232, took 4.923ms 2020-10-27T15:34:12.467+0800 D1 STORAGE [initandlisten] WT cursor random sample 753, Oct 26 15:58:47:1953, took 4.399ms 2020-10-27T15:34:12.472+0800 D1 STORAGE [initandlisten] WT cursor random sample 754, Oct 26 16:01:28:5317, took 4.598ms 2020-10-27T15:34:12.472+0800 D1 STORAGE [initandlisten] WT cursor random sampling total took 3414.51ms
可以看到这个实例采样了 755 次,总共耗时 3414ms,每次采样的时间都比较固定,在 4ms – 5ms 之间。
然后 MongoDB 的 serverStatus 命令本身也提供了一个 section 的输出用于查看启动时,初始化 oplog stones 的总时长,和初始化方法(是 scan 还是 sampling),
xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation { "totalTimeProcessingMicros" : NumberLong(3418164), "processingMethod" : "sampling", "totalTimeTruncatingMicros" : NumberLong(0), "truncateCount" : NumberLong(0) }
可以看到,其他部分的时间开销在 4ms 左右,不到总时长的 1%。
为了 验证初始化时间和 sample 的个数成正比 ,同样根据上述方法构造另外一个数据集,25GB,1.13 亿条 oplog,
xdjmgset-dbfs1:PRIMARY> show dbs admin 0.000GB config 0.000GB local 25.145GB test 26.517GB xdjmgset-dbfs1:PRIMARY> use local switched to db local xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count() 113211477
重启之后查看日志输出,
2020-10-27T15:43:02.121+0800 I STORAGE [initandlisten] Taking 568 samples and assuming that each section of oplog contains approximately 1991875 records totaling to 687195044 bytes 2020-10-27T15:43:02.121+0800 D1 STORAGE [initandlisten] WT cursor random sample 0, Oct 27 12:33:29:5201, took 0.216ms 2020-10-27T15:43:02.125+0800 D1 STORAGE [initandlisten] WT cursor random sample 1, Oct 27 12:36:06:5577, took 4.489ms 2020-10-27T15:43:02.130+0800 D1 STORAGE [initandlisten] WT cursor random sample 2, Oct 27 12:38:30:1191, took 4.417ms 2020-10-27T15:43:02.134+0800 D1 STORAGE [initandlisten] WT cursor random sample 3, Oct 27 12:40:51:1654, took 4.526ms 2020-10-27T15:43:02.139+0800 D1 STORAGE [initandlisten] WT cursor random sample 4, Oct 27 12:43:12:9085, took 4.51ms 2020-10-27T15:43:02.144+0800 D1 STORAGE [initandlisten] WT cursor random sample 5, Oct 27 12:45:36:3523, took 4.465ms 2020-10-27T15:43:02.148+0800 D1 STORAGE [initandlisten] WT cursor random sample 6, Oct 27 12:48:09:6883, took 4.63ms 2020-10-27T15:43:02.153+0800 D1 STORAGE [initandlisten] WT cursor random sample 7, Oct 27 12:50:09:6716, took 4.484ms 2020-10-27T15:43:02.157+0800 D1 STORAGE [initandlisten] WT cursor random sample 8, Oct 27 12:52:24:1495, took 4.531ms 2020-10-27T15:43:02.162+0800 D1 STORAGE [initandlisten] WT cursor random sample 9, Oct 27 12:54:39:3871, took 4.705ms 2020-10-27T15:43:02.167+0800 D1 STORAGE [initandlisten] WT cursor random sample 10, Oct 27 12:57:15:3946, took 4.661ms 2020-10-27T15:43:02.171+0800 D1 STORAGE [initandlisten] WT cursor random sample 11, Oct 27 12:59:36:5033, took 4.74ms 2020-10-27T15:43:02.176+0800 D1 STORAGE [initandlisten] WT cursor random sample 12, Oct 27 13:01:52:6908, took 4.424ms 2020-10-27T15:43:02.181+0800 D1 STORAGE [initandlisten] WT cursor random sample 13, Oct 27 13:04:22:2838, took 4.637ms 2020-10-27T15:43:02.186+0800 D1 STORAGE [initandlisten] WT cursor random sample 14, Oct 27 13:06:42:6574, took 5.21ms ... 2020-10-27T15:43:04.771+0800 D1 STORAGE [initandlisten] WT cursor random sample 567, Oct 27 12:17:32:2820, took 4.397ms 2020-10-27T15:43:04.771+0800 D1 STORAGE [initandlisten] WT cursor random sampling total took 2650.65ms
进行了 568 次 sample,总时间开销 2650ms,而 2650ms 基本上等于 ( 568.0/755) * 3414ms = 2568ms
,和 sample 个数成正比的结论可成立。
综上,考虑到单次 random cursor sample 的开销大约是 4-5ms,总 sample 上限是 1000,那么 oplog stones 初始化时间的上限在 5s 左右 (NVMe SSD)。
Recommend
-
14
MongoDB Oplog Stones 实现分析及启动加载优化本文基于 4.2 代码分析 Oplog Collection 首先是作为一个 Capped Collection 来实现的,但是单纯的 Capped Collection 会在每一次的写操作之后,如果发现集合大小超出配置的最大...
-
9
Navigating the Docker ecosystem can be confusing at the best of times. I’ve been involved in numerous conversations where people have confused Indexes, Registries, Commits, diffs, pushes, Pods, Containers, Images… the list goes on....
-
11
The Three Corner Stones of Developerhood Posted: 2006-11-25 - Last updated: 2019-06-05 Tagged
-
15
...
-
8
1. 基本概念 oplog使用固定大小集合记录了数据库中所有修改操作的操作日志(新增、修改和删除,无查询),mongodb收到修改请求后,先在主节点(Primary)执行请求,再把操作日志保存到oplog表中,其他从节点(Secondary)到主节点拉取oplog并在异步...
-
8
1. 前言 MongoDB复制集通过oplog进行同步,但是4.4之前oplog只能通过参数设置固定大小,无法像mysql binlog那样指定保留天数。 这种做法有很多问题: 主库峰值写入太多导致从库延迟,对应的oplog尚未传输到从库便...
-
7
The Rolling Stones: Robot dog 'Spot' shares Mick Jagger's moves
-
9
Things I've learned after two stones of weight loss Thu Dec 09 2021 ON FEBRUARY 26, 2021 I had an urgent, painful and life-saving revela...
-
11
AdvertisementToys and CollectiblesThese Indiana Jones Replicas Belong in a Museum, But You Can Own Them
-
14
How To Fix Oplog Restore Error: E11000 Duplicate Key Error Collection Using Percona Backup for MongoDB March 27, 2023 ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK