Merging Empty Chunks in MongoDB

Empty Chunks in MongoDB I recently wrote about one of the problems we can encounter while working with sharded clusters, which is Finding Undetected Jumbo Chunks in MongoDB. Another issue that we might run into is dealing with empty chunk management.

Chunk Maintenance

As we know, there is also an autoSplitter process that partitions chunks when they become too big. There is also a balancer process that takes care of moving chunks to ensure even distribution between all shards. So as data grows, chunks are partitioned and perhaps moved over to other shards and all is well.

But what happens when we delete data? It can be the case that some chunks are now empty. If we delete a lot of data, perhaps a significant number of the chunks will be empty. This can be a significant issue for sharded collections with a TTL index.

Potential Issues

One of the potential problems when dealing with a high percentage of empty chunks is uneven data distribution. The balancer will make sure the number of chunks on each shard is roughly the same, but it does not take into account whether the chunks are empty or not. So you might end up with a cluster that looks balanced, but in reality, a few shards have way more data than the rest.

To deal with this problem, the first step is to identify empty chunks.

Identifying Empty Chunks

To illustrate this, let’s consider a client’s collection that is sharded by the “org_id” field. Let’s assume the collection currently has the following chunks ranges:

minKey –> 1
1 -–> 5
5 —-> 10
10 –> 15
15 —-> 20
….

We can use the dataSize command to determine the size of a chunk. This command receives the chunk range as part of the arguments. For example, to check how many documents we have on the third chunk, we would run:

Shell

db.runCommand({ dataSize: "mydatabase.clients", keyPattern: { org_id: 1 }, min: { org_id: 5 }, max: { org_id: 10 } })

This returns a document like the following:

Shell

"size" : 0,

"numObjects" : 0,

"millis" : 30,

"ok" : 1,

"operationTime" : Timestamp(1641829163, 2),

"$clusterTime" : {

"clusterTime" : Timestamp(1641829163, 3),

"signature" : {

"hash" : BinData(0,"LbBPsTEahzG/v7I6oe7iyvLr/pU="),

"keyId" : NumberLong("7016744225173049401")

If the size is 0 we know we have an empty chunk, and we can consider merging it with either the chunk that comes right after it (with the range 10 → 15) or the one just before it (with the range 1 → 5).

Merging Chunks

Assuming we take the first option, here is the mergeChunks command that helps us get this done:

Shell

db.adminCommand( {

mergeChunks: "database.collection",

bounds: [ { "field" : "5" },

{ "field" : "15" } ]

The new chunk ranges now would be as follows:

minKey –> 1
1 —-> 5
5 —-> 15
15 —-> 20
….

One caveat is that the chunks we want to merge might not be on the same shard. If that is the case we need to move them together first, using the moveChunk command.

Putting it All Together

Following the above logic, we can iterate through all the chunks in shard key order and check their size. If we find an empty chunk, we merge it with the chunk just before it. If the chunks are not on the same shard, we move them together. The following script can be used to print all the commands required:

Shell

var mergeChunkInfo = function(ns){

var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}).noCursorTimeout();

//some counters for overall stats at the end

var totalChunks = 0;

var totalMerges = 0;

var totalMoves = 0;

var previousChunk = {};

var previousChunkInfo = {};

var ChunkJustChanged = false;

chunks.forEach(

function printChunkInfo(currentChunk) {

var db1 = db.getSiblingDB(currentChunk.ns.split(".")[0])

var key = db.getSiblingDB("config").collections.findOne({_id:currentChunk.ns}).key;

db1.getMongo().setReadPref("secondary");

var currentChunkInfo = db1.runCommand({datasize:currentChunk.ns, keyPattern:key, min:currentChunk.min, max:currentChunk.max, estimate:true });

totalChunks++;

// if the current chunk is empty and the chunk before it was not merged in the previous iteration (or was the first chunk) we have candidates for merging

if(currentChunkInfo.size == 0 && !ChunkJustChanged) {

// if the chunks are contiguous

if(JSON.stringify(previousChunk.max) == JSON.stringify(currentChunk.min) ) {

// if they belong to the same shard, merge with the previous chunk

if(previousChunk.shard.toString() == currentChunk.shard.toString() ) {

print('db.runCommand( { mergeChunks: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(previousChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ] })');

// after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.

ChunkJustChanged=true;

totalMerges++;

// if they contiguous but are on different shards, we need to have both chunks to the same shard before merging, so move the current one and don't merge for now

else {

print('db.runCommand( { moveChunk: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(currentChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ], to: "' + previousChunk.shard.toString() + '" });');

// after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.

ChunkJustChanged=true;

totalMoves++;

else {

// chunks are not contiguous (this shouldn't happen unless this is the first iteration)

previousChunk=currentChunk;

previousChunkInfo=currentChunkInfo;

ChunkJustChanged=false;

else {

// if the current chunk is not empty or we already operated with the previous chunk let's continue with the next chunk pair

previousChunk=currentChunk;

previousChunkInfo=currentChunkInfo;

ChunkJustChanged=false;

print("***********Summary Chunk Information***********");

print("Total Chunks: "+totalChunks);

print("Total Move Commands to Run: "+totalMoves);

print("Total Merge Commands to Run: "+totalMerges);

We can invoke it from the Mongo shell as follows:

Shell

mergeChunkInfo("mydb.mycollection")

The script will generate all the commands needed to merge pairs of chunks where at least one is empty. After running the generated commands, this should cut the number of empty chunks in half. Running the script multiple times will eventually get rid of all the empty chunks.

Final Notes

Most people are aware of the problems with jumbo chunks; now we have seen how empty chunks can also be problematic in certain scenarios.

It is a good idea to stop the balancer before attempting any operation that modifies chunks (like merging the empty chunks). This ensures that no conflicting operations happen at the same time. Don’t forget to enable back the balancer afterward.

Chunk Maintenance

Potential Issues

Identifying Empty Chunks

Merging Chunks

Putting it All Together

Final Notes

Recommend

开门红，继续深入挖掘！

公司与行业0208丨机构会否加速调仓？

Azure AD Managed Identities: Java Apps on Azure Kubernetes Service

社论：以新基建为代表扩大有效投资助力稳增长

成渝经济圈迎来万亿级基建项目，风电招标量高增超预期丨明日主题前瞻

疫情后美国人买房更难了，利率上涨或使美国楼市降温

我看西安“一码通”崩溃事件：外行领导内行

中国移动“脱鞋”首日迎大涨，能介入吗？丨一份观察

Percona Distribution for MySQL Operator 2.0.0-alpha Preview: Release Roundup Feb...

今日股市0208丨短期如何调仓换股应对市场风格变化？

About Joyk