5

Merging Empty Chunks in MongoDB

 2 years ago
source link: https://www.percona.com/blog/merging-empty-chunks-in-mongodb/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Empty Chunks in MongoDBI recently wrote about one of the problems we can encounter while working with sharded clusters, which is Finding Undetected Jumbo Chunks in MongoDB. Another issue that we might run into is dealing with empty chunk management.

Chunk Maintenance

As we know, there is also an autoSplitter process that partitions chunks when they become too big. There is also a balancer process that takes care of moving chunks to ensure even distribution between all shards. So as data grows, chunks are partitioned and perhaps moved over to other shards and all is well.

But what happens when we delete data? It can be the case that some chunks are now empty. If we delete a lot of data, perhaps a significant number of the chunks will be empty. This can be a significant issue for sharded collections with a TTL index.

Potential Issues

One of the potential problems when dealing with a high percentage of empty chunks is uneven data distribution. The balancer will make sure the number of chunks on each shard is roughly the same, but it does not take into account whether the chunks are empty or not. So you might end up with a cluster that looks balanced, but in reality, a few shards have way more data than the rest.

To deal with this problem, the first step is to identify empty chunks.

Identifying Empty Chunks

To illustrate this, let’s consider a client’s collection that is sharded by the “org_id” field. Let’s assume the collection currently has the following chunks ranges:

minKey –> 1
1 -–> 5
5 —-> 10
10 –> 15
15 —-> 20
….

We can use the dataSize command to determine the size of a chunk. This command receives the chunk range as part of the arguments. For example, to check how many documents we have on the third chunk, we would run:

Shell
db.runCommand({ dataSize: "mydatabase.clients", keyPattern: { org_id: 1 }, min: { org_id: 5 }, max: { org_id: 10 } })

This returns a document like the following:

Shell
    "size" : 0,
    "numObjects" : 0,
    "millis" : 30,
    "ok" : 1,
    "operationTime" : Timestamp(1641829163, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1641829163, 3),
        "signature" : {
            "hash" : BinData(0,"LbBPsTEahzG/v7I6oe7iyvLr/pU="),
            "keyId" : NumberLong("7016744225173049401")

If the size is 0 we know we have an empty chunk, and we can consider merging it with either the chunk that comes right after it (with the range 10 → 15) or the one just before it (with the range 1 → 5).

Merging Chunks

Assuming we take the first option, here is the mergeChunks command that helps us get this done:

Shell
db.adminCommand( {
   mergeChunks: "database.collection",
   bounds: [ { "field" : "5" },
             { "field" : "15" } ]

The new chunk ranges now would be as follows:

minKey –> 1
1 —-> 5
5 —-> 15
15 —-> 20
….

One caveat is that the chunks we want to merge might not be on the same shard. If that is the case we need to move them together first, using the moveChunk command.

Putting it All Together

Following the above logic, we can iterate through all the chunks in shard key order and check their size.  If we find an empty chunk, we merge it with the chunk just before it. If the chunks are not on the same shard, we move them together. The following script can be used to print all the commands required:

Shell
var mergeChunkInfo = function(ns){
    var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}).noCursorTimeout();
    //some counters for overall stats at the end
    var totalChunks = 0;
    var totalMerges = 0;
    var totalMoves = 0;
    var previousChunk = {};
    var previousChunkInfo = {};
    var ChunkJustChanged = false;
    chunks.forEach(
        function printChunkInfo(currentChunk) {
        var db1 = db.getSiblingDB(currentChunk.ns.split(".")[0])
        var key = db.getSiblingDB("config").collections.findOne({_id:currentChunk.ns}).key;
        db1.getMongo().setReadPref("secondary");
        var currentChunkInfo = db1.runCommand({datasize:currentChunk.ns, keyPattern:key, min:currentChunk.min, max:currentChunk.max, estimate:true });
        totalChunks++;
        // if the current chunk is empty and the chunk before it was not merged in the previous iteration (or was the first chunk) we have candidates for merging
        if(currentChunkInfo.size == 0 && !ChunkJustChanged) {    
          // if the chunks are contiguous
          if(JSON.stringify(previousChunk.max) == JSON.stringify(currentChunk.min) ) {
            // if they belong to the same shard, merge with the previous chunk
            if(previousChunk.shard.toString() == currentChunk.shard.toString() ) {
              print('db.runCommand( { mergeChunks: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(previousChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ] })');
              // after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.
              ChunkJustChanged=true;
              totalMerges++;
            // if they contiguous but are on different shards, we need to have both chunks to the same shard before merging, so move the current one and don't merge for now
            else {              
              print('db.runCommand( { moveChunk: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(currentChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ], to: "' + previousChunk.shard.toString() + '" });');
              // after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.
              ChunkJustChanged=true;
              totalMoves++;            
          else {
            // chunks are not contiguous (this shouldn't happen unless this is the first iteration)
            previousChunk=currentChunk;
            previousChunkInfo=currentChunkInfo;
            ChunkJustChanged=false;
        else {
          // if the current chunk is not empty or we already operated with the previous chunk let's continue with the next chunk pair
          previousChunk=currentChunk;
          previousChunkInfo=currentChunkInfo;
          ChunkJustChanged=false;
    print("***********Summary Chunk Information***********");
    print("Total Chunks: "+totalChunks);
    print("Total Move Commands to Run: "+totalMoves);
    print("Total Merge Commands to Run: "+totalMerges);

We can invoke it from the Mongo shell as follows:

Shell
mergeChunkInfo("mydb.mycollection")

The script will generate all the commands needed to merge pairs of chunks where at least one is empty. After running the generated commands, this should cut the number of empty chunks in half. Running the script multiple times will eventually get rid of all the empty chunks.

Final Notes

Most people are aware of the problems with jumbo chunks; now we have seen how empty chunks can also be problematic in certain scenarios.

It is a good idea to stop the balancer before attempting any operation that modifies chunks (like merging the empty chunks). This ensures that no conflicting operations happen at the same time. Don’t forget to enable back the balancer afterward.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK