BugZero | MongoDB BugID 1181540 - Chunk move failing after removing shard from clust...

MongoDB - Defect ID: 1181540

Chunk move failing after removing shard from cluster

MongoDB - Defect ID: 1181540

Chunk move failing after removing shard from cluster

Last updated on 10/27/2023

Overall: 5.95.9

Severity: 6.46.4

Community: 5.95.9

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 5.95.9

Severity: 6.46.4

Community: 5.95.9

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

Hi we are using mongo sharded cluster running with 4.2.1. Architecture: 3 mongos config server running as replica set ( 1 primary + 2 secondaries) 2 shard with 3 nodes running as replica set ( 1 primary + 2 secondaries) Since shard1 and shard2 are under utilized, we decided to remove shard2. we did the following steps 1) we issued remove shard from mongos and also Moved databases to another shard out of 2 sharded collections, all chunks related to 1 collections are drained to another shard. But chunk migration is failing for another collection.we see Balancer is not moving chunks and it's throwing following message 2020-02-24T15:56:48.481+0000 I SHARDING [Balancer] distributed lock 'keychain.eg_keyring' acquired for 'Migrating chunk(s) in collection keychain.eg_keyring', ts : 5de584eedab8c4c434adabb5 2020-02-24T15:56:48.546+0000 I SHARDING [TransactionCoordinator] distributed lock with ts: '5de584eedab8c4c434adabb5' and _id: 'keychain.eg_keyring' unlocked. 2020-02-24T15:56:48.549+0000 I SHARDING [Balancer] Balancer move keychain.eg_keyring: [{ rId: UUID("80460000-0000-0000-0000-000000000000") }, { rId: UUID("80480000-0000-0000-0000-000000000000") }), from test-mongodb-egdp-keychain-01-shard02, to test-mongodb-egdp-keychain-01-shard01 failed :: caused by :: OperationFailed: Data transfer error: migrate failed: Location51008: operation was interrupted 2020-02-24T15:56:48.550+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-10-0-212-244:27017-2020-02-24T15:56:48.550+0000-5e53f240dab8c4c434ec8b37", server: "ip-10-0-212-244:27017", shard: "config", clientAddr: "", time: new Date(1582559808550), what: "balancer.round", ns: "", details: { executionTimeMillis: 243, errorOccured: false, candidateChunks: 1, chunksMoved: 0 } } we tried even moving some of chunks manually and they also failed with same reason. sh.status() output is attached We issued the following command to include chunk info from above sh.status() output to move one chunk command: db.adminCommand( { moveChunk : "keychain.eg_keyring" , bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] , to : "test-mongodb-egdp-keychain-01-shard01" } ) Output: mongos> db.adminCommand( { moveChunk : "keychain.eg_keyring" , ... bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] , ... to : "test-mongodb-egdp-keychain-01-shard01" ... } ) { "ok" : 0, "errmsg" : "Data transfer error: migrate failed: Location51008: operation was interrupted", "code" : 96, "codeName" : "OperationFailed", "operationTime" : Timestamp(1582566446, 139), "$clusterTime" : { "clusterTime" : Timestamp(1582566446, 139), "signature" : { "hash" : BinData(0,"jaz2qGWhuM36vt48xNt+mv+CHfo="), "keyId" : NumberLong("6765960194405957649") } } } Apart from this , we also issued flushRouterConfig multiple times and we restarted all mongos. But still same issue exists. Please let me know if there is any known bug around this or any configuration that we need to tweak on our side.

Top User Comments

carl.champain commented on Wed, 4 Mar 2020 16:09:58 +0000: haidilip83@gmail.com, Thanks for getting back to us! I will now close this ticket. haidilip83@gmail.com commented on Tue, 3 Mar 2020 17:48:55 +0000: Thanks Carl for detailed summary.hypothesis 1 is confirmed. We are working on enforcing the uniqueness of the _id index on application side.we can now close this ticket. carl.champain commented on Fri, 28 Feb 2020 20:39:21 +0000: haidilip83@gmail.com, After investigating your issue, we’ve come up with two hypotheses: 1. The _id index key is not unique across your sharded cluster. Our documentation says the following about the uniqueness of the _id index across a sharded cluster: If the _id field is not the shard key or the prefix of the shard key, _id index only enforces the uniqueness constraint per shard and not across shards. For example, consider a sharded collection (with shard key {x:1}) that spans two shards A and B. Because the _id key is not part of the shard key, the collection could have a document with _id value 1 in shard A and another document with _id value 1 in shard B. If the _id field is not the shard key nor the prefix of the shard key, MongoDB expects applications to enforce the uniqueness of the _id values across the shards. So, in your case, we noticed that _id is neither the shard key nor the prefix of the shard key, which makes it possible that a document on shard2 has the same _id as a document on shard1. 2. Shard1 may contain orphan documents. Orphan documents appear after a failed migration or an unclean shutdown, they can be duplicates of documents that were moved onto a different shard. There are a few ways in which orphan documents would cause a duplicate key error: If the chunks migration from shard2 to shard1 failed, but some documents were still written on shard1. Then, when shard2 tries to migrate the chunks again, the duplicate error arises because shard1 already contains some documents from shard2. If shard1 crashed after the chunks migration from shard1 to shard2 and while the RangeDeleter was still running. Then, when shard1 goes back online, the RangeDeleter does not persist or replicate the ranges it has yet to clean, so it can’t restart from where it left off. During the chunks migration from shard2 to shard1, the error comes up since shard1 has orphan documents. To determine whether hypothesis 1 or 2 is correct, please connect directly to the primary replica set member of shard1 and shard2 and run: db.eg_keyring.find({{ _id: UUID("245a5a22-4eb3-35ab-b79d-6c0bc431f169") }}) If the returned documents have the same _id but not the same shard key, then the _id index key is not unique across your sharded cluster and hypothesis 1 is confirmed. You can enforce the uniqueness of the _id index key on your application logic or you can update your shard key. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag. If the returned documents are identical, then shard1 contains orphan documents and hypothesis 2 is confirmed. Please run cleanupOrphaned on the primary replica set member of shard1 to remediate this issue. Kind regards, Carl haidilip83@gmail.com commented on Wed, 26 Feb 2020 15:15:38 +0000: Please confirm if this is anyway related to https://jira.mongodb.org/browse/SERVER-45844 also? haidilip83@gmail.com commented on Tue, 25 Feb 2020 22:10:02 +0000: Hi Carl, I have uploaded all requested information. regards Dilip K carl.champain commented on Tue, 25 Feb 2020 16:30:15 +0000: Hi haidilip83@gmail.com, Thank you for the report. To help us understand what is happening, can you please provide: The logs for: Each of the mongos. The primary of shard1 and shard2. The primary of the config servers. The mongodump of your config server: The command should look like this: mongodump --db=config --host= We've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Kind regards, Carl

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 1181540

Chunk move failing after removing shard from cluster

MongoDB - Defect ID: 1181540

Chunk move failing after removing shard from cluster

Last updated on 10/27/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?