BugZero | MongoDB BugID 389472 - Sharding balancer schedules multiple migrations wi...

MongoDB - Defect ID: 389472

Sharding balancer schedules multiple migrations with the same conflicting source or destination

MongoDB - Defect ID: 389472

Sharding balancer schedules multiple migrations with the same conflicting source or destination

Last updated on 10/30/2024

Overall: 6.46.4

Severity: 6.46.4

Community: 8.58.5

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.46.4

Severity: 6.46.4

Community: 8.58.5

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

The sharding balancer policy avoids producing multiple migrations for the same shard. However, the policy runs per collection and doesn't save any state across collections. Because of this, if there are multiple collections which need balancing, it may produce overlapping shards. There is no correctness problem with this, but it will cause useless ConflictingOperationInProgress errors to pollute the config server and shard's logs on each balancer round.

Top User Comments

kaloian.manassiev commented on Mon, 22 Jan 2018 20:54:22 +0000: The effect of this fix cannot be easily quantified. It will not make individual migrations to go faster, but it will improve parallelism in the case where there are multiple collections, which all need to be re-balanced. Currently in the worst-case scenario only one migration could effectively run in a round, because of conflicts with other collections. So I expect the performance gain to be between 1 to the max possible parallel migrations in a system. aparna.shah commented on Mon, 22 Jan 2018 20:46:46 +0000: ramon.fernandez would you be able to provide some numbers on performance gain observed in balancing/chunk migration as a result of this fix? It might help us better manage customer expectations in https://jira.mongodb.org/projects/HELP/queues/issue/HELP-5680 xgen-internal-githook commented on Tue, 16 Jan 2018 20:17:10 +0000: Author: {'email': 'kaloian.manassiev@mongodb.com', 'name': 'Kaloian Manassiev', 'username': 'kaloianm'} Message: SERVER-29423 Prevent the balancer policy from scheduling migrations with the same source or destination (cherry picked from commit b5ebe8a5492c4f5e33970c0f885b9ac51460b9dc) Branch: v3.4 https://github.com/mongodb/mongo/commit/fbb20bc3b0e3f9274eeab9e8e2397821c8ab1853 xgen-internal-githook commented on Tue, 16 Jan 2018 19:23:49 +0000: Author: {'email': 'kaloian.manassiev@mongodb.com', 'name': 'Kaloian Manassiev', 'username': 'kaloianm'} Message: SERVER-29423 Prevent the balancer policy from scheduling migrations with the same source or destination (cherry picked from commit b5ebe8a5492c4f5e33970c0f885b9ac51460b9dc) Branch: v3.6 https://github.com/mongodb/mongo/commit/2ac2f347399022d91bb3d98ec1e5d5f4c061524c xgen-internal-githook commented on Tue, 16 Jan 2018 16:39:18 +0000: Author: {'email': 'kaloian.manassiev@mongodb.com', 'name': 'Kaloian Manassiev', 'username': 'kaloianm'} Message: SERVER-29423 Prevent the balancer policy from scheduling migrations with the same source or destination Branch: master https://github.com/mongodb/mongo/commit/b5ebe8a5492c4f5e33970c0f885b9ac51460b9dc akira.kurogane commented on Wed, 10 Jan 2018 22:19:16 +0000: Some notes to share for anyone diagnosing the issue backwards from logs and searching through this JIRA. 1. The following sort of "Balancer move ... failed" log message with "caused by :: ConflictingOperationInProgress" will be prevalent on the primary config server's logs before each new { what: "balancer.round" } document is inserted to the actionlog collection Unable to find source-code formatter for language: txt. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml 2017-12-17T11:19:47.508Z I SHARDING [NetworkInterfaceASIO-ShardRegistry-0] distributed lock with ts: '5a2e2d929069cca7cde7e996' and _id: '' unlocked. 2017-12-17T11:19:47.522Z I SHARDING [Balancer] Balancer move : [{ }, { }), from shard3, to shard1 failed :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently donating chunk [, ) for namespace to shard4 2017-12-17T11:19:47.522Z I SHARDING [Balancer] Balancer move : [{ }, { }), from shard3, to shard1 failed :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently donating chunk [, ) for namespace to shard1 ... 2017-12-17T11:19:47.522Z I SHARDING [Balancer] Balancer move : [{ }, { }), from shard4, to shard2 failed :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently donating chunk [, ) for namespace to shard4 2017-12-17T11:19:47.522Z I SHARDING [Balancer] Balancer move : [{ }, { }), from shard1, to shard5 failed :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently donating chunk [, ) for namespace to shard2 2017-12-17T11:19:47.522Z I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "serverY10-2017-12-17T06:19:47.522-0500-5a3652d39069cca7cdf8741d", server: "serverY", clientAddr: "", time: new Date(1513509587522), what: "balancer.round", ns: "", details: { executionTimeMillis: 9990, errorOccured: false, candidateChunks: 14, chunksMoved: 2 } } and spread between the shards there will be a >= number of { what: "moveChunk.error" } documents being inserted into the changelog collection. There may be other causes (ChunkRangeCleanupPending, ChunkTooBig, etc.) but the ConflictingOperationInProgress will outnumber them. 2. The race between moveChunks can lead to some shard pairs not having any migration at all, even though there were candidates that could have used those shard pairs. In the example above the { what: "balancer.round" } document has ..., errorOccured: false, candidateChunks: 14, chunksMoved: 2 but it was a six-shard cluster with candidates for all shard pairs, so it should have been chunksMoved: 3 in the typical balance round. The reason why 1 was missed was: A migration from shard A => B begins A migration from shard C => D begins A migration from shard F => B aborts because of the first one above A migration from E => F aborts because it contacted shard F during the short moment in time F was still awaiting the response from B. vkatikineni@snapfish-llc.com commented on Fri, 22 Dec 2017 18:51:00 +0000: ConflictingOperationInProgress error is also slowing down the chunk migration rate when there are multiple collections to be balanced. Looks like a shard can only participate in one chunk migration at a time ie either receive or donate. We should be able to receive/donate multiple chunks from the same shard provided they are different collections. oleg@evergage.com commented on Sat, 10 Jun 2017 22:02:35 +0000: The impact of this problem is captured in SERVER-29149. oleg@evergage.com commented on Sat, 10 Jun 2017 22:00:46 +0000: If many collections need balancing, this effectively degrades the config servers to constantly refresh the chunks due to ConflictingOperationInProgress. This heavily pollutes the logs, at a rate of ~600M/hr. I think part of the problem is that ConflictingOperationInProgress for the reason of "Unable to start new migration because this shard is currently receiving chunk" is not an operation that should merit a refresh. It's an operation that's conflicting with the act of transferring data, but not necessarily with this collection. So ConflictingOperationInProgress represents both "unable to transfer due to a shard issue" and "unable to transfer due to an another operation altering this collection's metadata." Is it possible that only the 2nd category needs a refresh retry in catalog_cache.cpp? If you look in the method CatalogCache::_scheduleCollectionRefresh_inlock, in this section: // It is possible that the metadata is being changed concurrently, so retry the // refresh again if (status == ErrorCodes::ConflictingOperationInProgress && refreshAttempt < kMaxInconsistentRoutingInfoRefreshAttempts) { _scheduleCollectionRefresh_inlock(dbEntry, nullptr, nss, refreshAttempt + 1); you can see how ConflictingOperationInProgress operations due to shards transferring chunks is going to lead to heavy collection metadata refreshing, if lots of collections need balancing.

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 389472

Sharding balancer schedules multiple migrations with the same conflicting source or destination

MongoDB - Defect ID: 389472

Sharding balancer schedules multiple migrations with the same conflicting source or destination

Last updated on 10/30/2024

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?