Loading...
Loading...
Config servers all have the same process id of "ConfigServer." On transition to primary, a config node unlocks all existing distlocks with the process id "ConfigServer." This means that DDL operations which serialize on the config server via a distlock but whose business logic is executed by a shard (moveChunk, movePrimary, and shardCollection) are suspect, because the shard can keep executing the business logic outside the distlock. For example, you could drop a database concurrently with sharding a collection and end up with a config.collections entry without a corresponding config.databases entry. Note that the track unsharded project will add two more DDL operations with the shardCollection pattern (renameCollection and convertToCapped).
kaloian.manassiev commented on Thu, 10 Feb 2022 09:26:41 +0000: This is now Gone Away after 5.0, because under the DDL project we have local synchronisation between DDL and moveChunk, which was the main reason for having the config server dist lock. esha.maharishi@10gen.com commented on Mon, 29 Jul 2019 19:31:00 +0000: kaloian.manassiev, hmm, I think the sharding catalog can only get corrupted by metadata commands for which the shard directly writes to the sharding catalog. So, I think it is only a problem for shardCollection, movePrimary, and moveChunk. This is because config servers use ConnectionString::forLocal and for ConnectionString::LOCAL, the ShardRegistry returns ShardLocal instances. So, if the config server is the one that writes to the sharding catalog, it will write on the same branch of history as the distlock was on. The problem is if a shard targets the new config primary, and therefore updates the sharding catalog on the new branch of history that has released the distlock. Note that this second issue: Another way this can manifest that would result in actual user data loss is if an old config primary executed dropCollection after the collection had been recreated on the new config primary. would not be solved even if the new config primary reacquired persisted locks. kaloian.manassiev commented on Mon, 29 Jul 2019 18:54:15 +0000: When we moved the balancer to the config server in 3.4, the collection lock acquisition due to moveChunk was the only operation, which was taking distributed locks on the config server. At the time, we must have (knowingly or not) made the decision that it would be cleaner to have step-up clean-up these locks so that (1) the migration manager recovery doesn't get stuck and (2) because the migration manager recovery will re-acquire them. I guess in 3.6 we moved more operations to the config server, which didn't have the recovery process that moveChunk goes through and because of this we introduced this bug. Given that the lock manager will re-acquire locks with the same session id, we could remove this behaviour where the lock manager removes all locks by making the MigrationManager have a fixed session id and on step-up only unlocking locks acquired by that session id. This will preserve the intra-node synchronization behaviour we rely on through the dist locks, without requiring us to build a more sophisticated dist lock manager. jason.zhang commented on Thu, 25 Jul 2019 22:29:26 +0000: Attached is a repro. For consistent repro you need to change mongos's internal retries to 1. This is because after stepping down the config server, mongos will retry 'shardCollection' on the new config server, which will take distlocks that will block dropCollection from taking distlocks on the target database. esha.maharishi@10gen.com commented on Thu, 25 Jul 2019 20:13:08 +0000: Assigned to jason.zhang to write a repro that demonstrates the sharding catalog becoming inconsistent if the config server fails over during shardCollection.
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.