BugZero | MongoDB BugID 875956 - New config server primary unlocks all distlocks he...

MongoDB - Defect ID: 875956

New config server primary unlocks all distlocks held by previous config server on stepup

MongoDB - Defect ID: 875956

New config server primary unlocks all distlocks held by previous config server on stepup

Last updated on October 27th, 2023

BugZero Risk Score
5.9 Medium

Overall: 5.9

Severity: 6.4

Community: 6.0

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed
Views: 1

Description

Info

Config servers all have the same process id of "ConfigServer." On transition to primary, a config node unlocks all existing distlocks with the process id "ConfigServer." This means that DDL operations which serialize on the config server via a distlock but whose business logic is executed by a shard (moveChunk, movePrimary, and shardCollection) are suspect, because the shard can keep executing the business logic outside the distlock. For example, you could drop a database concurrently with sharding a collection and end up with a config.collections entry without a corresponding config.databases entry. Note that the track unsharded project will add two more DDL operations with the shardCollection pattern (renameCollection and convertToCapped).

Top User Comments

kaloian.manassiev commented on Thu, 10 Feb 2022 09:26:41 +0000: This is now Gone Away after 5.0, because under the DDL project we have local synchronisation between DDL and moveChunk, which was the main reason for having the config server dist lock. esha.maharishi@10gen.com commented on Mon, 29 Jul 2019 19:31:00 +0000: kaloian.manassiev, hmm, I think the sharding catalog can only get corrupted by metadata commands for which the shard directly writes to the sharding catalog. So, I think it is only a problem for shardCollection, movePrimary, and moveChunk. This is because config servers use ConnectionString::forLocal and for ConnectionString::LOCAL, the ShardRegistry returns ShardLocal instances. So, if the config server is the one that writes to the sharding catalog, it will write on the same branch of history as the distlock was on. The problem is if a shard targets the new config primary, and therefore updates the sharding catalog on the new branch of history that has released the distlock. Note that this second issue: Another way this can manifest that would result in actual user data loss is if an old config primary executed dropCollection after the collection had been recreated on the new config primary. would not be solved even if the new config primary reacquired persisted locks. kaloian.manassiev commented on Mon, 29 Jul 2019 18:54:15 +0000: When we moved the balancer to the config server in 3.4, the collection lock acquisition due to moveChunk was the only operation, which was taking distributed locks on the config server. At the time, we must have (knowingly or not) made the decision that it would be cleaner to have step-up clean-up these locks so that (1) the migration manager recovery doesn't get stuck and (2) because the migration manager recovery will re-acquire them. I guess in 3.6 we moved more operations to the config server, which didn't have the recovery process that moveChunk goes through and because of this we introduced this bug. Given that the lock manager will re-acquire locks with the same session id, we could remove this behaviour where the lock manager removes all locks by making the MigrationManager have a fixed session id and on step-up only unlocking locks acquired by that session id. This will preserve the intra-node synchronization behaviour we rely on through the dist locks, without requiring us to build a more sophisticated dist lock manager. jason.zhang commented on Thu, 25 Jul 2019 22:29:26 +0000: Attached is a repro. For consistent repro you need to change mongos's internal retries to 1. This is because after stepping down the config server, mongos will retry 'shardCollection' on the new config server, which will take distlocks that will block dropCollection from taking distlocks on the target database. esha.maharishi@10gen.com commented on Thu, 25 Jul 2019 20:13:08 +0000: Assigned to jason.zhang to write a repro that demonstrates the sharding catalog becoming inconsistent if the config server fails over during shardCollection.

Steps to Reproduce

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Top MongoDB Defects

8.4Defect ID: 3423956
Config server crashes with invariant failure in QueryAnalysisCoordinator::onSamplerDelete when documents are deleted from config.mongos
8.4Defect ID: 3431051
MongoDB config server will crash in a cluster which is upgrade from 6.0 version
6.8Defect ID: 3422474
$project silently drops root-level fields after $lookup + $unwind when multiple nested documents contain a type field with different value types
6.8Defect ID: 3426651
Clustered collections do not correctly filter out results with $lt on _id
5.5Defect ID: 3437084
agg_list_cluster_catalog.js times out if config.system.profile exists from a previous test

Ready to prevent the next vendor outage?

Get a demo

MongoDB - Defect ID: 875956

New config server primary unlocks all distlocks held by previous config server on stepup

MongoDB - Defect ID: 875956

New config server primary unlocks all distlocks held by previous config server on stepup

Last updated on October 27th, 2023

BugZero Risk Score5.9 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Top MongoDB Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
5.9 Medium