
OPERATIONAL DEFECT DATABASE
...

...
In sharded clusters, on step-up those two flows may execute concurrently: (I) The range deleter service is scanning config.rangeDeletions, enqueuing all range deletion tasks that are marked as non-pending (II) The step-up procedure is spawning the recovery of migration coordinators based on the content of config.migrationCoordinators Regarding (II), when a committed migration needs to be recovered, the donor is re-executing all the local part of the commit upon step-up, that includes: (A) Enqueue a PENDING range deletion task (B) Updating the persistent state on config.rangeDeletions by marking the document as non-pending (C) Asynchronously observe the update performed at step (2), causing the task enqueued at (1) to be marked as non-pending in order to be eventually served It may happen for (A) and (B) to have already been executed before stepping down. In that case - when stepping up - it may happen for the flows (I) and (II) to interleave in the following way: [flow (I)] The range deleter service enqueues task T because it is already marked as ready in config.rangeDeletions The range deleter service quickly serves T, deleting the persistent document from config.rangeDeletions [flow (II)] The pending range deleter T is scheduled at step (A) [flow (II)] Step B is a no-op because the document was deleted at (2) Since (4) is a no-op, step C never happens. That's the correct behavior because the range deletion task doesn't exist anymore and the node may be starting receiving the same range, so no deletion should be performed. The problem is rather that (3) creates a dangling range deletion task that should never be marked as ready because it should not have been enqueued in the first place. Proposed solution The migration coordinator should only execute once the procedure marking the range deleter as non-pending. This code should be executed conditionally, only if the range deletion document is still pending.
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.