...
After running rs.stepDown() on the primary the primary crashes with the following output at the tail of the log: 2018-10-30T16:04:18.438-0400 I NETWORK [conn71] received client metadata from 192.168.0.122:63244 conn71: { driver: { name: "mongo-csharp-driver", version: "0.0.0.0" }, os: { type: "Windows", name: "Microsoft Windows 10.0.17134", architecture: "x86_32", version: "10.0.17134" }, platform: ".NET Framework 4.7.3190.0" } 2018-10-30T16:04:18.439-0400 F - [conn71] Invariant failure pin == ErrorCodes::CursorNotFound || pin == ErrorCodes::Unauthorized src\mongo\db\cursor_manager.cpp 186 2018-10-30T16:04:18.443-0400 F - [conn71] ***aborting after invariant() failure 2018-10-30T16:04:18.547-0400 F - [conn71] Got signal: 22 (SIGABRT). mongod.exe ...\src\mongo\util\stacktrace_windows.cpp(244) mongo::printStackTrace+0x67x mongod.exe ...\src\mongo\util\signal_handlers_synchronous.cpp(182) mongo::`anonymous namespace'::printSignalAndBacktrace+0x116x mongod.exe ...\src\mongo\util\signal_handlers_synchronous.cpp(238) mongo::`anonymous namespace'::abruptQuit+0x133x ucrtbase.dll raise+0x475x ucrtbase.dll abort+0x49x mongod.exe ...\src\mongo\util\assert_util.cpp(104) mongo::invariantFailed+0x410x mongod.exe ...\src\mongo\db\cursor_manager.cpp(186) mongo::`anonymous namespace'::GlobalCursorIdCache::killCursor+0x208x mongod.exe ...\src\mongo\db\cursor_manager.cpp(371) mongo::CursorManager::killCursorGlobalIfAuthorized+0x73x mongod.exe ...\src\mongo\db\service_entry_point_common.cpp(1127) mongo::`anonymous namespace'::receivedKillCursors+0x613x mongod.exe ...\src\mongo\db\service_entry_point_common.cpp(1339) mongo::ServiceEntryPointCommon::handleRequest+0x1313x mongod.exe ...\src\mongo\db\service_entry_point_mongod.cpp(123) mongo::ServiceEntryPointMongod::handleRequest+0x67x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(378) mongo::ServiceStateMachine::_processMessage+0x479x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(440) mongo::ServiceStateMachine::_runNextInGuard+0x207x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(479) ::operator()+0x109x mongod.exe ...\src\mongo\transport\service_executor_synchronous.cpp(116) mongo::transport::ServiceExecutorSynchronous::schedule+0x283x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(483) mongo::ServiceStateMachine::_scheduleNextWithGuard+0x236x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(303) mongo::ServiceStateMachine::_sourceCallback+0x289x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(259) ::operator()+0x185x mongod.exe ...\src\mongo\util\future.h(788) mongo::future_details::Future::getAsync >+0x103x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(254) mongo::ServiceStateMachine::_sourceMessage+0x192x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(436) mongo::ServiceStateMachine::_runNextInGuard+0x242x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(479) ::operator()+0x109x mongod.exe ...\src\mongo\transport\service_executor_synchronous.cpp(133) ::operator()+0x217x mongod.exe c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247) std::_LaunchPad >,std::default_delete > > > >::_Run+0x133x mongod.exe c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210) std::_Pad::_Call_func+0x9x ucrtbase.dll o_ceil+0x78x KERNEL32.DLL BaseThreadInitThunk+0x20x 2018-10-30T16:04:18.553-0400 F CONTROL [conn71] *** unhandled exception 0x0000000E at 0x00007FFDB0F9A388, terminating 2018-10-30T16:04:18.553-0400 F CONTROL [conn71] *** stack trace for unhandled exception: 2018-10-30T16:04:18.556-0400 I - [conn71] KERNELBASE.dll RaiseException+0x104x mongod.exe ...\src\mongo\util\signal_handlers_synchronous.cpp(239) mongo::`anonymous namespace'::abruptQuit+0x157x ucrtbase.dll raise+0x475x ucrtbase.dll abort+0x49x mongod.exe ...\src\mongo\util\assert_util.cpp(104) mongo::invariantFailed+0x410x mongod.exe ...\src\mongo\db\cursor_manager.cpp(186) mongo::`anonymous namespace'::GlobalCursorIdCache::killCursor+0x208x mongod.exe ...\src\mongo\db\cursor_manager.cpp(371) mongo::CursorManager::killCursorGlobalIfAuthorized+0x73x mongod.exe ...\src\mongo\db\service_entry_point_common.cpp(1127) mongo::`anonymous namespace'::receivedKillCursors+0x613x mongod.exe ...\src\mongo\db\service_entry_point_common.cpp(1339) mongo::ServiceEntryPointCommon::handleRequest+0x1313x mongod.exe ...\src\mongo\db\service_entry_point_mongod.cpp(123) mongo::ServiceEntryPointMongod::handleRequest+0x67x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(378) mongo::ServiceStateMachine::_processMessage+0x479x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(440) mongo::ServiceStateMachine::_runNextInGuard+0x207x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(479) ::operator()+0x109x mongod.exe ...\src\mongo\transport\service_executor_synchronous.cpp(116) mongo::transport::ServiceExecutorSynchronous::schedule+0x283x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(483) mongo::ServiceStateMachine::_scheduleNextWithGuard+0x236x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(303) mongo::ServiceStateMachine::_sourceCallback+0x289x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(259) ::operator()+0x185x mongod.exe ...\src\mongo\util\future.h(788) mongo::future_details::Future::getAsync >+0x103x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(254) mongo::ServiceStateMachine::_sourceMessage+0x192x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(436) mongo::ServiceStateMachine::_runNextInGuard+0x242x mongod.exe ...\src\mongo\transport\service_state_machine.cpp(479) ::operator()+0x109x mongod.exe ...\src\mongo\transport\service_executor_synchronous.cpp(133) ::operator()+0x217x mongod.exe c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247) std::_LaunchPad >,std::default_delete > > > >::_Run+0x133x mongod.exe c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210) std::_Pad::_Call_func+0x9x ucrtbase.dll o_ceil+0x78x KERNEL32.DLL BaseThreadInitThunk+0x20x 2018-10-30T16:04:18.561-0400 I CONTROL [conn71] writing minidump diagnostic file C:\mongodb\v4.2018-10-30T20-04-18.mdmp 2018-10-30T16:04:18.674-0400 I REPL [rsBackgroundSync] sync source candidate: Fractal:27017 2018-10-30T16:04:18.675-0400 I ASIO [RS] Connecting to Fractal:27017 2018-10-30T16:04:18.678-0400 F CONTROL [conn71] *** immediate exit due to unhandled exception
xgen-internal-githook commented on Wed, 12 Dec 2018 16:43:17 +0000: Author: {'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'} Message: SERVER-37838 prevent crash with getMore and simultaneous stepdown Branch: v4.0 https://github.com/mongodb/mongo/commit/3a76585f4e5d53fc380e0c0db44fd06f45dcdb24 xgen-internal-githook commented on Fri, 30 Nov 2018 15:34:23 +0000: Author: {'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'} Message: SERVER-37838 add correct tags to test Branch: master https://github.com/mongodb/mongo/commit/f3acfdec219b7a9122b4954a269c89e3604416d2 xgen-internal-githook commented on Thu, 29 Nov 2018 21:53:14 +0000: Author: {'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'} Message: SERVER-37838 prevent crash with getMore and simultaneous stepdown Branch: master https://github.com/mongodb/mongo/commit/b7300d0e9fac1df82468958b4b8eeda27a8a8c15 ian.boros commented on Mon, 19 Nov 2018 21:02:42 +0000: Okay, my above theory was mostly right, but here's a more thorough explanation in case I need to remember what happened: 1) We start an aggregation, and get a cursor ID 2) We do a getMore, and the getMore gets passed any checkForInterrupt() calls (I imagine this window of time is pretty small, but the race is there). 3) A stepdown happens. The operation is marked as killed. 4) The getMore "completes," and checks its cursor back in. At this point, the CursorManager sees that the operation was killed, and then marks the cursor as killed so that subsequent operations on the cursor will fail. The InterruptedDueToStepDown error code is stashed inside the cursor's executor. 5) Someone runs an OP_KILL_CURSORS on that cursor ID, on the stepped-down node. 6) As part of pinning the cursor to kill it, we see that its executor was killed. We clean up the cursor and return the stashed error code, InterruptedDueToStepDown. As a result, this invariant fails. This can be triggered without a $changeStream stage, which is a little scary. So far, the only event I've found which can trigger this is a stepdown. The following other events happening during a getMore, after the last checkForInterrupt() don't trigger the issue: killCursors command: The cursor will be disposed of immediately once it's checked back in, here. killOp Same as above Dropping a collection: Ownership of the cursor will be transferred to the operation using the cursor, and it will be cleaned up. Therefore, when the OP_KILL_CURSORS goes to check out the cursor, it won't be there so it will just return a CursorNotFound error. Dropping an index: Same as above This won't affect v3.6 because cursors could not be killed while they were checked out. ian@10gen.com commented on Fri, 9 Nov 2018 16:30:27 +0000: need to check whether this affects branches older than 4.0.
While there is an ongoing change stream operation with an active cursor run rs.stepDown() on the primary. This occurred while manually testing the C# driver support for change stream recovery after elections. It's possible there is a driver bug... but whether or not there is the server should not abort.