
OPERATIONAL DEFECT DATABASE
...


...

Whenever I run `mongo --host mongo3:27017 --eval 'db.serverStatus()'` this command hangs for a very long time (never managed to get reply). mongo3 is a SECONDARY server in mongodb replica set. On PRIMARY and other SECONDARIES everything works as expected. Output of the following command in attachment (gdb_2020-11-09_15-58-04.txt.xz): gdb p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y%m-%d_%H-%M-%S"`.txt This is Gentoo Linux with 4.2.10 mongodb installed. Another observation. I've tried to stop mongodb and it hanged with the following in logs: 2020-11-09T16:32:51.302+0300 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends 2020-11-09T16:32:51.304+0300 I REPL [signalProcessingThread] Stepping down the ReplicationCoordinator for shutdown, waitTime: 10000ms 2020-11-09T16:32:51.309+0300 I SHARDING [signalProcessingThread] Shutting down the WaitForMajorityService 2020-11-09T16:32:51.313+0300 I CONTROL [signalProcessingThread] Shutting down the LogicalSessionCache 2020-11-09T16:32:51.316+0300 I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: ClientMarkedKilled: client has been killed 2020-11-09T16:32:51.320+0300 I CONTROL [LogicalSessionCacheReap] Failed to reap transaction table: ClientMarkedKilled: client has been killed 2020-11-09T16:32:51.321+0300 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets... 2020-11-09T16:32:51.321+0300 I NETWORK [listener] removing socket file: /tmp/mongodb-27017.sock 2020-11-09T16:32:51.321+0300 I NETWORK [signalProcessingThread] Shutting down the global connection pool 2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the FlowControlTicketholder 2020-11-09T16:32:51.321+0300 I - [signalProcessingThread] Stopping further Flow Control ticket acquisitions. 2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the PeriodicThreadToAbortExpiredTransactions 2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the PeriodicThreadToDecreaseSnapshotHistoryIfNotNeeded 2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] Shutting down the ReplicationCoordinator 2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] shutting down replication subsystems 2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] Stopping replication reporter thread 2020-11-09T16:32:51.322+0300 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo1:27017: CallbackCanceled: Reporter no longer valid 2020-11-09T16:32:51.323+0300 I REPL [signalProcessingThread] Stopping replication fetcher thread 2020-11-09T16:32:51.324+0300 I REPL [signalProcessingThread] Stopping replication applier thread 2020-11-09T16:32:51.507+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. 2020-11-09T16:32:51.508+0300 I REPL [rsBackgroundSync] Stopping replication producer I've run gdb command (mentioned above) again and output is in gdb_2020-11-09_16-34-39.txt.xz. After I've killed server with -9 and started again db.serverStatus() started to work.
JIRAUSER1257570 commented on Thu, 23 Sep 2021 08:32:37 +0000: Ah, I see, there is no way to reopen this issue, I've opened SERVER-60164 to see if new debugging information will help. JIRAUSER1257570 commented on Thu, 23 Sep 2021 08:08:30 +0000: It happened again. According to monitoring, this happened around 2021-09-22 12:31:35. Diagnostic data that should cover the moment when the problem happened: https://disk.yandex.ru/d/IIW42L63fVojuw New backtrace, gdb_202109-23_11-07-08.txt.xz , is in attachment as well. JIRAUSER1257066 commented on Fri, 4 Dec 2020 21:03:21 +0000: Thank you for the update. Given that we lack adequate information to diagnose this issue, I'll close this as Incomplete. Edwin JIRAUSER1257570 commented on Fri, 4 Dec 2020 20:37:29 +0000: Nope, this happened only once. Of course, it's possible that this was caused by cosmic radiation, but this happened just after the upgrade so I decided to open the report. Thank you for your investigation. If this happens again, I'll provide more information. JIRAUSER1257066 commented on Fri, 4 Dec 2020 19:06:05 +0000: Hi peter.volkov@gmail.com, We'd like to know if this issue has repeated. Would you please inform us if this is a recurring problem or if has occurred on a different machine? Thanks, Edwin JIRAUSER1257066 commented on Thu, 19 Nov 2020 22:30:32 +0000: Hi peter.volkov@gmail.com, After taking a closer look at the gdb you provided, we're unable to determine what caused all of the threads to be stuck waiting to acquire the global lock. At the moment, we don't have enough information. Has his problem ever occurred on a different machine? Is this a recurring problem? Best, Edwin JIRAUSER1257066 commented on Mon, 16 Nov 2020 22:03:39 +0000: After looking into the gdb, we were able to find stack traces that confirm the behavior for both instances of the gdb being collected. However, we still don't know what caused the issue. The first gdb (gdb_2020-11-09_15-58-04.txt) correlates to the deadlock when running db.serverStatus(). We can see that 337 threads are struggling on serverStatus. 337.000 │ ▽ mongo::ServiceStateMachine::_sourceCallback 337.000 │ ▽ mongo::ServiceStateMachine::_scheduleNextWithGuard 337.000 │ ▽ mongo::transport::ServiceExecutorSynchronous::schedule 337.000 │ ▽ std::_Function_handler 337.000 │ ▽ mongo::ServiceStateMachine::_runNextInGuard 337.000 │ ▽ mongo::ServiceStateMachine::_processMessage 337.000 │ ▽ mongo::ServiceEntryPointMongod::handleRequest 337.000 │ ▽ mongo::ServiceEntryPointCommon::handleRequest 337.000 │ ▽ mongo::(anonymous namespace)::receivedCommands 337.000 │ ▽ mongo::(anonymous namespace)::execCommandDatabase 337.000 │ ▽ mongo::(anonymous namespace)::runCommandImpl 337.000 │ ▽ mongo::BasicCommand::Invocation::run 337.000 │ ▽ mongo::CmdServerStatus::run 337.000 │ ▽ mongo::WiredTigerServerStatusSection::generateSection 337.000 │ ▽ mongo::Lock::GlobalLock::GlobalLock 337.000 │ ▽ mongo::Lock::GlobalLock::GlobalLock 337.000 │ ▽ mongo::Lock::GlobalLock::_enqueue 337.000 │ ▽ mongo::Lock::ResourceLock::lock 337.000 │ ▽ mongo::LockerImpl::lockComplete 337.000 │ ▽ mongo::CondVarLockGrantNotification::wait 337.000 │ ▽ mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil 337.000 │ ▽ mongo::ClockSource::waitForConditionUntil 337.000 │ ▽ std::cv_status mongo::Waitable::wait_until 337.000 │ ▽ mongo::transport::TransportLayerASIO::BatonASIO::run_until 337.000 │ ▽ mongo::transport::TransportLayerASIO::BatonASIO::run 337.000 │ ▽ poll The following gdb (gdb_2020-11-09_16-34-39.txt) is when the mongod is attempting to be killed with -15. It gets stuck on _stopDataReplication_inlock. Afterward, the server is killed using -9. 1.000 │ ▽ mongo::(anonymous namespace)::signalProcessingThread 1.000 │ │ ▽ mongo::shutdown 1.000 │ │ ▽ mongo::(anonymous namespace)::runTasks 1.000 │ │ ▽ mongo::(anonymous namespace)::shutdownTask 1.000 │ │ ▽ mongo::repl::ReplicationCoordinatorImpl::shutdown 1.000 │ │ ▽ mongo::repl::ReplicationCoordinatorExternalStateImpl::shutdown 1.000 │ │ ▽ mongo::repl::ReplicationCoordinatorExternalStateImpl::_stopDataReplication_inlock 1.000 │ │ ▽ mongo::Interruptible::waitForConditionOrInterruptUntil 1.000 │ │ ▽ std::condition_variable::wait:53 1.000 │ │ ▽ __gthread_cond_wait:865 1.000 │ │ ▽ pthread_cond_wait@@GLIBC_2.3.2 Questions: bruce.lucas, are we able to find out what is failing to release the lock that prevents serverStatus to complete? JIRAUSER1257066 commented on Mon, 16 Nov 2020 17:50:44 +0000: Hi peter.volkov@gmail.com, Unfortunately the attached diagnostic data does not cover the issue that occurred (issue occurs 11/09, diagnostics start from 11/10), and there is no way to recover that lost data. However, we are glad to hear that the issue went away when you restarted mongod. If it occurs again, we'd love for you to please again attach the archived diagnostic.data, gdb, and logs so we can further investigate the hang on db.serverStatus(). Best Edwin JIRAUSER1257570 commented on Sun, 15 Nov 2020 20:43:06 +0000: Log file attached. As for diagnostic.data, it looks like it is too large, so I've put it here: https://yadi.sk/d/zY2ltxD8td9yrQ JIRAUSER1257066 commented on Fri, 13 Nov 2020 22:47:02 +0000: Hi peter.volkov@gmail.com, Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Could you also attach the log files and attach it to this ticket? Kind regards, Edwin
I don't know. We just found server in such state 3 days ago.
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.