...
BugZero found this defect 2641 days ago.
Hello, Since we upgraded our cluster from 3.2.8 to 3.4.3, regularly secondary members of our most loaded shard stopping syncing to primary. When that happens, the field "oplog last event time" returned by command rs.printReplicationInfo() stays indefinitely blocked. The secondary member just fall behind and we need to restart it in order to free the oplog, allowing to catch up the primary member. When it happens, just before the service shutdown, in logs we can see this error : 2017-06-11T05:13:46.410+0200 I REPL [replication-7923] Error returned from oplog query while canceling query: ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 39149258ms, timeout was set to 65000ms 2017-06-11T05:13:46.411+0200 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. The "took 39149258ms, timeout was set to 65000ms" is strange, seems that the oplog query was longer that the timeout. The shard on witch it happens receives huge amount of read and writes so it's not shocking for us to see a oplog query to fail. But the secondary should be able to retry it. And sometimes when the oplog is blocked and we restarting the concerned member, it produces theses messages in logs : 2017-06-25T17:35:32.914+0200 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends 2017-06-25T17:35:32.914+0200 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets... 2017-06-25T17:35:32.914+0200 I NETWORK [signalProcessingThread] closing listening socket: 7 2017-06-25T17:35:32.915+0200 I NETWORK [signalProcessingThread] shutdown: going to flush diaglog... 2017-06-25T17:35:32.915+0200 I REPL [signalProcessingThread] shutting down replication subsystems 2017-06-25T17:35:32.916+0200 I REPL [signalProcessingThread] Stopping replication reporter thread 2017-06-25T17:35:57.051+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.30:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.052+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.3:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.052+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.28:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.052+0200 W NETWORK [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set csReplSet 2017-06-25T17:35:57.052+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] All nodes for set csReplSet are down. This has happened for 1 checks in a row. 2017-06-25T17:35:57.329+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.18:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.329+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.16:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.329+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.17:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.329+0200 W NETWORK [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set sfx6 2017-06-25T17:35:57.329+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] All nodes for set sfx6 are down. This has happened for 1 checks in a row. 2017-06-25T17:35:57.839+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.30:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.839+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.29:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.839+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.28:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown 2017-06-25T17:35:57.839+0200 W NETWORK [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set sfx10 2017-06-25T17:35:57.839+0200 I NETWORK [ReplicaSetMonitor-TaskExecutor-0] All nodes for set sfx10 are down. This has happened for 1 checks in a row. [...] In that case only a "kill -9" let the member to shutdown. Our Mongo cluster details : Many shards + config replica set, each formed by 3 members (1 primary + 2 secondary) 2 mongos Balancer is disabled Package version 3.4.4, OS: Debian 8 Jessie Servers: 6 cores Xeon CPU, 64GB RAM, ~3To SSD, ext4 file system ~ 40 collections in 1 DB Many writes and reads Thank you in advance for your help. Best regards, Slawomir
mark.agarunov commented on Tue, 7 Nov 2017 19:31:21 +0000: Hello slluk-sa, Thank you for the additional information. After investigating the provided data, this appears to be due to a network issue, where slow communication between mongod nodes will keep locks from yielding and cause the behavior you're seeing. We've opened SERVER-31368 for additional logging to be added to better diagnose this issue in the future. Thanks, Mark slluk-sa commented on Tue, 27 Jun 2017 10:37:19 +0000: Hi Thomas, Thank you for your answer. For each of the 3 members of the concerned shard / replica set, I uploaded 2 files : An archive of diagnostic.data folder (seems to contain data from June 20). The mongod log file from June 18 to June 26. You should look at events of June 23 at night, on member with name ending with 24. That day we got the problem twice (the second time with failed mongod shutdown). Chronology : 02:29 : "oplog last event time" blocked (based on our monitoring) 05:08 : mongod restart (2017-06-23T05:08:15.647+0200 in log), then started to catch up primary 06:00 : "oplog last event time" blocked again 07:27 : mongod restart try (2017-06-23T07:27:08.761+0200 in log), mongod has not stopped, lots of "No primary detected for..." errors in log 07:33 : kill -9 mongod, then start, oplog catch up primary... Best regards, Slawomir thomas.schubert commented on Mon, 26 Jun 2017 18:33:36 +0000: Hi slluk-sa, Thanks for reporting this issue. We'll need some additional information to better understand what is going on here. To help us investigate, would you please provide the complete mongod log files covering this issue as well as an archive of the diagnostic.data for the three nodes in the affected replica set? I've created a secure upload portal for you to provide these files. Files uploaded to the portal are visible only to MongoDB employees investigating the issue and are routinely deleted after some time. Kind regards, Thomas