...
MongoDB server version: 3.4.16 Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart" mongod never comes to a complete stop: 2018-08-06T07:07:08.621-0700 I COMMAND [conn1040] terminating, shutdown command received 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] shutdown: going to close listening sockets... 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 7 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 8 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] removing socket file: /tmp/mongodb-27018.sock 2018-08-06T07:07:08.622-0700 I NETWORK [conn1040] shutdown: going to flush diaglog... 2018-08-06T07:07:08.622-0700 I REPL [conn1040] shutting down replication subsystems 2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication reporter thread 2018-08-06T07:07:08.622-0700 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid 2018-08-06T07:07:08.622-0700 D REPL [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid 2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication fetcher thread 2018-08-06T07:07:08.624-0700 I REPL [conn1040] Stopping replication applier thread 2018-08-06T07:07:08.624-0700 D REPL [replication-3] oplog query cancelled 2018-08-06T07:07:08.624-0700 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open 2018-08-06T07:07:08.624-0700 D REPL [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018 2018-08-06T07:07:08.624-0700 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re -evaluating our sync source. 2018-08-06T07:07:08.820-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:08.822-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:08.861-0700 I COMMAND [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z 2018-08-06T07:07:09.362-0700 I COMMAND [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:09.863-0700 I COMMAND [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:10.364-0700 I COMMAND [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:10.517-0700 D REPL [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700 2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z 2018-08-06T07:07:10.827-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:10.829-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:10.900-0700 I COMMAND [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z 2018-08-06T07:07:11.366-0700 I COMMAND [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms The output of the command "db.currentOp(true)" is posted below when mongod was stuck. The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below. rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks". https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking { "desc" : "rsSync", "threadId" : "140303695083264", "active" : true, "opid" : 204493, "secs_running" : 554, "microsecs_running" : NumberLong(554584803), "op" : "none", "ns" : "local.replset.minvalid", "query" : { }, "numYields" : 0, "locks" : { "Global" : "W" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1), "W" : NumberLong(1) } }, "Database" : { "acquireCount" : { "r" : NumberLong(1), "W" : NumberLong(1) } } } }, { "desc" : "repl writer worker 11", "threadId" : "140303548225280", "active" : true, "opid" : 204510, "op" : "none", "ns" : "", "query" : { }, "numYields" : 0, "locks" : { "Global" : "w", "local" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "oplog" : { "acquireCount" : { "w" : NumberLong(1) } } } },
matthias.eichstaedt commented on Wed, 5 Sep 2018 18:02:03 +0000: Hi Nick, we observed another instance of our issue on Sept 5th: What we saw today is the same as what we saw on Monday (Sept 3): 1. Replication from the primary (which is in a different datacenter) to the secondary became “stuck". The host fell further and further behind. It seems that one of the internal replication threads (rsSync) obtained some global lock and then went off to do something for an extended period of time while holding the global lock. Certain operations (e.g. db.serverStatus()) are blocked waiting for that lock to be released. So when we typed “db.serverStatus()” into the mongo shell, the prompt never came back. However, reading of documents in the collections are not affected. (We tried to type db[“”].find() and were able to see documents being returned.) This is probably because, in MongoDB 4.0.x, reading of documents from a secondary no longer requires obtaining a lock that is also obtained by the replication process (which is how things work in MongoDB 3.x.x). Instead, reading of documents are now done by reading (slightly older) snapshots of the database, allowing read operations (by our application) and write operations (by the replication process) to proceed in parallel without blocking each other. 2. However, even though the application instances were able to continue to read chunks from the stuck secondary, the significant replication lag meant that recently-inserted chunks (inserted into the primary) could not be found from the secondary. But, our application was designed so that, if it cannot find a chunk from a secondary, it will make an attempt to find it from the primary. In this incident (and also on the previous incident on Monday), that mechanism kicked in and the application instances were able to successfully find the missing chunks from the primary. 3. After we ran the gdb command, the secondary and the replication process recovered. Last time (on Monday), the same thing happened: after we ran the gdb command, the secondary recovered. Apparently, running the gdb command has the side-effect of fixing the problem (which is bizarre). gdb command: gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt Uploaded files to the secure area: db-currentOp-mongo-b-shard3-3-Sept-5-at-7-10am-pacific.txt gdb_20180905.txt matthias.eichstaedt commented on Mon, 3 Sep 2018 16:27:46 +0000: We have upgraded our mongodb in an attempt to get around this issue. MongoDB server version: 4.0.1 However, we just saw an issue where the replication with a secondary is slow/stuck. The system is lightly loaded (in the order of 20qps). I am uploading the output of the gdb command above. We have also run the db.currentOp() command and captured the output, also uploaded. We observed that the replication seems to be slow or stuck. The host in question is almost 1 hour behind the primary. Since the host in question is falling far behind the primary, our application is now compensating for the significant replication delay by reading from another datacenter. Our client hosts are printing warning messages saying that they are reading the chunks from the primary. matthias.eichstaedt commented on Tue, 21 Aug 2018 23:47:00 +0000: Thanks, Nick. Installed gdb and waiting for this issue to occur again. nick.brewer commented on Tue, 21 Aug 2018 17:54:00 +0000: matthias.eichstaedt We've looked over the data you provided but unfortunately it's still not enough to give us a clear picture of where the issue is occurring. To get a better idea, we'd like to have you run gdb on the mongod until it becomes unresponsive again. If you could issue the following command in a screen/tmux session: gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt You can upload the outputted gdb.txt to the secure portal. Thanks, Nick matthias.eichstaedt commented on Fri, 17 Aug 2018 21:15:58 +0000: Thanks for the update. nick.brewer commented on Fri, 17 Aug 2018 21:05:17 +0000: matthias.eichstaedt Just wanted to let you know we're still looking into this - should have some more details soon. -Nick matthias.eichstaedt commented on Fri, 10 Aug 2018 20:03:08 +0000: I uploaded another set of logs for a different incident at around 20180810 12:21:37 matthias.eichstaedt commented on Wed, 8 Aug 2018 21:47:31 +0000: I uploaded diagnostic.data.tgz and mongod.log.2. I believe the issue started around 2018-08-06T06:51:00 nick.brewer commented on Wed, 8 Aug 2018 18:36:12 +0000: matthias.eichstaedt Thanks for your report. Could you archive (tar or zip) and upload the $dbpath/diagnostic.data directory, as well as the log, from the affected secondary? If you'd prefer, you can upload this information to our secure portal. Information shared there is only available to MongoDB employees, and is automatically removed after a period of time. Thanks, Nick matthias.eichstaedt commented on Wed, 8 Aug 2018 17:48:46 +0000: Another random observation: The number of connections on the secondary stays constant at around 21 connections for the hours leading up to the hanging issue. At 2018-08-06T06:51:01.833-0700 The number of connections (was stable at 21) starts to increase, seemingly with every query there is a new additional connection. The connection count reaches 566 when the process gets killed manually (because mongod restart didn't complete successfully).