...
BugZero found this defect 2707 days ago.
Hi, This was witnessed twice where the events happened in the following order. 1. Primary(A) is down. 2. Secondary(B) detects and starts an election. 3. It wins the dry-run election and on every subsequent election timeout(10s) prints the following. This goes on until its restarted. [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 4. A is started. It performs the election and is primary. 5. B is restarted. On one instance it took a long time to shutdown and in another had to be SIGKILLed. 6. A is shutdown. 7. B detects, starts elections and becomes primary. Relevant logs from Node A for step 1-4 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] VoteRequester(term 8 dry run) received a yes vote from nodeB:27017; response message: { term: 8, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.534-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] VoteRequester(term 9) received a yes vote from node-arbiter:27022; response message: { term: 9, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:32:08.621-0700 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 9 Relevant logs from Node B for step 1-4 2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-08-17T05:18:32.066-0700 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-08-17T05:18:32.068-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) failed to receive response from nodeA:27017: HostUnreachable: Connection refused 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] VoteRequester(term 7 dry run) received a yes vote from node-Arbiter:27022; response message: { term: 7, voteGranted: true, reason: "", ok: 1.0 } 2017-08-17T05:18:32.069-0700 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-17T05:18:42.441-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:18:53.473-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate 2017-08-17T05:19:04.837-0700 I REPL [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election again; already candidate ... Note that, although nodeB didn't log any voteRequester messages, it does increase the election term from 7 to 8. This can be confirmed by the term number used by nodeA for dry-run election, 8. Assuming that nothing went wrong in starting the election, I believe this loop can happen when a node waits indefinitely for voteRequests after starting a real election. I couldn't find any information related to this.
matthew.russotto commented on Thu, 11 Jul 2019 19:15:30 +0000: This makes sense; WiredTigerRecordStore::reclaimOplog never releases the IX lock on "local" as long as there is work to do. bartle commented on Mon, 22 Apr 2019 04:51:54 +0000: I think I figured out what happened. These particular replsets were being run on EBS volumes created from EBS snapshots, and the oplog was also truncated (via replSetResizeOplog). The latter would have caused the background oplog thread to start pruning oplog entries. Obviously, that process takes a lock (IX) on "local", and it appears that reading local.system.replset also happens relatively frequently, which also takes a lock (IX) on "local". With pv1, every time an election is called, the "local.replset.election" file is written, which takes an X lock on "local". I believe that latter operation was being blocked by all the other IX locks on "local", and by the time the "X" lock was acquired, election timers had expired. I suspect if I had waited long enough, the background thread might have finished chewing through the oplog. bartle commented on Sun, 21 Apr 2019 05:21:55 +0000: Oh, and the situation we saw this in was a 3-node replset where at least 1 member was down initially. However, even when all members eventually come back up, they're still unable to elect a primary because it looks like 2 out of 3 nodes end up in the wedged state. bartle commented on Sun, 21 Apr 2019 04:32:16 +0000: We recently ran into this issue (on 3.4.20). I think the deadlock issue is on the right track, since running currentOp indicates that "replExecDBWorker-1" is stuck waiting for a lock while holding an X lock on "local"; calling "show collections" and "show dbs" also hang, presumably because they can't an IX lock on "local". I could't figure out what was actually holding the lock, but I can try attaching gdb next time and try to dump all thread stacks. I might be misreading currentOp, but I think it's the case that if "waitingForLock" is true, there isn't any indication of which lock an operation is waiting on? i.e. the "locks" document only contain locks held, and not those being waited on? It'd probably be good to surface the lock being waited on as well? william.schultz commented on Tue, 16 Jan 2018 15:56:13 +0000: Closing this for now, since I was unable to reproduce the issue. If it occurs again, we can possibly re-open it and investigate further. william.schultz commented on Mon, 18 Dec 2017 19:23:41 +0000: My suspicion is that there may be a deadlock on node5-2 after the dry run election succeeds, causing the thread that is going to start the real election to hang. Election timeouts keep firing and being serviced by the ReplicationExecutor, causing us to see the "Not running for election" messages repetedly, but the election thread is hung, so we indefinitely remain as a candidate. I have not figured out the root cause yet, but it may be related to how the Status ReplicationCoordinatorExternalStateImpl::storeLocalLastVoteDocument function takes locks. william.schultz commented on Mon, 18 Dec 2017 18:32:12 +0000: shredder12 Is this issue easily reproducible, or has it only been seen rarely? william.schultz commented on Thu, 14 Sep 2017 20:33:33 +0000: I did look into this briefly, but have not come up with a conclusive diagnosis yet. shredder12 commented on Fri, 18 Aug 2017 21:26:07 +0000: Hi Thomas, Thank you for the quick response. PFA logs for the replicaset nodes during the incident. node5-1.log node5-2.log node5-arb.log node5-1 aka nodeA node5-2 aka nodeB node5-arb - the arbiter of the replicaset. thomas.schubert commented on Fri, 18 Aug 2017 02:28:57 +0000: Hi shredder12, Thanks for reporting this behavior. So we can better understand what is going on here, would you please provide the complete log files for each node in the replica set as well as the replica set configuration, e.g. the output of rs.conf()? Thank you, Thomas