...
With a healthy replset, failovers via rs.stepDown() take 10+s, even when the election completes in 1-2s. After an rs.steDown(), a healthy moderately-loaded replset on a good network will normally complete an election for an new primary normally within 1-2s. However, before the new primary can step up, it has to stop the background sync thread. Suppose (as is usually/often the case) that the new primary was syncing from the stepped-down node. When that node stepped down, it closed all connections, including the connection the new primary was using to read the oplog. This triggered a DBException inside of _producerThread, triggering this code: sethbmsg(str::stream() sleepsecs(10); When the new primary goes to stop the rsBackgroundSync thread, the thread is caught in the middle of that sleep, and so we end up literally waiting on a sleep(10) before the cluster has a new primary.
milkie commented on Tue, 30 Jul 2013 12:50:00 +0000: Hi Nelson. Yes, the Backport field is what we use to indicate that. When it's complete we will change the field to "Done" (or "Cannot"). nelhage commented on Tue, 30 Jul 2013 03:36:53 +0000: Is this planned for backport to 2.4? It'd be really nice to have this fix in our production environment – slow failovers are pretty painful for us. mattd@10gen.com commented on Tue, 23 Jul 2013 21:13:50 +0000: I cherry-picked this commit on top of 2.4.5 and it made the server-9464 repro (test.sh) behave properly (get to a state with a primary). milkie commented on Sat, 29 Jun 2013 13:04:22 +0000: Further code changes are required to reduce the delay from 5s. auto commented on Tue, 18 Jun 2013 14:35:11 +0000: Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'} Message: SERVER-9934 remove double getmore when pulling new ops This reduces the potential delay in ending the bgsync thread from 10 seconds to 5 seconds. This is important because assumingPrimaryness waits on this loop before changing state to PRIMARY after an election. To reduce the delay lower than 5 seconds requires further redesign of the code. Branch: master https://github.com/mongodb/mongo/commit/acbae789795b2f628712ada73b2d89684b97da9d nelhage commented on Fri, 14 Jun 2013 20:19:37 +0000: Ah yeah, that would do it. I was testing on a hacked mongod that replaced the sleep with a timed-wait on a condvar – that version fails-over quickly in the no-ops case, but hits the double-wait if there's any work to do. The --force-sync test case is probably worthwhile anyways, since that's the more interesting case. milkie commented on Fri, 14 Jun 2013 19:29:43 +0000: I hit the double wait just with your original test, in fact. I was quite intrigued when I had removed the 10 second sleep and yet observed no change in the test's behavior, so I dug deeper. Thanks very much for writing up a complete test, by the way. nelhage commented on Fri, 14 Jun 2013 19:26:30 +0000: Not sure if you have your own test tools, but I added an option to https://gist.github.com/nelhage/5777797 to generate a bunch of writes immediately before the failover, which will force the new-primary to finish syncing before stepping up, thereby exhibiting the double-wait bug. milkie commented on Fri, 14 Jun 2013 19:21:33 +0000: Right, I was about to update this with what I've discovered, which was exactly that. I'm going to fix the double wait. To fix the single wait, I have some ideas on how to twist it to treat a getmore wait as being already done – if _appliedBuffer is already true in stopReplicationAndFlushBuffer(), we can make it guaranteed that it will not go false without transitioning to pause mode, in which case we don't need to wait for it. nelhage commented on Fri, 14 Jun 2013 19:10:45 +0000: (Let me know if this should be a separate issue) In the case where we do have ops to replicate, after replication is complete, we end up waiting for two GETMORE await timeouts. produce() does: if (!r.moreInCurrentBatch()) { ... r.more(); ... } if (!r.more()) break; In the case where there are no more ops to read, each of those .more() calls will end up issuing a GETMORE, which means we'll wait 2 AWAIT timeouts (roughly 8-10s) before we return, also delaying the new primary from stepping up. The current architecture makes it very difficult to avoid waiting for at least one AWAIT timeout, but there's no need to wait for two.
Run this test script with -vv: https://gist.github.com/nelhage/5777797 Observe the failovers are taking 10-12s, even though the cluster is healthy, unloaded, and running across the local interface. Look at the logs for a node that has become primary. Observe that after the election completes, the server logs "replSet waiting for replication to finish before becoming primary", and does not actually step up for up to 10 seconds later.