...
Mongos advances config optime using the last visible optime from config servers. If the write commands fail due to timeout or NotMaster, the last visible optime is ahead of the last committed and may be rolled back. With the following patch, majority read fails due to timeout, which should never happen. diff --git a/jstests/sharding/auth_no_config_primary.js b/jstests/sharding/auth_no_config_primary.js index 3bb1ea1..d2c4b79 100644 --- a/jstests/sharding/auth_no_config_primary.js +++ b/jstests/sharding/auth_no_config_primary.js @@ -8,6 +8,12 @@ (function() { "use strict"; + function enableFailPoint(node) { + assert.commandWorked(node.getDB('admin').runCommand( + {configureFailPoint: 'rsSyncApplyStop', mode: 'alwaysOn'}), + 'Failed to configure rsSyncApplyStop failpoint.'); + } + var st = new ShardingTest({shards: 1, keyFile: 'jstests/libs/key1'}); st.s.getDB('admin').createUser({user: 'root', pwd: 'pass', roles: ['root']}); @@ -15,9 +21,9 @@ var testDB = st.s.getDB('test'); testDB.user.insert({hello: 'world'}); - // Kill all secondaries, forcing the current primary to step down. + // Pause all secondaries. st.configRS.getSecondaries().forEach(function(secondaryConn) { - MongoRunner.stopMongod(secondaryConn.port); + enableFailPoint(secondaryConn); }); // Test authenticate through a fresh connection. diff --git a/src/mongo/s/sharding_egress_metadata_hook.cpp b/src/mongo/s/sharding_egress_metadata_hook.cpp index 2040955..4565a52 100644 --- a/src/mongo/s/sharding_egress_metadata_hook.cpp +++ b/src/mongo/s/sharding_egress_metadata_hook.cpp @@ -43,6 +43,8 @@ #include "mongo/s/sharding_egress_metadata_hook.h" #include "mongo/util/net/hostandport.h" +#include "mongo/util/log.h" + namespace mongo { namespace rpc { @@ -114,6 +116,7 @@ Status ShardingEgressMetadataHook::_advanceConfigOptimeFromShard(ShardId shardId const auto& replMetadata = parseStatus.getValue(); auto opTime = replMetadata.getLastOpVisible(); + log() << "XXXSZ advance config optime " << metadataObj; grid.advanceConfigOpTime(opTime); } } else { // Advance config optime, which is never commited. s20015| 2016-06-16T18:26:23.818-0400 I SHARDING [NetworkInterfaceASIO-ShardRegistry-0] XXXSZ advance config optime { $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1466115787000|11, t: 1 }, lastOpVisible: { ts: Timestamp 1466115968000|1, t: 1 }, configVersion: 1, replicaSetId: ObjectId('576326bd769d59b46d7a042b'), primaryIndex: 0, syncSourceIndex: -1 }, $gleStats: { lastOpTime: { ts: Timestamp 1466115968000|1, t: 1 }, electionId: ObjectId('7fffffff0000000000000001') } } s20015| 2016-06-16T18:27:38.614-0400 I SHARDING [Balancer] Operation timed out with status ExceededTimeLimit: Operation timed out, request was RemoteCommand 85 -- target:siyuan-mbp:20013 db:config expDate:2016-06-16T18:27:38.610-0400 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1466115968000|1, t: 1 } }, maxTimeMS: 30000 } // Then a majority read fails. s20015| 2016-06-16T18:27:38.862-0400 I SHARDING [conn1] Operation timed out with status ExceededTimeLimit: Operation timed out, request was RemoteCommand 104 -- target:siyuan-mbp:20012 db:config expDate:2016-06-16T18:27:38.857-0400 cmd:{ find: "databases", filter: { _id: "test" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1466115968000|1, t: 1 } }, limit: 1, maxTimeMS: 30000 } "errmsg" : "Operation timed out, request was RemoteCommand 104 -- target:siyuan-mbp:20012 db:config expDate:2016-06-16T18:27:38.857-0400 cmd:{ find: \"databases\", filter: { _id: \"test\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp 1466115968000|1, t: 1 } }, limit: 1, maxTimeMS: 30000 }", c20012| 2016-06-16T18:27:38.869-0400 I COMMAND [conn11] Command on database config timed out waiting for read concern to be satisfied. Command: { find: "databases", filter: { _id: "test" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1466115968000|1, t: 1 } }, limit: 1, maxTimeMS: 30000 } c20012| 2016-06-16T18:27:38.869-0400 I COMMAND [conn11] command config.$cmd command: find { find: "databases", filter: { _id: "test" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1466115968000|1, t: 1 } }, limit: 1, maxTimeMS: 30000 } numYields:0 reslen:92 locks:{} protocol:op_command 30011ms This is found during implementing SERVER-24574 which returns NotMaster rather than closing connections.
xgen-internal-githook commented on Fri, 29 Jul 2016 21:30:59 +0000: Author: {u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'visualzhou@gmail.com'} Message: SERVER-24630 Mongos erroneously advances config optime for writes that fail write concern. Use getLastOpCommitted rather than getLastOpVisible in sharding. Branch: v3.2 https://github.com/mongodb/mongo/commit/350efef02455cae900d9d781c8dadb7911fee47d milkie commented on Wed, 22 Jun 2016 14:49:17 +0000: Yes, backport of this ticket should commence. siyuan.zhou@10gen.com commented on Tue, 21 Jun 2016 21:52:47 +0000: milkie, this is a dependency of primary catch-up, do we want to backport this ticket and tagging internal connections? xgen-internal-githook commented on Tue, 21 Jun 2016 20:56:49 +0000: Author: {u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'} Message: SERVER-24630 Mongos erroneously advances config optime for writes that fail write concern. Use getLastOpCommitted rather than getLastOpVisible in sharding. Branch: master https://github.com/mongodb/mongo/commit/640d07ddffbd4c67fddcac67dac2b9e7cb281ff2 milkie commented on Fri, 17 Jun 2016 18:24:12 +0000: After discussion with kaloian.manassiev we think we should return a lastVisibleOpTime of 0 if the write concern fails. This should be true even when returning NotMaster as an error at the beginning of command processing (in addition to post-write write concern processing). This is option 1 as described above. Using lastOpCommitted for this would be incorrect. milkie commented on Fri, 17 Jun 2016 14:08:16 +0000: I think we should do (3). But only for writes, and only for replication-related errors (NotMaster, Write Concern Timeout). siyuan.zhou@10gen.com commented on Thu, 16 Jun 2016 23:10:37 +0000: In the discussion with spencer and scotthernandez, we came up with 4 possible solutions. 1. Revert lastVisibleOpTime on mongod if the command fails. 2. Instead of reverting lastVisibleOpTime, add an extra one that remains the same on failure. 3. On mongos, don't advance config optime when we got errors. 4. On mongos, use lastOpCommitted instead. I'd vote for the last option as its design is simplest and follows the current behavior. For successful writes, the lastVisibleOpTime equals to lastOpCommitted and may be ahead of snapshots anyway. CC schwerin and milkie.