
OPERATIONAL DEFECT DATABASE
...

...
DBCommandCursor will route getMore and killCursor operations to the current primary of the replica set. Since a cursor that exists on the primary remains following a stepdown, the DBCommandCursor will route a getMore or killCursor operation to the wrong node. A similiar situation can arise if slaveOk is set on the replica-set connection.
xgen-internal-githook commented on Mon, 29 Aug 2016 21:14:41 +0000: Author: {u'username': u'jbreams', u'name': u'Jonathan Reams', u'email': u'jbreams@mongodb.com'} Message: SERVER-23219 DBCommandCursor should route getMore operations to original server Branch: master https://github.com/mongodb/mongo/commit/bdf345a78764d2552373b5f938ac9aa6be5346b9 rassi@10gen.com commented on Mon, 21 Mar 2016 23:17:56 +0000: Max, Dave and myself spoke briefly about this issue today. Our tentative assessment of the user impact is that the shell will throw an error for certain queries and may crash, when started with a replica set connection that is connected to a 3.2+ cluster (see here and here, for more information about replica set connections). Specifically: Cursors that are not exhausted will crash the shell, either during routine garbage collection or during shutdown (see SERVER-22347 for details). Queries that target the primary will fail with "Cursor not found", if the primary steps down between batches. Queries that specify a read preference which targets a node other than the primary will always fail with "Cursor not found", if multiple batches are returned. See below example: // repro.js db.foo.drop() for (i=0; i<102; i++) { db.foo.insert({}); } db.foo.find().readPref("secondary").itcount() rassi@rassi:~/work/mongo$ mongo --host testReplSet/localhost:20000,localhost:20001 repro.js MongoDB shell version: 0.0.0 connecting to: testReplSet/localhost:20000,localhost:20001/test 2016-03-21T19:15:20.075-0400 I NETWORK [thread1] Starting new replica set monitor for testReplSet/localhost:20000,localhost:20001 2016-03-21T19:15:20.075-0400 I NETWORK [ReplicaSetMonitorWatcher] starting 2016-03-21T19:15:20.079-0400 I NETWORK [thread1] changing hosts to testReplSet/rassi:20000,rassi:20001 from testReplSet/localhost:20000,localhost:20001 2016-03-21T19:15:22.664-0400 E QUERY [thread1] Error: getMore command failed: { "ok" : 0, "errmsg" : "Cursor not found, cursor id: 51404920384", "code" : 43 } : Replica set connections have only been documented in the shell for a couple of versions (DOCS-1440 was resolved in February 2015), so I'm somewhat curious as to how commonly they are used. We have barely any test coverage for using replica set connections in the shell, so I've filed SERVER-23280 to help improve this coverage. I'm somewhat curious as to what folks think about whether we should continue to maintain support for replica set connections in the shell in the long-term. If we do decide to move forward with a backport for this issue, I would suggest forcing read mode "legacy" for replica set connections in the shell as an interim fix. This will likely require minor changes in the mozjs integration in order to expose this information, but the diff will still be relatively small. The real fix for this issue will be more difficult, as the core problem is a flaw of the original DBCommandCursor design. I see two possible paths forward, depending on whether or not we implement SERVER-20770: If we do implement SERVER-20770, we can do targeting in the same fashion that DBClientReplicaSet::query() does targeting, and keep a pointer to a DBClientConnection as a member variable of DBCommandCursor. We can then use this connection for all cursor operations. If we don't implement SERVER-20770, we have to 1) expose the host targeted by a read command to the caller, likely with a new runCommand() overload, and 2) provide a mechanism in the shell to ask a DBClientReplicaSet to return the underlying connection associated with a given host (technically we could open a new connection for every getMore/killCursor operation instead, but that would be insanity). This won't be very clean. We'll decide on an approach in our next triage meeting. rassi@10gen.com commented on Mon, 21 Mar 2016 14:19:11 +0000: Me and Max have misdiagnosed this issue. We originally thought this issue to affect DBClientReplicaSet, but the issue is actually in DBCommandCursor. DBCommandCursor does not track the host used for the original find, so it blindly issues getMore and killCursor requests using runCommand() against the underlying connection object (which is a DBClientReplicaSet, in this case), which can result in the request being routed to the wrong replica set member. I've updated the summary/description to reflect this new discovery. To clarify: this is a shell-only issue, and does not affect the server or C++ client library. All versions of the shell since 3.2 are affected. Re-assigning back to the query team for triage. rassi@10gen.com commented on Fri, 18 Mar 2016 13:53:15 +0000: Per discussion with milkie, reassigning to the sharding team backlog for triage. Feel free to bounce this back to platforms once triaged, if appropriate. milkie commented on Fri, 18 Mar 2016 13:49:03 +0000: Adding sharding component as this affects mongos.
The following patch modifies the stepdown_query.js test to use DBClientReplicaSet to demonstrate the issue for getMore and killCursor operations. It can be invoked with resmoke.py by doing python buildscripts/resmoke.py --executor no_passthrough jstests/noPassthrough/stepdown_query.js Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/jstests/noPassthrough/stepdown_query.js b/jstests/noPassthrough/stepdown_query.js index 05d22f3..5d1dd36 100644 --- a/jstests/noPassthrough/stepdown_query.js +++ b/jstests/noPassthrough/stepdown_query.js @@ -5,11 +5,12 @@ var dbName = "test"; var collName = jsTest.name(); - function runTest(host, rst) { - // We create a new connection to 'host' here instead of passing in the original connection. - // This to work around the fact that connections created by ReplSetTest already have slaveOk - // set on them, but we need a connection with slaveOk not set for this test. - var conn = new Mongo(host); + function runTest(connStr, rst) { + // We create a new connection using 'connStr' as our connection string instead of passing in + // the original connection. This to work around the fact that each connection created by + // ReplSetTest is backed by a DBClientConnection, but we need to use a DBClientReplicaSet + // connection. + var conn = new Mongo(connStr); var coll = conn.getDB(dbName).getCollection(collName); assert(!coll.exists()); assert.writeOK(coll.insert([{}, {}, {}, {}, {}])); @@ -19,10 +20,12 @@ cursor.next(); assert.eq(0, cursor.objsLeftInBatch()); var primary = rst.getPrimary(); + var secondary = rst.getSecondary(); assert.throws(function() { primary.getDB("admin").runCommand({replSetStepDown: 60, force: true}); }); rst.waitForState(primary, ReplSetTest.State.SECONDARY, 60 * 1000); + rst.waitForState(secondary, ReplSetTest.State.PRIMARY, 60 * 1000); // When the primary steps down, it closes all client connections. Since 'conn' may be a // direct connection to the primary and the shell doesn't automatically retry operations on // network errors, we run a dummy operation here to force the shell to reconnect. @@ -39,16 +42,17 @@ }); } - // Test querying a replica set primary directly. - var rst = new ReplSetTest({nodes: 1}); + // Test querying a replica set. + var rst = new ReplSetTest({nodes: 2}); rst.startSet(); rst.initiate(); - runTest(rst.getPrimary().host, rst); + runTest('mongodb://' + rst.getPrimary().host + ',' + rst.getSecondary().host + '/?replicaSet=' + + rst.name, rst); rst.stopSet(); // Test querying a replica set primary through mongos. - var st = new ShardingTest({shards: 1, rs: true}); - rst = st.rs0; - runTest(st.s0.host, rst); - st.stop(); + // var st = new ShardingTest({shards: 1, rs: true}); + // rst = st.rs0; + // runTest(st.s0.host, rst); + // st.stop(); })() Output [js_test:stepdown_query] 2016-03-17T18:06:56.965-0400 2016-03-17T18:06:56.965-0400 E QUERY [thread1] Error: getMore command failed: { [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 "ok" : 0, [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 "errmsg" : "Cursor not found, cursor id: 32766606856", [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 "code" : 43 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 } : [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 _getErrorWithCode@src/mongo/shell/utils.js:25:13 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 DBCommandCursor.prototype._runGetMoreCommand@src/mongo/shell/query.js:758:1 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 DBCommandCursor.prototype._hasNextUsingCommands@src/mongo/shell/query.js:786:9 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 DBCommandCursor.prototype.hasNext@src/mongo/shell/query.js:794:1 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 DBQuery.prototype.hasNext@src/mongo/shell/query.js:287:13 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 runTest@jstests/noPassthrough/stepdown_query.js:39:16 [js_test:stepdown_query] 2016-03-17T18:06:56.966-0400 @jstests/noPassthrough/stepdown_query.js:49:1 [js_test:stepdown_query] 2016-03-17T18:06:56.967-0400 @jstests/noPassthrough/stepdown_query.js:3:2 [js_test:stepdown_query] 2016-03-17T18:06:56.967-0400 [js_test:stepdown_query] 2016-03-17T18:06:56.967-0400 failed to load: jstests/noPassthrough/stepdown_query.js
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.