...
There is one change stream JSON test that fails on the latest sharded cluster. Reproducible at least on c# and Java drivers. Json test: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/change-streams-errors.json#L76 It looks like the reason of it that the latest server doesn't throw `280` error for the test "Change Stream should error when _id is projected out" anymore: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/change-streams-errors.json#L103 So, our test runner doesn't catch the expected error and then fails. The evergreen task: https://evergreen.mongodb.com/task_log_raw/dot_net_driver_unsecure_tests__version~latest_os~windows_64_topology~sharded_cluster_auth~noauth_ssl~nossl_test_f809794cf52538bd3f6b5c5ba5b9a74b17257fb7_19_10_17_23_46_15/2?type=T#L2824
justin.seyster commented on Fri, 8 Nov 2019 00:10:21 +0000: Quick note for myself: I checked the Evergreen logs to help narrow down what changed. The first task that failed with this error was using revision dab2ae229660af56a786a084ebc36666c4dd6a91 of the server. The task before that (which succeeded) was using revision 7264dd03a981ea9cc86dcdc073e7a6dedaa995af. Update: I did some more experimenting, and this test failure appears to be related to commit 97cc7b5838db4ef13ede3149c44bceca8f5c2977 of the server (SERVER-42723). When I run the Java driver version of this test against the version of the server previous to 97cc7b, the first getMore immediately returns the expected error. When I run the same test against 97cc7b, it takes multiple getMore commands to get a response that is not an empty batch. I also wanted to rule out commit f80979 of the C# driver as a possible cause; that's the first commit in Evergreen where this failure occurs, and it seems to involve tailable cursors. I ran f80979 on Evergreen against an older version of the server, and it succeeds, meaning the failure can't be caused by a change in f80979. tl;dr: The failing test used to pass because it got its response in the first getMore. SERVER-42723 changed mongos/mongod behavior so that it now takes multiple getMore commands to get the expected error, and the C# driver is not designed to retry the getMore, which is the caues of this failure. justin.seyster commented on Sat, 2 Nov 2019 01:46:04 +0000: dmitry.lukyanov Thanks for bearing with me! Good question about the spec. I'm going to have to get back to you on that next week. I'll need to ask the team to figure out which spec doc is relevant here. I was basing my initial answer off my read of the source code. It's clear that it was intentionally designed for the getMore to time out with no results but also no error when when the mongos doesn't see any events from the shards for a period of time. The duration that the mongos waits is controlled by the "awaitDataTimeout" parameter to the getMore command (which is 1,000ms by default). My expectation is that setting a very high "awaitDataTimeout" value would get the test passing again. That's probably not the route we want to take, though. My reading of the linked test specification is that "iterate changeStream once" refers to calling the driver's next() function exactly once without necessarily specifying how many times the next() function issues getNext commands to the server. I'm not familiar with these documents, though, so it's definitely possible that I'm misunderstanding. It sounds like the next step is to get on the same page about what the spec says for tailable+awaitData cursors timing out. I'll see what I can find next week. Query should also make sure to investigate if there's a performance regression in change streams here. Also, I misspoke in my previous message. I'm doing my testing on the "master" branch, which is 4.3, not 4.4. Sorry about that! If I do more testing, I'll try to do it on the specific commits you're using. Thanks again! dmitry.lukyanov commented on Fri, 1 Nov 2019 23:16:09 +0000: >What manual code changes did you add? the manual change to be able to call a server `getMore` command more than once. >are you referring to a driver "getMore" function or the server "getMore" command? to the server "getMore" command >The client needs to keep calling "getMore" until it sees results. I'm not familiar with this statement, can you point me out on the spec with it? >In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server yes, the successful `getMore` response contains no documents. NOTE: this is the description of the test for changeStreams: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/README.rst#spec-test-runner In particular, this line tells us to iterate changeStream only once: If there was no error and result.error is set, iterate changeStream once and capture any error. justin.seyster commented on Fri, 1 Nov 2019 22:50:14 +0000: dmitry.lukyanov Thanks for the additional info. Sorry, I'm not sure what you mean. What manual code changes did you add? When you say that you call "getMore" twice, are you referring to a driver "getMore" function or the server "getMore" command? In the former case, a driver "getMore" function that returns 0 results may be a driver bug, depending on the driver specification. (I'm not familiar with the C# driver specification.) In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server. The client needs to keep calling "getMore" until it sees results. When I ran the test locally using the Java driver, I saw that it would keep sending "getMore" commands forever if the server kept sending back empty responses. That said, if mongos used to reliably return the change in the first "getMore" batch but is now taking several seconds to return the result, that may be a real performance issue. By the way, my local testing was against 4.4 top-of-tree. I could try 4.3 next week to see if maybe that will reproduce the issue. Thanks for helping me investigate this! dmitry.lukyanov commented on Fri, 1 Nov 2019 22:29:54 +0000: >Are you able to reproduce it locally? yes, checked now on the server versions `4.3.0-1818-g6ca9be7` and `4.3.0-2014-g8010c51`. As I understand the problem is related to this line: https://github.com/mongodb/mongo-csharp-driver/blob/master/tests/MongoDB.Driver.Tests/Specifications/change-streams/tests/change-streams-errors.json#L105 When I call `getMore` command, I don't see any errors. But then when I call `getMore` a second time (after manual code changes), the error with code 280 and message: Command getMore failed: Encountered an event whose _id field, which contains the resume token, was modified by the pipeline. Modifying the _id field of an event makes it impossible to resume the stream from that point. Only transformations that retain the unmodified _id field are allowed. has been thrown. NOTE: the error is not reproducible on the server version `4.3.0-762-gf13641a`. UPDATE: The error can be thrown not necessarily after the second attempt. In my tests, I saw the third or the forth attempts when the error has been thrown justin.seyster commented on Fri, 1 Nov 2019 21:59:18 +0000: After I posted the above comment, I remembered that mongod/mongos only log a COMMAND if it exceeds the "slowms" threshold, which is 100ms by default. It is possible that there is a third getMore call in each test that isn't getting logged because it returns right away, perhaps with nreturned > 0. When I have some time I also plan to look at the logs for one of the successful runs to see how many getMore calls it has to make before it sees the error. justin.seyster commented on Thu, 31 Oct 2019 23:34:20 +0000: dmitry.lukyanov You mentioned that this reproduces on C# and Java. Are you able to reproduce it locally? Unfortunately, I was not able to make this happen using the Java test. (I didn't try to execute the C# test suite, because I don't have a Windows test machine readily available.) My initial thought was that maybe mongos was returning an empty batch in the getMore response (which is legal), causing the test to give up and fail. In my testing, though, the test keeps sending getMore requests when it gets empty batches from mongos, which is the correct behavior. In the logs from Evergreen, though, the test sends exactly two getMore commands and then gives up. (I'm looking at the mongos logs from this failure). (async = true) ./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:14:58.371+0000 I COMMAND [conn2171] command change-stream-tests.test command: getMore { getMore: 6199935151688990082, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("328c2bd5-eb89-41b4-9bb3-8ea505c3059b") }, $clusterTime: { clusterTime: Timestamp(1572376497, 24), signature: { hash: BinData(0, 79F3B098B291669A30AE2504FD09FB07BE31D898), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("328c2bd5-eb89-41b4-9bb3-8ea505c3059b") } } nShards:2 cursorid:6199935151688990082 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1000ms ./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:14:59.632+0000 I COMMAND [conn2173] command change-stream-tests.test command: getMore { getMore: 2518998828756750331, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("07b8c00b-6bbe-4ea9-aa0b-b018bb41b937") }, $clusterTime: { clusterTime: Timestamp(1572376498, 34), signature: { hash: BinData(0, B297D66CED85ED2C9889BE8CDF2B93B35F696D85), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("07b8c00b-6bbe-4ea9-aa0b-b018bb41b937") } } nShards:2 cursorid:2518998828756750331 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms (async = false) ./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:25:43.681+0000 I COMMAND [conn7919] command change-stream-tests.test command: getMore { getMore: 9096175620065252820, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("2fd89215-7d08-4067-90c1-27912443afbe") }, $clusterTime: { clusterTime: Timestamp(1572377142, 60), signature: { hash: BinData(0, 19DB81E36C8F74DECE6442DC1F232F516E1F06C5), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("2fd89215-7d08-4067-90c1-27912443afbe") } } nShards:2 cursorid:9096175620065252820 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms ./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:25:44.979+0000 I COMMAND [conn7921] command change-stream-tests.test command: getMore { getMore: 3119740060742483673, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("08c1b04c-7ed5-4fb1-83e5-18a94f52c9b8") }, $clusterTime: { clusterTime: Timestamp(1572377143, 34), signature: { hash: BinData(0, 4D757549276B6B6BA660E1E7F8E74BB2F4307AA2), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("08c1b04c-7ed5-4fb1-83e5-18a94f52c9b8") } } nShards:2 cursorid:3119740060742483673 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms The two mysteries are: 1) Why is mongos taking so long to respond with the change? The "await data" timeout is 1 second, meaning that the mongos has no events for two full seconds. Not impossible, but longer than I would expect. 2) Why is the driver giving up? Dmitry, if you can reproduce this locally, would it be possible for you to use the debugger or logging to determine why the driver stops retrying? That would be my next step if I could reproduce this and/or if I had a better understanding of the C# driver. david.storch commented on Wed, 23 Oct 2019 23:21:16 +0000: I tried to reproduce this using the server's infrastructure for testing sharded clusters, but the change stream failed with the expected code (code:280, codeName:ChangeStreamFatalError). (function() { "use strict"; let st = new ShardingTest({shards: 2}); const testDb = st.s.getDB("test"); st.shardColl(testDb.c, {_id: 1}, {_id: 0}, {_id: 0}); let changeStream = testDb.c.watch([{$project: {_id: 0}}]); assert.writeOK(testDb.c.insert({_id: 1})); assert.soon(() => changeStream.hasNext()); }()); What's the topology of the sharded cluster used by the drivers test? Is there any special configuration of the cluster we need in order to repro this? kaitlin.mahar commented on Wed, 23 Oct 2019 22:30:08 +0000: I believe the test expects the failure on the first getMore. The linked JSON test has drivers do the following: 1) create a change stream, which sends the initial aggregate command 2) do an insertOne to generate a new event 3) try to iterate the change stream to get the new event. this will trigger a getMore, since the initial batch returned by aggregate was empty. the getMore returns the error. ian.boros commented on Wed, 23 Oct 2019 21:41:31 +0000: Does this test expect the aggregation to fail immediately (as part of the 'aggregate' command being sent)? Does it attempt to run any getMores? I cannot tell from the logs provided.