...
Since we added 4 new shards to an existing 5 shard cluster, one of our database clients receives CursorNotFound errors from time to time. The cursorTimeoutMillis is set to 2 hours (7200000 ms) on all mongo nodes. The client uses the perl driver without specifying a batch size. The client iterates over all documents having the field clickCount being of $type:3 which is covered by an index. The query in mongo-shell syntax looks like this: db.offer.find({}, {"clickCount": 1, "offerId": 1, "_id": 1}).hint("clickCount_1_offerId_1__id_1") The index is defined as follows: { "v" : 2, "key" : { "clickCount" : 1, "offerId" : 1, "_id" : 1 }, "name" : "clickCount_1_offerId_1__id_1", "ns" : "offerStore.offer", "partialFilterExpression" : { "clickCount" : { "$type" : 3 } }, "background" : true } The client application calls constantly getMore (always within the cursorTimeoutMillis of 2 hours). As you can see in the strace of the client, which is running only one query at the time, at 11:29:04 there was a successful getMore call, and one second later the next getMore call failed with CursorNotFound: 11:29:04.596296 select(32, NULL, [24], NULL, {tv_sec=1800, tv_usec=0}) = 1 (out [24], left {tv_sec=1799, tv_usec=999997}) 11:29:04.596362 write(24, "g\0\0\0G@\2254\0\0\0\0\324\7\0\0\0\0\0\0offerStore.$cmd\0\0\0\0\0\377\377\377\377;\0\0\0\22getMore\0\213]\320\275\311>1M\2collection\0\6\0\0\0offer\0\20batchSize\0e\0\0\0\0", 103) = 103 11:29:04.596599 select(32, [24], NULL, NULL, {tv_sec=1800, tv_usec=0}) = 1 (in [24], left {tv_sec=1799, tv_usec=999750}) 11:29:04.596924 read(24, "f&\0\0>\247\t\6G@\2254\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0B&\0\0\3cursor\0\254%\0\0\22id\0\213]\320\275\311>1M\2ns\0\21\0\0\0offerStore.offer\0\4nextBatch\0w%\0\0\0030\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0N\5@\361\220*\332.\2_id\0!\0\0\0d97c8a0eb00c79a3982efa36fb8de9b8\0\0\0031\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0/\7\340\353\20+\332.\2_id\0!\0\0\0c6e2ef3033c554bd9ce1d80b5dc5eb9f\0\0\0032\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\331\35\340\0268+\332.\2_id\0!\0\0\0e2af0314c13f3460f3c13028cbca813b\0\0\0033\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0[\24\240\356(-\332.\2_id\0!\0\0\0c2460966da93961b94c71ab2ec61cdc8\0\0\0034\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\26\7`\r)-\332.\2_id\0!\0\0\0d0ef7338a5c089a018b8a2f766572d88\0\0\0035\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\232\4`l)-\332.\2_id\0!\0\0\0c6bd57d356540a93ea8ae80a53032dfa\0\0\0036\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0009\t v)-\332.\2_id\0!\0\0\0c465c432b3a1b44edb31137a99dae205\0\0\0037\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0!\1\200\224)-\332.\2_id\0!\0\0\0be2a698c3741944e47cebb3d1c19c3ce\0\0\0038\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\v\6 \315)-\332.\2_id\0!\0\0\0d102d497965addfc57d344663f5bacb3\0\0\0039\0[\0\0\0\3clickCount\0\17\0\0\0\0201_30\0\2\0\0\0\0\22offerId\0\234\2\300\367)-\332.\2_id\0!\0\0\0cf36ee16b06f7f"..., 372480) = 9830 ... 11:29:05.290051 select(32, NULL, [24], NULL, {tv_sec=1800, tv_usec=0}) = 1 (out [24], left {tv_sec=1799, tv_usec=999996}) 11:29:05.290112 write(24, "g\0\0\0)m\211K\0\0\0\0\324\7\0\0\0\0\0\0offerStore.$cmd\0\0\0\0\0\377\377\377\377;\0\0\0\22getMore\0\213]\320\275\311>1M\2collection\0\6\0\0\0offer\0\20batchSize\0e\0\0\0\0", 103) = 103 11:29:05.290349 select(32, [24], NULL, NULL, {tv_sec=1800, tv_usec=0}) = 1 (in [24], left {tv_sec=1799, tv_usec=947221}) 11:29:05.343287 read(24, "\5\1\0\0D\265\t\6)m\211K\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\341\0\0\0\1ok\0\0\0\0\0\0\0\0\0\2errmsg\0 \0\0\0cursor id 86597569444 not found\0\20code\0+\0\0\0\2codeName\0\17\0\0\0CursorNotFound\0\21operationTime\0\363\6\0\0a\304\311`\3$clusterTime\0X\0\0\0\21clusterTime\0K\10\0\0a\304\311`\3signature\0003\0\0\0\5hash\0\24\0\0\0\0\7)\354*\372\n\355\315\374\231\16\274/K\35`Ml\341h\22keyId\0\376\3\0\0p\215\16`\0\0\0", 372480) = 261 The size of one returned document is 100 Bytes on average. The index size is round about 1 GB, all shards together. We found out, that cursors timed out on all nine mongodb shards within a few seconds: mongo-offerstore-de07-03.db00.pro05.eu.idealo.com: 2021-06-21T09:21:55.953+0200 I QUERY [clientcursormon] Cursor id 86275630521 timed out, idle since 2021-06-21T07:21:54.768+0200 mongo-offerstore-de08-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:56.124+0200 I QUERY [clientcursormon] Cursor id 70641344629 timed out, idle since 2021-06-21T07:21:54.432+0200 mongo-offerstore-de03-03.db00.pro06.eu.idealo.com: 2021-06-21T09:21:57.046+0200 I QUERY [clientcursormon] Cursor id 88178921267 timed out, idle since 2021-06-21T07:21:54.570+0200 mongo-offerstore-de02-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:57.435+0200 I QUERY [clientcursormon] Cursor id 87481684996 timed out, idle since 2021-06-21T07:21:54.829+0200 mongo-offerstore-de06-03.db00.pro05.eu.idealo.com: 2021-06-21T09:21:57.494+0200 I QUERY [clientcursormon] Cursor id 98138927074 timed out, idle since 2021-06-21T07:21:54.914+0200 mongo-offerstore-de04-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:57.716+0200 I QUERY [clientcursormon] Cursor id 86355932061 timed out, idle since 2021-06-21T07:21:54.473+0200 mongo-offerstore-de01-02.db00.pro06.eu.idealo.com: 2021-06-21T09:21:57.831+0200 I QUERY [clientcursormon] Cursor id 88372047114 timed out, idle since 2021-06-21T07:21:54.488+0200 mongo-offerstore-de09-01.db00.pro08.eu.idealo.com: 2021-06-21T09:21:58.005+0200 I QUERY [clientcursormon] Cursor id 71157191726 timed out, idle since 2021-06-21T07:21:54.704+0200 mongo-offerstore-de05-01.db00.pro05.eu.idealo.com: 2021-06-21T09:22:00.531+0200 I QUERY [clientcursormon] Cursor id 65745554296 timed out, idle since 2021-06-21T07:21:57.605+0200 Is it possible that the cursor which the client uses is completely disassociated from the cursors between router and the shards? And if there's some caching or queueing between router and shard nodes, then it could be possible that the "backend" cursors less often get the getMore commands, causing the timeouts. Hypothesis: Each shard returns its batch to the router. The router caches these batches. Batches are fix in size. Since our cluster has nearly doubled the number of shards, the router has now nearly the doubled number of documents in its cache. Thus, the router takes almost twice as long to request the next batch from the shards, which may exceed the cursor timeout. Holds this hypothesis true? If so, the the router should transfer each getMore, received from the client, to the shards to notify them that the cursor is still in use even though no shard had to fetch more documents because the router had them still in cache.
kay.agahd@idealo.de commented on Wed, 3 Nov 2021 15:09:19 +0000: Hi mihai.andrei, sorry for my late reply, I oversaw your question about the shardkey. The shardkey is _id. mihai.andrei commented on Wed, 4 Aug 2021 21:24:04 +0000: Hi kay.agahd@idealo.de and eric.sedor; I took a look at this ticket and found that it can be fixed with SERVER-6036, so I'm going to go ahead and close the ticket. Feel free to leave a comment if you have any other questions; thanks! mihai.andrei commented on Thu, 22 Jul 2021 15:44:00 +0000: I think the simplest way to fix this would be to tie the query to a session so that the shard cursors don’t time out. After taking a look, I think this could certainly be fixed by SERVER-6036: if the query is run in a session, the cursors on each shard won’t time out if they’re all tied to one session. As far as the relation to SERVER-46918, at a high level it’s related because we have a query against a sharded cluster where the cursors on each shard timeout because the shard cursors are idled for 2 hours. As it pertains to the root cause of ‘sort’ being responsible for idling cursors, this could be somewhat similar because even though an explicit sort is not requested, the query results in a covered projection, which uses an index scan and returns results in sorted order (note that this is explain output on a single mongod): { "explainVersion" : "1", "queryPlanner" : { "namespace" : "db.c", "indexFilterSet" : false, "parsedQuery" : { }, "queryHash" : "7420B22A", "planCacheKey" : "D32EF81B", "maxIndexedOrSolutionsReached" : false, "maxIndexedAndSolutionsReached" : false, "maxScansToExplodeReached" : false, "winningPlan" : { "stage" : "PROJECTION_COVERED", "transformBy" : { "clickCount" : 1, "offerId" : 1, "_id" : 1 }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "clickCount" : 1, "offerId" : 1, "_id" : 1 }, "indexName" : "clickCount_1_offerId_1__id_1", "isMultiKey" : false, "multiKeyPaths" : { "clickCount" : [ ], "offerId" : [ ], "_id" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "clickCount" : [ "[MinKey, MaxKey]" ], "offerId" : [ "[MinKey, MaxKey]" ], "_id" : [ "[MinKey, MaxKey]" ] } } }, "rejectedPlans" : [ ] }, "command" : { "find" : "c", "filter" : { }, "projection" : { "clickCount" : 1, "offerId" : 1, "_id" : 1 }, "hint" : "clickCount_1_offerId_1__id_1", "$db" : "db" }, "serverInfo" : { "host" : "ip-10-122-6-36", "port" : 27017, "version" : "5.0.0-alpha0-1764-gcdb43ef", "gitVersion" : "cdb43efd361b20a0c6a9bf7ad6e14dd307b6a1a4" }, "serverParameters" : { "internalQueryFacetBufferSizeBytes" : 104857600, "internalQueryFacetMaxOutputDocSizeBytes" : 104857600, "internalLookupStageIntermediateDocumentMaxSizeBytes" : 104857600, "internalDocumentSourceGroupMaxMemoryBytes" : 104857600, "internalQueryMaxBlockingSortMemoryUsageBytes" : 104857600, "internalQueryProhibitBlockingMergeOnMongoS" : 0, "internalQueryMaxAddToSetBytes" : 104857600, "internalDocumentSourceSetWindowFieldsMaxMemoryBytes" : 104857600 }, "ok" : 1 } Do we know anything about the shard key being used/distribution of data in the above? If the data is sharded on clickCount (or even the key pattern of the index), then it's likely related. Regardless of the relation to SERVER-46918, I think that this could be fixed by tying the query to a session. eric.sedor commented on Tue, 29 Jun 2021 16:42:42 +0000: You're right that the batch size workaround won't scale well with shard count; currently we do intend to backport SERVER-6036 to 4.4 if it is doable. For working with batch size to get the best results in your specific environment I'd encourage you to explore this with our community by posting on the MongoDB Developer Community Forums. I believe it is the case that SERVER-6036 is expected to improve this scenario independent of how clients use sessions, but I will pass this ticket to an appropriate team to confirm that is the case. If so, we'll close this ticket as a duplicate of that one. kay.agahd@idealo.de commented on Tue, 29 Jun 2021 12:39:01 +0000: Hi eric.sedor, thanks for confirming our hypothesis. To your suggested work around to use a smaller batch sizes: what's the default batch size of the perl-driver? Is it the max size of a bson document (16 MB)? Given an average size of 100 Bytes per document, each batch could then have up to 167,772 of documents, multiplied by the number of 9 shards, makes 1.5 million of documents which must all be processed within the cursorTimeout of 2 hours. This makes less than 5 ms per document processing time by the client but the client needs actually 15 ms per document. So we'd need to reduce the batch size to at leat one third. However, it does not scale out when more shards will be added in the future. In this case, we'd always need to reduce the batch size further. That's suboptimal. By the way, after having added our 4 new shards to the cluster, we also hit the bug solved in SERVER-46487 because we still used v4.0.18 while the bug has only been fixed in 4.0.19. What I want to say by this, is that our chunk distribution must have been similar to those described in SERVER-46487, else we wouldn't have hit the bug. This may also explain why we were more exposed to hit the CursorNotFound error. The tickets you linked are indeed related, especially SERVER-46918. Unfortunately SERVER-46918 has been closed as a duplicate of SERVER-6036 even though I think it's not a duplicate because SERVER-6036 relies on sessions while SERVER-46918 and my ticket (SERVER-57863) do not rely on sessions. The author of SERVER-46918 stated out: It is not a real use case but I have real ones where I have 10 shards, process of each record takes more than 1 second and cursorTimeoutMillis: "7200000" (2 hours) mihai.andrei found out in SERVER-46918 that it has to do with the sort: The originating query requests a sort, which means that, after cursors are established on both shards, the cursor on the first shard (i.e. the shard that possesses the chunks which come first in sorted order) must be totally exhausted before we start to issue getMores against the cursor on the second shard mihai.andrei concludes that's rather a feature request than a bug and closes the ticket. Now, that you have a real life use case, even without sort, you may have a reason to solve this issue and backport it to at least version 4 because v5 has not even been published yet. eric.sedor commented on Mon, 28 Jun 2021 18:38:57 +0000: Hi kay.agahd@idealo.de, Another way to work around your issue might be to test explicit and smaller batch size options that will prompt more getMores to the shards. The mongos does maintain its own cursors to shards, and there is an inherent buffer in that batch size is not varied by the mongos when it makes requests to shards. So, batches from each cursor to shard members won't be fully consumed by a single getmore to the driver-facing cursor. Doubling the number of shards would certainly have an effect on this. I believe what you're reporting is similar in shape to both SERVER-46918 and SERVER-38123, which are being addressed as SERVER-6036 in MongoDB 5.0. Does that look right to you? Sincerely, Eric kay.agahd@idealo.de commented on Mon, 21 Jun 2021 12:15:17 +0000: To work around this issue, the database client now retrieves all documents from the database and keeps them in memory. The query took seven minutes and returned 4618504 documents. After that, these documents will be processed which will take approximately 20 hours.