BugZero | MongoDB BugID 2374592 - MongoDB 6.0: Adding a new shard renders all preced...

MongoDB - Defect ID: 2374592

MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid

MongoDB - Defect ID: 2374592

MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid

Last updated on 11/6/2023

Overall: 66.0

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 66.0

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

Very similar to what was observed here: https://jira.mongodb.org/browse/SERVER-42232 Seeing the same issue in MongoDB 6.0. More info: Using MongoDB 6.0.6. Having a DB with sharded collections. The DB has a few shards, every shard is in a P-S-A (2 node replicas and an arbiter). My application is using change-streams to gather statistics, and is constantly reading for operations, while also maintaining the clusterTime timestamp in case the the app crashes, so we can continue from the last handled point. For the shards, we are self-maintaining the storage, and decide on adding a shard from time to time. Been noticing that when a shard is added the following happens: added shard at ~9:15 AM, shard comes up, rebalancing its data with other shards few hours later at ~12:30 PM, Resume of change stream was not possible errors start showing up, coming from the port of the new added shard. {"t":{"$date":"2023-08-02T06:32:40.283+00:00"},"s":"W", "c":"QUERY", "id":20478, "ctx":"conn188","msg":"getMore command executor error","attr":{"error":{"code":286,"codeName":"ChangeStreamHistoryLost","errmsg":"Resume of change stream was not possible, as the resume point may no longer be in the oplog."},"stats":{},"cmd":{"getMore":2383202557420931747,"collection":"$cmd.aggregate","maxTimeMS":1000}}} also, the first available event in the new shard is delayed very much, here we added the shard around ~9:30 AM, but the first available event was only at ~14:30 PM, many hours later, but for other shards the first even is before that, so not sure how this makes sense because this is a totally new shard. In this case, the client, which is connected to the mongos router, is unable to proceed unless we move the start_at_operation_time pointer to 14:30 PM so it can continue reading (and this also makes us lose all the update from ~12:30 PM to ~14:30 PM which is not acceptable). Why is this happening? isn’t the change-stream suppose to continue regularly and add the new shard updates once its ready? failing like this does not look like normal behavior. Is there a safe way to add a shard and keep reading incoming updates for the other shards through the mongos router without getting stopped by this un-synced shard? Is this happening because of the P-S-A configuration, and will not occur in P-S-S? if yes, why is that? Following this community thread, were the question was raised few days ago but with no answer: https://www.mongodb.com/community/forums/t/change-stream-resume-point-lost-after-adding-a-new-shard/232005

Top User Comments

JIRAUSER1264630 commented on Wed, 9 Aug 2023 07:34:46 +0000: Hi yuan.fang@mongodb.com , Where you able to look at the logs and find anything? Thanks. JIRAUSER1264630 commented on Sun, 6 Aug 2023 06:36:36 +0000: Hi yuan.fang@mongodb.com , Yes I uploaded with full path but to me it seems that it worked. Uploaded now again as follows: curl -X POST https://upload.box.com/api/2.0/files/content \ > -H 'Authorization: Bearer 1!xWlQ4ggMMjG26rADGlkBg_aj6qpIzwRyLnJaRJ_M50u0TF_M6HRUlgDBsNyTALLOHmYIw33Cj5af0pyJk3QP4qrVpwwyL6b_I-_P-ullBWX1hS6iHarTnlq1q_U9vntPVxUQtZN05xypkNqhXavI0slAkYZ8XA5uSbPIqX4E6HmWlbJtSGDq98UTt6JGYPpx_Q0tdy2BQL0rinCJvBkhnOi54QqV_sKtB8r4mebBJezsTuRtMZysISKRe7W2hB1QipoLySFNNRAVOgi2qWEZBHIO4_lVwiTpKVRcdONIl8Lgn4yRYcAn7L-k-ZjtHj0g2lCw21Gj6lRzyFQend4pc7SORrdOHdIBfHPyA1fkDRDimcvv4wrsaqZvFv5XVPO9qKswrQWxkfRzn7Kt_-cWYv818xY.' \ > -H 'Content-Type: multipart/form-data' \ > -F attributes='{"name": "mongodb_logs2.zip", "parent": {"id": "214562146277"}}' \ > -F file=@mongodb_logs2.zip > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17.8M 100 1170 100 17.8M 213 3330k 0:00:05 0:00:05 --:--:-- 3471k JIRAUSER1270794 commented on Fri, 4 Aug 2023 13:52:20 +0000: Hi oraiches@zadarastorage.com, Thank you for uploading the data. Unfortunately, it seems that the data wasn't uploaded properly, and I couldn't see it in the portal. I've tested the link on my end, and it worked. Would you please try to upload it again? Please make sure that the files are fully uploaded and you should see the progress after initiating the command. One possible reason I've noticed that could cause the upload to fail is if you provided the full path of the file. In other words, you'll need to navigate to the current location of the file and replace with just the file name (e.g., "mongodb-logs.tgz") without the preceding path. Please let us know if you made another attempt to upload, and thank you so much for your time! JIRAUSER1264630 commented on Wed, 2 Aug 2023 14:19:39 +0000: Hi yuan.fang@mongodb.com , Thanks for the fast reply! Uploaded the logs to the link. The timeline was around {"$date":"2023-08-02T06:30:41.227+00:00"} in the logs. JIRAUSER1270794 commented on Wed, 2 Aug 2023 13:31:24 +0000: Hi oraiches@zadarastorage.com, Thank you for reaching out and bringing to our attention that the issue has occurred again. Here is the new uploader link, for the issue that happened this time: Please ensure that you upload the logs/FTDC covering the time that the incident occurred. Could you also provide a clear issue timeline? Regards, Yuan JIRAUSER1264630 commented on Wed, 2 Aug 2023 07:28:44 +0000: Hi yuan.fang@mongodb.com , We got the issue happening again on another system today. This time I collected the logs and $dbpath/diagnostic.data an about ~1 hour after the occurrence, so hopefully we have all the info you requested. Can you share a new upload link to send the logs once more? JIRAUSER1264630 commented on Wed, 19 Jul 2023 15:46:23 +0000: Hi yuan.fang@mongodb.com , Was not able to reproduce the same on a different system. I have uploaded the logs of the 2 nodes in a zipped file to your upload portal, reminding that the $dbpath/diagnostic.data were lost. JIRAUSER1270794 commented on Wed, 19 Jul 2023 13:54:01 +0000: Hi oraiches@zadarastorage.com, Have you had a chance to reproduce the issue? If not, we can start by looking at the log, could you upload them to the upload portal? Thank you. Regards, Yuan JIRAUSER1264630 commented on Tue, 27 Jun 2023 08:40:55 +0000: Hi yuan.fang@mongodb.com, thanks for the reply. Unfortunately, we only have the mongod/mongos logs since the $dbpath/diagnostic.data got rotated. Will try to reproduce, but is the mongod/mongos sufficient in the meantime? JIRAUSER1270794 commented on Mon, 26 Jun 2023 20:13:04 +0000: Hi oraiches@zadarastorage.com, Thank you for your report, it seems the new shard experienced an issue with synchronization of the change stream. In order to understand why the change stream prior to 14:30 PM is unavailable on the new shard, we need more diagnostic data for further investigation: I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link: the mongod/mongos logs the $dbpath/diagnostic.data directory (the contents are described here) Regards, Yuan

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 2374592

MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid

MongoDB - Defect ID: 2374592

MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid

Last updated on 11/6/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?