BugZero | MongoDB BugID 280696 - Multiple documents with same

MongoDB - Defect ID: 280696

Multiple documents with same _id after 3.2 upgrade

MongoDB - Defect ID: 280696

Multiple documents with same _id after 3.2 upgrade

Last updated on 11/7/2017

Overall: 6.36.3

Severity: 6.46.4

Community: 8.58.5

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.36.3

Severity: 6.46.4

Community: 8.58.5

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

We upgraded a replica set from 2.6 to 3.0 then 3.2, each time adding new servers, letting them replicate then removing the old servers. On the day that the first 3.2 server was added as a 4th (hidden) node in the 3.0 cluster to start the data sync, we appear to have encountered some data corruption. To be precise, we now have examples of multiple documents in a collection with the same _id. An unbounded find on the collection shows them (only showing the record around the problem one for brevity, and anonymising the collection name): db['COLLECTION'].find() { "_id" : "2016-03-15", "percent" : 7.317073170731707 } { "_id" : "2016-03-16", "percent" : 7.4074074074074066 } { "_id" : "2016-03-17", "percent" : 6.666666666666667 } { "_id" : "2016-03-18", "percent" : 6.944444444444445 } { "_id" : "2016-03-18", "percent" : 7.792207792207792 } { "_id" : "2016-03-19", "percent" : 7.6923076923076925 } { "_id" : "2016-03-20", "percent" : 7.6923076923076925 } { "_id" : "2016-03-21", "percent" : 6.756756756756757 } { "_id" : "2016-03-22", "percent" : 6.944444444444445 } { "_id" : "2016-03-23", "percent" : 7.142857142857142 } Note that "_id" : "2016-03-18" is there twice. If I try and query directly for this record, only one appears: db['COLLECTION'].find({ "_id" : "2016-03-18" }) { "_id" : "2016-03-18", "percent" : 7.792207792207792 } db['COLLECTION'].find({ "_id" : {$gt: "2016-03-17", $lt: "2016-03-19"} }) { "_id" : "2016-03-18", "percent" : 7.792207792207792 } Would a copy of the WiredTiger datafiles for this collection and its indexes help with analysing this issue?

Top User Comments

milkie commented on Wed, 13 Sep 2017 13:39:54 +0000: Hi Greg, After over a year of searching for reasons why this would occur, we have still yet to find a smoking gun. After you repaired the collections, did you ever have a recurrence of this problem? Have you upgraded to 3.4 in the meantime? gregmurphy commented on Fri, 22 Jul 2016 10:32:16 +0000: Hi - Yes, journalling is enabled. The process that writes this data uses the Java driver, version 3.2.2. thomas.schubert commented on Mon, 18 Jul 2016 15:49:20 +0000: Hi gregmurphy, Thank you for the detailed timeline. Unfortunately, we haven't been able to reproduce this issue yet. To help us reproduce this issue, would you please specify which MongoDB driver you were using and its specific version? Additionally, would you please clarify whether journaling is enabled? Thank you, Thomas gregmurphy commented on Tue, 26 Apr 2016 15:59:20 +0000: I've dug back through the timeline and any available logs to clarify (as well as I can) what happened around that time. This is hopefully an accurate representation of events: On 12th March cluster contained 3 x 3.0 nodes (no 2.6 nodes), and a single hidden 3.2 node was added. This took around 4 days to replicate. Replication completed on the 16th, and on the 17th the 3.2 node was shut down to rsync its data directory to one of the additional new 3.2 nodes. Two of these nodes were then started on the 18th, but one was then shut down to rsync to the 3rd new 3.2 node. The 3.2 nodes remained hidden until 24th March. In answer to your questions: 1. That's correct, no 2.6 nodes were present 2. 3.0.9 (I believe) and 3.2.3 3. Yes, they are upserts 4. Every example I've seen (46 so far) has "_id": "2016-03-18" thomas.schubert commented on Fri, 22 Apr 2016 20:24:41 +0000: Hi gregmurphy, Thank you for answering my questions, we are continuing to investigate this issue. My understanding is that on March 18th, the cluster contained 3 MongoDB 3.0 nodes and one hidden MongoDB 3.2 node, which was performing an initial sync. The affected documents were inserted that day into a MongoDB 3.0 primary and may have been updated during the initial sync process of the hidden MongoDB 3.2 node. Is this correct? Can you confirm that no MongoDB 2.6 nodes were in the cluster when the duplicated data was first inserted? Can you provide the exact versions of MongoDB running on the cluster on when the affected documents were first inserted, and the versions of the nodes in subsequent configurations? Are you upserting these documents? Are any documents with a different _id affected? Are documents with a different _id modified on later dates? Thank you for your help, Thomas gregmurphy commented on Wed, 20 Apr 2016 10:36:03 +0000: Should have added - I kept the output of some finds against the collections that had the missing elements (note the docs with { "_id" : "2016-03-18"} ) Primary: RS:PRIMARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2, "NewUsers" : 0 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } Secondary 1: RS:SECONDARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } Secondary 2: RS:SECONDARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } gregmurphy commented on Wed, 20 Apr 2016 09:15:14 +0000: Hi Thomas Yes, this is the same replica set that SERVER-23433 was raised for, so the workload details in there are relevant for this ticket too. Regarding your questions, these answers are in relation to the multiple documents with the same _id: 1. Primary and secondaries have identical output: db['COLLECTION'].getIndexes() [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "DB.COLLECTION" } ] 2. Unfortunately we no longer have the logs as we only retain for 30 days and the problem data was generated on March 18th. I did check the logs at the time though, and didn't notice any errors. 3. Inserts and updates, and the documents are all very small (longest ~200 bytes), and once they've been inserted any updates should just be to change numeric values, not add more elements. 4. Validate output: db['COLLECTION'].validate(true) { "ns" : "DB.COLLECTION", "nrecords" : 85, "nIndexes" : 1, "keysPerIndex" : { "DB.COLLECTION.$_id_" : 84 }, "indexDetails" : { "DB.COLLECTION.$_id_" : { "valid" : true } }, "valid" : false, "errors" : [ "number of _id index entries (84) does not match the number of documents (85)" ], "advice" : "ns corrupt. See http://dochub.mongodb.org/core/data-recovery", "ok" : 1 } Regarding the collections with missing elements on the secondaries (that I have already repaired), I only kept notes about one of the affected collections from the time. From memory, 10-15 exhibited this problem. For the one I kept notes about: getIndexes (same output from primary and secondaries): [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "DB.COLLECTION" } ] Collection validation: { "ns" : "DB.COLLECTION", "nrecords" : 672, "nIndexes" : 1, "keysPerIndex" : { "DB.COLLECTION.$_id_" : 671 }, "indexDetails" : { "DB.COLLECTION.$_id_" : { "valid" : true } }, "valid" : false, "errors" : [ "number of _id index entries (671) does not match the number of documents (672)" ], "advice" : "ns corrupt. See http://dochub.mongodb.org/core/data-recovery", "ok" : 1 } Interestingly this collection also suffers from the duplicate _id issue, and in turn this duplicate record was the specific record that had the missing elements on the secondaries. thomas.schubert commented on Tue, 19 Apr 2016 22:37:01 +0000: Hi gregmurphy, Thank you for reporting this issue. To get a better understanding of what is going on here, please answer the following questions: Can you please provide the output of db.collection.getIndexes() on each node for the collection with documents that had missing elements? Were any server errors logged since beginning this upgrade process? Can you describe what kinds of operations are performed on the affected documents? Do these documents significantly change in size when updated? Is this the same workload as you described in SERVER-23433? Can you please provide the output of db.collection.validate(true) executed against an affected collection? Please note that this command may impact the performance of your MongoDB instance. Kind regards, Thomas gregmurphy commented on Tue, 19 Apr 2016 14:28:07 +0000: I'm in the process of scanning all of our collections for instances like this and have so far found 16. I'll update with the complete figures once the scan is finished. gregmurphy commented on Tue, 19 Apr 2016 10:34:24 +0000: As a side note - we also saw some other data corruption on the same day, with elements missing from documents on secondaries. I have already repaired this data by running updates against the primary which replicated successfully to the secondaries, but its interesting that this happened to data from exactly the same day as this issue.

Steps to Reproduce

Hard to be precise, but I expect this is connected with the upgrade from 3.0 on MMAPv1 to Wired Tiger on 3.2.

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 280696

Multiple documents with same _id after 3.2 upgrade

MongoDB - Defect ID: 280696

Multiple documents with same _id after 3.2 upgrade

Last updated on 11/7/2017

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?