...
We upgraded a replica set from 2.6 to 3.0 then 3.2, each time adding new servers, letting them replicate then removing the old servers. On the day that the first 3.2 server was added as a 4th (hidden) node in the 3.0 cluster to start the data sync, we appear to have encountered some data corruption. To be precise, we now have examples of multiple documents in a collection with the same _id. An unbounded find on the collection shows them (only showing the record around the problem one for brevity, and anonymising the collection name): db['COLLECTION'].find() { "_id" : "2016-03-15", "percent" : 7.317073170731707 } { "_id" : "2016-03-16", "percent" : 7.4074074074074066 } { "_id" : "2016-03-17", "percent" : 6.666666666666667 } { "_id" : "2016-03-18", "percent" : 6.944444444444445 } { "_id" : "2016-03-18", "percent" : 7.792207792207792 } { "_id" : "2016-03-19", "percent" : 7.6923076923076925 } { "_id" : "2016-03-20", "percent" : 7.6923076923076925 } { "_id" : "2016-03-21", "percent" : 6.756756756756757 } { "_id" : "2016-03-22", "percent" : 6.944444444444445 } { "_id" : "2016-03-23", "percent" : 7.142857142857142 } Note that "_id" : "2016-03-18" is there twice. If I try and query directly for this record, only one appears: db['COLLECTION'].find({ "_id" : "2016-03-18" }) { "_id" : "2016-03-18", "percent" : 7.792207792207792 } db['COLLECTION'].find({ "_id" : {$gt: "2016-03-17", $lt: "2016-03-19"} }) { "_id" : "2016-03-18", "percent" : 7.792207792207792 } Would a copy of the WiredTiger datafiles for this collection and its indexes help with analysing this issue?
milkie commented on Wed, 13 Sep 2017 13:39:54 +0000: Hi Greg, After over a year of searching for reasons why this would occur, we have still yet to find a smoking gun. After you repaired the collections, did you ever have a recurrence of this problem? Have you upgraded to 3.4 in the meantime? gregmurphy commented on Fri, 22 Jul 2016 10:32:16 +0000: Hi - Yes, journalling is enabled. The process that writes this data uses the Java driver, version 3.2.2. thomas.schubert commented on Mon, 18 Jul 2016 15:49:20 +0000: Hi gregmurphy, Thank you for the detailed timeline. Unfortunately, we haven't been able to reproduce this issue yet. To help us reproduce this issue, would you please specify which MongoDB driver you were using and its specific version? Additionally, would you please clarify whether journaling is enabled? Thank you, Thomas gregmurphy commented on Tue, 26 Apr 2016 15:59:20 +0000: I've dug back through the timeline and any available logs to clarify (as well as I can) what happened around that time. This is hopefully an accurate representation of events: On 12th March cluster contained 3 x 3.0 nodes (no 2.6 nodes), and a single hidden 3.2 node was added. This took around 4 days to replicate. Replication completed on the 16th, and on the 17th the 3.2 node was shut down to rsync its data directory to one of the additional new 3.2 nodes. Two of these nodes were then started on the 18th, but one was then shut down to rsync to the 3rd new 3.2 node. The 3.2 nodes remained hidden until 24th March. In answer to your questions: 1. That's correct, no 2.6 nodes were present 2. 3.0.9 (I believe) and 3.2.3 3. Yes, they are upserts 4. Every example I've seen (46 so far) has "_id": "2016-03-18" thomas.schubert commented on Fri, 22 Apr 2016 20:24:41 +0000: Hi gregmurphy, Thank you for answering my questions, we are continuing to investigate this issue. My understanding is that on March 18th, the cluster contained 3 MongoDB 3.0 nodes and one hidden MongoDB 3.2 node, which was performing an initial sync. The affected documents were inserted that day into a MongoDB 3.0 primary and may have been updated during the initial sync process of the hidden MongoDB 3.2 node. Is this correct? Can you confirm that no MongoDB 2.6 nodes were in the cluster when the duplicated data was first inserted? Can you provide the exact versions of MongoDB running on the cluster on when the affected documents were first inserted, and the versions of the nodes in subsequent configurations? Are you upserting these documents? Are any documents with a different _id affected? Are documents with a different _id modified on later dates? Thank you for your help, Thomas gregmurphy commented on Wed, 20 Apr 2016 10:36:03 +0000: Should have added - I kept the output of some finds against the collections that had the missing elements (note the docs with { "_id" : "2016-03-18"} ) Primary: RS:PRIMARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2, "NewUsers" : 0 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } Secondary 1: RS:SECONDARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } Secondary 2: RS:SECONDARY> db['COLLECTION'].find({_id: /^2016-03/}) { "_id" : "2016-03-01", "ReturningUsers" : 17, "NewUsers" : 2 } { "_id" : "2016-03-02", "ReturningUsers" : 18, "NewUsers" : 0 } { "_id" : "2016-03-03", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-04", "ReturningUsers" : 10, "NewUsers" : 0 } { "_id" : "2016-03-05", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-06", "ReturningUsers" : 14, "NewUsers" : 3 } { "_id" : "2016-03-07", "ReturningUsers" : 6, "NewUsers" : 0 } { "_id" : "2016-03-08", "ReturningUsers" : 12, "NewUsers" : 1 } { "_id" : "2016-03-09", "ReturningUsers" : 12, "NewUsers" : 3 } { "_id" : "2016-03-10", "ReturningUsers" : 14, "NewUsers" : 0 } { "_id" : "2016-03-11", "ReturningUsers" : 10, "NewUsers" : 1 } { "_id" : "2016-03-12", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-13", "ReturningUsers" : 20, "NewUsers" : 1 } { "_id" : "2016-03-14", "ReturningUsers" : 20, "NewUsers" : 4 } { "_id" : "2016-03-15", "ReturningUsers" : 13, "NewUsers" : 0 } { "_id" : "2016-03-16", "ReturningUsers" : 20, "NewUsers" : 0 } { "_id" : "2016-03-17", "ReturningUsers" : 9, "NewUsers" : 1 } { "_id" : "2016-03-18", "ReturningUsers" : 2 } { "_id" : "2016-03-19", "ReturningUsers" : 18, "NewUsers" : 1 } { "_id" : "2016-03-20", "ReturningUsers" : 18, "NewUsers" : 4 } gregmurphy commented on Wed, 20 Apr 2016 09:15:14 +0000: Hi Thomas Yes, this is the same replica set that SERVER-23433 was raised for, so the workload details in there are relevant for this ticket too. Regarding your questions, these answers are in relation to the multiple documents with the same _id: 1. Primary and secondaries have identical output: db['COLLECTION'].getIndexes() [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "DB.COLLECTION" } ] 2. Unfortunately we no longer have the logs as we only retain for 30 days and the problem data was generated on March 18th. I did check the logs at the time though, and didn't notice any errors. 3. Inserts and updates, and the documents are all very small (longest ~200 bytes), and once they've been inserted any updates should just be to change numeric values, not add more elements. 4. Validate output: db['COLLECTION'].validate(true) { "ns" : "DB.COLLECTION", "nrecords" : 85, "nIndexes" : 1, "keysPerIndex" : { "DB.COLLECTION.$_id_" : 84 }, "indexDetails" : { "DB.COLLECTION.$_id_" : { "valid" : true } }, "valid" : false, "errors" : [ "number of _id index entries (84) does not match the number of documents (85)" ], "advice" : "ns corrupt. See http://dochub.mongodb.org/core/data-recovery", "ok" : 1 } Regarding the collections with missing elements on the secondaries (that I have already repaired), I only kept notes about one of the affected collections from the time. From memory, 10-15 exhibited this problem. For the one I kept notes about: getIndexes (same output from primary and secondaries): [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "DB.COLLECTION" } ] Collection validation: { "ns" : "DB.COLLECTION", "nrecords" : 672, "nIndexes" : 1, "keysPerIndex" : { "DB.COLLECTION.$_id_" : 671 }, "indexDetails" : { "DB.COLLECTION.$_id_" : { "valid" : true } }, "valid" : false, "errors" : [ "number of _id index entries (671) does not match the number of documents (672)" ], "advice" : "ns corrupt. See http://dochub.mongodb.org/core/data-recovery", "ok" : 1 } Interestingly this collection also suffers from the duplicate _id issue, and in turn this duplicate record was the specific record that had the missing elements on the secondaries. thomas.schubert commented on Tue, 19 Apr 2016 22:37:01 +0000: Hi gregmurphy, Thank you for reporting this issue. To get a better understanding of what is going on here, please answer the following questions: Can you please provide the output of db.collection.getIndexes() on each node for the collection with documents that had missing elements? Were any server errors logged since beginning this upgrade process? Can you describe what kinds of operations are performed on the affected documents? Do these documents significantly change in size when updated? Is this the same workload as you described in SERVER-23433? Can you please provide the output of db.collection.validate(true) executed against an affected collection? Please note that this command may impact the performance of your MongoDB instance. Kind regards, Thomas gregmurphy commented on Tue, 19 Apr 2016 14:28:07 +0000: I'm in the process of scanning all of our collections for instances like this and have so far found 16. I'll update with the complete figures once the scan is finished. gregmurphy commented on Tue, 19 Apr 2016 10:34:24 +0000: As a side note - we also saw some other data corruption on the same day, with elements missing from documents on secondaries. I have already repaired this data by running updates against the primary which replicated successfully to the secondaries, but its interesting that this happened to data from exactly the same day as this issue.
Hard to be precise, but I expect this is connected with the upgrade from 3.0 on MMAPv1 to Wired Tiger on 3.2.