...
Hi, I'm relatively to Mongo and seem to have stumbled upon an apparent race condition / bug in a sharded Mongo deployment. Please advise if I'm missing something fundamental? Problem Summary: Very rarely, though reproducible, duplicate documents can be inserted into a hashed sharded collection, even when a unique index exists that should explicitly prevent this from occurring. Please see the details below. Test Environment: The race condition was originally reproduced with a larger deployment (3 member replica sets), but it was simplified as follows for reproduction: General environment: Mongo Community Edition v3.6.6 CentOS Linux release 7.4.170 Everything running on localhost Minimum sharded cluster: 1 config-server (replica set with only 1 member) 3 shards (replica set with only 1 member per shard) using WiredTiger storage 2 mongoses Application: Test application that uses the Mongo C driver. Each instance connects to both mongoses and issues insert & delete commands as described later. Test Purpose: In a sharded collection using hashed sharding, verify that duplicate documents (those with the same shard key value) cannot be inserted but are instead rejected. For example, two good examples are demonstrated below where a "duplicate key error" is returned since the shard key value already exists in the collection. Good Example #1: using _id as the hashed shard key mongos> db.repo1.insert( { "_id" : "a9223000000000000001", "fn1" : "1" } ) WriteResult({ "nInserted" : 1 }) mongos> db.repo1.insert( { "_id" : "a9223000000000000001", "fn1" : "1" } ) WriteResult({ "nInserted" : 0, "writeError" : { "code" : 11000, "errmsg" : "E11000 duplicate key error collection: dbClientTest.repo1 index: _id_ dup key: { : \"a9223000000000000001\" }" } }) Good Example #2: using "repo1pk" field-name as the hashed shard key (with a unique index also defined) mongos> db.repo1.insert( { "repo1pk" : "a9223000000000000001", "fn1" : "1" } ) WriteResult({ "nInserted" : 1 }) mongos> db.repo1.insert( { "repo1pk" : "a9223000000000000001", "fn1" : "1" } ) WriteResult({ "nInserted" : 0, "writeError" : { "code" : 11000, "errmsg" : "E11000 duplicate key error collection: dbClientTest.repo1 index: repo1pk_1 dup key: { : \"a9223000000000000001\" }" }}) Example #2 is really how my application behaves, even though I reproduced the problem below using _id just to see if it would behave differently (it doesn't). Race Condition / Bug: I wrote a test application that attempts to insert & delete a single document - repeat forever - where "_id" is my hashed shard key: { "_id" : "a9223000000000000001", "fn1" : "1", "fn2" : 2 } Four instances of my test driver are started and compete trying to create & delete the single document above. The sharded collection should have at most 0..1 documents at any given time, and this is generally the case. However very rarely a duplicate document is inserted, where my tool detects this during document deletion when Mongo responds with { "deletedCount" : 2 } since it deletes all documents matching { "fn" : "1" }. When my tool detects that more than one document was deleted it spits out: [TID 0x7f3d07fdf700] Unexpected: 2 documents deleted; expecting 1 For example, during a 5 minute test run there were 270,000 create (insert) attempts where 134,514 were successful since the doc didn't yet exist, where 141 duplicate documents were detected during the delete cycle: [TID 0x7fe08f7fe700] Create Sent=65000 OK=31914; Delete Sent=65000 OK=32214 >1Total=33 [TID 0x7f68daffd700] Create Sent=65000 OK=32312; Delete Sent=65000 OK=32308 >1Total=28 [TID 0x7fbe30fd9700] Create Sent=70000 OK=35301; Delete Sent=70000 OK=34969 >1Total=44 [TID 0x7fbb6904b700] Create Sent=70000 OK=34987; Delete Sent=70000 OK=34859 >1Total=36 To confirm that duplicate documents were actually inserted, I wrote a reader that constantly tries to read the "fn1" fieldname, where it can sometimes catch instances of duplicate documents as shown below: { "totalRecordsRetrieved" : 2, "records" : [ { "_id" : "a9223000000000000001", "fn1" : "1", "fn2" : 2 }, { "_id" : "a9223000000000000001", "fn1" : "1", "fn2" : 2 } ]} The above behavior seems to be unexpected and a race condition in light of the following: Mongo documentation notes that the above behavior shouldn't be possible because of the existence of a 'unique' index for _id (or in the case where I explicitly create a unique index when using "repo1pk" as the hashed shard key): By default, MongoDB creates a unique index on the _id field during the creation of a collection. The unique constraint applies to separate documents in the collection. That is, the unique index prevents separate documents from having the same value for the indexed key. https://docs.mongodb.com/manual/core/index-unique/ Also Kevin Adistambha with Mongo recently wrote that when inserting or updating a document that, "The index and the collection will be updated as a single atomic operation." So it would seem that an attempt to insert a duplicate document should be caught and rejected due to the presence of a unique index? Other data: This is the dummy document that is created, deleted, repeat forever. mongos> db.repo1.find() { "_id" : "a9223000000000000001", "fn1" : "1", "fn2" : 2 } The set of defined indexes for my test. I doubt this matters, but I've reproduced with & without an index on "fn1" below. mongos> db.repo1.getIndexes() [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "dbClientTest.repo1" }, { "v" : 2, "key" : { "_id" : "hashed" }, "name" : "_id_hashed", "ns" : "dbClientTest.repo1" }, { "v" : 2, "key" : { "fn1" : 1 }, "name" : "fn1_1", "ns" : "dbClientTest.repo1" } ] Here is the shard status when reproducing the problem. Curious: why is "unique: false" stated below, even though a unique index is created by default (for _id) and is otherwise (usually) enforced? mongos> sh.status() --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("5b7ad9a050c13af8f291f1f9") } shards: { "_id" : "shard1", "host" : "shard1/localhost:50001", "state" : 1 } { "_id" : "shard2", "host" : "shard2/localhost:50002", "state" : 1 } { "_id" : "shard3", "host" : "shard3/localhost:50003", "state" : 1 } active mongoses: "3.6.6" : 2 autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: 8 : Success databases: { "_id" : "config", "primary" : "config", "partitioned" : true } config.system.sessions shard key: { "_id" : 1 } unique: false balancing: true chunks: shard1 1 { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard1 Timestamp(1, 0) { "_id" : "dbClientTest", "primary" : "shard3", "partitioned" : true } dbClientTest.repo1 shard key: { "_id" : "hashed" } unique: false balancing: true chunks: shard1 2 shard2 2 shard3 2 { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-6148914691236517204") } on : shard1 Timestamp(3, 2) { "_id" : NumberLong("-6148914691236517204") } -->> { "_id" : NumberLong("-3074457345618258602") } on : shard1 Timestamp(3, 3) { "_id" : NumberLong("-3074457345618258602") } -->> { "_id" : NumberLong(0) } on : shard2 Timestamp(3, 4) { "_id" : NumberLong(0) } -->> { "_id" : NumberLong("3074457345618258602") } on : shard2 Timestamp(3, 5) { "_id" : NumberLong("3074457345618258602") } -->> { "_id" : NumberLong("6148914691236517204") } on : shard3 Timestamp(3, 6) { "_id" : NumberLong("6148914691236517204") } -->> { "_id" : { "$maxKey" : 1 } } on : shard3 Timestamp(3, 7)
nick.brewer commented on Mon, 27 Aug 2018 19:30:23 +0000: jmmec Thanks for the detailed report - I've used a similar test to reproduce the behavior you're describing. After talking this over with our Query team, I believe the behavior you're seeing is expected, albeit somewhat difficult to trigger. I'll try to explain both of the cases you've identified where more than one document is returned: Two documents deleted: In MongoDB, write operations are automatically advanced to the latest snapshot to ensure that you are not writing to (or removing) data that was modified in between the time you issued your delete, and the time that it was executed by the mongod. So for example if you issue a delete, but it is queued behind several other write operations and is not performed until a few seconds later, a snapshot of the latest information is used. Since you're using deleteMany, the order of what is actually occurring looks something like this: A cursor is established to iterate through the matching documents. The cursor returns a matching document. That document gets deleted. The multi-delete operation starts using the most recent snapshot. In that snapshot, the deleted document has been recreated by a concurrently executed write operation. This new matching document gets deleted. Each document is deleted within a separate storage-level transaction which may see a different snapshot of the data. This makes it possible to delete the document, and then pick up a new snapshot that reflects the insertion of a new document. Two documents returned via find(): In reproducing this, the relevant piece of information uncovered was that a yield is taking place whenever more than 1 document is returned via docsExamined: keysExamined:0 docsExamined:2 cursorExhausted:1 numYields:1 nreturned:2 reslen:180 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 12ms With WiredTiger, read operations see a consistent snapshot of data until they yield; at yield points (and only at yield points), a query can change the snapshot from which it is reading. As a result, the default read concern may end up using multiple snapshots for read isolation. If you were to use transactions with the snapshot read concern, it would force your query to use only one snapshot, and you would never see multiple documents returned from such a query. What is happening here is that the find() finds the one document that exists in the snapshot that is used when the read is initialized, then yields its read lock to another read operation, then when the other read finishes and the find() continues, the query utilizes a newer snapshot and finds the second document that was inserted while it was yielded. -Nick jmmec commented on Thu, 23 Aug 2018 17:07:34 +0000: Hi, Here is a brief update with some new logs on a different system from yesterday (2018-08-22) described above. System #2: Today I reproduced the duplicate insert problem with Mongo v3.6.7 (has been v3.6.6) in a different container in a different network and did not (over several test attempts) hit the OpTime/rollback logs as hit on 2018-08-22. However once I did hit new logs that I've not seen before: 2018-08-23T14:40:01.393+0000 I COMMAND [conn4] command dbClientTest.repo1 command: insert { insert: "repo1", ordered: true, $db: "dbClientTest", lsid: { id: UUID("4022827c-8e35-4370-a8db-09e0b44fd79a") }, $clusterTime: { clusterTime: Timestamp(1535035201, 431), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } ninserted:1 keysInserted:1 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 169ms ..etc.. 2018-08-23T14:40:45.719+0000 I COMMAND [conn5] command dbClientTest.$cmd command: delete { delete: "repo1", ordered: true, $db: "dbClientTest", lsid: { id: UUID("83f60713-8dda-4bfe-9849-89e61feb09cd") }, $clusterTime: { clusterTime: Timestamp(1535035245, 1008), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 41, w: 41 } }, Database: { acquireCount: { w: 41 } }, Collection: { acquireCount: { w: 41 } } } protocol:op_msg 167ms System #2 Simplified Non-Sharded Configuration: Mongo Community Edition v3.6.7 (System #1 is using v3.6.6) Red Hat Enterprise Linux Server release 7.4 (System #1 is running CentOS Linux release 7.4) Single mongod replica set (single instance @ 127.0.0.1:50001). Single dbClient instance that connects directly to the mongod. Two dbClientTest instances that issue insert / delete commands to compete with each other using a non-sharded collection "dbClientTest.repo1" (these communicate with dbClient, where dbClient issues the request to the mongod). Everything is running within a single docker container with localhost communications. Logs: Please see the attached log file dated 2018-08-23. jmmec commented on Wed, 22 Aug 2018 19:36:10 +0000: Hi Nick, I restarted everything and began following your instructions – and had gotten pretty far reproducing it at each step in a more and more simplified sharded deployment – when I realized that I had forgotten to configure my test collection for sharding. doh... Since it seemed that you were wanting to further simplify the deployment to the bare minimal where the problem was still reproducible, then it turns out that a sharded deployment isn't needed at all. Below is the simplest configuration where the problem is reproducible, where I attempted to answer some of your more general questions. Simplified Non-Sharded Configuration: Single mongod replica set (single instance @ 127.0.0.1:50001). Single dbClient instance that connects directly to the mongod. Two dbClientTest instances that issue insert / delete commands to compete with each other using a non-sharded collection "dbClientTest.repo1" (these communicate with dbClient, where dbClient issues the request to the mongod). Everything is running within a single docker container with localhost communications. Start the single mongod: [root@aa2edcb466ee]# mkdir -p /ramdisk/mongo/shard1/db [root@aa2edcb466ee]# mongod --config ./mongo_s1.cfg about to fork child process, waiting until server is ready for connections. forked process: 12660 child process started successfully, parent exiting Where I'm reusing my sharded configuration file, but am not starting a mongos or config-server: [root@aa2edcb466ee private]# cat mongo_s1.cfg sharding: clusterRole: shardsvr replication: replSetName: shard1 net: bindIp: localhost port: 50001 storage: dbPath: /ramdisk/mongo/shard1/db systemLog: destination: file logAppend: true path: /ramdisk/mongo/shard1/logs processManagement: fork: true Configure the shard and also start my test clients so that the non-sharded collection is created. [root@aa2edcb466ee]# /usr/bin/mongo 127.0.0.1:50001 MongoDB shell version v3.6.6 connecting to: mongodb://127.0.0.1:50001/test MongoDB server version: 3.6.6 Sers.initiate( { _id: "shard1", members: [ { _id : 0, host : "127.0.0.1:50001" } ] } ) { "ok" : 1 } shard1:SECONDARY> shard1:PRIMARY> shard1:PRIMARY> db.serverStatus().storageEngine { "name" : "wiredTiger", "supportsCommittedReads" : true, "readOnly" : false, "persistent" : true } shard1:PRIMARY> shard1:PRIMARY> use dbClientTest switched to db dbClientTest shard1:PRIMARY> shard1:PRIMARY> db.repo1.getIndexes() [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "dbClientTest.repo1" } ] Only one mongo instance exists as expected: [root@aa2edcb466ee]# ps -ef | grep mongo root 12660 0 17 18:19 ? 00:00:39 mongod --config ./mongo_s1.cfg There are only 3 connections from dbClient to mongod, where 2 are the insert / deleter test clients, and 1 is doing SDAM in the C Client library: [root]# netstat -anp | grep 50001 tcp 0 0 127.0.0.1:50001 0.0.0.0:* LISTEN 12660/mongod tcp 0 0 127.0.0.1:47686 127.0.0.1:50001 ESTABLISHED 12750/./dbClient_sh tcp 0 105 127.0.0.1:50001 127.0.0.1:47686 ESTABLISHED 12660/mongod tcp 0 0 127.0.0.1:50001 127.0.0.1:47684 ESTABLISHED 12660/mongod tcp 0 272 127.0.0.1:50001 127.0.0.1:47688 ESTABLISHED 12660/mongod tcp 0 0 127.0.0.1:47684 127.0.0.1:50001 ESTABLISHED 12750/./dbClient_sh tcp 0 134 127.0.0.1:47688 127.0.0.1:50001 ESTABLISHED 12750/./dbClient_sh unix 2 [ ACC ] STREAM LISTENING 60415657 12660/mongod /tmp/mongodb-50001.sock Disk usage is always OK and CPU is maxed out near 100%: [root@aa2edcb466ee private]# df -h Filesystem Size Used Avail Use% Mounted on overlay 17G 14G 3.5G 80% / tmpfs 64M 0 64M 0% /dev tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/mapper/centos-root 17G 14G 3.5G 80% /etc/hosts shm 64M 192K 64M 1% /dev/shm The duplicate document problem is reproduced; refer to attached file "2018-08-22 Server-36797 Logs.txt" which has details. There are suspicious logs regarding OpTime and potential rollback. See my other comments in the attached text log file about the location of other logs you requested. Delete and Query Commands: The delete command that detects duplicate documents would be similar to this query (but of course executed via the Mongo C driver): shard1:PRIMARY> db.repo1.deleteMany( { "fn1" : { "$eq" : "1" } } ) { "acknowledged" : true, "deletedCount" : 1 } The query command (my reader) that can sometimes detect duplicate documents is similar to this (but of course executed via the Mongo C driver): shard1:PRIMARY> db.repo1.find( { "fn1" : { "$eq" : "1" } } ) Other: For completeness in case you ask about this: [root@aa2edcb466ee private]# cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] [root@aa2edcb466ee private]# cat /sys/kernel/mm/transparent_hugepage/defrag always madvise [never] Please let me know if you need other data captured, especially any clarifications about where log files are written, etc.. nick.brewer commented on Tue, 21 Aug 2018 20:53:36 +0000: jmmec Thanks for your report. I'd like to confirm a few things: How many connections are being established/running inserts/deletes? Can you reproduce this issue with a single shard? Can you reproduce this issue when only connecting your application to a single mongos (as opposed to 2)? Can you reproduce this issue on a single node replica set (not connecting through mongos)? Would you please provide the exact query that the reader is using? Are you able to reproduce this issue when the index isn’t hashed? Additionally, could increase the log level via db.setLogLevel(1, "query"), and then include logs from a time when this behavior occurs? Thanks, Nick jmmec commented on Tue, 21 Aug 2018 19:01:38 +0000: Before opening this bug, I searched for existing bugs and also did more investigation in addition to what is noted above, and here is a summary. I've reproduced the problem using two different approaches for creating a document, where for both approaches the hashed shard key is configured with a unique index. Below I reference the Mongo CLI commands, although realize that I am using the Mongo C driver: 1. db.collection.insertOne() 2. db.collection.updateOne($setOnInsert, upsert:true) For approach #1: SERVER-28546 (SERVER-28153): the same symptoms - duplicate document on insert - but otherwise does not apply. For approach #2: SERVER-20380 from 2015 is similar to my race condition except it was solved by creating a unique index, which is already the case. insert when possible" class="issue-link" data-issue-key="SERVER-14322">SERVER-14322 (still open from 2014) doesn't apply, but does confirm that the race condition shouldn't happen according to the referenced link below since my shard key has a unique index defined ("_id" by default, or "repo1pk" by explicitly configuring it): Warning To avoid inserting the same document more than once, only use upsert: true if the query field is uniquely indexed. https://docs.mongodb.com/manual/reference/method/db.collection.update/#use-unique-indexes General: SERVER-23688 (2017): Closed since the issue was no longer reproducible, but there were unexpected duplicate keys on a replica set that caused a crash. The user commenting at the link below unexpectedly discovered duplicate documents in a sharded deployment. https://jira.mongodb.org/browse/SERVER-23688?focusedCommentId=1477308&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1477308 SERVER-23794: The same symptoms though apparently a non-sharded deployment; unable to determine root cause. SERVER-30708: A sharded cluster with duplicate documents but only for reads on secondaries (this issue has other problems too).
I wrote a test application that is able to reproduce as noted in the Description. If you need my test application to reproduce, then I'll have to refactor it so that it is more generic and will compile for you, otherwise I am able to reproduce very easily and can provide any data that is needed.
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.