
OPERATIONAL DEFECT DATABASE
...

...
Multi-document transactions in their current implementation (MongoDB 4.0) work awfully unstable. I'm using transactions to consistently load high volumes of heterogeneous data (about 30 GB / 300 000 000 documents) and analyse it. I use latest Node driver, Node v10.6.0. Local database, replica set with one member only. A transaction is started like that: session.startTransaction({ readConcern: {level: 'snapshot'}, writeConcern: {w: 1} }) await db.collection('collection1').insertMany({...}) await db.collection('collection2').insertMany({...}) await session.commitTransaction() During the last few weeks I experienced a full range of exceptions, just can't make it work stable. Here are some of my observations: 1. When I run transaction in parallel with any database modification command (like create another collection or schedule building index), I always get WriteConflict exception: "MongoError: WriteConflict" (No details, only this message). I run the command from another application and it operates on separate collection, so in theory there can't be any write conflicts. 2. When I try to run two transactions in parallel (from two different apps, on separate collections), I always get the same WriteConflict exception in one of the applications. Therefore, I can see no way to start two parallel transactions (on two different collections from separate applications), and that definitely kills the idea of using this code in production. 3. Sometimes (once or twice a day) I get errors like MongoError: Transaction 161453 has been aborted. I can't figure out the reason behind it. 4. If I try to run two write operations from the same transaction simultaneously, I get MongoError: Given transaction number 231 does not match any in-progress transactions. So this works: session.startTransaction({readConcern: {level: 'snapshot'},writeConcern: {w: 1}}) await db.collection('collection1').insertMany({...}) await db.collection('collection2').insertMany({...}) await session.commitTransaction() and this code will eventually throw a "does not match any in-progress transactions" exception: session.startTransaction({readConcern: {level: 'snapshot'},writeConcern: {w: 1}}) await Promise.all([ db.collection('collection1').insertMany({...}), db.collection('collection2').insertMany({...}) ]) await session.commitTransaction() 5. Calling db.stats() or db.collection.stats() (even on another database) frequently leads to MongoError: Unable to acquire lock '{8543186066763233601: Database, 1625657039122151745}' within a max lock request timeout of '5ms' milliseconds. error in the running transaction. Is there any way to increase lock waiting timeout? And why calling stats() on another collection/database leads to an exception? I get the same results on both my Windows 10 laptop and CentOS 7.4 server. Is there any way to get rid of those annoying exceptions? Maybe I'm using wrong read isolation level? Are there any parameters that can be tuned?
alyson.cabral commented on Mon, 4 Feb 2019 00:20:11 +0000: Multi-document transactions often have multiple statements with a combination of reads and writes. If you were performing reads then writes based on the results of those reads, the server should not automatically retry the writes in the face of write conflicts, as the results of those original reads may have changed. At snapshot isolation, you're not able to block because all operations must be acting off the same cluster time. We've tried to make retrying as easy as possible by introducing error labels (i.e. TransientTransactionError) instead of having to listen to individual error codes. Here is an example in Node. You'll also notice that in cases where it is safe to block and retry on write conflicts, we do. Writes outside of a transaction (since they are inherently single statement) will block behind a transaction completing and auto-retry until MaxTimeMS. I describe the write conflict behavior of transactions in this video. To illustrate, I'll use the inventory example to illustrate why the server isn't in the best position to orchestrate retrys. Let's say I have a transaction with two statements: statement 1: read the inventory value statement 2: update the inventory value statement 3: insert order document Now, let's throw some values in there. First, the application may read an inventory value of 63 then, based on the order, need to update that value to 60 as the order contained three of the items. However, at the same time, another transaction wrote to that document and updated the inventory value, let's say to 62. If the server were to retry automatically after the first transaction was successful I'd be left with an inventory value of 60 instead of the correct value of 59. The logic to come to the correct result would only exist client-side. The MongoDB team has spent a lot of time minimizing write conflict by having transactions act off of the most current snapshot available. For sustained contentious documents, the clients must be prepared to retry in the case of write conflicts. We are currently working on drivers helpers that just have you pass in a transaction callback and will auto-retry. I'd love to get your early feedback. Can I email you to set something up? ahildoer commented on Sun, 3 Feb 2019 21:49:14 +0000: I have run into the same issue. I have written a test file that triggers the WriteConflict every time using MongoDB 4.0.5-xenial docker container from https://hub.docker.com/_/mongo. Reference: test.js On line 68 you can turn the problem on/off. I am using async to allow us to test linear transactions vs concurrent. The async function has two parameters. The first one tells how many total sessions/transactions to execute. In each transaction we do one mongo operation: increment an integer field on a document identified by name. I have this set to 2, so we are only trying to perform two operations in total regardless of concurrency. The second parameter tells how concurrent we want to be. If you set the second parameter to 1, transactions run one after the other, and everything works. If you set this to 2, then we try to run 2 document writes at the same time. The operation fails with WriteConflict every time in this latter configuration. Note that we are running mongod with the following: mongodb --smallfiles --bind_ip_all --replSet rs0 --setParameter transactionLifetimeLimitSeconds=60 --setParameter maxTransactionLockRequestTimeoutMillis=10000 We initiated the one-node replica set with: rs.initiate( { _id: "rs0", members: [ { _id: 0, host: "test-host:27017", priority: 100 } ] } ); We run the test script with: npm install mongoose async; # only run this first time node test.js; Node version is 10.15.0 The expected result is that when we run two transactions at the same time, the second transaction would wait for the lock on the first one to complete, up to maxTransactionLockRequestTimeoutMillis milliseconds later before erring out. This is how transactions work in other DBs I have been using for years. MySQL, MSSQL Postgres, etc. All of them will line up transactions when they encounter locks, and allow them to each run in step, thereby resolving the concurrency. MongoDB 4 is just not doing this. The actual result is that In this example, even with maxTransactionLockRequestTimeoutMillis set to 10,000ms, we are getting a WriteConflict error within 20ms according to the timer output from line 151 of test.js. Either maxTransactionLockRequestTimeoutMillis is being ignored, or something else is triggering the write conflict error. Side note: I am actually flabbergasted at MongoDB transactions, and their seeming uselessness. I feel like the point of transactions is to offload concurrency issues to the DB. I have never had issues with transactions ever on a case as simple as this. Years ago I used Percona's variation of MongoDB (TokuDB) which had transactions as far back as 2015, and those worked fine with the scenario in the attached test script. All the major SQL flavors also have zero issues with the test use case presented here. I feel like this example illustrates a nearly fundamental, main use case of transactions, and yet out of the box MongoDB is just failing. After being a MongoDB fan-boy for nearly a decade, I feel very let down right now. alyson.cabral commented on Mon, 1 Oct 2018 15:44:08 +0000: I'm going to close this ticket out for now. Please reopen if you have any further questions. alyson.cabral commented on Wed, 12 Sep 2018 12:47:13 +0000: Hi Christian, Reads and writes in MongoDB, even within transactions, take locks at the document level. However other operations, such as DDL operations, require stronger IX locks on the namespace. You get the lock timeout error when one of the operations that require an IX lock is pending (waiting for intent locks to cleanup) and a new transaction tries to acquire a new intent lock on the namespace that the IX lock is pending on. I talk about this behavior in my transactions talk: https://www.youtube.com/watch?v=3x_Pf9rQGCo&t=1425 Are you running any of the commands like create collection, create index, drop collection, or other types of DDL operations in your transactions tests? When this ticket was opened, we also discovered that dbstats was unnecessarily taking this strong lock and opened SERVER-36437 as a result. Check out our documentation for more details. If desired, you can change the default 5ms wait to avoid or minimize the error by using the parameter maxTransactionLockRequestTimeoutMillis. Lastly, it is a best practice that all transactional applications against MongoDB have the appropriate retry logic that listens for the TransientTransactionError label. This label is present for all errors where it is safe to retry the entire transaction from the beginning. The timeout error falls under this label but makes it so you do not need to independently listen and respond to each error individually. Aly christian@sparre.it commented on Wed, 12 Sep 2018 09:57:32 +0000: I'm seeing something similar and this was the only issue I could find that seems to relate I'm seeing Command insert failed: Unable to acquire lock '{8466156977036877082: Database, 1548627949395795226}' within a max lock request timeout of '5ms' milliseconds.. when trying to insert documents using transactions across multiple collections in the same database concurrently. I only started getting these when running some integration tests concurrently that create a new collection per test, the collections are in the same database. I don't know much about the internals of MongoDB, but I was not expecting a transaction on a collection acquiring a lock on the entire database blocking another transaction on a completely different collection. I'm using the c# driver version 2.7.0 and the mongo:4 docker image. This https://gist.github.com/christiansparre/d8b03a9ce840cb8237a181e4d5bf8544 code reproduces the error for me Is it expected that transactions locks the entire database? daniel.aprahamian commented on Fri, 3 Aug 2018 18:22:29 +0000: kreig 2. When I try to run two transactions in parallel (from two different apps, on separate collections), I always get the same WriteConflict exception in one of the applications. Therefore, I can see no way to start two parallel transactions (on two different collections from separate applications), and that definitely kills the idea of using this code in production. I have been unable to reproduce this. I am spamming a server with 2 different apps with transactions on separate collections, and I am seeing no errors. Are there any details you can provide to help reproduce this? Thanks, Dan kreig commented on Fri, 3 Aug 2018 18:09:24 +0000: Hey Aly, Are you passing in the session to each out the CRUD operations that you hope to execute within the transaction? Yes, await db.collection('collection1').insertMany({...}, {session}) I can't edit my original question, but I pass the session parameter to all CRUD operations that belong to a transaction. When you say that you are using the local database, does that mean you are actually using the "local" namespace and that you have collections you've created under the "local" database? No, I'm using a few databases (with names like db_aggregated1) in my replica set consisting of a single mongod instance on 127.0.0.1. Can you share the code that is triggering these exceptions? I'm sorry, but that's not possible. Logic is defined in a dozen of modules, and the code base is huge. In most cases insertMany or bulkWrite batches with up to 12000 objects per transaction. Are you listening for the "TransientTransactionError" label to retry after encountering transient exceptions? No, and I'm not sure that exceptions I described fall under the "Transient" category. I'll try to implement retries according to the example you provided. Thanks alyson.cabral commented on Fri, 3 Aug 2018 17:45:43 +0000: Hello, A couple of quick question before we address the comments. Are you passing in the session to each out the CRUD operations that you hope to execute within the transaction? When you say that you are using the local database, does that mean you are actually using the "local" namespace and that you have collections you've created under the "local" database? Can you share the code that is triggering these exceptions? Are you listening for the "TransientTransactionError" label to retry after encountering transient exceptions? Aly Cabral kreig commented on Fri, 3 Aug 2018 14:47:02 +0000: After playing a while with various readConcern and writeConcern settings I figured out that the following settings fix problems 1 and 2. There are no WriteConflict errors with concurrent transactions even if other collections are dropped/created/indexed in parallel. session.startTransaction({ readConcern: {level: 'local'}, writeConcern: {w: 1, j: false} }) I'm ok with such readConcern settings, but j:false option looks potentially problematic. MongoDB does not wait for w:"majority" writes to be written to the on-disk journal before acknowledging the writes. As such, majority write operations could possibly roll back in the event of a transient loss (e.g. crash and restart) of a majority of nodes in a given replica set. My application is able to resume the processing after crash, so it's not a problem if all current transactions are cancelled. But it can be a problem if some of them are applied while others are rollbacked. Is there any additional documentation describing the server behavior in case of "transient loss"? kreig commented on Fri, 3 Aug 2018 06:19:44 +0000: Some relevant trace records from MongoDB log: 2018-07-31T02:24:00.451+0300 I WRITE [LogicalSessionCacheRefresh] update config.system.sessions command: { q: { _id: { id: UUID("d2abf4ea-00a0-4b63-8c7f-96e2464c6eaf"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }, u: { $currentDate: { lastUse: true } }, multi: false, upsert: true } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:1 locks:{ Global: { acquireCount: { r: 6, w: 6 } }, Database: { acquireCount: { w: 5, W: 1 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 81 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } 369ms 2018-07-31T02:24:00.451+0300 I COMMAND [conn210] command admin.$cmd command: commitTransaction { commitTransaction: 1, writeConcern: { w: 1 }, txnNumber: 3370, autocommit: false, $clusterTime: { clusterTime: Timestamp(1532993040, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, lsid: { id: UUID("dd629965-f26c-4701-93a7-30e3f792496e") }, $db: "admin" } numYields:0 ok:0 errMsg:"Unable to acquire lock '{8543186066763233601: Database, 1625657039122151745}' within a max lock request timeout of '5ms' milliseconds." errName:LockTimeout errCode:24 reslen:412 locks:{ Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 1, w: 3 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 5320 } }, Collection: { acquireCount: { w: 5 } }, oplog: { acquireCount: { r: 1, w: 1 } } } protocol:op_query 225ms And one of the WriteConflict cases: MongoError: WriteConflict2018-07-31T03:30:26.797+0300 I COMMAND [conn285] command db1.collection1 command: insert { insert: "effects", ordered: false, bypassDocumentValidation: true, txnNumber: 3, autocommit: false, $clusterTime: { clusterTime: Timestamp(1532997009, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, lsid: { id: UUID("12ca5552-ed9f-4ea4-8ca0-dc596617953d") }, $db: "b" } ninserted:12928 keysInserted:64640 numYields:0 ok:0 errMsg:"WriteConflict" errName:WriteConflict errCode:112 reslen:293 locks:{ Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 1, w: 1 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 14084ms 2018-07-31T03:30:26.798+0300 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after freeMonitoring: 0, after globalLock: 12796, after locks: 12796, after logicalSessionRecordCache: 12796, after network: 12796, after opLatencies: 12796, after opcounters: 12796, after opcountersRepl: 12796, after repl: 12796, after security: 12796, after storageEngine: 12796, after tcmalloc: 12796, after transactions: 12796, after wiredTiger: 12796, at end: 12796 }
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.