...
We are seeing very consistent degradation in our primary periodically in our mongo logs. Every 60-90s or so we see a large batch of queries that seemingly gets blocked by a global lock before resolving on the order of seconds. These queries are generally fully indexed queries, often by _id. They span across multiple databases and collections at the same time. We originally noticed this consistently at 80s interval in 3.6.6, and since upgrading to 3.6.12 it is less prevalent, but still happening. In 3.6.12 we are seeing in vary in range from 60-90s generally. We have approximately 150 databases, with approximately 150 collections in each one. We are running on an AWS i3.8xlarge HVM on Ubuntu 16 LTS. Here are a couple logs, one from each batch of failures that show redacted, operations that we believe should resolve quickly. I have the diagnostics directory for this time frame if you can provide a secure upload link. 2019-04-22T22:51:03.768+0000 I COMMAND [conn547794] command command: find { find: "files", filter: { : "" }, projection: { _meta: 0 }, limit: 1, singleBatch: true, batchSize: 1, returnKey: false, showRecordId: false, $clusterTime: { clusterTime: Timestamp(1555973455, 204), signature: { hash: BinData(0, 5CC075D6D06195ED6EBD7ABD18FDA15406AC89AD), keyId: 6644202815071715329 } }, lsid: { id: UUID("68954c85-d1f9-4d49-9ee9-0c6b2bf4d54a") }, $db: "" } planSummary: IXSCAN { url: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:669 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 8144ms 2019-04-22T22:52:31.966+0000 I COMMAND [conn542979] command command: find { find: "", filter: { _id: "" }, projection: { _meta: 0 }, limit: 1, singleBatch: true, batchSize: 1, returnKey: false, showRecordId: false, $clusterTime: { clusterTime: Timestamp(1555973546, 62), signature: { hash: BinData(0, F8034F377C49CFC559FBD6DC0B17F4112229E6F3), keyId: 6644202815071715329 } }, lsid: { id: UUID("ee45d554-bda5-42b7-883d-b8792f2c08aa") }, $db: "" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:376401 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 5821ms 2019-04-22T22:54:02.859+0000 I WRITE [conn548157] update command: { q: { _id: "" }, u: { $set: { }, $setOnInsert: { }, $inc: { } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:1 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 3482747 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } 3544ms
devony commented on Wed, 22 May 2019 02:11:45 +0000: @Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact. file_manager=(close_handle_minimum=100,close_idle_time=3600,close_scan_interval=30) We then attempted to set close_idle_time to 30 but the operation got rejected: db.adminCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig:"file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)"}) { "operationTime" : Timestamp(1558485071, 244), "ok" : 0, "errmsg" : "WiredTiger reconfiguration failed with error code (95): Operation not supported", "code" : 2, "codeName" : "BadValue" } At 5:35PM, we then attempted to set it to 300, which worked. This caused a HUGE spike in performance and we reverted it back to 3600 at 5:42PM. At 5:54PM, we set it to 1800, and that's where it is currently set. It's unclear at this moment whether it has improved or if any flattening is just variance. devony commented on Wed, 22 May 2019 02:11:45 +0000: @Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact. file_manager=(close_handle_minimum=100,close_idle_time=3600,close_scan_interval=30) We then attempted to set close_idle_time to 30 but the operation got rejected: db.adminCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig:"file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)"}) { "operationTime" : Timestamp(1558485071, 244), "ok" : 0, "errmsg" : "WiredTiger reconfiguration failed with error code (95): Operation not supported", "code" : 2, "codeName" : "BadValue" } thomas.schubert commented on Tue, 21 May 2019 02:08:12 +0000: Yes, you'd see improvement from having fewer active files if you had larger collections rather than the same data distributed across many DBs (assuming you aren't regularly running COLLSCANS). Yes, there is tradeoff with the cost associated with opening and closing files too frequently. Have you had a chance to try the settings I suggested? ronnel commented on Tue, 21 May 2019 01:00:34 +0000: @Kelsey any update on this? ronnel commented on Mon, 13 May 2019 16:57:51 +0000: Is the issue with active file overhead, or is it the size of the collection and index files that is the issue? Would we see improvement if we had larger collections in a single DB instead of the same data distributed across many DBs? Also is there risk of setting close_idle_time too small and creating a large percentage of operations needing to open a file first? thomas.schubert commented on Fri, 10 May 2019 03:59:04 +0000: Hi devony, Sorry for the delay getting back to you. I've examined the diagnostic data that you've provided. During a checkpoint, WiredTiger has to examine each recently accessed index or collection file. Unfortunately, with many active indexes or collections this process may currently result in a drop throughput. I've plotted an example from the diagnostic data that you've provided: One workaround would be to reduce the number of active files held open by MongoDB. This can be accomplished in a few ways, but one option that doesn't require an application or deployment change would be to set the following parameter: file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30) This will reduce the amount of time a collection has to be idle from a bit over a day to 30 seconds and may benefit your workload if many collections aren't being accessed within that time window. This parameter can be set using setParameter or on startup in your config file. Would you please set this parameter and let us know if it resolves the issue? Thank you, Kelsey devony commented on Thu, 9 May 2019 21:35:22 +0000: Any update on this? devony commented on Wed, 24 Apr 2019 01:45:06 +0000: @Eric i've uploaded the metric files from the relevant timeframe eric.sedor commented on Tue, 23 Apr 2019 23:26:28 +0000: Thanks for your report devony. Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) and attach it to this ticket? devony commented on Tue, 23 Apr 2019 06:53:48 +0000: I mistyped, the instance size is an i3.8xlarge.
This pattern reproduces daily in our production clusters, can get additional information if useful.
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.