BugZero | MongoDB BugID 743365 - Slow queries every 80s under moderate load

MongoDB - Defect ID: 743365

Slow queries every 80s under moderate load

MongoDB - Defect ID: 743365

Slow queries every 80s under moderate load

Last updated on June 16th, 2020

BugZero Risk Score
6.2 Medium

Overall: 6.2

Severity: 6.4

Community: 7.8

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed

Description

Info

We are seeing very consistent degradation in our primary periodically in our mongo logs. Every 60-90s or so we see a large batch of queries that seemingly gets blocked by a global lock before resolving on the order of seconds. These queries are generally fully indexed queries, often by _id. They span across multiple databases and collections at the same time. We originally noticed this consistently at 80s interval in 3.6.6, and since upgrading to 3.6.12 it is less prevalent, but still happening. In 3.6.12 we are seeing in vary in range from 60-90s generally. We have approximately 150 databases, with approximately 150 collections in each one. We are running on an AWS i3.8xlarge HVM on Ubuntu 16 LTS. Here are a couple logs, one from each batch of failures that show redacted, operations that we believe should resolve quickly. I have the diagnostics directory for this time frame if you can provide a secure upload link. 2019-04-22T22:51:03.768+0000 I COMMAND [conn547794] command command: find { find: "files", filter: { : "" }, projection: { _meta: 0 }, limit: 1, singleBatch: true, batchSize: 1, returnKey: false, showRecordId: false, $clusterTime: { clusterTime: Timestamp(1555973455, 204), signature: { hash: BinData(0, 5CC075D6D06195ED6EBD7ABD18FDA15406AC89AD), keyId: 6644202815071715329 } }, lsid: { id: UUID("68954c85-d1f9-4d49-9ee9-0c6b2bf4d54a") }, $db: "" } planSummary: IXSCAN { url: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:669 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 8144ms 2019-04-22T22:52:31.966+0000 I COMMAND [conn542979] command command: find { find: "", filter: { _id: "" }, projection: { _meta: 0 }, limit: 1, singleBatch: true, batchSize: 1, returnKey: false, showRecordId: false, $clusterTime: { clusterTime: Timestamp(1555973546, 62), signature: { hash: BinData(0, F8034F377C49CFC559FBD6DC0B17F4112229E6F3), keyId: 6644202815071715329 } }, lsid: { id: UUID("ee45d554-bda5-42b7-883d-b8792f2c08aa") }, $db: "" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:376401 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 5821ms 2019-04-22T22:54:02.859+0000 I WRITE [conn548157] update command: { q: { _id: "" }, u: { $set: { }, $setOnInsert: { }, $inc: { } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:1 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 3482747 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } 3544ms

Top User Comments

devony commented on Wed, 22 May 2019 02:11:45 +0000: @Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact. file_manager=(close_handle_minimum=100,close_idle_time=3600,close_scan_interval=30) We then attempted to set close_idle_time to 30 but the operation got rejected: db.adminCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig:"file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)"}) { "operationTime" : Timestamp(1558485071, 244), "ok" : 0, "errmsg" : "WiredTiger reconfiguration failed with error code (95): Operation not supported", "code" : 2, "codeName" : "BadValue" } At 5:35PM, we then attempted to set it to 300, which worked. This caused a HUGE spike in performance and we reverted it back to 3600 at 5:42PM. At 5:54PM, we set it to 1800, and that's where it is currently set. It's unclear at this moment whether it has improved or if any flattening is just variance. devony commented on Wed, 22 May 2019 02:11:45 +0000: @Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact. file_manager=(close_handle_minimum=100,close_idle_time=3600,close_scan_interval=30) We then attempted to set close_idle_time to 30 but the operation got rejected: db.adminCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig:"file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)"}) { "operationTime" : Timestamp(1558485071, 244), "ok" : 0, "errmsg" : "WiredTiger reconfiguration failed with error code (95): Operation not supported", "code" : 2, "codeName" : "BadValue" } thomas.schubert commented on Tue, 21 May 2019 02:08:12 +0000: Yes, you'd see improvement from having fewer active files if you had larger collections rather than the same data distributed across many DBs (assuming you aren't regularly running COLLSCANS). Yes, there is tradeoff with the cost associated with opening and closing files too frequently. Have you had a chance to try the settings I suggested? ronnel commented on Tue, 21 May 2019 01:00:34 +0000: @Kelsey any update on this? ronnel commented on Mon, 13 May 2019 16:57:51 +0000: Is the issue with active file overhead, or is it the size of the collection and index files that is the issue? Would we see improvement if we had larger collections in a single DB instead of the same data distributed across many DBs? Also is there risk of setting close_idle_time too small and creating a large percentage of operations needing to open a file first? thomas.schubert commented on Fri, 10 May 2019 03:59:04 +0000: Hi devony, Sorry for the delay getting back to you. I've examined the diagnostic data that you've provided. During a checkpoint, WiredTiger has to examine each recently accessed index or collection file. Unfortunately, with many active indexes or collections this process may currently result in a drop throughput. I've plotted an example from the diagnostic data that you've provided: One workaround would be to reduce the number of active files held open by MongoDB. This can be accomplished in a few ways, but one option that doesn't require an application or deployment change would be to set the following parameter: file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30) This will reduce the amount of time a collection has to be idle from a bit over a day to 30 seconds and may benefit your workload if many collections aren't being accessed within that time window. This parameter can be set using setParameter or on startup in your config file. Would you please set this parameter and let us know if it resolves the issue? Thank you, Kelsey devony commented on Thu, 9 May 2019 21:35:22 +0000: Any update on this? devony commented on Wed, 24 Apr 2019 01:45:06 +0000: @Eric i've uploaded the metric files from the relevant timeframe eric.sedor commented on Tue, 23 Apr 2019 23:26:28 +0000: Thanks for your report devony. Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) and attach it to this ticket? devony commented on Tue, 23 Apr 2019 06:53:48 +0000: I mistyped, the instance size is an i3.8xlarge.

Steps to Reproduce

This pattern reproduces daily in our production clusters, can get additional information if useful.

Change history

2025-03-21 Added: 3.6.6, 3.6.12

Links

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.6.6, 3.6.12

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.6.6, 3.6.12

Fixed versions: No known fixed versions

Top MongoDB Defects by Risk Score

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 743365

Slow queries every 80s under moderate load

MongoDB - Defect ID: 743365

Slow queries every 80s under moderate load

Last updated on June 16th, 2020

BugZero Risk Score6.2 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB Defects by Risk Score

Ready to prevent the next vendor outage?

BugZero Risk Score
6.2 Medium