...
Yesterday, we had multiple slowdown on our primary server: From our metrics, we can see that the cache evictions threads were not able to evict cache fast enough until the cache used reached 95% and that applications threads started to help the dedicated threads (source: http://source.wiredtiger.com/mongodb-3.4/tune_cache.html): We already encountered this issue with mongo 3.2 two years ago (on the secondary server though) and the only thing that fixed the issue was to upgrade to mongo 3.4: https://jira.mongodb.org/browse/SERVER-27700 We plan to upgrade to mongo 3.6 in the following months but can you check that a regression has not be introduced in our current mongo version ? Another related ticket: https://jira.mongodb.org/browse/SERVER-27753 I can provide you diagnostic.data and slow log files. Thx for you help.
bigbourin@gmail.com commented on Thu, 1 Aug 2019 18:54:52 +0000: Ok thanks, we started by updating the secondary to 3.6.13 and so far it's not looking good, even on the secondary which never had any eviction performance issue with 3.4, after one day with 3.6.13 we saw some spikes to 95% cache usage: This is the MAX cache usage for primary (yellow, 3.4) and secondary (green, 3.6.13 since the drop to 0 visible on the chart). We can clearly see on the secondary it used to be OK and start showing some dangerous spikes since the update, while the primary is actually close to dying every day because of this. We'll monitor for a couple more days but it doesn't look very safe to update in these conditions... dmitry.agranat commented on Thu, 1 Aug 2019 06:15:00 +0000: Hi Adrien, The eviction server is single-threaded. It's performance, contention and issues need to be carefully diagnosed in order to get to the root cause. There is no single metric which can surface eviction server efficiency and point to the root cause. You can try monitoring the wt cache pages evicted by application threads metric which might be a good indication that the eviction is struggling and needs to be investigating. Just to reiterate, after the upgrade to the latest MongoDB version, please upload the diagnostic.data and we will be happy to take a look. Thank you, Dima bigbourin@gmail.com commented on Wed, 31 Jul 2019 11:29:58 +0000: Hi, so if I understand well (I'm working with Sebastien) mongo has an "eviction server" which generates candidates for eviction or something like that? and then the multiple eviction threads (or application threads) evicts items from a queue which is filled by this servers? and so if the server is not fast enough the evictions waits (blocking the entire server if we're at 95%) for the eviction server? Is this a multi-threaded thing? As this is another mongoDB bottleneck we would need to monitor it closely, how can see how close we are to the eviction server saturation? If we can measure this we can also see how much of an improvement 3.6 brings in this regard. Thanks! dmitry.agranat commented on Wed, 31 Jul 2019 07:59:14 +0000: Hi spuyet@gmail.com, Apologies for a delayed response. We can see that starting from 05:30 AM UTC, July 16th, there is a shift in the workload and the operations rate was gradually increased by ~x3. It looks to me like the main problem is that the eviction server is not queuing pages fast enough: This is a zoom in into the last our (12:05 - 13:05) of the above workload increase marked by A-B eviction calls to get a page found queue empty is not as large as eviction calls to get a page, so eviction workers are not starved for work all the time but still having trouble in ~27% of the time. In other words, if we had managed to evict pages at just a 0.27% higher rate, we would have kept cache utilization from rising to 95%. application thread time evicting is small, application thread time waiting for cache is large, indicating that application threads aren't finding eviction work queued for them so are waiting instead. eviction empty score is large, meaning the eviction server often found the eviction queue empty, i.e. eviction workers are draining it faster than the server can fill it. We plan to upgrade to mongo 3.6 in the following months but can you check that a regression has not be introduced in our current mongo version ? I do not believe this is a regression of SERVER-27700 which you have mentioned. SERVER-27700 explicitly talks about a secondary member while here, we are investigating the Primary member during the reported event. In addition, we do not have the information of the issue you have encouraged 2 years ago to be able to determine if this is a regression. It will be difficult to determine which out of ~400 WT improvements we've made since your current MongoDB version would help this workload with faster page eviction. We would be happy to re-evaluate this after the upgrade if the issue reoccurs. In addition, if possible, please consider upgrading to the latest of 4.0 (as of today, 4.0.11). spuyet@gmail.com commented on Mon, 29 Jul 2019 07:49:07 +0000: Dmitry any news about this issue ? spuyet@gmail.com commented on Thu, 18 Jul 2019 15:34:39 +0000: The cache used started to grow more than 80% from the morning yes: But we started to have a real impact on production between 2:00PM UTC+2 and 3:05PM UTC+2 (when we stepped down the server). Not sure that timezone really matters here though. dmitry.agranat commented on Thu, 18 Jul 2019 12:26:42 +0000: Thanks spuyet@gmail.com. Just to confirm, we are looking into the event that started with the workload increase/shift on July 16th at 06:00 AM and lasted until 01:00 PM UTC? spuyet@gmail.com commented on Thu, 18 Jul 2019 12:14:20 +0000: Hello Dmitry, I've just uploaded the diagnostic data and the slow log files. We don't see anything different than other days. dmitry.agranat commented on Thu, 18 Jul 2019 07:15:21 +0000: Hi spuyet@gmail.com, Thank you for the report, we'll be happy to look into this. Has anything changed on the application side in terms of the workload prior to this event? Please upload the diagnostic data, mongod logs as well as the exact time and timezone of the reported issue. You can upload all the data into this secure portal. Thank you, Dima spuyet@gmail.com commented on Wed, 17 Jul 2019 12:56:48 +0000: An interesting thing to notice is that after the steDdown the server was back to a "normal" state.