...
We experienced the following on our production replica set: 3 secondaries 1 primary, MongoDB 3.2.12 + WiredTiger. 1. All secondaries saw a spike in disk usage (about 4-5x base load, we see something like this about twice per day randomly) 2. One of the secondary's mongod CPU usage shot up to 100%. At this time the replica set continued to route queries to the affected instance, but the instance did not process them. 3. We killed all queries on the affected instance via console, the CPU remained at 100% it continued to NOT process queries. 4. Finally we removed the affected process from the replica set and again we could process all queries. Here is a CPU chart of the 4 instances. As you can see CPU of one of the instances hit 100%. (As it was not responding to queries, our app failed and stopped sending new queries, so other instances went to low CPU) Here is a chart showing the disk usage spike on the affected instance which happened before the high CPU. Unfortunately we did not get a core dump of the affected process.
thomas.schubert commented on Fri, 16 Feb 2018 19:54:26 +0000: Thanks for the update, johnnyshields. Given the infrequency of this issue, I'm going to close this ticket. If you encounter it again on 3.4, please let us know and we can continue to investigate. johnnyshields commented on Mon, 5 Feb 2018 19:21:36 +0000: Upgraded to 3.4 and no issue so far, but its very rare (happened once in 2+ years of running MongoDB 3.2) thomas.schubert commented on Mon, 5 Feb 2018 17:01:19 +0000: Hi johnnyshields, Upgrading to a larger host may resolve this issue, however since we haven't completed an RCA, I cannot guarantee that it will reduce the likelihood of a recurrence. Would you please let us know if this is still an issue for you after upgrading to the latest version of MongoDB 3.4? Thanks, Kelsey johnnyshields commented on Fri, 5 Jan 2018 04:25:19 +0000: Hi Kelsey, thank you for the clarification and understood. For avoidance of doubt, upgrading to a host with more CPUs (e.g. 4) would be a workaround that would reduce likelihood of recurrence? thomas.schubert commented on Fri, 5 Jan 2018 03:48:53 +0000: Hi johnnyshields, I'm familiar with SERVER-30189 and can confirm that it affects MongoDB 3.2. In my previous comment, I suggested that you upgrade to the latest version of MongoDB 3.4 to take advantage of this fix as a number of the queries recorded in your logs have large $in expressions. Given the relatively small size of this host (2 cpus), it is more susceptible to connection escalations if the application sends queries faster than it can process. SERVER-30189 is a known issue that may cause throughput to decrease unexpectedly resulting in this type of behavior. That said, there is not enough information at this time to determine whether or not upgrading will resolve this issue. However, I believe upgrading is the best way to progress this investigation for two reasons. First, upgrading to 3.4.10, will allow us to quickly determine whether this fix (which we have identified specifically as affecting your workload) resolves this issue. Second, MongoDB 3.4 contains a number of other performance improvements as well additional diagnostics, which would enable us to better understand what is going on if the issue was not resolved. To that end, I included instruction in my previous comment about the type of diagnostic information we would need if issue continued after upgrading. Kind regards, Kelsey johnnyshields commented on Wed, 3 Jan 2018 08:07:38 +0000: Hi Kelsey, one more question. I browsed the code differences and it seems that there's been quite a bit of refactoring in this area between 3.2 and 3.4. It would be helpful if someone who is very familiar with the code and refactoring done can comment if this issue can in fact occur on 3.2? johnnyshields commented on Wed, 3 Jan 2018 07:32:27 +0000: Hi Kelsey, is it correct to say that it appears due to the underlying issue of SERVER-30189 the affected instance had all threads go into a "dead spin-lock" state (similar to a deadlock but with CPU being consumed)? Also, its not entirely clear to me how SERVER-30189 definitively fixes the issue. We will try to reproduce the issue by bombarding the instance with $in queries. thomas.schubert commented on Tue, 2 Jan 2018 19:09:55 +0000: Hi johnnyshields and MiLk, Thanks for providing diagnostic.data and logs. After reviewing the logs, I see a few very large $in queries which occur around the event. Would you please upgrade to the latest version of MongoDB 3.4 to take advantage of SERVER-30189 and let us know if the issue is resolved? If the issue is not resolved after upgrading, please collect perf during the event: # record call stack samples and generate text output on test node perf record -a -g -F 99 sleep 60 perf script >perf.txt We'll want diagnostic.data, logs, and the perf.txt file. Note that in MongoDB 3.4, diagnostic.data also contains disk and cpu utilization metrics, which also help us better understand what is going on here. Thanks again for you help, Kelsey milk commented on Wed, 27 Dec 2017 17:40:04 +0000: Hi Kelsey, I've uploaded the diagnostic data from the 4 servers of the replica set, and the logs for the hour around the incident (around 09:33 UTC). The primary is mongodb001-05f4 and the affected server is mongodb001-04db. Please let me know if you need additional information. thomas.schubert commented on Wed, 27 Dec 2017 12:17:45 +0000: Hi johnnyshields, Thanks for reporting this issue. So we can continue to investigate, would you please upload the following information? complete mongod logs from the primary, affected secondary, and an unaffected secondary an archive of the diagnostic.data directory from the primary, affected secondary, and an unaffected secondary Thank you, Kelsey
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.