...
I just tried upgrading one of my RS from 3.4.15 to 3.6.5, thinking as 4.0 is now released 3.6 might be mature/stable enough now, but I had a very bad surprise, the CPU load increase about 4 times on the secondary (I didn't try the primary of course, I don't want to loose my cluster yet). As you can see on the chart bellow, my mongo usage is very stable and the secondary usually stays at 5% of total server CPU on 3.4.15, as soon as I upgrade to 3.6.5 it jumps to 15-25%. The load stayed exactly the same on the primary, and so does the CPU usage. When I saw that I thought: maybe I should try re-syncing the secondary to benefit from the improvement of 3.6, so I did this, the first time it filled the disk during the initial sync and crashed mongo so I tried again and the second time it worked, but the load is exactly the same. I also tried putting back 3.6.0 and 3.6.1 to see where the regression is, but both didn't boot for unknown reasons yet. Is this expected? are you aware of any change that would cause this huge regression? This is my mongo cloud manager account monitoring the RS if it helps: https://cloud.mongodb.com/v2/5012a0ac87d1d86fa8c22e64#metrics/replicaSet/5414cfabe4b0ce23e21b4b3b/overview
bruce.lucas@10gen.com commented on Thu, 12 Jul 2018 15:51:15 +0000: Thanks for the additional data Adrien. I think you are correct that cache pressure is an issue here, and would not be surprised if increasing the cache size improves the behavior. There are a few factors that may be contributing to cache pressure, and we would like to eliminate those one at a time to see whether it improves the behavior. The first relates to the use of protocol version 0 (pv0) with 3.6. This is not an ideal combination, as it requires keeping additional update history in the cache on the secondary, and could be the cause of what you're seeing. The default protocol version for new 3.6 replica sets is pv1, but replica sets created under prior versions will continue to use pv0 until explicitly upgraded. You can confirm that your replica set is using pv0 by looking in the mognod log for the 3.6 node; there will be a startup warning. If you don't see that, can you please upload the mongod log file for 3.6 so we can take a look? If you do see the warning can you please upgrade to pv1 and try running the secondary on 3.6 again and then upload diagnostic.data once the update job has finished? Ideally this would use the same cache size as before to get a fair comparison. Thanks, Bruce bigbourin@gmail.com commented on Wed, 11 Jul 2018 16:07:36 +0000: Ah yes indeed, I just uploaded 2 new archives. For the record I noticed that the CPU load in my last test with 3.6.5 was lower than in my first tests, and I believe this is because I increased the WT cache size limit from 1G to 2G so I'm currently trying with a higher cache size (~4G) to see if it's any better. This could maybe help pinpoint a memory usage increase causing higher CPU load due to cache stress (eviction, etc..) I'll let you know tomorrow after the 11am burst if it survives =p bruce.lucas@10gen.com commented on Wed, 11 Jul 2018 15:44:10 +0000: Hi Adrien, The two-day comparison on different versions will be useful since as you say the load is constant. Can you please upload the most recent diagnostic.data from primary and secondary, which should include July 9th? The latest upload we have is from July 8th so it doesn't include that data. Thanks, Bruce bigbourin@gmail.com commented on Wed, 11 Jul 2018 12:32:15 +0000: Thanks for having a look, I can't show you with two secondaries because I only have one (third member is an arbiter). But I can clearly show you the difference between both version as my workload on the primary is extremely stable, it's the same each day. (due to the nature of the service, it's updown.io, a monitoring service so the mongo traffic is 99% period requests) I made a new chart by overlapping two day (Jul 8 and Jul 9) showing the CPU usage difference between 3.4 and 3.6 during the night and during the 11am spike: Ignore the 50% flat blue line on the right, it's a ruby process which loops when mongo stops and have the same blue color as the mongo server when I overlap both charts. Before 2am it's different because it was re-syncing but after 2am you can see the CPU difference during regular load, which as you said is about 2 times and much less stable. Then at 11am we can see the CPU activity during a higher load, the workload on the primary is the same and here the CPU activity on the secondary is not only 4 times higher (~18% → 64%), but it doesn't even manage to follow the primary, gets out of sync and stops. Which means that it would require even more than 4 times the CPU to be able to follow the same load. (the reason why the machine is saturating at ~70% CPU is because it's a VM with stolen CPU) You can see this in the diagnostic.data files of July 8 for 3.6.5 and July 9 for 3.4.15, the load spike starts at 11am UTC+2 every day and lasts for about 1h. And you can see the workload charts in my cloud manager account I guess? (https://cloud.mongodb.com/v2/5012a0ac87d1d86fa8c22e64#host/replicaSet/5414cfabe4b0ce23e21b4b3b), otherwise here is a chart showing the stability of the primary workload during July 8 and 9 (spike is the 11am burst): bruce.lucas@10gen.com commented on Mon, 9 Jul 2018 19:30:32 +0000: Hi Adrien, Thanks for the data you have collected so far, and thanks for the offer to continue helping us diagnose what you are seeing. In the data that you've uploaded the most notable notable correlate with higher CPU utilization is secondary lag: This is seen at A-B (following initial sync) and C-D and E-F (following some amount of downtime). When the secondary is lagged it will be doing extra work to catch up, which we can see in the "repl apply ops" metric, and this will require more CPU. After the lag catch-up CPU utilization has been accounted for we still see a residual CPU increase of about double, going from 4-5% to 8-10%. In absolute terms this is not a large increase, but we will look into it further. However you have indicated the possibility of some other effects which are larger, e.g. related to the daily update job, but I don't think we have clean enough data to investigate that. Would you be willing to collect data with one secondary on 3.6.5 and the other on 3.4.15 for comparison, for at least 24 hours? If you need to restart a secondary to get it on 3.6.5 please try to minimize the downtime between restarts to minimize the amount of lag that the secondary has to catch up to. Also please try to restart it during a period of steady load so it has plenty of time to stabilize before any unusual load such as the daily update job. If there are any events you want to call our attention to please give us a timeline with as accurate times as you can, including dates and timezone. Once you have 24 hours of data on 3.6.5 please upload the content of diagostic.data from all three nodes. Also if you can upload the CPU charts from both secondaries that would be useful - as you can see from above we collect CPU information in diagnostic.data, but it will be good to have independent confirmation of that from your charts. Thanks, Bruce bigbourin@gmail.com commented on Sun, 8 Jul 2018 22:44:21 +0000: Ok to help you pinpoint the issue I erased the data and went back to 3.4.15 to make the upgrade again one version at a time and see the impact. I uploaded the new diagnostic.data files covering all these changes to the portal. Here are the upgrades I made since last time: jul 05 @ 00:47: 3.6.1 jul 05 @ 08:50: 3.6.2 jul 05 @ 13:20: 3.6.3 jul 05 @ 20:00: 3.6.4 Jul 06 @ 10:34: 3.6.5 Jul 06 @ 18:50: 3.4.15 Jul 06 @ 21:31: 3.6.5 Jul 09 @ 00:30: 3.4.15 Here's the CPU chart: All these upgrade showed that the regression seems to be entirely in 3.4.15 → 3.6.0, as the CPU level is basically the same between this version, it looks a bit lower than in my first tests (Jul 04), I'm not sure why as the Jul 04 test was from re-synced data already, but it's definitely way higher than in 3.4.15, the re-sync is slower and takes much more CPU, and the update burst I have every morning at 11am which used to up the secondary CPU to 20% is enough in 3.6.x to get the CPU to 100% and loose the secondary out of sync... Let me know if I can do anything else to help you diagnose this, in the meanwhile I'm going back to 3.4.15 as the upgrade is not possible. bigbourin@gmail.com commented on Wed, 4 Jul 2018 19:24:54 +0000: Sure, that's uploaded on the portal. For the secondary I forgot to keep the diagnostic folder when I resynced today so it only starts at ~13:00 but that's enough to see 3.4.15 and 3.6.0. ramon.fernandez commented on Wed, 4 Jul 2018 19:14:23 +0000: bigbourin@gmail.com, will you please upload the contents of the diagnostic.data directories for the primary and the affected secondary? You can attach them to the ticket if they're not too large, or upload them via this secure portal. Please comment on the ticket when you're done so we can investigate. Thanks, Ramón. bigbourin@gmail.com commented on Wed, 4 Jul 2018 18:56:58 +0000: Here is a couple more info, I let 3.6.5 run for a while and this morning at 11am I have a cron which does a lot of mongo updates, this usually take the secondary CPU load from 6% to ~15%, you can see on the following graph that with 3.6.5 it took it from 20% to 80%, the secondary started lagging as it couldn't keep up and at some point lost the oplog race and I had to resync it To help you find the regression more easily I decided to resync it with 3.4.15, first to see if I obverse the same load difference, and the to do the upgrade one version at a time. You can see at 13:30 I started the resync, which finished at 15:00 and the load was back to more "normal" and stable levels as I'm used to. I then tried upgrading to 3.6.0 at 18:00, you can see the resync took a lot more CPU, and then the load on the server is lower than on 3.6.5 but still higher and much less stable than in 3.4.15. I'll let it stabilize a bit to get better average numbers and then continue with 3.6.1 I can provider the diagnostic.data if that is of any help. BTW I had some trouble starting 3.6.0 due to a bug with "bindIpAll" which makes the server start but not answer any connection (and the log says it couldn't bind the port and shutdown, but it's still running...), this seems not to be present in 3.6.5 though so I'm considering this a fixed bug ^^ bigbourin@gmail.com commented on Wed, 4 Jul 2018 07:17:41 +0000: Here is the image that I couldn't upload at ticket creation: And another one showing the I/O doubled too (same bandwidth but twice the IOPS):
Upgrade RS secondary from 3.4.15 to 3.6.5 and watch CPU usage