...
BugZero found this defect 2696 days ago.
We have had a problem with mongodump, but believe it to be fixed/worked-around for now. I am sharing this for you guys at mongo to analyze and potentially other users in the same situation to maybe solve their poblem. If this is not the right place for this, please let me know. We are using mongodump to make backups of our data, the mongodump is run on a hidden secondary and during night so as not to impact production environment. Before (on 3.2 and 3.4.5) we have not has any difficulties with this. On 3.4.5 the mongodump would run for about 2-3 min on our 1.5 GB databases. After upgrading to 3.4.6, we saw an increase in time each night the dump was run: first 3 min (as the previous version), 9 min, then 46 min, then over 3 hours. For the last one, the mongod instance also apparently crashed and restarted. We are using cloud mongo to monitor our mongod instances and the instance, on which the mongodump was run, was reported to be down for about 2 hours while the mongodump was running. Furthermore, other instances on mongod and the monitor agent on the same server were reported down/up sporadically while the mongodump was running. Investigating the disk usage shows that after upgrading to 3.4.6, the disk read has been increasing similarly to the run time. 4 GB/h, 86 GB/h, 94 GB/h, ~110 GB/h for 3+ hours. We ran an unattended-upgrade on the server hosting the mongo instance and after that the following mongodump took only 3 min and the disk read was back at 4 GB/h. I don't know if the upgrade fixed it or if it was just coincidental and sporadic. I am not too keen on trying to revert the upgrades to reproduce this on our production server. The packages upgraded were: linux-cloud-tools-common linux-cloud-tools-virtual linux-headers-generic linux-headers-virtual linux-image-extra-virtual linux-image-generic linux-image-virtual linux-virtual Packages auto removed: linux-cloud-tools-4.4.0-83 linux-cloud-tools-4.4.0-83-generic If you need any more info, let me know and I try to provide it.
dimko commented on Fri, 27 Apr 2018 01:49:52 +0000: Hi Alex, Were you able to find the root case? We're facing the same issues on version 3.4.6 running on Azure. We've been running this setup for months without any issues but recently mongodump got really slow on some of the nodes: 5–8 hours vs 40–50 min. That said, it'd be super helpful if you share your further findings on this problem, if any. Thanks! JIRAUSER1269335 commented on Mon, 6 Nov 2017 09:11:56 +0000: Hi Mark Just to clarify the issue only goes away temporarily when we restart. After ~14 days the issue starts to reappear - after a restart it is gone again for a while. The last couple of incidents we have had was: 7th October + 8th October (then restarted) 18th October (then restarted) 5th November (no restart, but backup ran OK on 6th November) Regarding the disk utilization, we are seeing that as well in our Azure portal when the mongodump is running as attached earlier in this issue. We only see the issue when mongodump is running leading us to investigate if we are using it correctly. I noticed that we are not using --oplog for our mongodump command. I am going to change that to see if that will help. mark.agarunov commented on Thu, 14 Sep 2017 19:14:26 +0000: Hello aba@medei.dk, Thank you for providing this information. Looking over the diagnostic data, it seems to suggest that the slowdown is caused by the disk utilization approaching 100% causing a bottleneck and does not seem to be something internal to MongoDB. As running upgrades on the system and restarting it seems to have alleviated the issue, this would support the idea that this is something outside of MongoDB causing this behavior. As this does not seem to be a bug in MongoDB and you are no longer experiencing this issue, I've closed this ticket. If the issue arises again or additional information comes to light, we can reopen this ticket and continue investigating. Thanks, Mark JIRAUSER1269335 commented on Mon, 11 Sep 2017 13:32:27 +0000: Update: On Friday we ran an unattended upgrade (no packages installed or removed), then restarted the vm. After that we have not encountered the issue. Last time we did an unattended upgrade + restart it also appeared to fix the problem for a while. JIRAUSER1269335 commented on Fri, 8 Sep 2017 14:06:45 +0000: Hi Thomas Kelsey Schubert Thank you having a look at this. I just uploaded the diagnostics from yesterday. The mongodump was started on 7th of sept. at 5:30 GMT+2 (it shows up in the regular mongodb.log at 3:30) and ran until ~6:40 GMT+2. You can take a look at some of the older diagnostics to see how it runs when the problem is not occurring (the mongodump always start at 5:30). BTW we have a filesystem dump running 5:00 GMT+2 where we do a db.fsyncLock() and unlock - in case you see some activity at that point. Other than that we rarely see any activity during the night. Let me know if you need anything else from us. thomas.schubert commented on Fri, 8 Sep 2017 13:52:56 +0000: Hi aba@medei.dk, I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Please attach upload the diagnostic.data, as well as the complete logs for the affected mongod. Additionally, It would speed our investigation if you could identify when the mongodump was initiated. Thank you, Kelsey JIRAUSER1269335 commented on Fri, 8 Sep 2017 10:37:22 +0000: Just to clarify: The workaround that we tried did not work. The problem seems to occur sporadically. We just had another incident this night. When looking at the dumped data from the last two nights, it seems that the dumps fail to extract all data. When inspecting the dumped folders I can see that some of the collections in from some databases are not saved (missing the .bson file) - only their meta data is saved. Is it possible to reopen this issue? JIRAUSER1269335 commented on Thu, 7 Sep 2017 07:45:32 +0000: Hi Mark Agarunov, Thank you for taking a look at this. It seems that the problem is resurfacing for us. Tonight, while our backups were running, we saw similar symptom as I described earlier - The backup would run for over an hour (compared to 3 min the previous night) and the disk reads on the vm running the instance doing the backup was through the roof. I have archived the diagnostic data you mentioned. Is there a private place that I can upload the file, so that it is not openly available? I assume you guys will keep the diagnostic data confidential. Is there in other logs that I should capture or anything else I can do to help this investigation? mark.agarunov commented on Wed, 30 Aug 2017 20:53:23 +0000: Hello aba@medei.dk, Thank you for the report. Unfortunately I haven't been able to reproduce this issue, however I am glad to hear that you have found a workaround for it. Without diagnostic data from an affected node, we aren't able to further diagnose the issue. If this is still a problem or becomes a problem again, please archive and provide the $dbpath/diagnostic.data from the affected node so we can further diagnose the behavior. Thanks, Mark JIRAUSER1269335 commented on Wed, 2 Aug 2017 07:51:43 +0000: Our mongodump process has been running more or less smooth for the last week, but we just had a nother incident of our mongodump running for a little over an hour and restarting the mongod instance. I have not figured out exactly what caused the restart of the instance, but it does correlate with the time that mongodump finished. Also, the disk read very high, as we saw last time. The disk reads on the machine running the mongodump through out the last week is shown below. Notice that on the 25th was when we had the first major incident, after which we updated a couple of packages. Since we are now seeing another incident I am leaning more towards this being something sporadic. I have checked with the service health panel of Microsoft Azure, and there does not seem to have been any issues on their end. Let me know if you need any logs or other info.