...
A replica-set in Azure that was deployed few months ago. This setup was started to crash since Fri, Jan 8 (log example attached). No upgrade/change to the system or application was done in the last week. The crash happens only when instance is primary, after few hours of operation. You can see that during these hours memory usage increases, while all other parameters like connections remain constant (see the attached files). The replica set was 3.0.6, and was upgraded it today to 3.0.8. We also removed it from the replica set and recreated it. Yet the problem continues to happen on this machine. This one of few replica sets in Azure (where this is not recreated), yet it is the most active one.
moshekaplan commented on Thu, 11 Feb 2016 07:20:16 +0000: The customer is happy w/ the 3.2.1 installation and currently is not willing to put more effort into it. Thanks for you help! ramon.fernandez commented on Wed, 10 Feb 2016 20:16:58 +0000: MosheKaplan, without either the ss.log file (from the 3.0 affected node) or the contents of diagnostic.data (from a 3.2 affected node) it's not possible for us to investigate further, so I'm going to close this ticket. If this is still an issue for you please provide one of the two data options requested above and we'll reopen the ticket to take a closer look. Thanks, Ramón. ramon.fernandez commented on Mon, 25 Jan 2016 16:50:34 +0000: MosheKaplan, if you're able to observe this behavior on a 3.2 node, can you please upload the contents of the diagnostic.data directory within your dbpath? This directory contains the same information that you collected above in the ss.log file, and should help us understand what's going on. moshekaplan commented on Mon, 18 Jan 2016 15:49:02 +0000: Checking for that. P.S The major difference in 3.0.8 the cache was not utilized at all. In 3.2 it's actually utilized. I would look for that direction (memory leak in cache). ramon.fernandez commented on Mon, 18 Jan 2016 15:02:40 +0000: Thanks for the additional information MosheKaplan; when running the script above you should have ended up with another file, ss.log, which is the one that has the key information that can help debugging this issue. Can you please upload it as well? moshekaplan commented on Mon, 18 Jan 2016 09:55:52 +0000: Some more info: 1. iostat is attached 2. Scaling the machine to 32GB RAM did not help 3. Upgrade to 3.2 made a major improvement moshekaplan commented on Mon, 18 Jan 2016 09:55:47 +0000: iostat information ramon.fernandez commented on Mon, 11 Jan 2016 13:24:24 +0000: Sorry you're running into this issue MosheKaplan. In order to diagnose this problem, can you please run the following shell script while you reproduce the crash? # Delay in seconds delay=1 mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep($delay*1000)}" >ss.log & iostat -k -t -x $delay >iostat.log & You can adjust the delay depending on how long this issue takes to trigger; if it's, say, 24h, the delay can be 5s to prevent the resulting files from being too large. If you could then upload the ss.log and iostat.log files along with the mongod.log for the affected server that should give us sufficient information to understand the source of the problem. Thanks, Ramón.
Server details: RAM: 14GB Data Size: 47.5GB (storage size ~15GB) cacheSizeGB: 7 4 Cores 3.0.8 CentOS Linux release 7.2.1511 (Core): Azure, Linux version 3.10.0-229.11.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Replica set: Primary, Secondary and Arbiter. Engine: WiredTiger