BugZero | MongoDB BugID 246452 - Primary Server Crashes due to Memory Leak

MongoDB - Defect ID: 246452

Primary Server Crashes due to Memory Leak

MongoDB - Defect ID: 246452

Primary Server Crashes due to Memory Leak

Last updated on 2/11/2016

Overall: 6.16.1

Severity: 6.46.4

Community: 6.46.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.16.1

Severity: 6.46.4

Community: 6.46.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

A replica-set in Azure that was deployed few months ago. This setup was started to crash since Fri, Jan 8 (log example attached). No upgrade/change to the system or application was done in the last week. The crash happens only when instance is primary, after few hours of operation. You can see that during these hours memory usage increases, while all other parameters like connections remain constant (see the attached files). The replica set was 3.0.6, and was upgraded it today to 3.0.8. We also removed it from the replica set and recreated it. Yet the problem continues to happen on this machine. This one of few replica sets in Azure (where this is not recreated), yet it is the most active one.

Top User Comments

moshekaplan commented on Thu, 11 Feb 2016 07:20:16 +0000: The customer is happy w/ the 3.2.1 installation and currently is not willing to put more effort into it. Thanks for you help! ramon.fernandez commented on Wed, 10 Feb 2016 20:16:58 +0000: MosheKaplan, without either the ss.log file (from the 3.0 affected node) or the contents of diagnostic.data (from a 3.2 affected node) it's not possible for us to investigate further, so I'm going to close this ticket. If this is still an issue for you please provide one of the two data options requested above and we'll reopen the ticket to take a closer look. Thanks, Ramón. ramon.fernandez commented on Mon, 25 Jan 2016 16:50:34 +0000: MosheKaplan, if you're able to observe this behavior on a 3.2 node, can you please upload the contents of the diagnostic.data directory within your dbpath? This directory contains the same information that you collected above in the ss.log file, and should help us understand what's going on. moshekaplan commented on Mon, 18 Jan 2016 15:49:02 +0000: Checking for that. P.S The major difference in 3.0.8 the cache was not utilized at all. In 3.2 it's actually utilized. I would look for that direction (memory leak in cache). ramon.fernandez commented on Mon, 18 Jan 2016 15:02:40 +0000: Thanks for the additional information MosheKaplan; when running the script above you should have ended up with another file, ss.log, which is the one that has the key information that can help debugging this issue. Can you please upload it as well? moshekaplan commented on Mon, 18 Jan 2016 09:55:52 +0000: Some more info: 1. iostat is attached 2. Scaling the machine to 32GB RAM did not help 3. Upgrade to 3.2 made a major improvement moshekaplan commented on Mon, 18 Jan 2016 09:55:47 +0000: iostat information ramon.fernandez commented on Mon, 11 Jan 2016 13:24:24 +0000: Sorry you're running into this issue MosheKaplan. In order to diagnose this problem, can you please run the following shell script while you reproduce the crash? # Delay in seconds delay=1 mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep($delay*1000)}" >ss.log & iostat -k -t -x $delay >iostat.log & You can adjust the delay depending on how long this issue takes to trigger; if it's, say, 24h, the delay can be 5s to prevent the resulting files from being too large. If you could then upload the ss.log and iostat.log files along with the mongod.log for the affected server that should give us sufficient information to understand the source of the problem. Thanks, Ramón.

Steps to Reproduce

Server details: RAM: 14GB Data Size: 47.5GB (storage size ~15GB) cacheSizeGB: 7 4 Cores 3.0.8 CentOS Linux release 7.2.1511 (Core): Azure, Linux version 3.10.0-229.11.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Replica set: Primary, Secondary and Arbiter. Engine: WiredTiger

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 246452

Primary Server Crashes due to Memory Leak

MongoDB - Defect ID: 246452

Primary Server Crashes due to Memory Leak

Last updated on 2/11/2016

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?