...
1. High Memory usage reported on Edge ~90%, followed by "out of memory" errors2. Edge becomes unresponsive and unmanageable3. Edge Automatically reboots4. Critical warnings reported in UI (NSX Edge is out of memory. The Edge is rebooting in 3 seconds.Top 5 processes are: {#}.)
The purpose of this article is to provide awareness of a known issue where memory Leak can be seen with open-vm tools that is addressed in 6.4.8 release of NSX via 11.0.5 version of open-vm tools
This issue is caused by Open-VM tools running on NSX Edge that can cause high memory usage due to memory leakEventually NSX Edge becomes unmanageable and reboots automatically as a part of auto-recovery. Sometime Edge needs to be be manually rebooted to clear high memory usage if an automatic reboot doesn't occur.Verify this issue by below: (Log snippets vary from NSX Edge to Edge/Version to Version)1. Critical Alarms shown in UI (NSX Edge is out of memory. The Edge is rebooting in 3 seconds. Top 5 processes are: {#}.)2. Check NSX Edge logs and verify memory usage warnings2020-06-06T05:22:55+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] payload len:368 data:{"systemEvents":[{"moduleName":"vShield Edge Appliance","severity":"Critical","eventCode":"30149","message":"vShield Edge memory over used","timestamp":1591420975,"metaData":{"message":"Memory usage: 90.05%","details":" 1772 390456 987680 vmtoolsd 801 4156 200708 syslog-ng 7652 3716 67060 sync_path.pl 7406 2808 14144 sh 7382 2752 14140 runevery.sh "}}]2020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.warning] dcsms invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=02020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.err] Out of memory: Kill process 1772 (vmtoolsd) score 817 or sacrifice child2020-06-06T05:26:18+00:00 NSX Edge kernel[]: [default]: [kern.err] Killed process 1772 (vmtoolsd) total-vm:987756kB, anon-rss:384548kB, file-rss:244kB2020-06-06T05:26:18+00:00 NSX Edge OOMChecker[1780]: [default]: [daemon.warning] OOM, top 5 memory used processes: 8347 54180 116972 VseEventProcess 801 3824 200708 syslog-ng 8349 2780 14144 sh 1780 2628 39876 VseOOMChecker.p 7652 2548 67060 sync_path.pl2020-06-06T05:26:18+00:00 NSX Edge config[]: [default]: [daemon.info] INFO :: Utils :: ha: UpdateHaResourceFlags:2020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] Building event message2020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] correlation id:Event_502a1cba-7f73-2be8-49a6-5b96ce953aaf15914211782020-06-06T05:26:18+00:00 NSX Edge MsgMgr[1778]: [default]: [daemon.info] payload len:360 data:{"systemEvents":[{"severity":"Critical","message":"OOM happened, system rebooting in 3 seconds...","metaData":{"message":" 8347 54180 116972 VseEventProcess 801 3824 200708 syslog-ng 8349 2780 14144 sh 1780 2628 39876 VseOOMChecker.p 7652 2548 67060 sync_path.pl "},"timestamp":1591421178,"eventCode":30180,"moduleName":"vShield Edge Appliance"}]}2020-06-06T05:26:21+00:00 NSX Edge shutdown[8425]: [default]: [user.notice] shutting down for system rebootvsm.log 2020-06-06 13:28:18.992 XXX INFO SimpleAsyncTaskExecutor-1 EventServiceImpl:119 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [SystemEvent] Time:'Sat Jun 06 13:27:21.000 xxx 2020', Severity:'Informational', Event Source:'edge-xxxxxxxx-xxxxxxxxx-xxxx-xxxxxxxxxxxx', Code:'30101', Event Message:'NSX Edge was booted', Module:'vShield Edge Appliance', Universal Object:'falseEdge System Process via TOP commandUSER PID PPID %CPU %MEM VSZ RSS NI TTY STAT STIME TIME COMMANDroot 1175 1147 0.0 67.3 943376 335868 0 ? Sl 2018 07:05:47 /usr/local/bin/vmtoolsd --plugin-path=/usr/local/lib/open-vm-tools/plugins/vmsvc/
NSX Edge will not be manageable and services will impact.
Currently, the resolution is via Open-VM tools version 11.0.5 which is shipped in NSX 6.4.8 release.Customer must upgrade NSX Manager and other components to 6.4.8 version
The only workaround is to reboot the Edge when there is a warning shown in UI for Memory usage i.e. It gets critical when you see that Edge is constantly reporting high memory usage.A critical alarm shall be generated something like i.e. Alarm 30180 OOM, If you are seeing this alarm in UI, that means OOM has occurred and system will try to recover it via an automatic rebootYou can find list of critical alarms and system event in -- https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSphere/6.4/com.vmware.nsx.logging.doc/GUID-4CAA25F7-1EE7-4B8A-957E-52865F723C10.html
https://docs.vmware.com/en/VMware-NSX-Data-Center-for-vSphere/6.4/com.vmware.nsx.logging.doc/GUID-4CAA25F7-1EE7-4B8A-957E-52865F723C10.html