Symptoms
Critical alarm "SNAT Port Usage On Gateway Is High" is seen continuously for SNAT IP, even though there are not many active connections
NSX alarm in syslog:
2023-04-14T18:05:10.037Z nsxmgr-03 NSX 5281 MONITORING [nsx@6876 alarmId="927cab4a-e760-42b2-8faa-36f3a55a14b5" alarmState="OPEN" comp="nsx-manager" entId="62a03bb6-fb1e-4e20-983a-a7279c6a0ca6" errorCode="MP701099" eventFeatureName="nat" eventSev="CRITICAL" eventState="On" eventType="snat_port_usage_on_gateway_is_high" level="FATAL" nodeId="62a03bb6-fb1e-4e20-983a-a7279c6a0ca6" subcomp="monitoring"] SNAT ports usage on logical router 42ecb79b-5ad0-470e-bc6d-c3e599c41862 for SNAT IP 10.10.10.10 has reached the high threshold value of 80%. New flows will not be SNATed when usage reaches the maximum limit.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
Purpose
Critical alarms is triggered for SNAT port usage on logical router for a specific IP. Upon checking the active flows for the specific SNAT IP, flow count will be lesser than the maximum possible SNAT limit per IP which is close to 60,000
nsxedge-01(tier0_sr[15])> get firewall connection state | count 10.10.10.10
Mon Apr 17 2023 UTC 17:44:48.822
Number of lines that match pattern '10.10.10.10': 15079
In case SNAT port usage limit is reached, we might find 'Failed NAT translation' incrementing
nsxedg-01(tier0_sr[15])> get firewall interface stats | find Failed.NAT.trans
Tue May 02 2023 UTC 10:19:42.495
Failed NAT translation : 0
Cause
Alarm triggered due to a software bug
Impact / Risks
No impact to production, alarm is false positive
Resolution
Issue is resolved in upcoming NSX-T version 4.1.1
Workaround
"Disable" the alarm under "Alarm Definitions". This should avoid the alarm from appearing. It is safe to do so as the error is happening by bug and not because of SNAT ports running out.