...
Customer was receiving the following errors continuously: %ETHPORT-3-IF_ERROR_VLANS_SUSPENDED: VLAN/BDs 476 on Interface Ethernet111/1/11 are being suspended. (Reason: IFTMC PD commit db search failed) (Other interfaces specified) Followed by the TCAM exhaustion errors: %IFTMC-SLOT4-2-IFTMC_RES_ALLOC_ FAIL: IFTMC resource allocation failure: Ethernet108/1/3 failed HW table allocat ion. In asic 2 No. of PMTCAM left 0 Total 12240 (Many other interfaces cited) We advised customer to scale back on the allowed VLANs on the trunk ports between their servers and the FEX trunk ports. They did this, and the errors stopped. However, the TCAM exhaustion caused inappropriate VLAN assignments to be created. It seemed to be a corrupted dynamic-mac entry. I shut/no-shut the Ethernet interface E111/1/38, and this cleared the corrupted mac-address table entry. It was assigned to VLAN 11 for some reason, and the appropriate VLAN 2801 was assigned once flapped. However, after ~5 minutes, it returned to the prior inappropriate VLAN 11 assignment. Note that VLAN 11 isn't even allowed on the trunk: switchport trunk allowed vlan 30,62,100-101,219,300-308,310,316-317
The enhancement we want is the message to appear at the CLI, once customer tries to configure more than 75 allowable VLANs on a FEX trunk port. And once they exceed 2,000 total VLANs per FEX. So the conditions would be normal operational conditions, but when the end-user exceeds normal operational VLAN limitations per FEX device, respective to the platform and port in question. (And SW in question.)
The only workaround was to 1). Limit each FEX Trunk port to 75 or less allowed VLANs; 2). Limit Each FEX device to 2,000 VLANs; 3). If this is exceeded (which customer did exceed); the device has to be reloaded. (Note that, reloading the module(s) in question may have the same effect.)
Our customer experienced TCAM exhaustion on their Nexus c7710, as they over-extended the allowable number of VLANs on their N77-F348XP-23 module. 6 48-port FEX devices were connected to the module, and they configured more than ~75 VLANs per trunk port. This resulted in TCAM exhaustion, and subsequent TCAM corruption. This was a very visible event within the customer management and executives, as well within Cisco. Various BU engineers are re-creating the issue, as the corruption of TCAM that existed thereafter was catastrophic. A sniffer from the compute side, as well as An ELAM capture at the module, showed the VLAN being assigned correctly as VLAN 2801. The Mac-address table (dyamic) as well as hardware programming had the VLAN as VLAN 11. (Again, not even allowed on the trunk.) Clearing the dynamic Mac-address table would temporarily alleviate the issue (~5 mins), but the assignment would switch back to VLAN 11. BU Engineers. Tejas Nagarmat (tnagarma) and Tejas Nagarmat (tnagarma) joined a WebEx with myself and HTEs, and actively troubleshot the issue. They found the TCAM exhaustion also caused TCAM corruption, and the device needed to be reloaded. (Note: Module could have been reloaded, with similar results. Customer asked for full reload of device.) This enhancement request includes either a warning to the end-user/engineer, not to exceed the 50 or 75 VLANs per trunk (dependent on SW version); and not to exceed 2,000 VLANs per FEX. (Either warn, or just not allow the configuration to be accepted.)