...
In a situation in which a virtual machine cannot reach the default gateway, Cisco ACI declares a VMNIC as down, but VMware vCenter displays that VMNIC as connected.
This issue happens on fabrics with ESX hosts connecting to leaf nodes, and there is a VMware domain with the controller connected to the VMware vCenter of the hosts. This issue is triggered by the following steps. 1. Set a leaf node to Maintenance Mode -> Fault "NIC operational state is down" is raised on HpNics (this is expected). 2. Recommission the leaf node after 10 minutes -> The faults persist on the HpNics (this is the issue). When a fabric is upgraded or downgraded, the above steps can happen when APICs finish the upgrade before the leaf node, thus might trigger this issue. However, if the leaf node is recommissioned within ~10 minutes after the NIC down event, a recurring task checking for host adjacency could capture the NIC state when it is up, thus not giving any issues.
1. Shut / no shut the connecting port on the leaf node (or anything that triggers a vCenter "NIC is up" event). 2. Manually trigger an inventory pull (Right click on VMware domain’s controller->"Trigger Inventory Sync"). 3. Wait for the next daily inventory pull.
This issue is due to a VMware vCenter issue that the VMware vCenter did not send out an event when a physical NIC is up. In this problem, the last event that the vCenter sent out was "Physical NIC vmnicX is down", due to the connecting leaf node in Maintenance mode. By this event, Cisco APIC checks the NIC state on VMware vCenter and mark the HpNic as down. When the leaf node is recommissioned however, the vCenter did not send any event like "Physical NIC vmnicX is up", so Cisco APIC would not know when to check for the NIC info again, until an inventory pull or a reset of the port (which causes VMware vCenter to send out a NIC up event properly). With the fix of this problem, APIC will keep on retrying to get the NIC/adjacency info if it is used as uplink and is down, so that APIC would know when the NIC is up silently. If the NIC is down for too long, it is expected to see a FSM fault on the retrying task due to it failed for more than a count. This FSM fault should be cleared within 10 minutes after the NIC is up.