Loading...
Loading...
DFW rules are missing in host transport nodes.*Relevant log’s location*:In CCP log, it shows that DFW rules in "RULE_SECTION" are deleted during a Full Sync.In directory nsx_manager/var/log/cloudnet$ grep "Full sync entity update for RULE*" nsx-ccp-20210727*nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.310Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE_SECTION: Added 0, Deleted 0, Updated 7374nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.480Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE: Added 879, Deleted 1378, Updated 56935Note that these messages are generic and indicate that DFW rules are deleted. The above messages only show a problem if no DFW rule is deleted in NSX-T Manager. The following messages are related to the race conditions causing stale configuration in in-memory cache.1. In Proton log, it shows that a Full Sync started at 2021-07-29T10:55:19:nsx_manager/var/log/proton2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/cxxxxxxf-0xx5-4xxa-bxx9-fxxxxxxxxxx9] to protobuf2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message UpmProfile/UpmProfile/3xxxxxx0-dxx9-4xx4-8xx0-7xxxxxxxxxx32021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/3xxxxxx0-dxx9-4xx4-8xx0-7xxxxxxxxxx3] to protobuf2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/axxxxxxb-0xxd-4xxe-axx8-fxxxxxxxxxx8] to protobuf2. In Proton log, it shows that there was a new Sync Leader at 2021-07-29T10:55:58/nsx_manager/var/log/proton2021-07-29T10:55:58.850Z INFO pool-410-thread-1 WaitingForLeadershipLostAckState - - [nsx@6876 comp="gmle-client" level="INFO" subcomp="WaitingForLeadershipLostAckState"] Entering WaitingForLeadershipLostAckState for lease id <UUID> service TraceflowCacheManager on member <UUID> of group <UUID> from AcquiredState : LeadershipLease{serviceName=TraceflowCacheManager, leaderId=<UUID>, leaseVersion=6560, revocationCount=0, leaseId=<UUID>}3. In Proton log, it shows that Statesync tried to stop loading messages to in-memory cacheproton$ grep 2021-07-29T10:55:59.845 nsxapi.*nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor StateSyncShardingCoordinator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting shutdown for leave() processingnsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor CorfuDbTransactionLogHandler - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping Corfu Transaction Log Processornsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullSyncCommunicator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] FullSyncService ccp full sync communicator bean is stopping.nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping FullStateSyncDataBuildernsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Clearing cachensxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor FullStateSyncCache - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full state sync cache stats.4. In Proton log, it shows that Fullsync still loaded messages for 5 more seconds. Then, a fatal error occurred at 2021-07-29T10:56:04. lease note that fatal error does not always occur. /nsx_managervar/log/protonproton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | grep 021-07-29T10:55:59 | head -52021-07-29T10:55:59.015Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.046Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.071Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.094Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.114Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>hdixon@gss-prd-csp-1:proton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | tail -102021-07-29T10:56:04.019Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.043Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.062Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] closeTransaction2021-07-29T10:56:04.067Z FATAL FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" errorCode="MP4705" level="FATAL" subcomp="manager"] Unexpected error in full state sync processor at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadMessage(AbstractFullStateSyncDataBuilder.java:382) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadSpecial(AbstractFullStateSyncDataBuilder.java:363) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadData(AbstractFullStateSyncDataBuilder.java:275) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.run(AbstractFullStateSyncDataBuilder.java:217) ~[ccp-1.0.jar:?]2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Exiting MessageLoader
This issue is due to a rare race condition which causes the NSX Manager to fail to publish updated rules to the Transport Nodes. If the NSX Manager with the stale configuration becomes the Sync leader, host transport nodes receives incorrect configuration. The sequence of events cause the race condition is as follows: A full sync starts. The Sync Leader is loading the configuration into in-memory cacheBefore the configuration is fully loaded into in-memory cache, another node takes over as the Sync Leader. This causes Fullsync loading partial configuration to in-memory cache. This creates stale configuration in in-memory cache.When the node with stale configuration becomes the Sync Leader again, the node pushes stale data to host transport nodes *The time span between #2 and 3 may be a few days or weeks.
Issue is resolved in NSX-T 3.1.3.1This issue only affects all NSX-T versions up to 3.1.3. As State Sync is not used in 3.2.0, this issue does not apply to 3.2.0 and later versions.
After a Manager loses its leadership in Fullsync, re-start proton via cli:/etc/init.d/proton restart
Click on a version to see all relevant bugs
VMware Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.