BugZero | VMware BugID 88452 - During a full sync, CCP pushes stale configuration...

VMware - Defect ID: 88452

During a full sync, CCP pushes stale configuration to host transport nodes.

VMware - Defect ID: 88452

During a full sync, CCP pushes stale configuration to host transport nodes.

Last updated on February 23rd, 2024

BugZero Risk Score
8.3 High

Overall: N/A

Severity: N/A

Community: N/A

Lifecycle: N/A

What is the BugZero Risk Score?

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Views: 4

Description

Symptoms

DFW rules are missing in host transport nodes.*Relevant log’s location*:In CCP log, it shows that DFW rules in "RULE_SECTION" are deleted during a Full Sync.In directory nsx_manager/var/log/cloudnet$ grep "Full sync entity update for RULE*" nsx-ccp-20210727*nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.310Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE_SECTION: Added 0, Deleted 0, Updated 7374nsx-ccp-20210727-010852714.log:2021-07-26T19:03:51.480Z INFO MessageProcessor FullSyncTransactionCreator - - [nsx@6876 comp="nsx-controller" level="INFO" subcomp="adapter-mp"] Full sync entity update for RULE: Added 879, Deleted 1378, Updated 56935Note that these messages are generic and indicate that DFW rules are deleted. The above messages only show a problem if no DFW rule is deleted in NSX-T Manager. The following messages are related to the race conditions causing stale configuration in in-memory cache.1. In Proton log, it shows that a Full Sync started at 2021-07-29T10:55:19:nsx_manager/var/log/proton2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/cxxxxxxf-0xx5-4xxa-bxx9-fxxxxxxxxxx9] to protobuf2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message UpmProfile/UpmProfile/3xxxxxx0-dxx9-4xx4-8xx0-7xxxxxxxxxx32021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/3xxxxxx0-dxx9-4xx4-8xx0-7xxxxxxxxxx3] to protobuf2021-07-29T10:55:19.237Z INFO FullSyncMsgLoader ProfileFullSyncMessageProvider - UPM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Converting Profile [type = SystemHealthPluginProfile, id = UpmProfile/axxxxxxb-0xxd-4xxe-axx8-fxxxxxxxxxx8] to protobuf2. In Proton log, it shows that there was a new Sync Leader at 2021-07-29T10:55:58/nsx_manager/var/log/proton2021-07-29T10:55:58.850Z INFO pool-410-thread-1 WaitingForLeadershipLostAckState - - [nsx@6876 comp="gmle-client" level="INFO" subcomp="WaitingForLeadershipLostAckState"] Entering WaitingForLeadershipLostAckState for lease id <UUID> service TraceflowCacheManager on member <UUID> of group <UUID> from AcquiredState : LeadershipLease{serviceName=TraceflowCacheManager, leaderId=<UUID>, leaseVersion=6560, revocationCount=0, leaseId=<UUID>}3. In Proton log, it shows that Statesync tried to stop loading messages to in-memory cacheproton$ grep 2021-07-29T10:55:59.845 nsxapi.*nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor StateSyncShardingCoordinator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting shutdown for leave() processingnsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor CorfuDbTransactionLogHandler - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping Corfu Transaction Log Processornsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullSyncCommunicator - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] FullSyncService ccp full sync communicator bean is stopping.nsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Stopping FullStateSyncDataBuildernsxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Clearing cachensxapi.1.log:2021-07-29T10:55:59.845Z INFO StateSyncShardingCoordinator-leadership-executor FullStateSyncCache - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full state sync cache stats.4. In Proton log, it shows that Fullsync still loaded messages for 5 more seconds. Then, a fatal error occurred at 2021-07-29T10:56:04. lease note that fatal error does not always occur. /nsx_managervar/log/protonproton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | grep 021-07-29T10:55:59 | head -52021-07-29T10:55:59.015Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.046Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.071Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.094Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:55:59.114Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>hdixon@gss-prd-csp-1:proton$ grep AbstractFullStateSyncDataBuilder nsxapi.1.log | tail -102021-07-29T10:56:04.019Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.043Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.062Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Loading message FirewallSection/FirewallSection/<UUID>2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] closeTransaction2021-07-29T10:56:04.067Z FATAL FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" errorCode="MP4705" level="FATAL" subcomp="manager"] Unexpected error in full state sync processor at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadMessage(AbstractFullStateSyncDataBuilder.java:382) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadSpecial(AbstractFullStateSyncDataBuilder.java:363) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.loadData(AbstractFullStateSyncDataBuilder.java:275) ~[ccp-1.0.jar:?] at com.vmware.nsx.management.statesync.full.AbstractFullStateSyncDataBuilder.run(AbstractFullStateSyncDataBuilder.java:217) ~[ccp-1.0.jar:?]2021-07-29T10:56:04.067Z INFO FullSyncMsgLoader AbstractFullStateSyncDataBuilder - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Exiting MessageLoader

Cause

This issue is due to a rare race condition which causes the NSX Manager to fail to publish updated rules to the Transport Nodes. If the NSX Manager with the stale configuration becomes the Sync leader, host transport nodes receives incorrect configuration. The sequence of events cause the race condition is as follows: A full sync starts. The Sync Leader is loading the configuration into in-memory cacheBefore the configuration is fully loaded into in-memory cache, another node takes over as the Sync Leader. This causes Fullsync loading partial configuration to in-memory cache. This creates stale configuration in in-memory cache.When the node with stale configuration becomes the Sync Leader again, the node pushes stale data to host transport nodes *The time span between #2 and 3 may be a few days or weeks.

Resolution

Issue is resolved in NSX-T 3.1.3.1This issue only affects all NSX-T versions up to 3.1.3. As State Sync is not used in 3.2.0, this issue does not apply to 3.2.0 and later versions.

Workaround

After a Manager loses its leadership in Fullsync, re-start proton via cli:/etc/init.d/proton restart

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Top VMware Defects

No bugs this month

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

VMware - Defect ID: 88452

During a full sync, CCP pushes stale configuration to host transport nodes.

VMware - Defect ID: 88452

During a full sync, CCP pushes stale configuration to host transport nodes.

Last updated on February 23rd, 2024

BugZero Risk Score8.3 High

Bug Details

Symptoms

Cause

Resolution

Workaround

Top VMware Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
8.3 High