BugZero | VMware BugID 88761 - NSX-T cluster is unavailable due to datastore outa...

VMware - Defect ID: 88761

NSX-T cluster is unavailable due to datastore outage and deactivate cluster

VMware - Defect ID: 88761

NSX-T cluster is unavailable due to datastore outage and deactivate cluster

Last updated on February 21st, 2024

BugZero Risk Score
5.3 Medium

Overall: N/A

Severity: N/A

Community: N/A

Lifecycle: N/A

What is the BugZero Risk Score?

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Views: 103

Description

Symptoms

You have had a datastore outage on the ESXi's hosts where the NSX-T managers reside.To recover the cluster you used the 'deactivate cluster' command and now you have a single node cluster.The command 'get cluster status' shows everything up except: CORFU_NONCONFIG which is in an UNKOWN state: Group Type: CORFU_NONCONFIGGroup Status: UNAVAILABLEMembers: UUID FQDN IP STATUS 206dea88-edcf-496f-9a91-b88d2c7eb4de nsx-03.corp.local 192.168.1.203 UNKNOWN/nonconfig/corfu/corfu/LAYOUT_CURRENT.ds - shows the other 2 nodes are still present. "layoutServers": [ "192.168.1.201:9040", ##########this node was removed "192.168.1.202:9040", ##########this node was removed "192.168.1.203:9040" ], "sequencers": [ "192.168.1.201:9040", ##########this node was removed "192.168.1.202:9040", ##########this node was removed "192.168.1.203:9040" ], "segments": [ { "replicationMode": "CHAIN_REPLICATION", "start": 0, "end": -1, "stripes": [ { "logServers": [ "192.168.1.202:9040", ##########this node was removed "192.168.1.201:9040" ##########this node was removed ] } ] } ], "unresponsiveServers": [ "192.168.1.203:9040" ], "epoch": 1825, "clusterId": "ff44209a-6f7f-48e6-8389-9401d1842f86" When you run the command 'get cluster config' on the remaining node, it correctly shows only the single remaining node.In the following log we can see a loop looking for the now detached nodes: /var/log/corfu-nonconfig/corfu.9040.log 2022-05-31T14:59:51.264Z | DEBUG | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Layout server 192.168.1.203:9040 responded with layout Layout(layoutServers=[192.168.1.201:9040, 192.168.1.202:9040, 192.168.1.203:9040], sequencers=[192.168.1.202:9040, 192.168.1.201:9040, 10.211.42.203:9040], segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.1.202:9040, 192.168.1.201:9040])])], unresponsiveServers=[192.168.1.203:9040], epoch=1725, clusterId=ff44209a-6f7f-48e6-8389-9401d1842f86)2022-05-31T14:59:51.264Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery layout epoch:1725, Cluster epoch: 17252022-05-31T14:59:51.264Z | ERROR | initializationTaskThread | o.c.i.ManagementAgent | initializationTask: Recovery failed 1364 times. Retrying in PT1Ss.2022-05-31T14:59:52.262Z | DEBUG | client-5 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...2022-05-31T14:59:52.265Z | DEBUG | initializationTaskThread | o.c.runtime.view.RuntimeLayout | Requested move of servers to new epoch 1726 servers are [192.168.1.203:9040, 192.168.1.202:9040, 192.168.1.201:9040]2022-05-31T14:59:52.265Z | INFO | initializationTaskThread | o.c.runtime.clients.BaseClient | sealRemoteServer: send SEAL from me(clientId=null) to new epoch 1726...2022-05-31T14:59:52.464Z | DEBUG | client-6 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.201:9040]: Channel connection failed, reconnecting......2022-05-31T14:59:53.265Z | DEBUG | initializationTaskThread | o.c.r.v.QuorumFuturesFactory | QuorumGet: Exception TimeoutException2022-05-31T14:59:53.265Z | ERROR | initializationTaskThread | o.c.r.v.LayoutManagementView | Error: recovery: {}org.corfudb.runtime.exceptions.QuorumUnreachableException: Couldn't reach quorum, reachable=1, required=2 at ...2022-05-31T14:59:53.265Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery reconfiguration attempt result: false2022-05-31T14:59:53.763Z | DEBUG | client-7 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...2022-05-31T14:59:53.766Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 192.168.1.202:9040 but failed by timeout In the following log we also see timeout requests to the now detached nodes: /var/log/corfu-nonconfig/nonconfig-corfu-compactor-audit.log 2022-05-31T14:11:10.601Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.201:9040 but failed by timeout2022-05-31T14:11:11.102Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.202:9040 but failed by timeout2022-05-31T14:11:31.203Z ERROR main UfoCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP2" level="ERROR" subcomp="corfu-compactor"] UFO: Trim failed for ufo data in namespace ufoorg.corfudb.runtime.exceptions.UnreachableClusterException: Cluster is unavailable at com.vmware.nsx.platform.ufo.CorfuRuntimeHelper$1.run(CorfuRuntimeHelper.java:43) ...

Cause

This issue happens due to the underlying datastore issue and then issuing the 'deactivate cluster' command to attempt to recover the cluster.The node where the deactivate cluster command was executed, was an unresponsive nodes at the time.As this node was in an unresponsive state when the deactivate cluster command was issued, it was unhealthy even before the deactivate cluster command was issued.This node needs to cure itself. However, to cure itself, it needs information from the other two nodes.As the other two nodes no longer exist, this prevents the remaining node from being able to cure itself and therefore the cluster is down.The loop is caused by this remaining unhealthy node trying to continually connect to the other 2 nodes to cure itself.Best practice when a datastore issue causes impact to all 3 managers, is to restore from backup to a point in time before the datastore issue occurred.

Resolution

This is a known issue impacting NSX-T Datacenter when all three managers nodes are on the same datastore.

Workaround

Restore the NSX-T cluster from backup, to a point before the datastore outage occurred: Backup restore guide

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Top VMware Defects

No bugs this month

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

VMware - Defect ID: 88761

NSX-T cluster is unavailable due to datastore outage and deactivate cluster

VMware - Defect ID: 88761

NSX-T cluster is unavailable due to datastore outage and deactivate cluster

Last updated on February 21st, 2024

BugZero Risk Score5.3 Medium

Bug Details

Symptoms

Cause

Resolution

Workaround

Top VMware Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
5.3 Medium