...
You have had a datastore outage on the ESXi's hosts where the NSX-T managers reside.To recover the cluster you used the 'deactivate cluster' command and now you have a single node cluster.The command 'get cluster status' shows everything up except: CORFU_NONCONFIG which is in an UNKOWN state: Group Type: CORFU_NONCONFIGGroup Status: UNAVAILABLEMembers: UUID FQDN IP STATUS 206dea88-edcf-496f-9a91-b88d2c7eb4de nsx-03.corp.local 192.168.1.203 UNKNOWN/nonconfig/corfu/corfu/LAYOUT_CURRENT.ds - shows the other 2 nodes are still present. "layoutServers": [ "192.168.1.201:9040", ##########this node was removed "192.168.1.202:9040", ##########this node was removed "192.168.1.203:9040" ], "sequencers": [ "192.168.1.201:9040", ##########this node was removed "192.168.1.202:9040", ##########this node was removed "192.168.1.203:9040" ], "segments": [ { "replicationMode": "CHAIN_REPLICATION", "start": 0, "end": -1, "stripes": [ { "logServers": [ "192.168.1.202:9040", ##########this node was removed "192.168.1.201:9040" ##########this node was removed ] } ] } ], "unresponsiveServers": [ "192.168.1.203:9040" ], "epoch": 1825, "clusterId": "ff44209a-6f7f-48e6-8389-9401d1842f86" When you run the command 'get cluster config' on the remaining node, it correctly shows only the single remaining node.In the following log we can see a loop looking for the now detached nodes: /var/log/corfu-nonconfig/corfu.9040.log 2022-05-31T14:59:51.264Z | DEBUG | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Layout server 192.168.1.203:9040 responded with layout Layout(layoutServers=[192.168.1.201:9040, 192.168.1.202:9040, 192.168.1.203:9040], sequencers=[192.168.1.202:9040, 192.168.1.201:9040, 10.211.42.203:9040], segments=[Layout.LayoutSegment(replicationMode=CHAIN_REPLICATION, start=0, end=-1, stripes=[Layout.LayoutStripe(logServers=[192.168.1.202:9040, 192.168.1.201:9040])])], unresponsiveServers=[192.168.1.203:9040], epoch=1725, clusterId=ff44209a-6f7f-48e6-8389-9401d1842f86)2022-05-31T14:59:51.264Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery layout epoch:1725, Cluster epoch: 17252022-05-31T14:59:51.264Z | ERROR | initializationTaskThread | o.c.i.ManagementAgent | initializationTask: Recovery failed 1364 times. Retrying in PT1Ss.2022-05-31T14:59:52.262Z | DEBUG | client-5 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...2022-05-31T14:59:52.265Z | DEBUG | initializationTaskThread | o.c.runtime.view.RuntimeLayout | Requested move of servers to new epoch 1726 servers are [192.168.1.203:9040, 192.168.1.202:9040, 192.168.1.201:9040]2022-05-31T14:59:52.265Z | INFO | initializationTaskThread | o.c.runtime.clients.BaseClient | sealRemoteServer: send SEAL from me(clientId=null) to new epoch 1726...2022-05-31T14:59:52.464Z | DEBUG | client-6 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.201:9040]: Channel connection failed, reconnecting......2022-05-31T14:59:53.265Z | DEBUG | initializationTaskThread | o.c.r.v.QuorumFuturesFactory | QuorumGet: Exception TimeoutException2022-05-31T14:59:53.265Z | ERROR | initializationTaskThread | o.c.r.v.LayoutManagementView | Error: recovery: {}org.corfudb.runtime.exceptions.QuorumUnreachableException: Couldn't reach quorum, reachable=1, required=2 at ...2022-05-31T14:59:53.265Z | INFO | initializationTaskThread | o.c.i.RecoveryHandler | Recovery reconfiguration attempt result: false2022-05-31T14:59:53.763Z | DEBUG | client-7 | o.c.r.c.NettyClientRouter | connectAsync[192.168.1.202:9040]: Channel connection failed, reconnecting...2022-05-31T14:59:53.766Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 192.168.1.202:9040 but failed by timeout In the following log we also see timeout requests to the now detached nodes: /var/log/corfu-nonconfig/nonconfig-corfu-compactor-audit.log 2022-05-31T14:11:10.601Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.201:9040 but failed by timeout2022-05-31T14:11:11.102Z WARN CorfuRuntime-0 CorfuRuntime - Tried to get layout from 192.168.1.202:9040 but failed by timeout2022-05-31T14:11:31.203Z ERROR main UfoCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP2" level="ERROR" subcomp="corfu-compactor"] UFO: Trim failed for ufo data in namespace ufoorg.corfudb.runtime.exceptions.UnreachableClusterException: Cluster is unavailable at com.vmware.nsx.platform.ufo.CorfuRuntimeHelper$1.run(CorfuRuntimeHelper.java:43) ...
This issue happens due to the underlying datastore issue and then issuing the 'deactivate cluster' command to attempt to recover the cluster.The node where the deactivate cluster command was executed, was an unresponsive nodes at the time.As this node was in an unresponsive state when the deactivate cluster command was issued, it was unhealthy even before the deactivate cluster command was issued.This node needs to cure itself. However, to cure itself, it needs information from the other two nodes.As the other two nodes no longer exist, this prevents the remaining node from being able to cure itself and therefore the cluster is down.The loop is caused by this remaining unhealthy node trying to continually connect to the other 2 nodes to cure itself.Best practice when a datastore issue causes impact to all 3 managers, is to restore from backup to a point in time before the datastore issue occurred.
This is a known issue impacting NSX-T Datacenter when all three managers nodes are on the same datastore.
Restore the NSX-T cluster from backup, to a point before the datastore outage occurred: Backup restore guide