...
You are using VMware NSX-T.After the Load Balancer (LB) is reconfigured, an nginx core file is generated, from root of the edge you can see nginx core files similar to the below: root@edge01:/var/dump# lstotal 454M-rw-rw-rw- 1 root root 321M Jun 26 12:39 core.nginx.1672058350.9414.134.11.gz-rw-rw-rw- 1 root root 321M Jun 26 12:37 core.nginx.1672058216.8391.134.11.gz Pool members may report "Connect to Peer Failure" or "TCP Handshake Timeout".In var/log/syslog of the Edge Node you see log entries for "all pool members are down": 2022-12-27T01:22:23.064227+00:00 edge02 NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG1200000"] [9f918207-06f7-4d94-bbba-94ca54c86a34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'Pool', Obj.UUID: '8d2c9c89-Once5f4b-484d-871b-a634201ddd95', Obj.Name: 'cluster:ske-mgmt-west:sketestjhp-engine-rxwyl/kube-apiserver', Lb.UUID: '9f918207-06f7-4d94-bbba-94ca54c86a34', Lb.Name: 'LB-K8Sworker', Vs.UUID: '45b63dc1-0bbf-48eb-a99d-656a0eff03f8', Vs.Name: 'cluster:ske-mgmt-west:sketestjhp-engine-rxwyl/kube-apiserver', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'.2022-12-27T01:22:23.064913+00:00 edge02 NSX 6552 LOAD-BALANCER [nsx@6876 comp="nsx-edge" subcomp="lb" s2comp="lb" level="ERROR" errorCode="EDG9999999"] [9f918207-06f7-4d94-bbba-94ca54c86a34] Operation.Category: 'LbEvent', Operation.Type: 'StatusChange', Obj.Type: 'VirtualServer', Obj.UUID: '45b63dc1-0bbf-48eb-a99d-656a0eff03f8', Obj.Name: 'cluster:ske-mgmt-west:sketestjhp-engine-rxwyl/kube-apiserver', Lb.UUID: '9f918207-06f7-4d94-bbba-94ca54c86a34', Lb.Name: 'LB-K8Sworker', Status.NewStatus: 'Down', Status.Msg: 'all pool members are down'. The LB CONF process for the LB instance is not running, this can be confirmed by following the below steps: 1. Execute the below command from the root CLI of the Edge Node, this requires the UUID of the LB. #ps -ef | grep lb | grep nginx | grep <LB UUID>eg:root@edge02:~# ps -ef | grep lb | grep nginx | grep fc6c40b4-16ee-49e2-9d00-a6332103eba8lb 9568 9481 0 Jun23 ? 00:00:00 /opt/vmware/nsx-edge/bin/nginx -u fc6c40b4-16ee-49e2-9d00-a6332103eba8 -g daemon off; Note: Execute get load-balancer from the admin CLI of the active Edge Node, to retrieve the LB UUID. In the above example the LB UUID is fc6c40b4-16ee-49e2-9d00-a6332103eba8. 2. Use the nginx process ID (9568, as highlighted above) in the following command to confirm it has a LB CONF process running, if there is no output to the above command, there is no process running and the issue has been encountered. #ps -ef | grep <nginx process ID>| grep CONFeg:Impactedroot@edge02:~# ps -ef | grep 9568 | grep CONFroot@edge02:~#Not impactedroot@edge02:~# ps -ef | grep 9568 | grep CONFlb 9572 9568 0 Jun23 ? 00:00:06 nginx: LB CONF processroot@edge02:~# NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.
During periods of memory starvation, it is possible to encounter this behavior due to an issue in the LB CONF process.The process is automatically restarted, however the incorrect worker data is used, thus it does not get initialized, as a result, no session between nestdb and LB nginx process is made and the new LB configurations do not take effect.
This is resolved in VMware NSX-T Data Center 3.2.3 and VMware NSX version 4.1.1 VMware Downloads.
Restart the Edge Node to fail over services to standby node.ORRestart the docker of this LB instance using the below command ran from the CLI as root on the edge node: #docker ps | grep <LB UUID>#docker restart <CONTAINER ID> eg:root@edge02:~# docker ps | grep fc6c40b4-16ee-49e2-9d00-a6332103eba8126fa3da65e3 nsx-edge-lb:current "/opt/vmware/edge/lb…" 2 days ago Up 2 days service_lb_fc6c40b4-16ee-49e2-9d00-a6332103eba8root@edge02:~# docker restart 126fa3da65e3