...
After upgrading VC to 7.0 U3 or while running on 7.0 U3, all hosts in the environment enter a not responding state. The vCenter logs state that the host moved to this state due to missing heartbeats, even though heartbeats are showing as received on the network of the VCSA.vpxd logs, located at /var/log/vmware/vpxd/, show repeated messages "Certificate not available, starting hostsync"The operation timeout can happen anywhere between 2-30 mins as in the example below: 2022-08-18T17:55:45.133Z info vpxd[34753] [Originator@6876 sub=Vmomi opID=HostSync-host-147-79013b58] Creating SOAP stub adapter for /vpxa on <ESXihostfqdn>:443 2022-08-18T17:57:45.142Z info vpxd[34753] [Originator@6876 sub=IO.Http opID=HostSync-host-147-79013b58] Terminating user agent request; (null), N7Vmacore16TimeoutExceptionE(Operation timed out) 2022-08-18T17:57:45.153Z info vpxd[34753] [Originator@6876 sub=IO.Http opID=HostSync-host-147-79013b58] Set user agent error; state: 1, (null), N7Vmacore16TimeoutExceptionE(Operation timed out) 2022-08-18T17:57:45.156Z warning vpxd[34781] [Originator@6876 sub=IO.Connection opID=HostSync-host-147-79013b58] Failed to connect; <io_obj p:0x00007fc56ca86dc8, h:-1, <TCP '0.0.0.0 : 45208'>, <TCP '<IP> : 443'>>, e: 104(Connection reset by peer), duration: 120020msec 2022-08-18T17:57:45.157Z warning vpxd[34753] [Originator@6876 sub=Vmomi opID=HostSync-host-147-79013b58] Creating SOAP stub adapter failed: N7Vmacore24InvalidArgumentExceptionE(No version for VMODL calls to <<cs p:00007fc5a8348eb0, TCP:<ESXihostfqdn>:443>, /vpxa>) 2022-08-18T18:01:45.274Z info vpxd[23864] [Originator@6876 sub=Vmomi opID=HostSync-host-147-2f8fc66f] Creating SOAP stub adapter for /vpxa on <ESXihostfqdn>:443
Usually, this is caused by a lock which is blocking heartbeats. Some other common causes for this issue can be: If the password of ESX user "vpxuser" is out of sync with the password maintained by vpxd, which can happen in various scenarios such as: VCSA has been restored from a snapshot, and "vpxuser" password has been updated between the time snapshot was taken and restore time. The password of ESX user "vpxuser" has expired on ESX due to ESX updated setting 'Security.PasswordMaxDays' (default is "never expire" but the user may update that advanced setting) The ESX host has been added to another VC without having been removed from the first VC (the first VC will fail to login to ESX in that case) Due to mis-configured network connectivity, which causes connection failure from vpxd to ESX host after ~120 seconds timeout. Any other underlying issue causing ESXi host not responding, typically related to storage/network loss (e.g SD card as described in https://kb.vmware.com/s/article/2149257)
This issue is fixed in vCenter 7.0 U3i , please download from Customer Connect .
To workaround this issue: Identify the offending host(s) that are unresponsive in the environment and disconnect them from vCenter. Then you can address the issue with the host(s) while allowing all other hosts to remain connected.Investigate this host for problems and resolve them or disconnect this host from the vCenter Server until the issue can be resolved.A single host, or a small number of hosts which are not responding to the connection attempt can cause this due to hostd being unresponsive, etc. To identify if this is the issue you are encountering, look at the vpxd logs for one of the not responding hosts that you have attempted to reconnect. Find the reconnect task: 2022-08-25T19:08:07.405Z info vpxd[36085] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722379-auto-seuto-h5:72353065-61] [VpxLRO] -- BEGIN task-5224518 -- host-1607205 -- vim.HostSystem.reconnect -- 52ebb36c-1fbb-da0a-4c90-4e7468ca9fcc(52d277a1-9d4b-adb3-630c-0b0536f777d8) Using the opID, find where this task finished and notice there was no error with the task: 2022-08-25T19:08:10.998Z info vpxd[36085] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722379-auto-seuto-h5:72353065-61] [VpxLRO] -- FINISH task-5224518 Next, use the host ID to find the host sync task in the logs: 2022-08-25T19:08:31.553Z verbose vpxd[36097] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a] Queuing host sync; [vim.HostSystem:host-1607205,esxi.domain.com] 2022-08-25T19:08:31.554Z verbose vpxd[59454] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] [VpxLRO] -- BEGIN lro-1301294 -- -- VpxdInvtHostSyncHostLRO.Synchronize -- Immediately after the host sync is started, we see a failure stating that the connection is not alive due to missing heartbeats: 2022-08-25T19:08:31.554Z warning vpxd[59454] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] Connection not alive due to missing heartbeats; [vim.HostSystem:host-1607205,esxi.domain.com], cnx: 52d42264-85e5-80d4-9fe9-39ed1cd3c6eb 2022-08-25T19:08:31.558Z warning vpxd[59454] [Originator@6876 sub=MoHost opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] host [vim.HostSystem:host-1607205,esxi.domain.com] connection state changed to NO_RESPONSE 2022-08-25T19:08:31.570Z warning vpxd[59454] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] Exception occurred during host sync; Host communication failed; [vim.HostSystem:host-1607205,esxi.domain.com], e: N5Vmomi5Fault17HostCommunication9ExceptionE(Fault cause: vmodl.fault.HostCommunication Note that the time between when the reconnect is started and the failure is less than 30 seconds in these logs. It may be longer, but we should not be marking a host as unresponsive for missing heartbeats in this short of a time span.If you had run a packet capture looking for network problems, you may notice that the host is reporting a TCP Window Full Message like below: 3685 19:08:08.897786 2022-08-25 13:08:08.897786 <Host IP> <vCenter IP> TCP 1602 [TCP Window Full] 443 → 50118 [ACK] Seq=3191140846 Ack=3569770941 Win=66560 Len=1536 TSval=3535247282 TSecr=2844746348 [TCP segment of a reassembled PDU] 3688 19:08:08.899182 2022-08-25 13:08:08.899182 <vCenter IP> <Host IP> TCP 66 [TCP ZeroWindow] 50118 → 443 [ACK] Seq=3569770941 Ack=3191202286 Win=0 Len=0 TSval=2844746349 TSecr=353524728 You may also be able to use the following command run against the vpxd logs to identify a set of hosts to investigate for this issue. Note that you must change the <VPXD LOGS> instances in the sample to the log or logs you want to review (replace <VPXD LOGS> with the name of the specific log or a regex for running the command on multiple logs): $ grep "Terminating user agent request; SSL" <VPXD LOGS> | grep HostSync | cut -d "=" -f3 | cut -d "]" -f1 | sort | uniq | while read line; do grep $line <VPXD LOGS> | egrep "Creating SOAP stub adapter for \/vpxa|Terminating user agent request; SSL"; done or $ zgrep "Terminating user agent request; SSL" vpxd-*.log.gz | grep HostSync | cut -d "=" -f3 | cut -d "]" -f1 | sort | uniq | while read line; do zgrep $line vpxd-*.log.gz | egrep "Creating SOAP stub adapter for \/vpxa|Terminating user agent request; SSL"; done