Symptoms
Updating a host using VMware Cloud Foundation for Service Providers 2.4.x lifecycle management (LCM) fails with an upgradeStatus of COMPLETED_WITH_FAILURE and an error similar to the following: "metadata": "\nThe ESXi cluster is not healthy (green).\nUpgrade failed. Auto-recovery attempt failed as well. Manual intervention needed.\nCheck for errors in the lcm log files located on server 127.0.0.1 under /home/vrack/lcm/logs/\nLCM will bring the domain back online once problms found above steps are fixed manually. Please retry the upgrade once the upgrade is available again." In the log /home/vrack/lcm/logs/lcm.log, the cluster status is not green due to a cluster config issue similar to the following: 2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] DEBUG [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-f1d9-4879-a8b4-1a4ea120069a,resourceType=ESX_HOST,resourceId=0513e5ea-4ac1-4664-ba64-125d6aea975d,bundleId=3c13036f-fdb8-4ccc-ac1c-f34d2d938b93,bundleElementId=e1fd15e1-2732-4c90-88e5-b8f45ca102a6 Cluster config status: yellow 2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] ERROR [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-f1d9-4879-a8b4-1a4ea120069a,resourceType=ESX_HOST,resourceId=0513e5ea-4ac1-4664-ba64-125d6aea975d,bundleId=3c13036f-fdb8-4ccc-ac1c-f34d2d938b93,bundleElementId=e1fd15e1-2732-4c90-88e5-b8f45ca102a6 Cluster status is not green; update cannot proceed. 2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] ERROR [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-f1d9-4879-a8b4-1a4ea120069a,resourceType=ESX_HOST,resourceId=0513e5ea-4ac1-4664-ba64-125d6aea975d,bundleId=3c13036f-fdb8-4ccc-ac1c-f34d2d938b93,bundleElementId=e1fd15e1-2732-4c90-88e5-b8f45ca102a6 Cluster Config Issue: vSphere HA failover operation in progress in cluster ProdCluster1 in datacenter Datacenter1: 0 VMs being restarted, 2 VMs waiting for a retry, 0 VMs waiting for resources, 4 inaccessible Virtual SAN VMs Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment as well as the HA notification. The following is an additional example: Cluster config Issue: vSphere HA initiated a failover action on 1 virtual machines in cluster LabCluster on datacenter DC01. In the vSphere Web Client, the same notification can be found by selecting the referenced data center and navigating to Issues under the Monitor tab. The message cannot be removed.
Resolution
Disable HA at the cluster level - wait for the HA agent to be uninstalled from all the hosts in the cluster, and then re-enable HA on the cluster.Retry the failed host update.