BugZero | VMware BugID 89205 - WCP Supervisor Cluster upgrade to 1.21 or 1.22 hun...

VMware - Defect ID: 89205

WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%

VMware - Defect ID: 89205

WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%

Last updated on 9/29/2022

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Symptoms

Upgrading the WCP Supervisor Cluster does not proceed past 50% after new SV nodes are built. This impacts Supervisor Cluster upgrades from older versions to 1.21 or 1.22. The following symptoms are present: The user will see a task in vCenter GUI indicating the Namespace Upgrade is in progress: From the vCenter server SSH session, when running DCLI to query the namespace status, users will see: dcli> namespacemanagement software clusters get --cluster domain-c8upgrade_status: desired_version: v1.22.6+vmware.wcp.1-vsc0.0.17-19939323 messages: - severity: ERROR details: A general system error occurred. progress: total: 100 completed: 50 message: Namespaces cluster upgrade is in the "upgrade components" step. When connected to the Supervisor Node via SSH, the user will see errors like the following in the /var/log/vmware/upgrade-ctl-compupgrade.log "CapwUpgrade": {"status": "failed", "messages": [{"level": "error", "message": "Component CapwUpgrade failed: Failed to run command: Resource=customresourcedefinitions\", GroupVersionKind: \"apiextensions.k8s.io/v1, Kind=CustomResourceDefinition\"\nName: \"wcpmachinetemplates.infrastructure.cluster.vmware.com\", Namespace: \"\"\nfor: \"wcp-infrastructure.yaml\": CustomResourceDefinition.apiextensions.k8s.io \"wcpmachinetemplates.infrastructure.cluster.vmware.com\" is invalid: status.storedVersions[0]: Invalid value: \"v1alpha2\": must appear in spec.versions\n", "backtrace": [" File \"/usr/lib/vmware-wcp/upgrade/compupgrade.py\", line 252, in do\n The below messaging may also appear in the log, this is a symptom of the above errors: 2022-09-23T17:16:36.47Z ERROR comphelper: Failed to run command: ['/usr/local/bin/etcdctl_lock', '/vmware/wcp/upgrade/components/lock', '--', '/usr/bin/python3', '/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py', '--logfile', '/var/log/vmware/upgrade-ctl-compupgrade.log', '--statestore', 'EtcdStateStore', 'do-upgrade'] ret=1 out={"error": "OSError", "message": "[Errno 7] Argument list too long: '/usr/local/bin/etcdctl'", "backtrace": [" File \"/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py\" The CAPI-controller-manager, CAPW, and Scheduler pods may be in CrashLoopBackOff state with 1 of 2 containers running: # kubectl get pods -A | grep -v Run NAMESPACE NAME READY STATUS RESTARTS AGEkube-system kube-scheduler-423f01b9b30c727e9c237a00319c15l 1/2 CrashLoopBackOff 5 (99s ago) 57msvc-tmc-c63 agentupdater-workload-27657688--1-r46p5 0/1 Completed 0 30ssvc-tmc-c63 tmc-agent-installer-27657688--1-wpmxm 0/1 Completed 0 30svmware-system-capw capi-controller-manager-766c6fc449-4qqvf 1/2 CrashLoopBackOff 19 (3m42s ago) 53mvmware-system-capw capi-controller-manager-766c6fc449-bcpdq 1/2 CrashLoopBackOff 13 (4m15s ago) 23m

Cause

These symptoms occurs due to upstream K8s issues where deprecated CRDs necessitate removal and upgrades. In this case specifically, CRDs for the v1alpha2 are not being correctly removed, leading to a failure when adding the new CRDs for v1alpha3 and v1beta1. CRD versions cannot be removed when the listed version is present in the 'status.storedVersions' of the CRD. These are retained in .storedVersions if the .served flag is set to true.This upgrade failure has been narrowed to specific upgrade paths where the environment was initially built on WCP supervisor cluster using vCenter versions prior or equal to 7.0.0d (7.0.0.10700) Build 16749653, then the vCenter upgraded to 7.0 U3e (7.0.3.00600) Build 19717403 where the CAPI/CAPW `v1alpha2` was removed.Environments initially installed with WCP Supervisor Clusters on vCenter versions 7.0 U1 (7.0.1.00000) Build 16860138 are not susceptible to this failure.

Resolution

VMware engineering is working to address this issue in future releases of WCP, please use the workaround below if you encounter this issue and need to manually correct it.

Workaround

NOTE: If you are running this workaround as a proactive fix for your Supervisor Cluster upgrade, please skip step 9, instead, start the WCP Supervisor Cluster upgrade from the vCenter GUI.1. First, verify that v1alpha2 is a .served version: # kubectl get crd -o json machines.cluster.x-k8s.io | jq '.spec.versions[] | "\(.name) \(.served) \(.storage)"' "v1alpha2 true false" "v1alpha3 true true" # kubectl get crd -o json machines.cluster.x-k8s.io | jq .status.storedVersions [ "v1alpha2", "v1alpha3" ] 2. SSH to Supervisor Control Plane Node, gather the script attached to this KB and SCP it to the control plane node in /tmp/. Extract the script after importing using the following command: # cd /tmp# tar -zxf patch-capi-versions-Linux-x86_64.tar.gz 3. Start proxy on port 8080 in order to run the script: # kubectl proxy --port=8080 & Starting to serve on 127.0.0.1:8080 4. Alias the proxy pid from prior command # proxy_pid=$! 5. Run the script to gather CRD resources presently available: # ./patch-capi-versions-Linux-x86_64 clusters.cluster.x-k8s.io v1alpha2 storage=false served=trueclusters.cluster.x-k8s.io v1alpha3 storage=false served=trueclusters.cluster.x-k8s.io storedVersions [v1alpha2 v1alpha3] We can see that served=true on the above output, this is what is causing the problem. We can remove this by running the script again with the -update flag.6. Update CRDs: # ./patch-capi-versions-Linux-x86_64 -update 7. Kill the proxy after script completion: # kill $proxy_pid [1]+ Terminated: 15 kubectl proxy --port=8080 8. Confirm v1alpha2 is no longer set to served or stored: # kubectl get crd -o json machines.cluster.x-k8s.io | jq '.spec.versions[] | "\(.name) \(.served) \(.storage)"'\ "v1alpha2 false false" "v1alpha3 true true" # kubectl get crd -o json machines.cluster.x-k8s.io | jq .status.storedVersions [ "v1alpha3" ] 9. After confirming the CRDs are successfully updated. Proceed with the WCP upgrade script: NOTE: If you are running this workaround as a proactive fix for your Supervisor Cluster upgrade, please skip step 9 and run the WCP Supervisor Cluster upgrade from the vCenter GUI. # bash /usr/lib/vmware-wcp/objects/PodVM-GuestCluster/20-capw/install.sh The WCP cluster upgrade should proceed and complete the component upgrades along with the Spherelet upgrades if the environment is configured on NSX.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 89205

WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%

VMware - Defect ID: 89205

WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%

Last updated on 9/29/2022

Vendor details

Vendor details

Description

Symptoms

Cause

Resolution

Workaround

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?