Loading...
Loading...
You are upgrading your Tanzu Kubernetes Grid clustersYou have applications running on the cluster using persistent volumesYour upgrade is in a hung state or failed after a long waitYour pods that are utilizing persistent volumes are not able to attach persistent volumes
Due to race conditions between detaching and deleting volume operations, CNS volumes never get detached from the nodes. This issue affects Tanzu Kubernetes Grid versions 1.x to 1.6.x. One of the scenarios in which this can occur is during an upgrade and if you have stateful workloads utilizing persistent volumes: TKG upgrade hung due to misconfigure PodDisruptionBudget(PDB)After resolving the PDB errors workers get reconciled and old workers get deletedCSI controller repeatedly tries to detach the volume from a node that is not presentAll stateful workloads are stuck in container creation or an init state, depending on which stage the volume is mounted Repeated error messages "Failed to find VirtualMachine for node" are logged in the vSphere CSI controller logs: kubectl logs -n kube-system vsphere-csi-controller-76d888d87c-wsml9 vsphere-csi-controller{"level":"error","time":"2020-09-27T18:15:59.174108121Z","caller":"vanilla/controller.go:569","msg":"failed to find VirtualMachine for node:\"tcp-md-0-5bb7dc9f5c-mbjwl\". Error: node wasn't found","TraceId":"a5ab0f92-a59e-4b67-9185-a9bd020cc1fb","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume/build/pkg/csi/service/vanilla/controller.go:569github.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5200github.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume/go/pkg/mod/github.com/rexray/gocsi@v1.2.1/middleware/serialvolume/serial_volume_locker.go:141 The application pods are stuck during the container creation or init container phase and constantly error out with "Multi-Attach error for volume and Unable to attach or mount volumes: unmounted volumes": kubectl describe <your-failing-pod>Warning FailedAttachVolume 32m attachdetach-controller Multi-Attach error for volume "pvc-c3c29367-658b-4548-ac7c-134fa73df4c2" Volume is already exclusively attached to one node and can't be attached to anotherWarning FailedMount 12m (x2 over 21m) kubelet, tcp-md-0-7f67dbbfb8-lthnt Unable to attach or mount volumes: unmounted volumes=[wordpress-persistent-storage], unattached volumes=[istio-certs default-token-wdbzv wordpress-persistent-storage istio-envoy]: timed out waiting for the condition
Known issue as per VMware-vSphere-Container-Storage-Plug-inhttps://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.5/rn/vmware-vsphere-container-storage-plugin-25-release-notes/index.html#known-issues
Deleting volumeAttachement resources should only happen after the node is drained and deleted, otherwise the workload might be impactedThis operation below must be repeated for all pods backed by Persistent Volumes.This process should happen in parallel when triggering the upgrade with the Tanzu CLI Check the status of the pods kubectl get pods -A | grep -v Running NAME READY STATUS RESTARTS AGE pod/web-0 0/2 Init:0/1 0 9h pod/wordpress-6c6794cb7d-cdnsc 0/2 Init:0/1 0 31m pod/wordpress-mysql-756d555798-gtvvp 0/2 Init:0/1 0 9h Query the existing volumeattachments and compare the node names to the nodes from the output above using -kubectl get volumeattachments.storage.k8s.io On comparing the above two outputs, it is clear that there are certain volumeattachment objects which refer to nodes that are no longer part of the cluster. You need to delete these attachments to workaround the multi attach errors. Before deleting the volume attachments, please make sure attachments of only those nodes are deleted that are not part of the kubectl get nodes command's output.To delete the attachments, remove the finalizer from all the volumeattachments that belonged to older nodes. kubectl patch volumeattachments.storage.k8s.io csi-<uuid> -p '{"metadata":{"finalizers":[]}}' --type=merge The new volume attachments should soon be created and new nodes will be able to mount the persistent volumes.
Click on a version to see all relevant bugs
VMware Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.