BugZero | VMware BugID 85213 - Persistent volumes cannot attach to a new node if ...

VMware - Defect ID: 85213

Persistent volumes cannot attach to a new node if previous node is deleted

VMware - Defect ID: 85213

Persistent volumes cannot attach to a new node if previous node is deleted

Last updated on August 8th, 2023

BugZero Risk Score
8.3 High

Overall: N/A

Severity: N/A

Community: N/A

Lifecycle: N/A

What is the BugZero Risk Score?

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Views: 11

Description

Symptoms

You are upgrading your Tanzu Kubernetes Grid clustersYou have applications running on the cluster using persistent volumesYour upgrade is in a hung state or failed after a long waitYour pods that are utilizing persistent volumes are not able to attach persistent volumes

Cause

Due to race conditions between detaching and deleting volume operations, CNS volumes never get detached from the nodes. This issue affects Tanzu Kubernetes Grid versions 1.x to 1.6.x. One of the scenarios in which this can occur is during an upgrade and if you have stateful workloads utilizing persistent volumes: TKG upgrade hung due to misconfigure PodDisruptionBudget(PDB)After resolving the PDB errors workers get reconciled and old workers get deletedCSI controller repeatedly tries to detach the volume from a node that is not presentAll stateful workloads are stuck in container creation or an init state, depending on which stage the volume is mounted Repeated error messages "Failed to find VirtualMachine for node" are logged in the vSphere CSI controller logs: kubectl logs -n kube-system vsphere-csi-controller-76d888d87c-wsml9 vsphere-csi-controller{"level":"error","time":"2020-09-27T18:15:59.174108121Z","caller":"vanilla/controller.go:569","msg":"failed to find VirtualMachine for node:\"tcp-md-0-5bb7dc9f5c-mbjwl\". Error: node wasn't found","TraceId":"a5ab0f92-a59e-4b67-9185-a9bd020cc1fb","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume/build/pkg/csi/service/vanilla/controller.go:569github.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5200github.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume/go/pkg/mod/github.com/rexray/gocsi@v1.2.1/middleware/serialvolume/serial_volume_locker.go:141 The application pods are stuck during the container creation or init container phase and constantly error out with "Multi-Attach error for volume and Unable to attach or mount volumes: unmounted volumes": kubectl describe <your-failing-pod>Warning FailedAttachVolume 32m attachdetach-controller Multi-Attach error for volume "pvc-c3c29367-658b-4548-ac7c-134fa73df4c2" Volume is already exclusively attached to one node and can't be attached to anotherWarning FailedMount 12m (x2 over 21m) kubelet, tcp-md-0-7f67dbbfb8-lthnt Unable to attach or mount volumes: unmounted volumes=[wordpress-persistent-storage], unattached volumes=[istio-certs default-token-wdbzv wordpress-persistent-storage istio-envoy]: timed out waiting for the condition

Resolution

Known issue as per VMware-vSphere-Container-Storage-Plug-inhttps://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.5/rn/vmware-vsphere-container-storage-plugin-25-release-notes/index.html#known-issues

Workaround

Deleting volumeAttachement resources should only happen after the node is drained and deleted, otherwise the workload might be impactedThis operation below must be repeated for all pods backed by Persistent Volumes.This process should happen in parallel when triggering the upgrade with the Tanzu CLI Check the status of the pods kubectl get pods -A | grep -v Running NAME READY STATUS RESTARTS AGE pod/web-0 0/2 Init:0/1 0 9h pod/wordpress-6c6794cb7d-cdnsc 0/2 Init:0/1 0 31m pod/wordpress-mysql-756d555798-gtvvp 0/2 Init:0/1 0 9h Query the existing volumeattachments and compare the node names to the nodes from the output above using -kubectl get volumeattachments.storage.k8s.io On comparing the above two outputs, it is clear that there are certain volumeattachment objects which refer to nodes that are no longer part of the cluster. You need to delete these attachments to workaround the multi attach errors. Before deleting the volume attachments, please make sure attachments of only those nodes are deleted that are not part of the kubectl get nodes command's output.To delete the attachments, remove the finalizer from all the volumeattachments that belonged to older nodes. kubectl patch volumeattachments.storage.k8s.io csi-<uuid> -p '{"metadata":{"finalizers":[]}}' --type=merge The new volume attachments should soon be created and new nodes will be able to mount the persistent volumes.

Relevant Products

Click on a version to see all relevant bugs

Affected versions:1.x

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:1.x

Fixed versions: No known fixed versions

Top VMware Defects

No bugs this month

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

VMware - Defect ID: 85213

Persistent volumes cannot attach to a new node if previous node is deleted

VMware - Defect ID: 85213

Persistent volumes cannot attach to a new node if previous node is deleted

Last updated on August 8th, 2023

BugZero Risk Score8.3 High

Bug Details

Symptoms

Cause

Resolution

Workaround

Top VMware Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
8.3 High