BugZero | VMware BugID 84446 - Pods fail to attach or detach volumes

VMware - Defect ID: 84446

Pods fail to attach or detach volumes

VMware - Defect ID: 84446

Pods fail to attach or detach volumes

Last updated on September 1st, 2023

BugZero Risk Score
5.3 Medium

Overall: N/A

Severity: N/A

Community: N/A

Lifecycle: N/A

What is the BugZero Risk Score?

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Views: 19

Description

Symptoms

When using the vSphere CSI driver in a multi cluster Tanzu Kubernetes Grid Integrated Edition (TKGI) environment, pods start failing to attach or detach volumes. You see messages similar to the following in the events for the affected pods: Warning FailedMount 25m (x2145 over 16d) kubelet, cfbfhbeb-a32b-48df-8d30-94562da4701f Unable to attach or mount volumes: unmounted volumes=[myvolume], unattached volumes=[myvolume]: timed out waiting for the conditionWarning FailedAttachVolume 19m attachdetach-controller AttachVolume.Attach failed for volume "pvc-21ad222f-ffa2-123a-abcf-edfabc456231" : rpc error: code = Internal desc = failed to attach disk: "432567fa-abcd-4449-5674-765432abccfb" with node: "cffebaeb-a20a-41a0-89b0-95617c64701f" err failed to attach cns volume: "432567fa-abcd-4449-5674-765432abccfb" to node vm: "VirtualMachine:vm-172 [VirtualCenterHost: 10.237.25.10, UUID: 4237a733-67c3-8130-702c-63f9383289ba, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-21, VirtualCenterHost: 10.237.25.10]]". fault: "(*types.LocalizedMethodFault)(0xc0007df960)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (types.CnsFault) {\n BaseMethodFault: (types.BaseMethodFault) <nil>,\n Reason: (string) (len=79) \"CNS: The input volume 432567fa-abcd-4449-5674-765432abccfb is not a CNS volume.\"\n },\n LocalizedMessage: (string) (len=95) \"CnsFault error: CNS: The input volume 432567fa-abcd-4449-5674-765432abccfb is not a CNS volume.\"\n})\n". opId: "074be712" You see messages similar to the following in the kube-controller logs on the control plane nodes: E0616 14:11:07.519321 11 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^abcde123-1456-bcde-5643-aaa5674433f podName: nodeName:}" failed. No retries permitted until 2021-06-16 14:11:08.019297747 +0000 UTC m=+3981445.422084190 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-21ad222f-ffa2-123a-abcf-edfabc456231\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^abcde123-1456-bcde-5643-aaa5674433f\") from node \"e13e7d0b-464f-43ba-a130-271d14e3c107\" : rpc error: code = Aborted desc = pending"I0616 14:11:07.519367 11 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"mynamespace", Name:"mypod", UID:"dfe86621-5731-48f6-8814-b41d5318c32f", APIVersion:"v1", ResourceVersion:"80991861", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-21ad222f-ffa2-123a-abcf-edfabc456231" : rpc error: code = Aborted desc = pendingW0616 14:11:07.521828 11 reconciler.go:206] attacherDetacher.DetachVolume started for volume "pvc-21ad222f-ffa2-123a-abcf-edfabc456231" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^abcde123-1456-bcde-5643-aaa5674433f") on node "cffebaeb-a20a-41a0-89b0-95617c64701f" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detachingI0616 14:11:07.521874 11 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pvc-27ea3883-d20a-4dbd-82ea-f346048c988c" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^12345abc-abcd-sf67-8904-abc3217865c9b") from node "cffebaeb-a20a-41a0-89b0-95617c64701f"E0616 14:11:07.668770 11 csi_attacher.go:662] kubernetes.io/csi: detachment for VolumeAttachment for volume [abcde123-1456-bcde-5643-aaa5674433f] failed: rpc error: code = Internal desc = volumeID "abcde123-1456-bcde-5643-aaa5674433f" not found in QueryVolumeE0616 14:11:07.668869 11 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^abcde123-1456-bcde-5643-aaa5674433f podName: nodeName:}" failed. No retries permitted until 2021-06-16 14:11:08.168818196 +0000 UTC m=+3981445.571604639 (durationBeforeRetry 500ms). Error: "DetachVolume.Detach failed for volume \"pvc-21ad222f-ffa2-123a-abcf-edfabc456231\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^abcde123-1456-bcde-5643-aaa5674433f\") on node \"cffebaeb-a20a-41a0-89b0-95617c64701f\" : rpc error: code = Internal desc = volumeID \"abcde123-1456-bcde-5643-aaa5674433f\" not found in QueryVolume"

Cause

The vSphere CSI driver uses the cluster-id for the volume create spec. If there are multiple kubernetes clusters in the same vSphere using the same cluster-id, each time one of the kubernetes clusters syncs to the vCenter it will tag or untag the volumes in vCenter. This in turn causes the volumes not to attach or detach as they should.

Impact / Risks

It can take up to two hours to change cluster-ids if multiple clusters are using the same value. During that time volumes cannot be managed and new volumes cannot be created.

Resolution

Deploy each vSphere CSI driver with a unique cluster-id.

Workaround

For already existing clusters it is possible to change the cluster-id. One of the clusters will have to keep the original cluster-id. Choose one kubrernetes cluster to keep the original cluster-id value.Modify the CSI driver deployment such that the replica count is 0 on all kubernetes clusters except for the one you selected in Step 1.Wait at least one hour to allow other volumes to de-register.Change the cluster-id value in all other kubernetes clusters except for the one you selected in Step 1.Modify the CSI driver deployment such that the replica count is 1 on all kubernetes clusters except for the one you selected in Step 1.Wait at least one hour to allow volumes to get re-registered.

Related Information

This is only applicable to TKGI 1.10 and lower as vSphere Cloud Native Storage is integrated into TKGI in version 1.11 and higher. See Cloud Native Storage (CNS) on vSphere for more information.

Relevant Products

Click on a version to see all relevant bugs

Affected versions:1.x

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:1.x

Fixed versions: No known fixed versions

Top VMware Defects

No bugs this month

VMware Integration

Learn more about where this data comes from

VMware Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

VMware - Defect ID: 84446

Pods fail to attach or detach volumes

VMware - Defect ID: 84446

Pods fail to attach or detach volumes

Last updated on September 1st, 2023

BugZero Risk Score5.3 Medium

Bug Details

Symptoms

Cause

Impact / Risks

Resolution

Workaround

Related Information

Top VMware Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
5.3 Medium