Symptoms
Sporadic slow machine creation using TKGm, with an increased number of sessions open and idle, especially exacerbated at scale
Cause
Some of the VSphereVMs can end at the back of the processing queue because of some transient errors (e.g. too many client-side connections).
Resolution
VMware is aware of this issue in capv 1.05 and working to fix it in future release capv 1.5.
Workaround
To mitigate the issue ensure that the enable keep-alive session and sync-period to 5 Mins (Default is 10 Mins)Run the following command:1. Switch to the management cluster context
kubectl config get-contexts
kubectl config use-context CONTEXT_NAME
2. Find and edit the capv controller manager deployment:
kubectl edit deployment –n capv-system capv-controller-manager
and add the following flags
spec:
containers:
- args:
….
- –-sync-period=5m0s
- --enable-keep-alive
Example:
Please edit the object below. Lines beginning with a
"#' will be ignored
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: vl
kind: Pod
metadata:
creationTimestamp:"2023-04-27T09:13:18Z"
generateName:capv-controller-manager-9d8499798-
labels:
cluster.x-k8s.io/provider: infrastructure-vsphere
control-plane: controller-manager
pod-template-hash: 9d8499798
name: capv-controller-manager-9d8499798-slngk
namespace: capv-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: capy-controller-manager-9d8499798
uid: 6a06b2b6-d168-434c-ae54-0696c727e430
resourceVersion:"374526282"
uid:6c777950-08d9-45d1-863f-9643af27a92b
spec:
containers:
- args:
- --enable/leader-election
- --metrics-addr=0.0.0.0:8080
- --enable-keep-alive
- --sync-period=5m0s
- --logtostderr
- --V=4
env:
- name: HTTP PROXY
- name: HTTPS PROXY
- name: NO PROXY
image: projects.registry.vmware.com/tkg/cluster-api/cluster-api-vsphere-controller:v1.0.3_vmware.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
4. Save and exit, which should cause the CAPV manager pod to be redeployed.
This change is not persistent and will be overridden during the next management cluster upgrade.
This should ensure a more frequent overall sync by CAPV