Loading...
Loading...
Users may experience issues due to a conflicting Mutating Webhook when upgrading the Dell Automation Platform. The first symptom is that the upgrade gets stuck for a long time (more than 25 minutes) on the step of the PORTAL ChartKey deployment. Main installation log shows: ... Orchestrator chart exists. Skip unarchive... Portal chart exists. Skip unarchive... Portal Installation has started OperationType: INSTALL OperationStatus: IN_PROGRESS ChartKey: PORTAL This issue usually blocks the upgrade in a moment when Portal vault deployment appears in the list of the pods. The vault shows 2/3 READY states for two of its deployments. Like: #kubectl get po -A ... dapp edgevault-0 3/3 Running 0 30m dapp edgevault-1 2/3 Running 0 30m dapp edgevault-2 2/3 Running 0 30m ... The logs show that the vault cannot communicate between the nodes: 2025-10-30T15:27:26.896Z [INFO] core: attempting to join possible raft leader node: leader_addr=http://edgevault-2.edgevault-internal:8200 2025-10-30T15:27:26.900Z [ERROR] core: failed to retry join raft cluster: retry=2s err="failed to send answer to raft leader node: error bootstrapping cluster: cluster already has state" 2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-1.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded" 2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-0.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded"
The root cause of this issue is the conflicting Mutating Webhook in the Orchestrator, which interferes with the portal. This conflict arises when the Orchestrator's Mutating Webhook is not properly configured, causing the sidecar to fail to encrypt outgoing traffic. As a result, the SSL termination logic is unable to properly handle the traffic, leading to chaos in the namespace. This issue typically occurs in installations that were initially installed with 2.2 NativeEdge Orchestrator (NEO) or earlier releases and then upgraded later. Explanation A Mutating Webhook is a Kubernetes feature that allows for the modification of resources, such as pods, before they are created or updated. In the context of the Dell Automation Platform, the Orchestrator's Mutating Webhook plays a crucial role in injecting sidecars into pods. Historically, installations were done in a single namespace, eliminating the need for a namespace selector. However, with newer versions, a namespace selector is required to prevent the Orchestrator's Mutating Webhook from interfering with other components. This ensures sidecar injection occurs within the correct namespace. Note: This issue DOES happen if the orchestrator was installed with 2.2 NEO or Previous releases and then upgraded later. This issue DOES NOT happen if the orchestrator is installed from 3.0 NEO or later releases and then upgraded later.
To resolve this issue, it is essential to modify the Orchestrator's Mutating Webhook configuration before initiating the upgrade process. Important Note: If a previous upgrade attempt failed with these symptoms, it is recommended to take one of the following corrective actions: Roll back to the preupgrade snapshot Or Remove the checkpoint data from the ConfigMaps . This helps ensure a clean and successful upgrade process. To remove the ConfigMaps , use the following commands: #kubectl get cm -A | grep check hzp checkpoint-data 7 31m #kubectl delete cm checkpoint-data -n hzp Fixing the webhook: Before starting (or restarting) the upgrade, add the following entry to the webhooks.namespaceSelector.matchExpressions path in the Orchestrator's Mutating Webhook configuration: kubectl edit mutatingwebhookconfigurations hzp-iam-sidecar-injector Find the following section: .... namespaceSelector: matchExpressions: ... In case this section does not contain this snippet, add this snippet. Indentation is important ! - key: kubernetes.io/metadata.name operator: In values: - hzp This stops the orchestrator mutating webhook from interfering in the portal. When applied, sidecar injection is not applied for the "portal" namespace. This resolves the issue faced in all the pods.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.