Dell Automation Platform: Upgrade Issues Portal Vault is not starting
Summary: This article describes the solution for Dell Automation Platform Upgrade Issues, when Portal Vault pods are not coming up.
Symptoms
Users may experience issues due to a conflicting Mutating Webhook when upgrading the Dell Automation Platform.
The first symptom is that the upgrade gets stuck for a long time (more than 25 minutes) on the step of the PORTAL ChartKey deployment. Main installation log shows:
... Orchestrator chart exists. Skip unarchive... Portal chart exists. Skip unarchive... Portal Installation has started OperationType: INSTALL OperationStatus: IN_PROGRESS ChartKey: PORTAL
This issue usually blocks the upgrade in a moment when Portal vault deployment appears in the list of the pods. The vault shows 2/3 READY states for two of its deployments. Like:
#kubectl get po -A ... dapp edgevault-0 3/3 Running 0 30m dapp edgevault-1 2/3 Running 0 30m dapp edgevault-2 2/3 Running 0 30m ...
The logs show that the vault cannot communicate between the nodes:
2025-10-30T15:27:26.896Z [INFO] core: attempting to join possible raft leader node: leader_addr=http://edgevault-2.edgevault-internal:8200 2025-10-30T15:27:26.900Z [ERROR] core: failed to retry join raft cluster: retry=2s err="failed to send answer to raft leader node: error bootstrapping cluster: cluster already has state" 2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-1.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded" 2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-0.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded"
Cause
The root cause of this issue is the conflicting Mutating Webhook in the Orchestrator, which interferes with the portal.
This conflict arises when the Orchestrator's Mutating Webhook is not properly configured, causing the sidecar to fail to encrypt outgoing traffic. As a result, the SSL termination logic is unable to properly handle the traffic, leading to chaos in the namespace. This issue typically occurs in installations that were initially installed with 2.2 NativeEdge Orchestrator (NEO) or earlier releases and then upgraded later.
Explanation
A Mutating Webhook is a Kubernetes feature that allows for the modification of resources, such as pods, before they are created or updated. In the context of the Dell Automation Platform, the Orchestrator's Mutating Webhook plays a crucial role in injecting sidecars into pods. Historically, installations were done in a single namespace, eliminating the need for a namespace selector. However, with newer versions, a namespace selector is required to prevent the Orchestrator's Mutating Webhook from interfering with other components. This ensures sidecar injection occurs within the correct namespace.
Resolution
To resolve this issue, it is essential to modify the Orchestrator's Mutating Webhook configuration before initiating the upgrade process.
- Roll back to the preupgrade snapshot
- Remove the checkpoint data from the
ConfigMaps. This helps ensure a clean and successful upgrade process. To remove theConfigMaps, use the following commands:
#kubectl get cm -A | grep check hzp checkpoint-data 7 31m #kubectl delete cm checkpoint-data -n hzp
Fixing the webhook:
Before starting (or restarting) the upgrade, add the following entry to the webhooks.namespaceSelector.matchExpressions path in the Orchestrator's Mutating Webhook configuration:
kubectl edit mutatingwebhookconfigurations hzp-iam-sidecar-injector
Find the following section:
....
namespaceSelector:
matchExpressions:
...
In case this section does not contain this snippet, add this snippet. Indentation is important!
- key: kubernetes.io/metadata.name
operator: In
values:
- hzp
This stops the orchestrator mutating webhook from interfering in the portal. When applied, sidecar injection is not applied for the "portal" namespace. This resolves the issue faced in all the pods.