Dell Automation Platform: Upgrade Issues Portal Vault is not starting

Summary: This article describes the solution for Dell Automation Platform Upgrade Issues, when Portal Vault pods are not coming up.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Users may experience issues due to a conflicting Mutating Webhook when upgrading the Dell Automation Platform.

The first symptom is that the upgrade gets stuck for a long time (more than 25 minutes) on the step of the PORTAL ChartKey deployment. Main installation log shows:

...
Orchestrator chart exists. Skip unarchive...
Portal chart exists. Skip unarchive...
Portal Installation has started
OperationType: INSTALL OperationStatus: IN_PROGRESS ChartKey: PORTAL

This issue usually blocks the upgrade in a moment when Portal vault deployment appears in the list of the pods. The vault shows 2/3 READY states for two of its deployments. Like:

#kubectl get po -A
...
dapp              edgevault-0                                          3/3     Running     0             30m
dapp              edgevault-1                                          2/3     Running     0             30m
dapp              edgevault-2                                          2/3     Running     0             30m
...

The logs show that the vault cannot communicate between the nodes:

2025-10-30T15:27:26.896Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://edgevault-2.edgevault-internal:8200
2025-10-30T15:27:26.900Z [ERROR] core: failed to retry join raft cluster: retry=2s err="failed to send answer to raft leader node: error bootstrapping cluster: cluster already has state"
2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-1.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded"
2025-10-30T15:27:28.664Z [ERROR] core: failed to get raft challenge: leader_addr=http://edgevault-0.edgevault-internal:8200 error="error during raft bootstrap init call: context deadline exceeded"

Cause

The root cause of this issue is the conflicting Mutating Webhook in the Orchestrator, which interferes with the portal.

This conflict arises when the Orchestrator's Mutating Webhook is not properly configured, causing the sidecar to fail to encrypt outgoing traffic. As a result, the SSL termination logic is unable to properly handle the traffic, leading to chaos in the namespace. This issue typically occurs in installations that were initially installed with 2.2 NativeEdge Orchestrator (NEO) or earlier releases and then upgraded later. 


Explanation

A Mutating Webhook is a Kubernetes feature that allows for the modification of resources, such as pods, before they are created or updated. In the context of the Dell Automation Platform, the Orchestrator's Mutating Webhook plays a crucial role in injecting sidecars into pods. Historically, installations were done in a single namespace, eliminating the need for a namespace selector. However, with newer versions, a namespace selector is required to prevent the Orchestrator's Mutating Webhook from interfering with other components. This ensures sidecar injection occurs within the correct namespace.
 

Note: This issue DOES happen if the orchestrator was installed with 2.2 NEO or Previous releases and then upgraded later. This issue DOES NOT happen if the orchestrator is installed from 3.0 NEO or later releases and then upgraded later.

Resolution

To resolve this issue, it is essential to modify the Orchestrator's Mutating Webhook configuration before initiating the upgrade process.
 

Important Note: If a previous upgrade attempt failed with these symptoms, it is recommended to take one of the following corrective actions:
  • Roll back to the preupgrade snapshot
Or
  • Remove the checkpoint data from the ConfigMaps. This helps ensure a clean and successful upgrade process. To remove the ConfigMaps, use the following commands:
#kubectl get cm -A | grep check
hzp               checkpoint-data                                        7      31m
#kubectl delete cm checkpoint-data -n hzp


Fixing the webhook:
Before starting (or restarting) the upgrade, add the following entry to the webhooks.namespaceSelector.matchExpressions path in the Orchestrator's Mutating Webhook configuration: 

kubectl edit mutatingwebhookconfigurations hzp-iam-sidecar-injector

Find the following section:

....
 namespaceSelector:
      matchExpressions:
...

In case this section does not contain this snippet, add this snippet. Indentation is important!

    - key: kubernetes.io/metadata.name
      operator: In
      values:
      - hzp

This stops the orchestrator mutating webhook from interfering in the portal. When applied, sidecar injection is not applied for the "portal" namespace. This resolves the issue faced in all the pods. 

Affected Products

Dell Automation Platform, NativeEdge Solutions, Dell Automation Platform Components, NativeEdge
Article Properties
Article Number: 000390269
Article Type: Solution
Last Modified: 20 تشرين الثاني 2025
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.