Dell Automation Platform upgrade from 1.0 to 1.2 fails due to high memory utilization on single-node orchestrator.
Сводка: This article explains how to resolve dell automation platform 1.0 to 1.2 upgrade failures on a single-node orchestrator caused by high memory utilization. The issue can occur when Fusion services have high replica counts and the node is under resource pressure. To resolve it, check memory usage, scale down Metrics Server if installed, scale down Fusion deployments if required, and retry the upgrade. If Metrics Server is not installed, skip that step and browse the procedure. ...
Инструкции
Procedure to Verify and Reduce Memory Usage Before Upgrade
Before performing the upgrade, verify the cluster memory utilization. If memory usage is below 60%, the upgrade can proceed normally. If memory usage is above 70%, reduce memory consumption before starting the upgrade.
Step 1: Verify cluster memory utilization
Run the following command to check node memory usage:
kubectl top nodes
Run the following command to check allocated node resources:
kubectl describe nodes | grep -A 12 "Allocated resources"
Step 2: Check and scale down Metrics Server if installed
Run the following command to check whether Metrics Server is running:
kubectl -n kube-system get deploy metrics-server
If Metrics Server is present, run the following command to scale it down:
kubectl -n kube-system scale deploys metrics-server --replicas=0
Run the following command to verify that Metrics Server is scaled down:
kubectl -n kube-system get deploy metrics-server
Step 3: Wait for resource stabilization
If Metrics Server was scaled down, wait approximately 5 minutes for cluster resources and metrics to stabilize. If Metrics Server was not installed, skip this wait.
Step 4: Check Fusion deployments
Run the following command to check Fusion deployments and replica counts:
kubectl get deploy -A | grep -i fusion
Step 5: Scale down Fusion deployments if memory usage is still high
Run the following command to find Fusion deployments with more than one replica and scale them down:
kubectl get deploy -A | awk 'NR>1 && tolower($2) ~ /fusion/ && $3+0 > 1 {print $1, $2}' | while read -r ns dep; do echo "Scaling $ns/$dep to 1 replica"; kubectl -n "$ns" scale deploy "$dep" --replicas=1; done
Run the following command to verify Fusion deployments after scaling:
kubectl get deploy -A | grep -i fusion
Step 6: Recheck cluster memory utilization after Fusion scale down
Wait a few minutes after scaling down Fusion deployments for cluster resources to stabilize.
Run the following command to check node memory usage again:
kubectl top nodes
Run the following command to check allocated node resources again:
kubectl describe nodes | grep -A 12 "Allocated resources"
Proceed with the upgrade only after memory utilization is reduced to an acceptable level and the cluster is stable.
Step 7: Retry the upgrade
After memory utilization is reduced and the cluster is stable, retry the Dell Automation Platform upgrade.
Step 8: Verify the upgrade
After the upgrade completes, run the following commands to verify Fusion pods and deployments:
kubectl get pods -A | grep -i fusion kubectl get deploy -A | grep -i fusion
Дополнительная информация
- The scaling is temporary. During the upgrade process, deployment configurations are reapplied from Helm charts or registry manifests. These configurations restore the intended replica counts automatically, so manual restoration is not required after the upgrade.
- Also, the Fusion replica count was reduced to 2 starting with Dell Automation Platform 1.2, this replica-count-related memory issue is not expected when upgrading from Dell Automation Platform 1.2 or later versions.
EE-Ticket DAPEE-235
Defect DAP07A-2316 , DAP07A-2300