Container Storage Interface Drivers Family: When a Node Goes Down, Block Volumes that are Attached to the Node Cannot be Attached to Another Node

摘要: When a node goes down (due to node crash, node down, power off scenario), the block volumes that are attached to the node cannot be attached to another node.

本文适用于 本文不适用于 本文并非针对某种特定的产品。 本文并非包含所有产品版本。

症状

When a node goes down (due to node crash, node down, power off scenario), the block volumes that are attached to the node cannot be attached to another node. 

The issue is specific to block volumes only.

The issue is not seen for NFS volumes.

Issue affects the following drivers:

  • CSI Driver for PowerFlex
  • CSI Driver for PowerMax
  • CSI Driver for PowerScale
  • CSI Driver for Unity

This issue does not affect the CSI Driver for PowerStore.

Issue is reported on GitHub #282 This hyperlink is taking you to a website outside of Dell Technologies.

Steps to Reproduce: 

  1. Create a PVC1 and create POD1.
  2. Check the node where POD1 was created and power off the node from vSphere.
  3. When node becomes not ready, try to delete POD1 (It is stuck in terminating state since node is not ready.)
  4. Try to create POD2 using same PVC1. POD2 is in container creating state with this error in describe output.
Warning FailedAttachVolume 43s attachdetach-controller Multi-Attach error for volume "csivol-18eb3daee0" Volume is already used by pod(s) iscsipod1-p 

Expected Result: POD should get deleted even when node is not ready.

Result: POD is stuck in terminating state because of not ready node.

The below output shows the original pod terminating and the new pod stuck in Container Creating:

kubectl get pods -o wide

NAME        READY STATUS            RESTARTS AGE   IP     NODE    NOMINATED NODE READINESS GATES
iscsipod1-p 1/1   Terminating       0        9m43s <IP>   <Node3> <none>         <none>
iscsipod2-p 0/1   ContainerCreating 0        55s   <none> <Node2> <none>         <none>


The following command shows that the node is Not Ready:

kubectl get nodes

NAME  STATUS   ROLES                AGE  VERSION
Node1 Ready    control-plane,master 163d v1.23.0
Node2 Ready    <none>               162d v1.23.0
Node3 NotReady <none>              162d v1.23.0


The following command shows that the PVC is still bound to the PV:

kubectl get pvc -n <namespace>

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
iscsipvc1-p Bound csivol-18eb3daee0 5Gi RWO powerstore-iscsi 10m


The following command shows the warning:

kubectl describe pod -n <namespace>

...
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 108s default-scheduler Successfully assigned default/iscsipod2-p to lglw3178
Warning FailedAttachVolume 108s attachdetach-controller Multi-Attach error for volume "csivol-18eb3daee0" Volume is already used by pod(s) iscsipod1-p

原因

The root cause is that the attacher sidecar is not able to send ControllerUnpublishVolume() for the node that went down. See information contained in GitHub #215 This hyperlink is taking you to a website outside of Dell Technologies.

解决方案

Workaround:
  1. Force delete the pod that was running on the node that went down.
kubectl delete po <pod name> --force --grace-period=0
  1. Delete the volume attachment to the node that went down.
kubectl delete volumeattachment <volumeattachment>

The volume can now be attached to the new node.

Resolution:
 This solution will be updated when a fix has been released.
文章属性
文章编号: 000200778
文章类型: Solution
上次修改时间: 07 7月 2023
版本:  8
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。