Environment Details:-
Master node =1
Workers =2
Host OS = SLES 15 SP2
Kubernetes version = v1.21.1
Dell Unity CSI Version = v1.6
To reproduce the issue , follow the steps:-
MountVolume.MountDevice failed for volume "csivol-4562e8f2a0" : kubernetes.io/csi: attacher.MountDevice failed to create dir "/var/lib/kubelet/plugins/k
ubernetes.io/csi/pv/csivol-4562e8f2a0/globalmount": mkdir /var/lib/kubelet/plugins/kubernetes.io/csi/pv/csivol-4562e8f2a0/globalmount: file exists
Also, we notice a number of failed multipath devices on the node even though volumeattachment is deleted.
Observation: - Kubelet on startup only clears one bind mount and unable to clear the other bind mount and “globalmount” point. Once volumeAttachment becomes “true” the “globalmount” point gets into “RO” mode.
Note:- We delete volumeAttachment because, we are singleton application and want to move application PODs to the other worker node as quickly as possible in the event of worker node getting into “NotReady” state. When moving to the other node is successful but coming back to the first node again causes FS corruption due to dangling mount points.
Hi,
Under normal operation, I did not reproduce the issue on Unity iSCSI.
That is to say, I executed the exact same steps as you have from 1 to 10 without an error.
On step 7. I run kubectl delete pod [podname] --grace-period=0 --force as per : https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/
On step 8., I did nothing as Kubernetes scheduler takes care of it after 6 minutes : https://github.com/kubernetes/kubernetes/blob/5e31799701123c50025567b8534e1a62dbc0e9f6/pkg/controlle...
Have you try to not manually delete the volumeAttachment and let Kubernetes do ?
Are you running the Unity driver on top of FiberChannel, iSCSI or NFS ?
Rgds.
Hi,
We have tried both, manual deletion and kubernetes deletion as well. However, we noticed that our application continues to do I/O even when kubelet is down, which I suppose is normal behavior since pods on the failed node continue to run. The application dies with I/O error once the volumeattachment is deleted.
So I guess to reproduce the exact scenerio, an I/O workload might be needed inside the pod.
Regards,
A question to clarify: You are not using our CSM-Resiliency product in this test (with the podmon container)? You do not mention it.
No, we are not using CSM Resiliency product in this test.