RecoverPoint for VMs: Journal Volumes Locked to Specific ESXi Hosts Can Lead to Data Replication Unavailable and Reboot Regulation.

Summary: Journal Volumes Locked to Specific ESXi Hosts Can Lead to Data Replication Unavailable and Reboot Regulation.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

RecoverPoint for Virtual Machines uses journal volumes to track and manage point in time images.
Within an RP4VMs environment, an issue can occur which locks a journal volume to a single ESXi host.
If the vRPA pair assigned to replicate the VMs is vMotioned to a different ESXi host, Data Replication Unavailable(DRU), can occur.

This can affect all the Consistency Groups (CGs) assigned to a specific RPA, which will lead the status to be DISABLE or ERROR.

 

Replication process crashes may be seen:

replication Sun May 19 20:12:01 UTC 2024

replication Sun May 19 20:37:55 UTC 2024

replication Sun May 19 20:38:21 UTC 2024

replication Sun May 19 21:03:10 UTC 2024

replication Sun May 19 21:32:18 UTC 2024

replication Sun May 19 22:03:39 UTC 2024

replication Sun May 19 22:32:26 UTC 2024

replication Sun May 19 23:03:56 UTC 2024

replication Sun May 19 23:28:46 UTC 2024

Resulting in the RecoverPoint Appliance(RPA) detaching from the cluster.

Site control vRPA Replication logs showing deadlock:

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21659/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=501, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21658/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=288, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21657/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=493, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21656/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=503, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21655/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=775, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21654/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=501, cmd=N6Kashya33SerializableMethodRequestRefCountE

2024/05/20 01:27:54.039 - #1 - 21594/21577 - DLManager: deadlock suspected at N6Kashya26ReplicationControl_AO_IMPLE, pid=21653/21577, config=(alertThreshold=10,alertLevel=1,killThreshold=0,coreThreshold=0,full=1), deltaTime=799, cmd=N6Kashya33SerializableMethodRequestRefCountE

RPA control logs showing replication being killed:

RecoverPoint-0x5d0aa2fff4d45cf4-RPA6:/home/kos/control# grep killRep result.log.latest

2024/05/20 02:54:41.348 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 02:54:41.936 - #2 - 23464/23290 - RepHandler: killRep called. isKill=0 reason=Rep disconnected. Verifying it is down isCoreDump=0

2024/05/20 02:54:48.974 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 02:54:49.172 - #2 - 23464/23290 - RepHandler: killRep called. isKill=0 reason=Rep disconnected. Verifying it is down isCoreDump=0

2024/05/20 03:18:43.263 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 03:18:43.265 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 03:18:43.266 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 03:18:43.268 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 03:18:43.269 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

2024/05/20 03:18:43.807 - #2 - 23464/23290 - RepHandler: killRep called. isKill=0 reason=Rep disconnected. Verifying it is down isCoreDump=0

2024/05/20 03:59:26.445 - #2 - 23464/23290 - RepHandler: killRep called. isKill=1 reason=Replication process didn't handle last command in a timely manner. cmd=Vector(busyType=BUSY_WITH_RCS(busyWithWhat= RC (RC_preOpenTask),busyWithWhichRCs= VectorMap([KBox6 => 1]))) isCoreDump=1

 

ESXi Host xxxxxxx iofilterd-emcjiraf.log show VSAN object locks:

2024-05-23T17:15:33.517Z iofilterd-emcjiraf[2104572]: FiltLib: Opening virtual disk ("/vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/RPvStorage/5d0aa2fff4d45cf4_IOFilter_JVOL_00668.vmdk") failed: "Failed to lock the file" (16392).

2024-05-23T17:15:33.517Z iofilterd-emcjiraf[2104572]: RPVS_iof_LunInfo_OpenLun: Failed to open vmdk file of lun 668 with status Busy path /vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/RPvStorage/5d0aa2fff4d45cf4_IOFilter_JVOL_00668.vmdk

2024-05-23T17:15:33.517Z iofilterd-emcjiraf[2104572]: execute_open_cmd: Cluster id: 0x5d0aa2fff4d45cf4 lun 668 open failed. open_rc: Busy con 0xc6db9c9340

2024-05-23T17:15:33.518Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 668 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:33.519Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 48 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:33.519Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 507 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:33.519Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 558 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:33.520Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 359 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:34.228Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'b05fe864-d249-c108-0391-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:36.581Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'e6ebf164-9c47-0fd7-6c17-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:37.526Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'b2e10165-ee40-1674-3ead-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:37.530Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object '681efa64-16f0-8bb4-ab56-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:38.537Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Could not lock object 'b05fe864-d249-c108-0391-6cfe544f26e4': Failed to lock the file (4).

2024-05-23T17:15:38.537Z iofilterd-emcjiraf[2104572]: DISKLIB-VMFS  : "vsan://529b87d1fe9acb29-b0cbb19ee7de1e57/b05fe864-d249-c108-0391-6cfe544f26e4" : failed to open (Failed to lock the file): ObjLib_Open failed. Type 3

2024-05-23T17:15:38.537Z iofilterd-emcjiraf[2104572]: DISKLIB-LINK  : "/vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/f9f9b364-aadc-f567-41a4-6cfe544ef870/5d0aa2fff4d45cf4_IOFilter_JVOL_00116.vmdk" : failed to open (Failed to lock the file).

2024-05-23T17:15:38.537Z iofilterd-emcjiraf[2104572]: DISKLIB-CHAIN : "/vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/f9f9b364-aadc-f567-41a4-6cfe544ef870/5d0aa2fff4d45cf4_IOFilter_JVOL_00116.vmdk" : failed to open (Failed to lock the file).

2024-05-23T17:15:38.545Z iofilterd-emcjiraf[2104572]: DISKLIB-VMFS  : "vsan://529b87d1fe9acb29-b0cbb19ee7de1e57/b05fe864-d249-c108-0391-6cfe544f26e4" : open successful (524293) size = 25769803776, hd = 0. Type 3

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: PluginLdr_Load: Loaded plugin 'libvmiof-disk-emcjiraf.so' from '/usr/lib64/vmware/plugin/libvmiof-disk-emcjiraf.so'

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: DISKLIB-VMFS  : "vsan://529b87d1fe9acb29-b0cbb19ee7de1e57/b05fe864-d249-c108-0391-6cfe544f26e4" : closed.

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: DISKLIB-LIB   : Failed to open '/vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/RPvStorage/5d0aa2fff4d45cf4_IOFilter_JVOL_00116.vmdk' with flags 0xa Failed to lock the file (16392).

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: FiltLib: Opening virtual disk ("/vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/RPvStorage/5d0aa2fff4d45cf4_IOFilter_JVOL_00116.vmdk") failed: "Failed to lock the file" (16392).

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: RPVS_iof_LunInfo_OpenLun: Failed to open vmdk file of lun 116 with status Busy path /vmfs/volumes/vsan:529b87d1fe9acb29-b0cbb19ee7de1e57/RPvStorage/5d0aa2fff4d45cf4_IOFilter_JVOL_00116.vmdk

2024-05-23T17:15:38.546Z iofilterd-emcjiraf[2104572]: execute_open_cmd: Cluster id: 0x5d0aa2fff4d45cf4 lun 116 open failed. open_rc: Busy con 0xc6dc2deaa0

2024-05-23T17:15:38.547Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 457 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:38.548Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 250 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:38.548Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 623 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:38.548Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 436 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:38.549Z iofilterd-emcjiraf[2104572]: execute_close_cmd: cluster id: 0x5d0aa2fff4d45cf4 lun 62 is not opened (openState: 0). Ignore close request

2024-05-23T17:15:40.886Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'e6ebf164-9c47-0fd7-6c17-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:41.837Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'b2e10165-ee40-1674-3ead-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:41.842Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object '681efa64-16f0-8bb4-ab56-6cfe544f26e4' locked, retrying.

2024-05-23T17:15:42.561Z iofilterd-emcjiraf[2104572]: OBJLIB-VSANOBJ: VsanObjLock: Object 'bcdce564-9ed7-ff34-42df-6cfe544f26e4' locked, retrying.

In the host where the affected RPA is placed in the vmkernel.log, the bellow entries are printed:

2024-10-21T13:08:13.487Z cpu31:2101016)WARNING: NFSLock: 4084: File is being locked by a consumer on host HOSTNAME0503.corp.qbe.com with exclusive lock.
2024-10-21T13:08:13.487Z cpu31:2101016)NFSLock: 5032: failed to get lock on file 38233f05329208c8_IOFilter_JVOL_00244-flat.vmdk(.lck-d84c0c0000000000) 0x430b567b2f60 on DATASTORE xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx): Busy
2024-10-21T13:08:23.795Z cpu31:2101016)WARNING: NFSLock: 4084: File is being locked by a consumer on host HOSTNAME0503.corp.qbe.com with exclusive lock.
2024-10-21T13:08:23.795Z cpu31:2101016)NFSLock: 5032: failed to get lock on file 38233f05329208c8_IOFilter_JVOL_00244-flat.vmdk(.lck-d84c0c0000000000) 0x430b567b7330 on DATASTORE xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx): Busy
2024-10-21T13:08:33.854Z cpu7:2099418)WARNING: NFSLock: 4084: File is being locked by a consumer on host HOSTNAME0503.corp.qbe.com with exclusive lock.
2024-10-21T13:08:33.854Z cpu7:2099418)NFSLock: 5032: failed to get lock on file 38233f05329208c8_IOFilter_JVOL_00069-flat.vmdk(.lck-8d7b030000000000) 0x430b567db0b0 on DATASTORE xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx): Busy
2024-10-21T13:08:44.163Z cpu2:2099418)WARNING: NFSLock: 4084: File is being locked by a consumer on host HOSTNAME0503.corp.qbe.com with exclusive lock.

Cause

The cause of this situation is presently being investigated.
No permanent resolution is available currently; however, it is a situation specific to RP4VMs.

The situation can occur in various scenarios:

1. When vRPA responsible for protecting a VM is vMotioned to another ESXi host as the original ESXi host locks the journal to that specific ESXi host.

2. When a Consistency Group moves ownership from one vRPA pair to another, thus invoking the situation above.

Resolution

Resolution:
A solution exists for this issue, but intervention from Dell Technologies technical support personnel is required.  Proper RP4VMs, ESXi, and vCenter logs should be gathered to triage this situation and escalated as necessary. Contact the Dell Technologies Customer Support Center or your service representative for technical support and reference this Dell Technologies knowledgebase solution ID.

Affected Products

RecoverPoint for Virtual Machines

Products

RecoverPoint for Virtual Machines
Article Properties
Article Number: 000053722
Article Type: Solution
Last Modified: 31 Oct 2024
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.