RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Summary: RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

N/A

Consistency group looping between init and error state in a scale environment

Causing data replication unavailable (DRU).

Symptoms found in the logs :

ESX splitter logs:

Below logs indicate that reading /vmfs/volumes/vsan:5xxxxxxxxxx-dxxxxxxxxxxxxxxx failed, so all RPVS volumes in the VSAN will be removed.

spl_esx_discover_RPvStorage_clusters_in_datastore: failed to read directory /vmfs/volumes/vsan:5xxxxxxxxxxxxxxxxxx-dxxxxxxxxxxxxxxxxxxxx, returned with status Timeout

update_rpvs_db: lun 1 wasn't scanned on last device view update
RPVS_ClusterLuns_removeLunInfo: called for lun=1 (name RPVS_Lun00001.vmdk). Cluster id=2xxxxxxxxxxxx
update_rpvs_db: lun 12 wasn't scanned on last device view update
RPVS_ClusterLuns_removeLunInfo: called for lun=12 (name RPVS_Lun00012.vmdk). Cluster id=2xxxxxxxxxxx
update_rpvs_db: lun 13 wasn't scanned on last device view update ...

* RPVS discovery process succeeds, hence all the RPVS volumes are added back:

parse_vmdk_file: called with file /vmfs/volumes/vsan:5xxxxxxxxxxxxxxxx/RPvStorage/4xxxxxxxxxxx/RPVS_Lun00001.vmdk
parse_vmdk_file: capacity=12000000, thinLun=0, flat_filename=RPVS_Lun00001-flat.vmdk, rawguid=0x6xxxxxxxxxxxxxxxxxx
RPVS_ClusterLuns_addLunInfo: added lun 1, cluster 4xxxxxxxxxxxxxxxx parse_vmdk_file: called with file /vmfs/volumes/vsan:5xxxxxxxxxxxxxx-dxxxxxxxxxxxxxxxx/RPvStorage_23d5fb88838940xxx_010/RPVS_Lun00012.vmdk parse_vmdk_file: capacity=524288000, thinLun=0, flat_filename=RPVS_Lun00012-flat.vmdk, rawguid=0x6xxxxxxxxxxxx RPVS_ClusterLuns_addLunInfo: added lun 12, cluster 2xxxxxxxxxxxxxxx

* Log indicating that RPVS discovery process has been taking long time

CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 32585607 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 33277695 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 35834242 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 36488014 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 37767728 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 49355575 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 109257427 microseconds, num commands in queue: 19

Affected all RP4VM version

Cause

ESX splitter scans for RPVS volume (journal and repository) every t_rpvsDiscoveryPeriodicTimerInterval (default: 30) seconds.
The scan is done by reading /vmfs/volumes/ and traversing each directory inside it, looking for RPVS_LunXXXXX.vmdk
A RPVS volume resides in /vmfs/volumes/<datastore>/<cluster=id>/. In VSAN environment, it would reside in /vmfs/volumes/vsan:<vsan-id>/<cluster=id>/
If reading of any directory inside /vmfs/volumes/ fails (timeout, transient error, etc), it will result in removal all RPVS volumes in the failed directory.
In any subsequent runs, if rpvs discovery process succeeds to read and find RPVS_LunXXXXX.vmdk, the corresponding RPVS volumes will be added back.
This is the reason why the CGs loop between Error and Init.
The issue is magnified when large number of hosts in the VSAN read directories under /vmfs/volumes/ at the same time.

Resolution

Workaround:

In each ESX host in the cluster, update the splitter tweak value of t_rpvsDiscoveryPeriodicTimerInterval
to random value between 180 and 600 seconds, and restart kdriver.
ESX Splitter Tweak can be found at
/etc/kdriver/tweak/tweak.params.splitter or /etc/config/emc/rp/kdriver/tweak/tweak.params.splitter.

Resolution:

Dell EMC engineering is currently investigating this issue. A permanent fix is still in progress. Contact the Dell EMC Customer Support Center or your service representative for assistance and reference this solution ID.

Affected Products

RecoverPoint for Virtual Machines

Products

RecoverPoint for Virtual Machines

Article Number: 000167797

Article Type: Solution

Last Modified: 30 May 2025

Version: 4

Check if your device is covered by Support Services.

RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Summary: RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Symptoms

Cause

Resolution

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Summary: RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Cause

Resolution

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services