RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

Summary: RecoverPoint for Virtual Machine : Consistency group looping between init and error state in a scale environment

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

N/A

Consistency group looping between init and error state in a scale environment 

Causing data replication unavailable (DRU).

Symptoms found in the logs :

ESX splitter logs:

Below logs indicate that reading /vmfs/volumes/vsan:5xxxxxxxxxx-dxxxxxxxxxxxxxxx failed, so all RPVS volumes in the VSAN will be removed.    

spl_esx_discover_RPvStorage_clusters_in_datastore: failed to read directory /vmfs/volumes/vsan:5xxxxxxxxxxxxxxxxxx-dxxxxxxxxxxxxxxxxxxxx, returned with status Timeout        

update_rpvs_db: lun 1 wasn't scanned on last device view update        
RPVS_ClusterLuns_removeLunInfo: called for lun=1 (name RPVS_Lun00001.vmdk). Cluster id=2xxxxxxxxxxxx        
update_rpvs_db: lun 12 wasn't scanned on last device view update        
RPVS_ClusterLuns_removeLunInfo: called for lun=12 (name RPVS_Lun00012.vmdk). Cluster id=2xxxxxxxxxxx        
update_rpvs_db: lun 13 wasn't scanned on last device view update        ...    

 * RPVS discovery process succeeds, hence all  the RPVS volumes are added back:    
 
parse_vmdk_file: called with file /vmfs/volumes/vsan:5xxxxxxxxxxxxxxxx/RPvStorage/4xxxxxxxxxxx/RPVS_Lun00001.vmdk        
parse_vmdk_file: capacity=12000000, thinLun=0, flat_filename=RPVS_Lun00001-flat.vmdk, rawguid=0x6xxxxxxxxxxxxxxxxxx       
RPVS_ClusterLuns_addLunInfo: added lun 1, cluster 4xxxxxxxxxxxxxxxx        parse_vmdk_file: called with file /vmfs/volumes/vsan:5xxxxxxxxxxxxxx-dxxxxxxxxxxxxxxxx/RPvStorage_23d5fb88838940xxx_010/RPVS_Lun00012.vmdk        parse_vmdk_file: capacity=524288000, thinLun=0, flat_filename=RPVS_Lun00012-flat.vmdk, rawguid=0x6xxxxxxxxxxxx        RPVS_ClusterLuns_addLunInfo: added lun 12, cluster 2xxxxxxxxxxxxxxx      


* Log indicating that RPVS discovery process has been taking long time        

CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 32585607 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 33277695 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 35834242 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 36488014 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 37767728 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 49355575 microseconds, num commands in queue: 11 CommandExecuterBase_v_handleCommands_i: cmd 0x417fdde35040, cmd->execute(CommandRPVSDiscovery), running time 109257427 microseconds, num commands in queue: 19


Affected all RP4VM version

Cause


ESX splitter scans for RPVS volume (journal and repository) every t_rpvsDiscoveryPeriodicTimerInterval (default: 30) seconds.    
The scan is done by reading /vmfs/volumes/ and traversing each directory inside it, looking for RPVS_LunXXXXX.vmdk  
A RPVS volume resides in /vmfs/volumes/<datastore>/<cluster=id>/. In VSAN environment, it would reside in /vmfs/volumes/vsan:<vsan-id>/<cluster=id>/      
If reading of any directory inside /vmfs/volumes/ fails (timeout, transient error, etc), it will result in removal all RPVS volumes in the failed directory.    
In any subsequent runs, if rpvs discovery process succeeds to read and find RPVS_LunXXXXX.vmdk, the corresponding RPVS volumes will be added back.          
This is the reason why the CGs loop between Error and Init.      
The issue is magnified when large number of hosts in the VSAN read directories under /vmfs/volumes/ at the same time.

Resolution

Workaround:

In each ESX host in the cluster, update the splitter tweak value of t_rpvsDiscoveryPeriodicTimerInterval
to random value between 180 and 600 seconds, and restart kdriver.
ESX Splitter Tweak can be found at
/etc/kdriver/tweak/tweak.params.splitter or /etc/config/emc/rp/kdriver/tweak/tweak.params.splitter.  

Resolution:

Dell EMC engineering is currently investigating this issue. A permanent fix is still in progress. Contact the Dell EMC Customer Support Center or your service representative for assistance and reference this solution ID.

Affected Products

RecoverPoint for Virtual Machines

Products

RecoverPoint for Virtual Machines
Article Properties
Article Number: 000167797
Article Type: Solution
Last Modified: 30 May 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.