RecoverPoint for Virtual Machines: Consistency Groups in loops Paused by system/Error/Init/Active with many intermittent splitter/volume connection errors
Summary: RP4VM with IP splitter in a large scale environment - Consistency Groups in loops Paused by system/Error/Init/Active with many intermittent splitter/volume connection errors, even though no network issues identified ...
Symptoms
RP4VM with IP splitter in a large scale environment - Consistency Groups in loops Paused by system/Error/Init/Active with many intermittent splitter/volume connection errors, even though no network issues identified
The following can be seen in splitter logs:
2018-04-18T12:49:07.856Z cpu20:6831323)esx_splitter: KL_INFO:865: #2 - PathRpaEsx_v_check_status_impl: Rpa Port 43151e692cf0 (state OK) lease expired (120 < 262615747 - 262615622)
2018-04-18T12:49:07.856Z cpu20:6831323)esx_splitter: KL_INFO:865: #2 - PathRpaEsx_disconnect: starting to flush splits. path 10.xx.xx.xx:5042
Or similar to:
krnl:[21:05:19.493] 0/0 #1 - PathRpa_isKeepAliveOk: 0x430944449088 (10.xx.xxx.xx:5042) expired. 124 seconds since last keep-alive (120 allowed).
under NGS logs (extracted*/home/kos/ngs/result.log.gz) run the below grep:
zgrep Received result.log.gz | grep VSCSI | grep <ESX_IP>
e.g.
zgrep Received result.log.gz | grep VSCSI | grep '10.yy.yy.yyy'
The below print should be printed every ~20 seconds for every ESXi Splitter:
2018/05/16 11:17:40.388 - #2 - 16249/16238 - xTE: 2199311247: processSystemMessage: Received connect for channel: 31681873 module: MPIControlTransporter machine: VSCSI_SPLITTER&10.yy.yy.yyy UIDName: 4b0f9fb95da8xxxx version: 0 isExternal: srcInternalSiteName: localMPIVersion: 1 remoteMPIVersion: 1
2018/05/16 11:27:40.390 - #2 - 16249/16238 - xTE: 2284932560: processSystemMessage: Received connect for channel: 31684868 module: MPIControlTransporter machine: VSCSI_SPLITTER&10.yy.yy.yyy UIDName: 2836bd09480exxxx version: 0 isExternal: srcInternalSiteName: localMPIVersion: 1 remoteMPIVersion: 1
In this case 10 minutes (more than a minute) between 2 calls is a good indication that we're suffering from this issue.
Other symptoms in replication logs (extracted*/files/home/kos/replication/result.log) which are also symptoms of network/communication issues:
2018/04/30 09:18:13.630 - #1 - 4997/4967 - SplitCommIO::~SplitCommIO: IPDataCommIOInterface: not all responses where processed...
2018/04/18 10:16:50.383 - #0 - 7383/7323 - SocketIOProvider::openDevice: errno=35 socket is not connected for device ( m_lr=(0x0,0x108db1cd4dxxxxxx,e_IPDataComm) m_handle=31 m_openCount=34 m_status=e_NoConnection m_splitterId = 108db1cd4dxxxxxx m_localIPAddress = 10.xx.xx.xxx) : a_openParams.m_openMode = 2
2018/04/18 10:56:11.509 - #0 - 7373/7323 - SplitCommSocketIOInterface_AO_IMPL::processOneResponse_sys: errno=62 IPDataCommIOInterface: processOneResponse readStream failed
2018/04/18 10:56:11.513 - #0 - 7375/7323 - SplitCommSocketIOInterface_AO_IMPL::readStream: errno=62 IPDataCommIOInterface: readSocket failed ioStatus = general error a_bResetOnError = 0
Cause
Many sync calls (AmIAttach) from the NGS process to Control process leads to a bottleneck from the RPA to the Splitter which leads the Spitter to close the connection between the RPA to the Splitter due to inactivity for X (120) seconds
Resolution
Workaround:
No direct workaround exists
Resolution:
This issue is resolved in RecoverPoint for VMs 5.1.1.4 and later.
Upgrade to RecoverPoint for VMs 5.1.1.5 is recommended to resolve.