RecoverPoint: Replication Process Crashes When Phase1 Cache Memory Becomes Insufficient
Summary: Replication will crash with phase1 cache memory not sufficient assertion causing reboot regulation.
Symptoms
The state of the Consistency Group continues to be in initialization, but normal distribution never seems to start and the CG does not transition into an active state. A replication process crash with an assertion logged, when Phase1 cache memory is not sufficient and the target side RecoverPoint Appliances are unable to write to the target journal. Symptoms found in /home/kos/replication logs: Assertion: XXXX/XX/XX 18:59:25.693 - #2 - 17936/16776 - AssertLogSender: sending log: topic=DistributorGroupHandler, msg=Assertion failed: bIsPhase1CacheMemorySufficient Line 1825 File DistributorGroupHandlerPhase1.cc PID: 16776 Info: regular phase1 cache memory not sufficient m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXXXX) 0) ),gridCopyID=0) XXXX/XX/XX 18:59:25.694 - #2 - 16911/16776 - RemoteLogSender: got event (uniqueId=0, eventTime=1584471565693987), EventID_KBOX_ASSERTION_FAILED(3031), SiteUID(0xxxxxxxxxxxxxxxxx), seDetails=Sender=replication, Topic=DistributorGroupHandler, msg=Assertion failed: bIsPhase1CacheMemorySufficient Line 1825 File DistributorGroupHandlerPhase1.cc PID: 16776 Info: regular phase1 cache memory not sufficient m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXX) 0) ),gridCopyID=0) Statistics which shows high data flow: XXXX/XX/XX 18:52:41.520 - #2 - 7676/7665 - AccumulatorFormatManager::printStatistics: Group Statistics for group Option( kVolSlot = XXXXXXXXXX groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXX) 0) gridID = 0): { STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 description: init nc one phase speed . STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 8 sec window: Average: 1.14e+03 MB/sec STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 77 sec window: Average: 1.06e+03 MB/sec A Consistency Group is in Initialization state: 2020/03/17 18:56:05.070 - #2 - 7954/7665 - InitNCState::DistributeOnePhase: distributing one phase m_groupID = (groupCopyRID=( kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXX) 0) ),gridCopyID=0) Phase1 consumer for this Consistency Group is showing high consumption on the assertion: XXXX/XX/XX 18:56:05.241 - #2 - 7954/7665 - MemoryManager: viscus on assert + countdowns = 2413/390 + min memory demand = 433429 (fixed 329537 flexible 103892) + flexible usage space = 37977/3864963 + pool space usage = 37985/4194500 (max 143544) >> 1160635626647715840 :phase1#22 >> (groupTaskID=(sessionID=1817723153,replicationLinkID=(kVolSlot=XXXXXXXXX,srcCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXX) >> 0) ,destCopyID=GlobalCopy(SiteUID A Replication StackTrace is also experienced: 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 3: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZNK6Kashya23DistributorGroupHandler21waitForMemoryIfNeededEv+0x5b2) [0xxxxxxxxxxxxx] 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 4: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler25addSequencesToPhase1CacheENS_9SequencesERNS_15ReplicationModeE+0x939) 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 5: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler23handleSplittedSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x20a) 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 6: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler15handleSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x577) 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 7: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya19Distributor_AO_IMPL23continueHandleSequencesENS_9SequencesENS_15ReplicationModeEbRKNS_10GridCopyIDE+0xf7) 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 8: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya16SequencesRequest21continueHandleRequestERNS_28JournalRegulationRequestBase14RequestHandlerE+0x30b) 2020/03/17 18:56:05.278 - #0 - 7954/7665 - StackTrace: errno=0 9: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya31JournalRegulationThread_AO_IMPL9process_iERKNS_16GroupGridCopyRIDE+0x36f)
Cause
Resolution
Workaround: Change value of tweak t_phase1CacheMemoryThreadSleepTime to 5000. (Increasing waiting time from 10 micro seconds to 5 milliseconds).This will ensure that we will not assert until the thread waits for memory for 5 milliseconds. If the issue is still occurring: 1. Please collect the production site logs as well. Since it will let us know the amount of data being sent from the production at the time of the issue. 2. Change value of tweak t_maxNoOfTriesToWaitForPhase1CacheMemory to 10. Note: These tweaks are only relevant to versions 5.1.3 and above. If the code version is not 5.1.3 or higher, RecoverPoint MUST be upgraded to the latest code to utilize these tweaks. Resolution: Dell EMC engineering is currently investigating this issue. A permanent fix is still in progress. Contact the Dell EMC Customer Support Center or your service representative for assistance and reference this solution ID.