RecoverPoint:当阶段 1 高速缓存内存不足时,复制过程崩溃
Summary: 复制将崩溃,阶段 1 高速缓存内存不能进行足够的断言,从而导致重新启动管控。
Symptoms
一致性组的状态继续处于初始化状态,但正态分布似乎从未启动,并且 CG 不会转换为活动状态。 当第 1 阶段高速缓存内存不足且目标端 RecoverPoint Appliance 无法写入目标日志时,复制过程崩溃并记录断言。 在 /home/kos/replication 日志中发现的症状: 断言: XXXX/XX/XX 18:59: 25.693-#2-17936/16776-AssertLogSender:发送日志:topic = DistributorGroupHandler,msg = 断言失败:bIsPhase1CacheMemorySufficient 行 1825 文件 DistributorGroupHandlerPhase1.cc PID:16776 Info: regular phase1 cache memory not enough m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXXXX) 0) ),gridCopyID=0) XXXX/XX/XX 18:59: 25.694-#2-16911/16776-RemoteLogSender:获得事件(uniqueId = 0,eventTime = 1584471565693987),EventID_KBOX_ASSERTION_FAILED(3031),SiteUID (0xxxxxxxxxxxxxxxxx),seDetails = Sender = replication, Topic=DistributorGroupHandler, msg=Assertion failed: bIsPhase1CacheMemorySufficient Line 1825 File DistributorGroupHandlerPhase1.cc PID:16776 信息: 常规 phase1 高速缓存内存不足 m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXX) 0) ),gridCopyID=0) 显示高数据流的统计信息: XXXX/XX/XX 18:52: 41.520-#2-7676/7665-AccumulatorFormatManager::p rintStatistics:组统计信息 Option( kVolSlot = XXXXXXXXXX groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXX) 0) gridID = 0):{ STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 description: init nc one phase speed . STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 8 sec window:平均:1.14e + 03 MB/秒 STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 77 sec window:平均:1.06e + 03 MB/秒 一致性组处于 Initialization 状态: 2020/03/17 18:56: 05.070-#2-7954/7665-InitNCState::D istributeOnePhase:分发一个阶段 m_groupID = ( groupCopyRID=( kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID (0xXXXXXXXXXXXX) 0) ),gridCopyID=0) 此一致性组的 Phase1 使用者在断言上显示高消费: XXXX/XX/XX 18:56: 05.241-#2-7954/7665-MemoryManager:断言上的 viscus + 倒计时 = 2413/390 + 最小内存需求 = 433429(固定329537灵活103892) + 灵活使用空间 = 37977/3864963 + 池空间使用量 = 37985/4194500(最大143544) >> 1160635626647715840 :p hase1#22 >> (groupTaskID=(sessionID=1817723153,replicationLinkID=(kVolSlot=XXXXXXXXX,srcCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXX) >> 0) ,destCopyID=GlobalCopy (SiteUID 还会遇到 Replication StackTrace: 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 3: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZNK6Kashya23DistributorGroupHandler21waitForMemoryIfNeededEv+0x5b2) [0xxxxxxxxxxxxx] 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 4: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler25addSequencesToPhase1CacheENS_9SequencesERNS_15ReplicationModeE+0x939) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 5: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler23handleSplittedSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x20a) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 6: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler15handleSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x577) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 7: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya19Distributor_AO_IMPL23continueHandleSequencesENS_9SequencesENS_15ReplicationModeEbRKNS_10GridCopyIDE+0xf7) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 8: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya16SequencesRequest21continueHandleRequestERNS_28JournalRegulationRequestBase14RequestHandlerE+0x30b) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace:提供 = 0 9: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya31JournalRegulationThread_AO_IMPL9process_iERKNS_16GroupGridCopyRIDE+0x36f)
Cause
Resolution
解决办法:将调整 t_phase1CacheMemoryThreadSleepTime 的值更改为 5000。(将等待时间从 10 微秒增加到 5 毫秒)。这将确保在线程等待内存 5 毫秒之前我们不会断言。如果问题仍未发生:1.请同时收集生产站点日志。因为这会让我们知道在问题发生时从生产环境发送的数据量。2.将调整t_maxNoOfTriesToWaitForPhase1CacheMemory的值更改为 10。提醒:这些调整仅与版本 5.1.3 及更高版本相关。如果代码版本不是 5.1.3 或更高版本,则必须将 RecoverPoint 升级到最新代码才能利用这些调整。解决方案:Dell EMC 工程部门目前正在调查此问题。目前正在开发永久修复。要寻求帮助,请联系 Dell EMC 客户支持中心或服务代表并参考此解决方案 ID。