RecoverPoint: 當第 1 階段快取記憶體不足時,複製過程崩潰
Summary: 複寫將當機,而第 1 階段快取記憶體的斷言不足,導致重新開機法規。
Symptoms
一致性組的狀態繼續處於初始化狀態,但正態分發似乎從未開始,並且 CG 不會轉換為活動狀態。 當第 1 階段快取記憶體不足,且目標端 RecoverPoint Appliance 無法寫入目標日誌時,複寫程序會崩潰並記錄聲明。 在 /home/kos/replication 記錄中找到的症狀: 斷言: XXXX/XX/XX 18:59:25.693-#2-17936/16776-AssertLogSender: send log: topic=DistributorGroupHandler, msg=Assertion failed: bIsPhase1CacheMemoryEnough 行 1825 檔案 DistributorGroupHandlerPhase1.cc PID:16776 資訊:一般 phase1 快取記憶體不足 m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXXXX) 0) ),gridCopyID=0) XXXX/XX/XX 18:59:25.694-#2-16911/16776-RemoteLogSender: got event (uniqueId=0, eventTime=1584471565693987), EventID_KBOX_ASSERTION_FAILED(3031), SiteUID(0xxxxxxxxxxxxxxxxxx), seDetails=Sender=replication, Topic=DistributorGroupHandler,msg=Assertion failed: bIsPhase1CacheMemoryEnough 行 1825 檔案 DistributorGroupHandlerPhase1.cc PID:16776 資訊: 常規 phase1 高速緩存不足 m_GroupGridCopyRID = (groupCopyRID=(kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXX) 0) ),gridCopyID=0) 顯示高資料流量的統計資料: XXXX/XX/XX 18:52:41.520-#2-7676/7665-累加器格式管理員::p rintStatistics:群組的群組統計資料 Option( kVolSlot = XXXXXXXXXX groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXX) 0) gridID = 0):{ STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 描述: init nc 一相速度 . STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 8 秒視窗:平均:1.14E+03 MB/秒 STATISTICS: name=InitNCOnePhaseSpeed kVolSlot = 1346840554 groupUID = GroupCopy(1346840554 SiteUID(0xXXXXXXXXXXXXX) 0) gridID = 0 77 秒視窗:平均:1.06e+03 MB/秒 一致性群組處於初始化狀態: 2020/03/17 18:56:05.070 - #2 - 7954/7665 - InitNCState::D istributeOnePhase: distributing one phase m_groupID = (groupCopyRID=( kVolSlot=XXXXXXXXXX,globalCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXX) 0) ),gridCopyID=0) 此一致性組的第 1 階段消費者在斷言上顯示高消耗: XXXX/XX/XX 18:56: 05.241-#2-7954/7665-MemoryManager:判斷提示時的 viscus + 倒計時 = 2413/390 + 最小記憶體需求 = 433429(固定329537彈性103892) + 靈活使用空間 = 37977/3864963 + 泳池空間使用量 = 37985/4194500 (最大143544) >> 1160635626647715840 :p hase1#22 >> (groupTaskID=(sessionID=1817723153,replicationLinkID=(kVolSlot=XXXXXXXXX,srcCopyID=GlobalCopy(SiteUID(0xXXXXXXXXXXXX) >> 0) ,destCopyID=GlobalCopy(SiteUID 也會發生複寫堆疊追蹤: 2020/03/17 18:56:05.278-#0-7954/7665-StackTrace: errno = 0 3: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZNK6Kashya23DistributorGroupHandler21waitForMemoryIfNeededEv+0x5b2) [0xxxxxxxxxxxxxxx] 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace: errno = 0 4: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler25addSequencesToPhase1CacheENS_9SequencesERNS_15ReplicationModeE+0x939) 2020/03/17 18:56:05.278-#0-7954/7665-StackTrace: errno = 0 5: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler23handleSplittedSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x20a) 2020/03/17 18:56:05.278-#0-7954/7665-StackTrace: errno = 0 6: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya23DistributorGroupHandler15handleSequencesENS_9SequencesERKNS_15ReplicationModeERKb+0x577) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace: errno = 0 7: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya19Distributor_AO_IMPL23continueHandleSequencesENS_9SequencesENS_15ReplicationModeEbRKNS_10GridCopyIDE+0xf7) 2020/03/17 18:56:05.278-#0-7954/7665-StackTrace: errno = 0 8: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya16SequencesRequest21continueHandleRequestERNS_28JournalRegulationRequestBase14RequestHandlerE+0x30b) 2020/03/17 18:56: 05.278-#0-7954/7665-StackTrace: errno = 0 9: /home/kos/kashya/archive/lib/libreplication_libsrelease.so(_ZN6Kashya31JournalRegulationThread_AO_IMPL9process_iERKNS_16GroupGridCopyRIDE+0x36f)
Cause
Resolution
因應措施:將「調整 t_phase1CacheMemoryThreadSleepTime 」的值變更為 5000。(等待時間從 10 微秒增加至 5 毫秒)。這將確保在線程等待記憶體 5 毫秒之前我們不會斷言。如果問題仍然存在:1.請同時收集生產現場記錄。因為它會讓我們知道在問題發生時從生產部門傳送的資料量。2.將調整t_maxNoOfTriesToWaitForPhase1CacheMemory的值變更為 10。注意:這些調整僅與版本 5.1.3 及更新版本相關。如果程式碼版本不是 5.1.3 或更高版本,則必須將 RecoverPoint 升級至最新程式碼,以使用這些調整功能。解決方案:Dell EMC 工程部門目前正在調查此問題。永久修正方法仍在進行中。如需技術協助,請聯絡 Dell EMC 客戶支援中心或您的服務代表,並引用此解決方案 ID。