PowerFlex:由於記憶體分配失敗,導致 MDM 切換失敗 - mos_MemMalloc

Summary: 切換 MDM 擁有權 (手動或其他方式) 時,由於記憶體分配失敗,無法正確啟動接收的 MDM,使叢集沒有主要 MDM。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

從接收的 MDM /opt/emc/scaleio/mdm/bin/showevents.py 輸出的事件會有多個項目,可用於嘗試接管主要 MDM 責任,且全部都在彼此的短時間內相繼發生:  

 2017-10-04 12:08:33.915 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, ID 394760fd6714xxxx, took control of 
the cluster and is now the Master MDM. 2017-10-04 12:08:33.915 MDM_BECOMING_MASTER WARNING This MDM is
 switching to Master mode. MDM will start running. .. 2017-10-04 12:08:34.309 MDM_CLUSTER_BECOMING_MASTER 
WARNING This MDM, ID 394760fd6714xxxx, took control of the cluster and is now the Master MDM. 
2017-10-04 12:08:34.309 MDM_BECOMING_MASTER WARNING This MDM is switching to Master mode. MDM will start 
running.

來自接收 MDM 的 exp.0 檔具有如下條目: 

 04/10 12:08:34.079823 Panic in file /data/build/workspace/ScaleIO-SLES12-2/src/mos/usr/mos_utils.c, line 73, 
function mos_MemMalloc, PID 9978.Panic Expression bCanFail . /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(
mosDbg_PanicPrepare+0x115) [0x6a86f5] /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mos_MemMalloc+0x81) 
[0x6ac0d1] /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(multiHeadMgr_GetUpdateMultiHeadsMsg+0x66) [0x57123c]
 /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(tgtMgr_ConfigureTgt+0x9c1) [0x4d579e] 
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(tgtMgr_HandleWorkReq+0x41b) [0x4d6206] 
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211() [0x6c57d8] 
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mosUmt_StartFunc+0xea) [0x6c51af]
 /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mosUmt_SignalHandler+0x51) [0x6c65d1]
 /lib64/libpthread.so.0(+0x10b00) [0x7f844e8a6b00] /lib64/libc.so.6(sleep+0xd4) [0x7f844d8911a4]

/var/log/messages 檔案會顯示 MDM 服務的多次重新開機,就像events.txt一樣: 

 systemd[1]: mdm.service: Main process exited, code=exited, status=255/n/a systemd[1]: mdm.service: 
Unit entered failed state. systemd[1]: mdm.service: Failed with result 'exit-code'. systemd[1]: mdm.service:
 Service has no hold-off time, scheduling restart. systemd[1]: Stopped scaleio mdm. systemd[1]: mdm.service: 
Start request repeated too quickly. systemd[1]: Failed to start scaleio mdm. systemd[1]: mdm.service: Unit 
entered failed state. systemd[1]: mdm.service: Failed with result 'start-limit'.

Cause

根本原因是 Linux 作業系統會遇到記憶體上限,無法在初始化時授予 MDM 服務的記憶體要求。這是因為核心參數設定未正確調整。
注意:如果操作系統確實分配了比物理可用記憶體更多的記憶體,則在消息檔中會看到oom-killer消息,並且在這些故障之前,其他服務和進程將被終止。

Resolution

這不是 ScaleIO 的問題。ScaleIO 是依設計運作。

若要檢查及/或修改 vm.overcommit 設定,請按照下列步驟操作:

1.使用 SSH 以 root

登入 SDS 2。執行 

cat /etc/sysctl.conf | grep "vm.overcommit"
Ex.
[root@sds-node logs]# cat /etc/sysctl.conf | grep "vm.overcommit"
vm.overcommit_memory = 2
vm.overcommit_ratio = 50

3、執行下列命令

sed -i 's/vm\.overcommit_memory = .*/vm\.overcommit_memory = 2/g' /etc/sysctl.conf
sed -i 's/vm\.overcommit_ratio = .*/vm\.overcommit_ratio = 100/g' /etc/sysctl.conf
sysctl -p

驗證

[root@sds-node logs]# cat /etc/sysctl.conf | grep "vm.overcommit"
vm.overcommit_memory = 2
vm.overcommit_ratio = 100


對環境中所有受影響的 SDS 重複這些步驟,以確保它們已設置為建議的最佳實務設置。您無需將 SDS 置於維護模式即可執行此操作。 

若要瞭解有關這些設置的更多資訊,請參閱有關過度使用記帳的Linux內核文檔

Additional Information

檢查 sysctl 核心參數是否有記憶體過度使用:

# sysctl -a |grep commit
vm.overcommit_memory = 2 (default is 0)
vm.overcommit_ratio = 50 (default is 50)

在這種情況下,將「vm.overcommit_memory」設置為 2 表示不會過度使用記憶體。這會使任何超過超額使用限制的記憶體分配失敗。系統的總位址空間認可不得超過交換 + 可設定的實體 RAM 數量 (預設值為 50%)。當此設置為 0 時,它會拒絕明顯的過度使用請求,但允許根進程分配超過過度使用限制。 

若要檢查目前的超額使用上限和承諾量,請參閱下列命令中的 CommitLimit 和 Committed_AS:

#cat /proc/meminfo 
MemTotal: 8174572 kB 
.. 
CommitLimit: 4087284 kB 
Committed_AS: 3879388 kB

此主機上有 8 GB 的 RAM,CommitLimit 設定為 ~4 GB,佔總位址空間的 50%。

 

若要解決此問題,請在 /etc/sysctl.conf 中新增/編輯下列其中一項:

 將「vm.overcommit_ratio」變更為 100,以便作業系統可以提交可用總位址空間並重新開機。

若要瞭解有關這些設置的更多資訊,請參閱有關過度使用記帳的Linux內核文檔

Affected Products

PowerFlex rack, VxFlex Product Family
Article Properties
Article Number: 000030300
Article Type: Solution
Last Modified: 22 Sept 2025
Version:  7
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.